Practical Data Science

Foundations

CSCI 1109 is a hands-on introduction to practical data science. Students learn how to acquire, clean, analyze, and visualize data in Python, while developing a critical understanding of uncertainty, modeling, ethics, and data-driven decision-making.

The course combines programming, mathematics, statistics, and applied case studies. It begins with Python and pandas, then moves through data cleaning and preprocessing, descriptive statistics and visualization, probability and causal thinking, introductory machine learning, clustering and network analysis, and responsible data science.

Course website

What students learn

By the end of the course, students will be able to:

  • write and debug Python programs for data analysis using tools such as pandas, NumPy, and Matplotlib
  • formulate data-driven questions and design analyses to investigate them
  • apply core ideas from descriptive and inferential statistics
  • build and evaluate introductory machine learning models for regression, classification, and clustering
  • create and critique data visualizations for exploration and communication
  • analyze simple networks using NetworkX
  • identify limitations, biases, and ethical risks in data science workflows
  • communicate reproducible analyses clearly

Course structure

The course is organized around a progression from foundations to applications:

  • Python and tabular data: Python basics, notebooks, pandas, tidy data, grouping, and reshaping
  • Cleaning and preparation: missingness, duplicates, outliers, feature creation, and preprocessing pipelines
  • Math, statistics, and visualization: vectors and matrices, descriptive statistics, bootstrapping, plotting, and storytelling
  • Inference and causality: probability, p-values, A/B testing, confounding, and correlation vs causation
  • Machine learning: train/validation/test splits, k-NN, regression, classification, clustering, and evaluation
  • Networks, ethics, and communication: graph analysis, fairness, privacy, responsible data science, and communicating results