Python for Data Science

Why Python?

Scientific Python (SciPy) Ecosystem

Libraries are important characteristic of how Python works: each application has its libraries. For scientific computing, one can use a combination of the following packages that meets the need.

Package Description Logo
Numpy Numerical arrays
Scipy User-friendly and efficient numerical routines:
numerical integration, interpolation, optimization, linear algebra, and statistics
Jupyter Notebook Interactive programming environment
Matplotlib Plotting
Pandas Data analytics
R-like data frames
Scikit-learn Machine learning

All the packages above are included in the Anaconda distribution of Python. If you download Anaconda, it comes with all the useful scientific programming packages. Packages used in this workshop are in bold.

Before We move on...

We have limited resources for each user on the cloud. Don't forget to shut down the kernels you are not using.

Pandas

Pandas is a Python library for working with datasets. It supports data frame structure like in R.

Loading data

Let's load icustays.csv.gz file as a pandas data frame. We need to predetermine the columnes with date-time values.

You may press Shift-Tab to see the function documentation interactively, or typing ?pd.read_csv in the code cell.

The variable read in is an instance of DataFrame. Let's talk a little bit about what this means.

Note: Object-Oriented Programming

Python has built-in object-oriented programming (OOP) support. The OOP paradigm is based on "objects", which is bundled with data representing properties of the object and code in the form of method.

Now admissions.csv.gz:

And patients.csv.gz:

For chartevents_filtered_itemid.csv.gz, we learn how to read in only selected columns.

For filtering, we can use the query method.

And for plotting, we use the package matplotlib.

Note: Method Chaining

One may use method chaining for linearizing method calls, as above. As the dot operator(.) is evaluated from left to right, one may "chain" another method call or attribute access right after obtaining the result of the previous method call or attribute access. This is a "pythonic" way of avoiding nested calls.

One limitation is that we can only do this for methods or attributes of a class. In this case, print() is not a method of DataFrame, we cannot chain print as what we did in R. One may use pandas.DataFrame.pipe() method for such operations. There are packages that implement pipe by overloading other operator (e.g., the bitwise or (|) operator).

Target cohort (from R section)

Let's continue on with the task we did with R. We aim to develop a predictive model, which computes the chance of dying within 30 days of ICU stay intime based on baseline features

We restrict to the first ICU stays of each unique patient.

Wrangling and merging data frames

Our stragegy is

  1. Identify and keep the first ICU stay of each patient.

  2. Identify and keep the first vital measurements during the first ICU stay of each patient.

  3. Join four data frames into a single data frame.

Important data wrangling concepts: group_by, sort, slice, joins, and pivot.

Step 1: restrict to the first ICU stay of each patient

icustays_df has 76,540 rows, which is reduced to 53,150 unique ICU stays.

Step 2: restrict to the first vital measurements during the ICU stay

Key data wrangling concepts: selecting columns, left_join, right_join, group_by, sort, pivot

Step 3: merge DataFrames

New data wrangling concept: mutate. Pandas equivalent is assign.

Data visualization

Remember we want to model:

thirty_day_mort ~ first_careunit + age_intime + gender + ethnicity + heart_rate + temp_f

Let's start with a numerical summary of variables of interest.

For numerical column, we can obtain mean, standard deviation, and quartiles using the method describe(). For a categorical column, we obtain number of unique values, value with the most appearance, and its frequency.

Do you spot anything unusual?

To obtain counts of each value for categorical column, we use value_counts() method.

Univariate summaries

Before we start, let's import the seaborn package for styling the figures a little bit (looking like ggplot2). This package is for statistical data visualization.

Bar plot of first_careunit

Histogram and boxplot of age_intime

Exercises

  1. Summarize discrete variables: gender, ethnicity.
  2. Summarize continuous variables: heart_rate, temp_f.
  3. Is there anything unusual about temp_f?

Bivariate summaries

Tally of thirty_day_mort vs first_careunit.

We need to be a little more verbose for plotting frequencies in stacked barplot in Python.

Tally of thirty_day_mort vs gender

Exercises

  1. Graphical summaries of thirty_day_mort vs other predictors.

Pros and Cons of Python

Pros:

Cons: