Introduction to Data Science

Data science

A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.

  • Glassdoor consistently ranks data scientists as the top 3 best jobs in America.

Who are hiring?

Following tables are based on a survey of 403 students who earned a master’s degree in statistics, biostatistics, or a related field (actuarial science, data science, informatics, math with stats focus) during the 2019–2020 academic year.

Source: AmStat News (2021 Nov).

there were more than 109 unique—although similar—job titles. The most common were data scientist (20), biostatistician (18), data analyst (9), biostatistician I (7), and statistician (5).

A typical data scientist on LinkedIn

A position posted by Genetech.

Why Jupyter (Julia/Python/R)?

Julia: Walk Like Python; Run Like C.

What's this workshop about?

Source: R for Data Science.

  • Hands-on experience on a typical data science workflow using Julia, Python, and R.

data ingestion (from text files or databases) -> data wrangling (filtering, selecting, merging, pivoting) -> data visualization (static, interactive) -> data analytics

  • Module 1 practices the workflow starting from text files.

  • Module 2 practices the workflow starting from genomic data.

  • The point is not to memorize all the commands, but to have a high-level understanding of the workflow and appreciate the ease of data manipulations using these languages.

MIMIC data

Source: URL

  • Tutorials in Module 1 use MIMIC IV, an intensive care electronic health record (EHR) dataset curated from over 40,000 patients in the Beth Israel Deaconess Medical Center (BIDMC) at Boston.

  • Suppose you are the chief data scientist at BIDMC and are tasked developing a predictive model to improve the 30-day mortality rate, a key factor when ranking hospitals.

  • Given the basic characteristics (e.g., demographics), vital signs, and initial lab tests of a newly admitted ICU patient, can we predict the chance the patient dies within 30 days of the admission?

    How to create a meaningful cohort for this predictive modeling?

How to get started

Follow instructions at workshop website.