Overview of Git and GitHub

Workshop: Data Science Tools

Tyler George, Cornell College

Why Git/Github?

What do I name my fileeeeee?

  • Have your students ever named files in strange ways?
  • 3 years ago I oversaw 3 students doing research with Covid-19 data….
  • Their files names included…

What is wrOOng with this?

  • In the chat, guess what problems arose later in the project

Funny, but what now?

I was slightly amused at the time, but it led to bad outcomes

  • Team members didn’t know what script file had what in it
  • They confused versions of data, scripts, and papers
  • They built models on old variants of data
  • In the end, their work was not trustworthy

(Some) Important Components of Good Analysis’

  • Reproducibility

  • Collaboration

Git and GitHub

What is Git?

“Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.” (git, https://git-scm.com/)

  • a software

  • version control

  • operates in a terminal

Google Docs Version Control (1/2)

Google Docs Version Control (2/2)

What is GitHub?

“GitHub is a developer platform that allows developers to create, store, manage and share their code. It uses Git software” (Wikipedia, 2024, https://en.wikipedia.org/wiki/GitHub)

A service that hosts your repository (files/folders) online

A collaboration tool (1/2)

Source: mozilla ScienceLab, A Friendly Github intro Workshop

A collaboration tool (2/2)

Source: mozilla ScienceLab, A Friendly Github intro Workshop

An app (GitHub Desktop)

Research on Using GitHub in the Classroom

  • Dogucu, M., & Çetinkaya-Rundel, M. (2022). Tools and Recommendations for Reproducible Teaching. Journal of Statistics and Data Science Education, 0(ja), 1–25. https://doi.org/10.1080/26939169.2022.2138645

  • Beckman, M. D., Çetinkaya-Rundel, M., Horton, N. J., Rundel, C. W., Sullivan, A. J., & Tackett, M. (2021). Implementing Version Control With Git and GitHub as a Learning Objective in Statistics and Data Science Courses. Journal of Statistics and Data Science Education, 29(sup1), S132–S144. https://doi.org/10.1080/10691898.2020.1848485

  • Çetinkaya-Rundel, M., & Rundel, C. (2018). Infrastructure and Tools for Teaching Computing Throughout the Statistical Curriculum. The American Statistician, 72(1), 58–65. https://doi.org/10.1080/00031305.2017.1397549

Opinionated reasons to use Git/GitHub

  • Free
  • coding (I use R)
  • forced collaboration
  • students make more deliberate choices
  • integration in RStudio
  • GitHub Pages
  • Student Job preparedness

What to expect in the GitHub Workshop?

  • Git and GitHub terminology
  • Hands-on practice collaboration: Fork a repo and complete a pull request
  • Help me fix my typos
  • Additional options for educators