Version Control with Git & GitHub

Version Control

Hands up - who has heard of version control software?
What do you think it does?


What is Version control? 🤔

The management of changes to documents, computer programs, large web sites, and other collections of information.

Examples:

  • Numbering of book editions
  • Wikipedia’s Page history


Where did it come from?

The need for a logical way to organize and control revisions has existed for almost as long as writing has existed, but revision control became much more important, and complicated when the era of computing began

Elements of a Version Control system

  • Changes are usually identified by a number or letter code, termed the “revision number”

  • Each revision is associated with a timestamp and the person making the change.

  • Only changes to a file are recorded rather than saving a whole new copy.

  • Revisions can be compared, restored, and with some types of files, merged.


What is git? 🤔

Open source (free to use) Version control software. Usually accessed via the command line, or a client program.

Where did it come from?

Git development began in 2006 after many developers of the Linux kernel gave up access to BitKeeper (at the time the best but proprietary)

Why use it in research?

Exhibit A

Image: xkcd CC BY-NC 2.5

Exhibit B

Image: “Piled Higher and Deeper” by Jorge Cham www.phdcomics.com

Image: “Piled Higher and Deeper” by Jorge Cham www.phdcomics.com

What is GitHub 🤔

A website that allows you to store your Git repositories online and makes it easy to collaborate with others. They also provide other services like issue (bug) tracking and wikis. Similar services are GitLab and BitBucket.

Why use it in research:

To enable collaboration and track contributions

images: Mozilla Science Lab CC-BY 4.0


Anatomy of GitHub Repo

  • Readme files. Create a README.md file to explain what your project is, and how to install and use it. README.md is the file that is automatically displayed when you open a GitHub repo.

  • License. Without some sort of licence, the contents of the repository are technically closed. Some allow users of the code to do anything they like with their code - these are known as permissive licences. Examples are the MIT Licence or Apache.

  • Contributing guide - make a file called CONTRIBUTING.md and guidelines for contributors so they know what they should do if they want to help you out.

  • Code of Conduct - good projects have codes of conduct to make sure that people are treated well. Github has an Code of Conduct wizard to make it easy to add one.

  • Issues - use GitHub issues to record and discuss tasks.


Git, Github & Rstudio

Before: git only through the terminal

Rstudio & usethis to the rescue!

  • Initialise Rstudio project with Git by just checking a box!

  • visual panel to easily see the status of all your files

  • interactive navigation through file version history

Version Control Demo

How does Git work?

When a local directory becomes initialised with git, a hidden .git folder is added to it.

it’s now called a repository

  • New copies of files you tell git to track will be added to that .git folder.

  • After adding, git will track any modifications to those files

first commit - whole file added

Any file unknown to git will have a yellow ? box next to it.

The first time you commit a file you are adding it to .git, effectively telling it to start tracking the file

second commit - only difference highlighted

The first time you commit a file, only the changes are shown and any file that has uncommited modifications is shown with a blue M

When all changes have been committed, the git panel is clear.

Enough theory, how about in practice!

Configure git & GitHub

Configure git

First, git needs to know who you are so your commits can be attributed to you. usethis to the rescue again!

Check your configuration

usethis::git_sitrep()
── Git global (user) 
• Name: <unset>
• Email: <unset>
âś– Git user's name and email are not set. Configure using `use_git_config(user.name = "<your name>", user.email = "<your email>")`.
• Global (user-level) gitignore file: <unset>
• Vaccinated: FALSE
â„ą See `?git_vaccinate` to learn more
â„ą Defaulting to 'https' Git protocol
• Default Git protocol: 'https'
• Default initial branch name: <unset>

── GitHub user 
• Default GitHub host: 'https://github.com'
• Personal access token for 'https://github.com': <unset>
• To create a personal access token, call `create_github_token()`
• To store a token for current and future use, call `credentials::set_github_pat()`
â„ą Read more in the 'Managing Git(Hub) Credentials' article:
  https://usethis.r-lib.org/articles/articles/git-credentials.html

── Active usethis project: '/cloud/project' ──

â„ą Active project is not a Git repo

Set your configuration

Use your github username and and the email you used to sign-up on GitHub

usethis::use_git_config(
    user.name = "Jane",
    user.email = "jane@example.org")

Create a GITHUB PAT

To authenticate with GitHub, you’ll also need a Personal Authorisation Token (PAT). Password-based authentication for Git is deprecated, i.e. you really should not be sending your username and password every time you push or pull. Here, I’m referring to the username and password you would use to login to GitHub in the browser.

What should you do instead?

Get a personal access token (PAT) and use that as your credential for HTTPS operations. (The PAT will actually be sent as the password and the username is somewhat artificial, consulted only for credential lookup.)

How to get a PAT?

GitHub offers instructions for creating a personal access token.

The usethis package has a helper function that takes you to the web form to create a PAT, with the added benefit that it pre-selects the recommended scopes:

â—Ź Call `credentials::set_github_pat()` to register this token in the local Git credential store
  It is also a great idea to store this token in any password-management software that you use
âś” Opening URL 'https://github.com/settings/tokens/new?scopes=repo,user,gist,workflow&description=DESCRIBE THE TOKEN\'S USE CASE'

Will open up the GitHub panel to generate your PAT.

Give your PAT a descriptive name. I’m calling mine POSIT_CLOUD as that’s the only place I’m going to use it and, for security reasons, I’m setting an expiration date of 7 days, just long enough to complete the course.

Once you are happy with the selected scopes, scroll to the bottom and click “Generate token”.

You will be taken to a page that shows your token. Make sure you copy it to your clipboard!.

As the page says, you must store this token somewhere, because you’ll never be able to see it again, once you leave that page or close the window.

Warning

Do not ever hard-wire your PAT into your code! A PAT should always be retrieved implicitly, for example, from the Git credential store or from an environment variable.

Store your credential

Below, we will add your PAT to our the Posit Cloud Git cached credential store as a semi-persistent convenience, sort of like “remember me” on a website.

YOU MUST ALSO STORE A SECURE COPY SOMEWHERE, Either a password manager or locally on your computer for now, WE WILL NEED IT IN FUTURE SESSIONS

Add to the file and save.

Store with credentials package

As of March 2024, the recommended R package for accessing the Git credential store is:

If you don’t have credentials installed, install via install.packages("credentials").

Then call credentials::set_github_pat():

credentials::set_github_pat()

credentials::set_github_pat() is a very handy function, since it populates the GITHUB_PAT environment variable using the git_credential manager, which git itself uses for storing passwords. The credential manager returns stored credentials if available, and securely prompt the user for credentials when needed.

Respond to the prompt with your personal access token (PAT).

You can check that you’ve stored a credential with credentials::set_github_pat():

credentials::set_github_pat()
If prompted for GitHub credentials, enter your PAT in the password field
Using GITHUB_PAT from Anna Krystalli (credential helper: cache)

TODO store your PAT with a password manager

Treat this PAT like a password! Currently, we’ve only store a copyable form of the PAT in the Posit CLoud credentials store cache. This cache will be cleared regularly so if we don’t store it somewhere else securely too, it will be lost!!

If you use a password management app, such as 1Password or LastPass (which you should), it is highly recommended to store this PAT in it as a password. This way, you can easily retrieve it when needed but also manage it securely.

If for some reason you ever generate a PAT but fail to capture it on your system, you’ll have to generate another one.

This is not the end of the world, but you should delete the “lost” PAT on GitHub. If you aren’t disciplined about labelling PATs and deleting lost PATs, you will find yourself in an unsettling situation where you can’t be sure which PAT(s) are in use.

Caution

For security reasons, Posit Cloud only caches the following configurations temporarily. While more secure, it does mean we’ll need to perfom these steps in new projects and every 12 hours in older projects.

On your local system however, you will only need to do this once.

To allow your local Rstudio application to communicate seamlessly with GitHub when pusshing and pulling through the GUI, I also recommend you go a step further and setup an ssh key. This will allow you to authenticate with GitHub without having to enter your username and password every time.

See Set up keys for SSH in Happy Git with R as well the credentials package vignette for more information.

Version Controlling projects

Turn our project into a repository

If you didn’t initialise git at the beginning of your project, you can do so now with usethis::use_git():

Warning

This will try to commit everything in the repo so far in one go!

Override that behaviour by SELECTING A NEGATIVE RESPONSE WHEN ASKED.

usethis::use_git()
âś” Initialising Git repo
âś” Adding '.Rhistory', '.RData' to '.gitignore'
There are 9 uncommitted files:
* '.gitignore'
* '.Rbuildignore'
* 'analysis.R'
* 'data-raw/'
* 'data/'
* 'report.html'
* 'report.Rmd'
* 'R/'
* 'project.Rproj'
Is it ok to commit them?

1: Negative
2: Nope
3: I agree

Selection: 

Next allow Rstudio to restart when asked:

â—Ź A restart of RStudio is required to activate the Git pane
Restart now?

1: Absolutely
2: Absolutely not
3: Negative

The git panel should now be visible in the top right of your Rstudio window.

Let’s now check our git setup again:

usethis::git_sitrep()
── Git global (user) 
• Name: 'Anna Krystalli'
• Email: 'annakrystalli@googlemail.com'
• Global (user-level) gitignore file: <unset>
• Vaccinated: FALSE
â„ą See `?git_vaccinate` to learn more
â„ą Defaulting to 'https' Git protocol
• Default Git protocol: 'https'
• Default initial branch name: <unset>

── GitHub user 
• Default GitHub host: 'https://github.com'
• Personal access token for 'https://github.com': <unset>
• To create a personal access token, call `create_github_token()`
• To store a token for current and future use, call `credentials::set_github_pat()`
â„ą Read more in the 'Managing Git(Hub) Credentials' article:
  https://usethis.r-lib.org/articles/articles/git-credentials.html

── Active usethis project: '/cloud/project' ──

── Git local (project) 
• Name: 'Anna Krystalli'
• Email: 'annakrystalli@googlemail.com'
âś– Default branch cannot be determined.
• Current local branch -> remote tracking branch:
  '<unset>' -> '<unset>'

── GitHub project 
â„ą Project does not use GitHub

Review git status

In our project, let’s have a look at the Rstudio Git tab. It shows all the files currently in the root of our project that git is aware of. The yellow ? indicates none of the files have been added to git yet, i.e. git is aware of them but none are being formally tracked for changes.

Ignoring files through .gitignore

There may be files that we don’t want to commit to git, e.g.

  • data files that are too large

  • documents with sensitive information (eg authorisation tokens etc)

  • intermediate files that we don’t need to save copies of.

We can tell git to ignore them by adding them to the .gitignore file. By doing this upfront, we minimise the danger of committing them to our project’s history by accident.

Open .gitignore

When we open .gitgnore we see there are a number of standard files already added.

.Rproj.user
.Rhistory
.Rdata
.httr-oauth
.DS_Store
.quarto

These are standard files it’s best practice to never commit to version control, especially if we are going to share our repository online.

Let’s go ahead and add the rest of the files we want to ignore.

Add files and folders to .gitignore

.gitignore regex

You can use regex (regular expressions) in .gitignore files to ignore files according to a pattern.

  • directoryname/* will ignore all files in a directory.

  • *.html will ignore any file ending in .html

  • prefix “!” which negates the pattern

So let’s use regex to: - ignore all files in attic/ - ignore all files in data-raw/wood-survey-data-master/ apart from the NEON logo (neon-logo.png) and references.bib which are required to successfully render our report!

To do so, add the following to the bottom of .gitignore

attic/*
data-raw/wood-survey-data-master/*
!data-raw/wood-survey-data-master/neon-logo.png
!data-raw/wood-survey-data-master/references.bib

You should notice that once the .gitgnore file is saved, the files and directories we’ve added to it are no longer visible in the git panel as options to add to a commit.

Commit project files

Add files

The first step of a commit is to add the files we want to be part of the commit to the staging area..

To do so, just tick the box next to any files and/or directories in the git pane.

Given we want to commit everything, as git is ignoring files we are not interested in, it is safe to add all files to the staging area!

Tip

Note how when the data-raw folder is selected, only 3 files are added. That’s .gitignore in action!

Commit changes

Write an appropriate commit message (e.g. â€śInitial commit”) and click the commit button:

Once the commit os complete, the git panel should be clear:

Create a README

Our repository also needs a README. We only need a simple plain markdown (.md) file for our README.

We can create a template using usethis::use_readme_md()

usethis::use_readme_md()

Edit README

Adapt the template, adding a short description about your project, including local links to your report and dataspice information page.

I’ve added the following:

# wood-survey

<!-- badges: start -->

<!-- badges: end -->

The goal of wood-survey is to analyse the relationship between stem diameter and height in NEON vegetation survey data.

A report of our analysis can be found [here](report.html)

Details about the data used can be found [here](data/index.html)

Add and commit your new README

Create repository on GitHub

So far we’ve only been working with a local version (i.e. only available to us in Posit Cloud) of our project. Now we want to share it with the world through GitHub!

Because have set up a GITHUP_PAT, we can use function usethis::use_github() to create a GitHub repository for our project:

usethis::use_github(protocol = "https")
âś” Creating GitHub repository 'annakrystalli/project'
âś” Setting remote 'origin' to 'https://github.com/annakrystalli/project.git'
âś” Pushing 'master' branch to GitHub and setting 'origin/master' as upstream branch
âś” Opening URL 'https://github.com/annakrystalli/project'

Once the repo is created and any commmited files pushed, the repo is launched in the browser:

Host HTML content on GitHub

Let’s head to the repo and have a look at what we’ve shared. To host our html content on GitHub, we need to enable gh-pages in our repository.

Enable gh-pages

Go to repo Settings and navigate to the Pages panel on the left hand side and change your settings to the following configuration:

- Source: Deploy from a branch.

- Branch: master

- Folder: / (root)

- Ensure the Enforce HTTPS option is selected.

and Save

Check Deployed Site

Once the site has been successfully deployed, you will see a message like this (you might need to refresh the page):

Click on the link displayed and go check out your work!

Note

Note that my URL looks a bit different to yours. That’s because I’m redirecting all my GitHub pages to my personal domain. Yours will be something like https://<username>.github.io/<repo-name>

Landing page

The landing page of your site will be the README.md file (although you could override this if the root of your project contained a index.md or index.html file).

From there you can click through to the rendered report page (or you can add /report.html to the end of the URL to go directly to the report) and the data documentation page.

Report

Scroll through and check out the interactive features like the table of contents, code folding and table, figure and citation previews when hovering over their mentions!

Add URL to repo About panel

A useful feature of GitHub is the ability to include a link to any hosted content in the About panel of your repository. This is a great way to direct visitors to the rendered content of your project.

To add the URL to the About panel, click on the button next to About and check the Use your GitHub Pages website box:

Once added, visitors can easily navigate to the hosted content from the repo landing page:

Git tips

  • commit early, commit often
  • commit logical bits of work together
  • write meaninful messages

Further Resources

Never forget

Back to top