Software

R

Most computing in the lab is done in R, which requires a few downloads to use effectively

Why R?

R is a widely used, powerful programming language with a large community of bioinformatics researchers regularly putting out new, open source packages for analyses. R is particularly powerful when it comes to statistics and graphics.

Setting up R involves three steps:

  1. Download R from its official website. You'll want to download it from the nearest mirror, which is at Duke

  2. Download Rstudio. While R comes with its own IDE (Integrated Development Environment), the version put out by the people at Rstudio is better in basically every way.

  3. Learn how to use R. There are a lot of different R tutorials, but my best advice for learning is to start with a problem you want to solve. The examples found in a tutorial need to be reapplied to your own problem before you can really start to absorb them. For example, I learned R when the postdoc I was working with handed me some code and told me to generalize the functions to work on any dataset. Talk to me if you need help brainstorming.

Longleaf (UNC Cluster Computing)

(This is largely borrowed from Mike Love's instructions from his own lab manual)

We have access to the Longleaf cluster at UNC for computing. While you may often find yourself able to do your work locally on your laptop or on a lab machine, you will sometimes need either the computing power or the security of the Longleaf cluster to do your work.

The main pieces you need to work on longleaf include:

  • Use OnDemand (with the VPN) for interactive work w/ RStudio or when you expect visual outputs.

  • Some way of editing and running code on the cluster (Rstudio for R, Notepad++/EMACS/vim for most everything else)

  • Know how to submit jobs to the queue

  • Think about version control.

On Demand for Interactive Work

As of 2020, Research Computing at UNC has made a very nice solution for interactive work on the cluster, which makes the piece below about X11 forwarding and ESS irrelevant. For various data science applications, first see if they are supported here, as this will be a much easier interface for most students. If you are off campus you will need to connect via VPN first.

Submitting Jobs

WIP

Version control using git

For editing data analysis R scripts or working on a new method, you should be saving your code in git repositories, and typically also syncing this with a BitBucket or GitHub remote server.

You will have to set up SSH keys on the cluster, to sync git repositories on the cluster with GitHub or BitBucket. You can follow the steps described on the the git page.

In the end, the ideal setup is to have GitHub repos on your laptop and the same repo on the cluster, and you will use git pull to keep all code up to date on all locations. You should commit and push your code daily, to avoid any lost work.

Other

Last updated