Persistent RStudio on Google Compute Engine

A new Docker image is available which installs tools on top of the default rocker/tidyverse to help persist files over Docker containers. This image is part of the public Docker images built on top of googleComputeEngineR.

With this image, there are three ways to save files between Docker sessions:

  1. Write files to the host VM file system - this is through the -v flag upon Docker startup
  2. Use Git to pull/push from Git repositories.
  3. Use googleCloudStorageR to save and read R working directories between machines, including your GitHub/SSH configurations.

A combination of the above should be used for what best fits your workflow.

Using base VM file system

These files will disappear if you delete the VM, so it is recommend if they are important to write them somewhere else as well.

If relying on this, you will probably want to create a larger VM disk than the default 10GBs using the disk_size_gb argument:

vm <- gce_vm("vm-larger-disk", 
             predefined_type = "n1-standard-1", 
             template = "rstudio", 
             username = "mark", password = "blah",
             disk_size_gb = 100)

GitHub

Generally git is the best place for code under version control across many computers. The below details how you can pull code to your Docker container each restart without needing to resupply your GitHub SSH keys.

See also these references:

The below assumes you have started a VM using the persistent-rstudio image, which includes SSH tools:

vm <- gce_vm("vm-ssh", 
             predefined_type = "n1-standard-1", 
             template = "rstudio", 
             username = "mark", password = "blah", 
             dynamic_image = "gcr.io/gcer-public/persistent-rstudio")

First time you launch a VM:

  1. Once the VM is launched, log in to RStudio Server at the IP provided by the script
  2. Go to Tools > Global Options > Git/SVN > Create RSA Key
  3. Click on “View public key”" then add it to GitHub here: https://github.com/settings/keys
  4. Open the terminal in RStudio via Tools > Shell..., and configure you GitHub email and username:
git config --global user.email "your@githubemail.com"
git config --global user.name "GitHubUserName"
  1. Check it works - you should see your GitHub details via cat .gitconfig and SSH keys in ls .ssh, ssh -T git@github.com should succeed.

A new GitHub project

Do the below for each new RStudio Project to download from GitHub:

  1. On GitHub, click the Clone or download green button and copy the Clone with SSH URI. Do not copy the browser URL! - it won’t work
  2. Put the URI on RStudio Server via New Project > Version Control > Git > Repository URL
  3. The first connect you may need to input “yes” in the scary dropdown
  4. Make changes, push to GitHub via the RStudio Git pane

Restarting the VM/Docker

This configuration should now persist across Docker sessions e.g. you can stop/start the VM and still have GitHub configured.

  1. Stop the RStudio server via the Web UI or gce_vm_stop()
  2. Restart it via the Web UI or gce_vm_start()
  3. Login to RStudio via the URL, then open terminal and check your older configurations are there via cat .gitconfig and SSH keys in ls .ssh and ssh -T git@github.com works

Using googleCloudStorageR

This can be combined with the above GitHub settings to persist the GitHub settings over VMs.

The authentication for the googleCloudStorageR backups is re-using the credentials you used to launch the VM

It is not intended as a replacement for Git - it only adds files if they are not present locally. I use it to copy projects over to more powerful VMs as required.

On local computer

  1. Create a Google Cloud Bucket to save your R sessions to - you can do this via the web UI or using googleCloudStorageRs gce_create_bucket() function.

Choose a bucket region that is closest to you and your VM for best performance

  1. Add that bucket to your .Renviron as the GCS_SESSION_BUCKET argument:

    GCS_SESSION_BUCKET=gcer-bucket-name

The .Renviron usually sits in your computer home directory, see ?Startup for details.

  1. Add the gcs_first() and gcs_last() functions to your .RProfile file like so:
.First <- function(){

  cat("\n# Welcome Mark! Today is ", date(), "\n")
  cat("\n# Loading .Rprofile from", path.expand("~"))

  googleCloudStorageR::gcs_first()
}

.Last <- function(){
  # will only upload if a _gcssave.yaml in directory with bucketname
  googleCloudStorageR::gcs_last()
  message("\nGoodbye Mark at ", date(), "\n")
}
  1. Create RStudio Project
  2. Make R stuff
  3. Add a _gcssave.yaml file specifying the GCS bucket to save to.

It can carry various settings shown below:

## The GCS bucket to save/load R workspace from step 1
bucket: my-bucket

## set to FALSE if you want to load on R session startup
load_on_startup: FALSE

## on first load, whether to look for a different directory on GCS than present getwd()
loaddir:

## regex to only save these files to GCS
pattern:
  1. Exit RStudio project. You should see a message similar to:

    Saving data to Google Cloud Storage:
    your-gcs-bucket
    2017-08-18 23:25:43 -- File size detected as 1.3 Mb

When you startup that project again you should see:

[Workspace loaded from: 
gs://your-gcs-bucket/Users/the-rproject-folder]

Summary

There are three files to configure:

  • .Renviron - environment arguments ìncluding GCS_SESSION_BUCKET=gcer-bucket-name that will be looked for as where your session files are
  • .Rprofile- general R startup behaviour that carry the googleCloudStorageR::gcs_last() and googleCloudStorageR::gcs_first() functions
  • _gcssave.yaml - per folder settings for what to save that specifies which files to save in which folder

On cloud RStudio server

Now the R data is saved to GCS under the local folder name. We can load this data in an RStudio server cloud instance via:

  1. Launch the RStudio Server image gcr.io/gcer-public/persistent-rstudio that has appropriate libraries loaded.
vm <- gce_vm("mark-rstudio",
             template = "rstudio",
             username = "mark", password = 'mypassword',
             predefined_type = "n1-standard-2",
             dynamic_image = "gcr.io/gcer-public/persistent-rstudio")
  1. Add a GCS_SESSION_BUCKET metadata, either via webUI or via:
gce_set_metadata(list(GCS_SESSION_BUCKET = "your-session-bucket"), vm)
  1. Login to RStudio server and create an RStudio project

  2. As you did on your local machine, you need to create an .Rprofile so googleCloudStorageR can load and save and load data. For example:

.First <- function(){
  cat("\n# Welcome Ignacio! Today is ", date(), "\n")

  ## will look for download if GCS_SESSION_BUCKET env arg set
  googleCloudStorageR::gcs_first()
}


.Last <- function(){
  # will only upload if a _gcssave.yaml in directory with bucketname
  googleCloudStorageR::gcs_last()
  message("\nGoodbye Ignacio at ", date(), "\n")
}

message("n*** Successfully loaded .Rprofile ***n")
  1. Transfer the local RStudio project to this cloud VM by creating a _gcssave.yaml file at the root of the project with these entries:
bucket: your-gcs-bucket
loaddir: your-local-directory-name
  1. Close and re-open the RStudio project. Your local files should now load from GCS
  2. Do work, then exit the project. It will be saved to a new folder on GCS

Persisting GitHub with googleCloudStorageR

You can also use the above in conjunction with the GitHub setup to persist over VMs.

To do so, you need to :

  1. Keep the same RStudio login username,
  2. Use the same bucket for GCS_SESSION_BUCKET or in the _gcssave.yaml
  3. Use this Dockerfile’s image - gcr.io/gcer-public/persistent-rstudio

The configurations of GitHub that are saved in .ssh and .gitconfig folders in your home directory will be backed up to Google Cloud Storage.

Saving GitHub configurations

  1. Add a _gcssave.yaml file to your home folder that will download/upload the configurations.
## The GCS bucket to save/load R workspace from
bucket: gcer-store-my-rstudio-files

## regex to only save these files to GCS
pattern: "id_rsa|.gitconfig"
  1. With no project open and your working directory the base (e.g. getwd() is /home/you) save the yaml file and quit the R session:
q(save = "no")

You should see a message saying its saving the home folder. Upon restart, that folder will load from the bucket.

Loading GitHub configurations

  1. Start another VM, with the same details as before:
vm2 <- gce_vm("mark-rstudio",
             template = "rstudio",
             username = "mark", password = 'mypassword',
             predefined_type = "n1-standard-2",
             dynamic_image = "gcr.io/gcer-public/persistent-rstudio")

gce_set_metadata(list(GCS_SESSION_BUCKET = "your-session-bucket"), vm2)
  1. Upon logging in, you should see a message saying its loading data from GCS:
[Workspace loaded from: 
gs://your-session-bucket/home/you]
  1. You should now be able to run ssh -T git@github.com successfully
  2. Pull/push (private) GitHub repos via the steps outlined in the GitHub section above.

You can now delete VMs and start up new ones using RStudio Docker, and the GitHub configurations will persist so long as you follow the steps above.

Running RStudio Server on App Engine

Since the compute and the data are now separated, you can now become fully cloud native by running RStudio Server on App Engine. This means you don’t need to worry about servers at all. Each time you visit your RStudio Server App Engine URL, a new instance will start, loading your data from your last session. When you finish, close the browser and the VM will tear down itself.

See more at GitHub: RStudio on App Engine.

Running on App Engine has many advantages, including:

  • No servers to setup, stop/start
  • Only pay for when you are on the VM
  • Auto auth via the app engine service account
  • Load balancing and OAuth2 login options

Details on how the above is working

This build includes the newest version of googleCloudStorageR and googleComputeEngineR which have had functions added to help with the workflow above.

The functions can store data to Google’s dedicated store via googleCloudStorageRs gcs_first and gcs_last functions. This Dockerbuild puts the functions into a custom .Rprofile file that will save the projects workspace data to its own bucket, if they have a _gcssave.yaml file in the folder, or if the directory matches one already saved.

The .yaml tells googleCloudStorageR which bucket to save the folder to, or if not present an environment argument GCS_SESSION_BUCKET - this is used on first load when no .yaml file is present.

Thus, you can save an RStudio project via your local computer, then launch an RStudio server in the cloud with the loaddir: argument set to that directory name to load the files onto your cloud server. Once done, when you quit the R session it will save your work to its own new folder, that when you stop/start a Docker container with RStudio within and create a project with the same name, will automatically load.

It will only download files to your folder that don’t exist, so local changes won’t be overwritten if they already exist. It is not git, treat it more as a backup that will load if the files are not already present (such as when you relaunch a Docker container)

If you upload to GCS, make sure to load the directory and files you want - delete the GCS folder if you want to stop backups via gcs_delete_all()

Example _gcssave.yaml:

## The GCS bucket to save/load R workspace from
bucket: gcer-store-my-rstudio-files

## set to FALSE if you dont want to load on R session startup
load_on_startup: TRUE

## on first load and init, whether to look for a different directory on GCS than present getwd()
loaddir: /Users/mark/the/folder/on/local

## regex to only save these files to GCS
pattern:

An advantage on using R on a GCE instance is that you can reuse the authentication used to launch the VM for other cloud services, via googleAuthR::gar_gce_auth() so you don’t need to supply your own auth file.

To use, the VM needs to be supplied with a bucket name environment. Using a separate bucket means the same files can be transferred across Docker RStudio stop/starts and VMs. This is set in the instance running the Docker’s metadata, that will get copied over to an environment argument R can see.