persistent-rstudio.Rmd
A new Docker image is available which installs tools on top of the default rocker/tidyverse
to help persist files over Docker containers. This image is part of the public Docker images built on top of googleComputeEngineR
.
With this image, there are three ways to save files between Docker sessions:
googleCloudStorageR
to save and read R working directories between machines, including your GitHub/SSH configurations.A combination of the above should be used for what best fits your workflow.
These files will disappear if you delete the VM, so it is recommend if they are important to write them somewhere else as well.
If relying on this, you will probably want to create a larger VM disk than the default 10GBs using the disk_size_gb
argument:
Generally git is the best place for code under version control across many computers. The below details how you can pull code to your Docker container each restart without needing to resupply your GitHub SSH keys.
See also these references:
The below assumes you have started a VM using the persistent-rstudio
image, which includes SSH tools:
vm <- gce_vm("vm-ssh",
predefined_type = "n1-standard-1",
template = "rstudio",
username = "mark", password = "blah",
dynamic_image = "gcr.io/gcer-public/persistent-rstudio")
Tools > Global Options > Git/SVN > Create RSA Key
Tools > Shell...
, and configure you GitHub email and username:git config --global user.email "your@githubemail.com"
git config --global user.name "GitHubUserName"
cat .gitconfig
and SSH keys in ls .ssh
, ssh -T git@github.com
should succeed.Do the below for each new RStudio Project to download from GitHub:
Clone or download
green button and copy the Clone with SSH
URI. Do not copy the browser URL! - it won’t work
New Project > Version Control > Git > Repository URL
This configuration should now persist across Docker sessions e.g. you can stop/start the VM and still have GitHub configured.
gce_vm_stop()
gce_vm_start()
cat .gitconfig
and SSH keys in ls .ssh
and ssh -T git@github.com
worksThis can be combined with the above GitHub settings to persist the GitHub settings over VMs.
The authentication for the googleCloudStorageR
backups is re-using the credentials you used to launch the VM
It is not intended as a replacement for Git - it only adds files if they are not present locally. I use it to copy projects over to more powerful VMs as required.
googleCloudStorageR
s gce_create_bucket()
function.Choose a bucket region that is closest to you and your VM for best performance
.Renviron
as the GCS_SESSION_BUCKET
argument:GCS_SESSION_BUCKET=gcer-bucket-name
The .Renviron
usually sits in your computer home directory, see ?Startup
for details.
gcs_first()
and gcs_last()
functions to your .RProfile
file like so:.First <- function(){
cat("\n# Welcome Mark! Today is ", date(), "\n")
cat("\n# Loading .Rprofile from", path.expand("~"))
googleCloudStorageR::gcs_first()
}
.Last <- function(){
# will only upload if a _gcssave.yaml in directory with bucketname
googleCloudStorageR::gcs_last()
message("\nGoodbye Mark at ", date(), "\n")
}
_gcssave.yaml
file specifying the GCS bucket to save to.It can carry various settings shown below:
## The GCS bucket to save/load R workspace from step 1
bucket: my-bucket
## set to FALSE if you want to load on R session startup
load_on_startup: FALSE
## on first load, whether to look for a different directory on GCS than present getwd()
loaddir:
## regex to only save these files to GCS
pattern:
Saving data to Google Cloud Storage:
your-gcs-bucket
2017-08-18 23:25:43 -- File size detected as 1.3 Mb
When you startup that project again you should see:
There are three files to configure:
.Renviron
- environment arguments ìncluding GCS_SESSION_BUCKET=gcer-bucket-name
that will be looked for as where your session files are.Rprofile
- general R startup behaviour that carry the googleCloudStorageR::gcs_last()
and googleCloudStorageR::gcs_first()
functions_gcssave.yaml
- per folder settings for what to save that specifies which files to save in which folderNow the R data is saved to GCS under the local folder name. We can load this data in an RStudio server cloud instance via:
gcr.io/gcer-public/persistent-rstudio
that has appropriate libraries loaded.vm <- gce_vm("mark-rstudio",
template = "rstudio",
username = "mark", password = 'mypassword',
predefined_type = "n1-standard-2",
dynamic_image = "gcr.io/gcer-public/persistent-rstudio")
Login to RStudio server and create an RStudio project
As you did on your local machine, you need to create an .Rprofile so googleCloudStorageR
can load and save and load data. For example:
.First <- function(){
cat("\n# Welcome Ignacio! Today is ", date(), "\n")
## will look for download if GCS_SESSION_BUCKET env arg set
googleCloudStorageR::gcs_first()
}
.Last <- function(){
# will only upload if a _gcssave.yaml in directory with bucketname
googleCloudStorageR::gcs_last()
message("\nGoodbye Ignacio at ", date(), "\n")
}
message("n*** Successfully loaded .Rprofile ***n")
_gcssave.yaml
file at the root of the project with these entries:You can also use the above in conjunction with the GitHub setup to persist over VMs.
To do so, you need to :
GCS_SESSION_BUCKET
or in the _gcssave.yaml
gcr.io/gcer-public/persistent-rstudio
The configurations of GitHub that are saved in .ssh
and .gitconfig
folders in your home directory will be backed up to Google Cloud Storage.
_gcssave.yaml
file to your home folder that will download/upload the configurations.## The GCS bucket to save/load R workspace from
bucket: gcer-store-my-rstudio-files
## regex to only save these files to GCS
pattern: "id_rsa|.gitconfig"
getwd()
is /home/you
) save the yaml file and quit the R session:You should see a message saying its saving the home folder. Upon restart, that folder will load from the bucket.
vm2 <- gce_vm("mark-rstudio",
template = "rstudio",
username = "mark", password = 'mypassword',
predefined_type = "n1-standard-2",
dynamic_image = "gcr.io/gcer-public/persistent-rstudio")
gce_set_metadata(list(GCS_SESSION_BUCKET = "your-session-bucket"), vm2)
ssh -T git@github.com
successfullyYou can now delete VMs and start up new ones using RStudio Docker, and the GitHub configurations will persist so long as you follow the steps above.
Since the compute and the data are now separated, you can now become fully cloud native by running RStudio Server on App Engine. This means you don’t need to worry about servers at all. Each time you visit your RStudio Server App Engine URL, a new instance will start, loading your data from your last session. When you finish, close the browser and the VM will tear down itself.
See more at GitHub: RStudio on App Engine.
Running on App Engine has many advantages, including:
This build includes the newest version of googleCloudStorageR
and googleComputeEngineR
which have had functions added to help with the workflow above.
The functions can store data to Google’s dedicated store via googleCloudStorageR
s gcs_first
and gcs_last
functions. This Dockerbuild puts the functions into a custom .Rprofile
file that will save the projects workspace data to its own bucket, if they have a _gcssave.yaml
file in the folder, or if the directory matches one already saved.
The .yaml
tells googleCloudStorageR
which bucket to save the folder to, or if not present an environment argument GCS_SESSION_BUCKET
- this is used on first load when no .yaml
file is present.
Thus, you can save an RStudio project via your local computer, then launch an RStudio server in the cloud with the loaddir:
argument set to that directory name to load the files onto your cloud server. Once done, when you quit the R session it will save your work to its own new folder, that when you stop/start a Docker container with RStudio within and create a project with the same name, will automatically load.
It will only download files to your folder that don’t exist, so local changes won’t be overwritten if they already exist. It is not git, treat it more as a backup that will load if the files are not already present (such as when you relaunch a Docker container)
If you upload to GCS, make sure to load the directory and files you want - delete the GCS folder if you want to stop backups via gcs_delete_all()
Example _gcssave.yaml
:
## The GCS bucket to save/load R workspace from
bucket: gcer-store-my-rstudio-files
## set to FALSE if you dont want to load on R session startup
load_on_startup: TRUE
## on first load and init, whether to look for a different directory on GCS than present getwd()
loaddir: /Users/mark/the/folder/on/local
## regex to only save these files to GCS
pattern:
An advantage on using R on a GCE instance is that you can reuse the authentication used to launch the VM for other cloud services, via googleAuthR::gar_gce_auth()
so you don’t need to supply your own auth file.
To use, the VM needs to be supplied with a bucket name environment. Using a separate bucket means the same files can be transferred across Docker RStudio stop/starts and VMs. This is set in the instance running the Docker’s metadata, that will get copied over to an environment argument R can see.