scheduled-rscripts.Rmd
The below lets you send any of your R scripts to a dedicated VM. It takes advantage of containeRit to bundle up all the dependencies your R script may need.
In brief you need to:
containeRit
to create a Dockerfile that will run your R script with all its dependenciesgce_vm_scheduler
- keep this livegce_schedule_docker
The example below uses a script that comes with the package, that you can use as a template.
A demo script is below, that may look like something you want to schedule:
library(googleAuthR) ## authentication
library(googleCloudStorageR) ## google cloud storage
## set authentication details for non-cloud services
# options(googleAuthR.scopes.selected = "XXX",
# googleAuthR.client_id = "",
# googleAuthR.client_secret = "")
## download or do something
something <- tryCatch({
gcs_get_object("schedule/test.csv",
bucket = "mark-edmondson-public-files")
}, error = function(ex) {
NULL
})
something_else <- data.frame(X1 = 1,
time = Sys.time(),
blah = paste(sample(letters, 10, replace = TRUE), collapse = ""))
something <- rbind(something, something_else)
## authenticate on GCE for google cloud services
googleAuthR::gar_gce_auth()
tmp <- tempfile(fileext = ".csv")
on.exit(unlink(tmp))
write.csv(something, file = tmp, row.names = FALSE)
## upload something
gcs_upload(tmp,
bucket = "mark-edmondson-public-files",
name = "schedule/test.csv")
Note that its best to not save data onto the scheduler - its much better to use an external storage to load and save data to such as Google Cloud Storage.
The above script can then be scheduled via the below.
We first create a Dockerfile
that holds all your scripts dependencies. This is the magic of containeRit
:
if(!require(containeRit)){
devtools::install_github("MarkEdmondson1234/containerit") #use my fork until fix merged
library(containeRit)
}
script <- system.file("schedulescripts", "schedule.R", package = "googleComputeEngineR")
## put the "schedule.R" script in the working directory
file.copy(script, getwd())
## it will run the script whilst making the dockerfile
container <- dockerfile("schedule.R",
copy = "script_dir",
cmd = CMD_Rscript("schedule.R"),
soft = TRUE)
write(container, file = "Dockerfile")
Now you have the Dockerfile, it can be used to create Docker images. There are several options here, including docker_build
function to build on another VM or locally, but the easiest is to use a code repository such as GitHub, and the new Google Container Registry service, Build Triggers.
This creates the Docker image for you on every GitHub push, and makes it available either publically or privately to you.
The example does this via this public Container Registry service created for googleComputeEngineR
and has the above scripts at "demo-docker-scheduler"
After the image has built, you can schdule it to be called via a crontab
task via the new dedicated functions, gce_vm_scheduler
and gce_schedule_docker
## Create a VM to run the schedule
vm <- gce_vm_scheduler("my_scheduler")
## setup any SSH settings if not using defaults
vm <- gce_vm_setup(vm, username = "mark")
## get the name of the just built Docker image that runs your script
docker_tag <- gce_tag_container("demo-docker-scheduler", project = "gcer-public")
## Schedule the docker_tag to run every day at 0453AM
## schedule uses crontab syntax
gce_schedule_docker(docker_tag, schedule = "53 4 * * *", vm = vm)
The Docker image is now downloaded the first time, and run on the schedule (0453 AM in above example)
The Dockerfile
used need not be an R related one, any Docker image can be scheduled.
For bigger jobs or more seperation, you can launch entire VMs dedicated for your scheduled task. This lets you tailor the VM individually.
$4.09 a month for the master + $1.52 a month per slave (daily 30 min cron job on a 7.5GB RAW instance).
These have been set up via a public Google Container Registry via build triggers, tied to googleComputeEngineR
’s repostiory on Github. You can see the Dockerfiles
used in the dockerfiles in system.file("dockerfiles", "gceScheduler", package = "googleComputeEngineR")
Each time the GitHub repository is pushed, these Docker images are rebuilt, allowing for easy changes and versioning.
Now we have the templates saved to Container Registry, make a ‘Master’ VM that is small, and will be on 24/7 to run cron. This costs ~$4.09 a month. Give it a strong password.
library(googleComputeEngineR)
username <- "mark"
## make the cron-master
master <- gce_vm("cron-master",
predefined_type = "g1-small",
template = "rstudio",
dynamic_image = gce_tag_container("gce-master-scheduler", project = "gcer-public"),
username = username,
password = "mark1234")
## set up SSH from master to slaves with username 'master'
gce_ssh(master, "ssh-keygen -t rsa -f ~/.ssh/google_compute_engine -C master -N ''")
## copy SSH keys into the docker container
## (probably more secure than keeping keys in Docker container itself)
docker_cmd(master, cmd = "cp", args = sprintf("~/.ssh/ rstudio:/home/%s/.ssh/", username)
docker_cmd(master, cmd = "exec", args = sprintf("rstudio chown -R %s /home/%s/.ssh/", username, username)
Create the larger slave instance, that can be then stopped ready for the master to activate as needed. These will cost in total $1.52 a month if they run every day for 30 minutes. Here its called slave-1
but a more descriptive name helps, such as a client name.
slave <- gce_vm("slave-1",
predefined_type = "n1-standard-2",
template = "rstudio",
dynamic_image = gce_tag_container("gce-slave-scheduler", project = "gcer-public"),
username = "mark",
password = "mark1234")
## wait for it to all install (e.g. RStudio login screen available)
## stop it ready for being started by master VM
gce_vm_stop(slave)
If you want to use the latest version of the Docker built image, you need to recreate the instance, allowing you to create versioning.
Create the R script you want to schedule. Make sure it is self sufficient in that it can authenticate, do stuff and upload to a safe repository, such as Google Cloud Storage.
This script will be in turn uploaded itself to Google Cloud Storage, so the slave instance can call it via a handy googleCloudStorageR
function that runs a script locally from a cloud storage file:
The example script below authenticates with Google Cloud Storage, downloads a ga.httr-oauth
file that carries the Google Analytics authentication, runs the download then reauthenticates with Google Cloud Storage to upload the results. Modify for your own expensive operation.
## download.R - called from slave VM
library(googleCloudStorageR)
library(googleAnalyticsR)
## set defaults
gce_global_project("my-project")
gce_global_zone("europe-west1-b")
gcs_global_bucket("your-gcs-bucket")
## gcs can authenticate via GCE auth keys
googleAuthR::gar_gce_auth()
## use GCS to download auth key (that you have previously uploaded)
gcs_get_object("ga.httr-oauth", saveToDisk = "ga.httr-oauth")
auth_token <- readRDS("ga.httr-oauth")
options(googleAuthR.scopes.selected = c("https://www.googleapis.com/auth/analytics",
"https://www.googleapis.com/auth/analytics.readonly"),
googleAuthR.httr_oauth_cache = "ga.httr-oauth")
googleAuthR::gar_auth(auth_token)
## fetch data
gadata <- google_analytics_4(81416156,
date_range = c(Sys.Date() - 8, Sys.Date() - 1),
dimensions = c("medium", "source", "landingPagePath"),
metrics = "sessions",
max = -1)
## back to Cloud Storage
googleAuthR::gar_gce_auth()
gcs_upload(gadata, name = "uploads/gadata_81416156.csv")
gcs_upload("ga.httr-oauth")
message("Upload complete", Sys.time())
Create the script that will run on master VM. This will start the slave instance, run your scheduled script and stop the slave instance again.
## intended to be run on a small instance via cron
## use this script to launch other VMs with more expensive tasks
library(googleComputeEngineR)
library(googleCloudStorageR)
gce_global_project("my-project")
gce_global_zone("europe-west1-b")
gcs_global_bucket("your-gcs-bucket")
## auth to same project we're on
googleAuthR::gar_gce_auth()
## launch the premade VM
vm <- gce_vm("slave-1")
## set SSH to use 'master' username as configured before
vm <- gce_ssh_setup(vm, username = "master", ssh_overwrite = TRUE)
## run the script on the VM that will source from GCS
runme <- "Rscript -e \"googleAuthR::gar_gce_auth();googleCloudStorageR::gcs_source('download.R', bucket = 'your-gcs-bucket')\""
out <- docker_cmd(vm,
cmd = "exec",
args = c("rstudio", runme),
wait = TRUE)
## once finished, stop the VM
gce_vm_stop(vm)
Log in to the master VM and save the script, then schedule it via the cronR
RStudio addin.