Question mark Down-pointing chevron Down-pointing chevron edit Copy Return Up-pointing chevronRight-pointing chevronfacebooklinkedintwitter

Part 3 — Working with datasets

Datasets are Nerd’s way to manage files. A dataset is a collection of files, like a folder on a computer, that’s stored on the Nerdalize cloud. Datasets can be used as input for a job and, when an application creates output files, these can be automatically stored in a new dataset.

Video transcoding is a good example: the original videos are provided as an input dataset, the transcoder compresses the videos and the compressed videos are stored in a new dataset.

To show you how datasets work, we’ll continue using our CO2 calculator.

Using an input dataset

Before you can use files as input for a job, you need to upload them as a dataset. You’ll have to provide the location of the files on your computer and it’s helpful to give the dataset a name, so you can find it easily.

Our CO2 calculator allows you to provide a CSV file with a list of flights. You can download it and add your own flights if you’d like. Then upload it by executing:

$ nerd dataset upload --name=flights path-to-data-folder 
Archiving (Step 1/2): 78 B / 78 B [===============] 100.00% 0s
Uploading (Step 2/2): 2.05 KB / 1.02 KB [=========] 200.00% 0s
Uploaded dataset: 'flights'
To run a job with a dataset, use: 'nerd job run'

You can use any dataset as input for a job by providing the dataset’s name and the location that the application expects the input data.

Our CO2 calculator expects the input in /input. To use your dataset as input for the CO2 calculator, run:

$ nerd job run \
  --input=flights:/input \
  --name=flights-co2-calc \
  nerdalize/co2-calculator \
Submitted job: 'flights-co2-calc'
To see whats happening, use: 'nerd job list'

It’s possible to provide a location on your computer instead of a dataset name in the --input option to create a new dataset and use that as input for a job in one go.

Storing output files in a dataset

Most applications, including the CO2 calculator, also generate output files. To have these files automatically stored in a dataset simply provide the location in which the application creates the output files. It’s helpful to also provide a name for the new dataset so you’ll be able to find it easily.

Our CO2 calculator stores its output in /output. To reuse the flights dataset as input and store the output in a new dataset, run:

$ nerd job run \
  --input=flights:/input \
  --output=co2-result:/output \
  --name=output-co2-calc \
  nerdalize/co2-calculator \
Submitted job: 'output-co2-calc'
To see whats happening, use: 'nerd job list'

Downloading datasets

You can download any of your datasets. You’ll only have to provide its name and the location on your computer that you want to download it to.

To download the output of your job with custom input data, run:

$ nerd dataset download co2-result ~/my-first-nerd-output
Downloading (Step 1/2): 25.09 KB / 25.09 KB [=====] 100.00% 0s
Unarchiving (Step 2/2): 25.09 KB / 25.09 KB [=====] 100.00% 0s
Downloaded dataset: 'co2-result'
To delete the dataset from the cloud, use: `nerd dataset delete co2-result`

That’s how to use datasets with your jobs

You’ve used nerd dataset upload, nerd job run & nerd dataset download commands to run a job with input and output data. Perfect! That should allow you to do almost anything. Let’s continue to the final part to find out where and how to get images, so you can start running your own jobs!