Downloading Datasets¶
Download NEMAR datasets using git-annex for efficient large file handling.
Quick Download¶
# Download dataset (includes all data files)
nemar dataset download nm000104
This clones the dataset and downloads all data files from S3.
Download Options¶
# Download to specific directory
nemar dataset download nm000104 -o ./datasets/
# Clone metadata only (skip large data files)
nemar dataset download nm000104 --no-data
# Parallel downloads for large datasets
nemar dataset download nm000104 -j 8
Resume an Interrupted Download¶
If a download is interrupted, rerun with --resume instead of deleting the
partial clone:
nemar dataset download nm000104 --resume
--resume validates the existing directory is a git-annex clone of the same
dataset, refuses to proceed when the working tree is dirty, and refuses when
the local DatasetVersion has fallen behind the remote (use --update
instead). It then re-runs git annex get so only missing files are pulled.
Update to a Newer Version¶
When upstream publishes a new version, pull only the diff:
nemar dataset download nm000104 --update # pulls just the changed files
nemar dataset download nm000104 --update --prune # also drops orphaned annex objects
--update reads the local and remote DatasetVersion, fast-forwards to the
remote HEAD, and runs git annex get only on the annex keys that changed
between the two manifests. For a 5 GB dataset with a 20 MB metadata bump, this
typically transfers ~20 MB instead of the whole dataset. Non-fast-forward
merges (you have local commits) are refused; use nemar dataset update
(the PR workflow) to push them first.
BIDS Entity Filters¶
Pull only the parts of the dataset you need. The clone retains the full
git-annex tree (so the result is still a structurally valid BIDS dataset),
but only matching files have content locally. You can git annex get <path>
later to pull more.
# Specific subjects only (auto-prefix; "01" == "sub-01")
nemar dataset download nm000104 --subjects sub-01,02
# A single task across all subjects
nemar dataset download nm000104 --tasks rest
# Subjects, tasks, and datatypes intersected
nemar dataset download nm000104 \
--subjects 01,02 --tasks rest --datatypes eeg
# Runs (unpadded 1-9 match both run-1 and run-01)
nemar dataset download nm000104 --runs 1,2
# Sessions
nemar dataset download nm000104 --sessions ses-pre,post
# Raw glob pass-through
nemar dataset download nm000104 --include 'sub-01/eeg/*.edf,*.json'
nemar dataset download nm000104 --exclude 'derivatives/**,sourcedata/**'
| Flag | Comma-list values | Maps to |
|---|---|---|
--subjects |
sub-01,02 |
sub-01/**, sub-02/** |
--sessions |
ses-pre,post |
**/ses-pre/**, **/ses-post/** |
--tasks |
rest,nback |
**/*_task-rest_*, **/*_task-nback_* |
--runs |
1,2 |
**/*_run-1_*, **/*_run-01_*, ... |
--datatypes |
eeg,emg |
**/eeg/**, **/emg/** |
--include |
raw glob list | --include pass-through |
--exclude |
raw glob list | --exclude pass-through |
Filters compose with --update (only changed files inside the filter scope
are pulled). They cannot be combined with --no-data, since filters imply
data download.
Clone vs Download¶
For large datasets, you may want to clone first and get files selectively:
# Clone metadata only
nemar dataset clone nm000104
# Get specific files later
cd nm000104
nemar dataset get sub-01/
# Get specific modality
nemar dataset get sub-01/eeg/
How It Works¶
NEMAR uses git-annex for efficient data management:
- Metadata stored in Git (GitHub)
- Large files stored in S3 (retrieved on demand)
- Versioning tracked automatically
This means: - Quick initial clone (just metadata) - Download only files you need - Automatic deduplication - Version history preserved
Working with Downloaded Data¶
Check What's Available¶
# See what files exist but aren't downloaded
git annex find --not --in here
# See what's downloaded
git annex find --in here
Free Space¶
Drop files you no longer need locally:
# Drop specific files (keeps remote copies)
nemar dataset drop sub-01/eeg/sub-01_task-rest_eeg.edf
# Drop all local copies
nemar dataset drop
Troubleshooting¶
"Permission denied" Error¶
Ensure you're logged in:
nemar auth status --refresh
Slow Download¶
For large datasets, downloads happen from S3. Check your connection and try
increasing parallelism with -j 8.
"Content not available" Error¶
The file may have been removed or moved. Try pulling the latest changes:
git pull
nemar dataset get <file>