Downloading Datasets¶
Download NEMAR datasets using git-annex for efficient large file handling.
Public vs Private Datasets¶
- Public datasets (published) can be downloaded by anyone without authentication.
- Private datasets (pre-publication) can only be downloaded by the owner and designated collaborators, and only through the NEMAR CLI. Direct git-annex commands (
git annex get,git annex pull) will not work for private data.
After publishing, datasets become publicly available for everyone.
Quick Download¶
# Download a public dataset (no account needed)
nemar dataset download nm000104
# Download a private dataset (requires login)
nemar auth login
nemar dataset download nm000115
This clones the dataset and downloads all data files from S3. The CLI shows real-time progress during download, including transfer speed, estimated time remaining, and per-file status.
Download Options¶
# Download to specific directory
nemar dataset download nm000104 -o ./datasets/
# Clone metadata only (skip large data files)
nemar dataset download nm000104 --no-data
# Parallel downloads for large datasets
nemar dataset download nm000104 -j 8
Clone vs Download¶
For large datasets, you may want to clone first and get files selectively:
# Clone metadata only
nemar dataset clone nm000104
# Get specific files later
cd nm000104
nemar dataset get sub-01/
# Get specific modality
nemar dataset get sub-01/eeg/
Download Progress¶
Downloads display real-time progress with:
- Transfer speed - current download rate (e.g., 2.4 MB/s)
- Estimated time remaining - based on sliding-window speed averaging
- File progress - tracks completed vs total files
- Per-file status - shows which file is currently downloading
The CLI automatically detects whether git-annex is installed. If available, it uses git-annex for efficient content-addressed downloads. Otherwise, it falls back to direct S3 downloads via pre-signed URLs.
Prerequisites¶
The download command requires the GitHub CLI (gh) to be installed and authenticated. The CLI checks for required tools before starting and provides platform-specific install guidance if anything is missing.
For private datasets, you must also be logged in with nemar auth login.
How It Works¶
NEMAR uses git-annex for efficient data management:
- Metadata stored in Git (GitHub)
- Large files stored in S3 (retrieved on demand)
- Versioning tracked automatically
This means: - Quick initial clone (just metadata) - Download only files you need - Automatic deduplication - Version history preserved
Working with Downloaded Data¶
Check What's Available¶
# See what files exist but aren't downloaded
git annex find --not --in here
# See what's downloaded
git annex find --in here
Free Space¶
Drop files you no longer need locally:
# Drop specific files (keeps remote copies)
nemar dataset drop sub-01/eeg/sub-01_task-rest_eeg.edf
# Drop all local copies
nemar dataset drop
Troubleshooting¶
"Permission denied" Error¶
Ensure you're logged in:
nemar auth status --refresh
Slow Download¶
For large datasets, downloads happen from S3. Check your connection and try
increasing parallelism with -j 8.
"Content not available" Error¶
The file may have been removed or moved. Try pulling the latest changes:
git pull
nemar dataset get <file>