Queries

Queries are useful for exploring the available imaging data. The general hierarchy of the cancer imaging archive (TCIA) is:

Collection -> PatientID -> StudyInstanceUID -> SeriesInstanceUID -> SOPInstanceUID

To download images, the SeriesInstanceUID and/or SOPInstanceUID must be known. The query functions are meant to help identify the relevant unique identifiers (UIDs)

Detailed information is available in the TCIA's user guide which includes a list of available query endpoints and the type of information returned by each query.

Note

As mentioned in the Formats section, each query returns either a DataFrame or a Dictionary Array. The current section will exclusively use the DataFrame output. That being said, a dictionary array can always be obtained by any of these functions by passing format = "json" as an input argument.

All collections

The names of all available collections on TCIA is obtained by:

julia> tcia_collections()
106×1 DataFrame
│ Row │ Collection              │
│     │ String                  │
├─────┼─────────────────────────┤
│ 1   │ TCGA-GBM                │
│ 2   │ LIDC-IDRI               │
│ 3   │ BREAST-DIAGNOSIS        │
│ 4   │ PROSTATE-MRI            │
│ 5   │ PROSTATE-DIAGNOSIS      │
│ 6   │ NaF PROSTATE            │
│ 7   │ CT COLONOGRAPHY         │
⋮
│ 99  │ Pelvic-Reference-Data   │
│ 100 │ HEAD-NECK-RADIOMICS-HN1 │
│ 101 │ PDMR-292921-168-R       │
│ 102 │ GBM-DSC-MRI-DRO         │
│ 103 │ Prostate-MRI-US-Biopsy  │
│ 104 │ DRO-Toolkit             │
│ 105 │ COVID-19                │
│ 106 │ COVID-19-AR             │

Imaging modalities

The imaging modalities in a specific collection and/or anatomy are listed by:

julia> tcia_modalities(collection = "TCGA-KIRP")
3×1 DataFrame
│ Row │ Modality │
│     │ String   │
├─────┼──────────┤
│ 1   │ MR       │
│ 2   │ CT       │
│ 3   │ PT       │

julia> tcia_modalities(bodypart = "BRAIN")
5×1 DataFrame
│ Row │ Modality │
│     │ String   │
├─────┼──────────┤
│ 1   │ MR       │
│ 2   │ CT       │
│ 3   │ PT       │
│ 4   │ DX       │
│ 5   │ SEG      │

julia> tcia_modalities(collection = "CPTAC-HNSCC", bodypart = "HEAD")
2×1 DataFrame
│ Row │ Modality │
│     │ String   │
├─────┼──────────┤
│ 1   │ CT       │
│ 2   │ MR       │
Note

Capitalization matters when passing in arguments, i.e. bodypart = "BRAIN" works but passing bodypart = "brain" will return an empty object. However, there are some cases where different versions are valid. As an example, passing bodypart = Kidney or bodypart = "KIDNEY" will both return valid (but different!) results. So although fully-capitalized body part names will work most of the time, do double-check if alternative spellings exist when using the bodypart argument (see next section for names)

Anatomy/body parts

The anatomy scanned in a specific collection and/or modality are listed by:

julia> tcia_bodyparts(collection = "CPTAC-HNSCC")
8×1 DataFrame
│ Row │ BodyPartExamined │
│     │ String?          │
├─────┼──────────────────┤
│ 1   │ missing          │
│ 2   │ Head-Neck        │
│ 3   │ NECK             │
│ 4   │ HEAD             │
│ 5   │ HEADNECK         │
│ 6   │ CHEST            │
│ 7   │ ABDOMEN          │
│ 8   │ CHEST_TO_PELVIS  │

julia> tcia_bodyparts(modality = "CT")
65×1 DataFrame
│ Row │ BodyPartExamined │
│     │ String?          │
├─────┼──────────────────┤
│ 1   │ BRAIN            │
│ 2   │ COLON            │
│ 3   │ CHEST            │
│ 4   │ HEADNECK         │
│ 5   │ LIVER            │
│ 6   │ OVARY            │
│ 7   │ STOMACH          │
⋮
│ 58  │ ABDOMENPELVIS    │
│ 59  │ THORAXABD        │
│ 60  │ CT 3PHASE REN    │
│ 61  │ WO INTER         │
│ 62  │ ABD PELV         │
│ 63  │ ABD PEL          │
│ 64  │ CAP              │
│ 65  │ WHOLEBODY        │

julia> tcia_bodyparts(collection = "CPTAC-SAR", modality = "MR")
3×1 DataFrame
│ Row │ BodyPartExamined │
│     │ String?          │
├─────┼──────────────────┤
│ 1   │ missing          │
│ 2   │ EXTREMITY        │
│ 3   │ Pelvis           │

julia> tcia_bodyparts(collection = "CPTAC-SAR", modality = "CT")
5×1 DataFrame
│ Row │ BodyPartExamined │
│     │ String?          │
├─────┼──────────────────┤
│ 1   │ missing          │
│ 2   │ ABDOMEN          │
│ 3   │ EXTREMITY        │
│ 4   │ CHEST            │
│ 5   │ WHOLEBODY        │

Manufacturers

A list of scanner manufacturers for a specific collection/modality/anatomy is obtained by

julia> tcia_manufacturers(collection = "TCGA-KICH")
3×1 DataFrame
│ Row │ Manufacturer       │
│     │ String             │
├─────┼────────────────────┤
│ 1   │ GE MEDICAL SYSTEMS │
│ 2   │ SIEMENS            │
│ 3   │ TOSHIBA            │

julia> tcia_manufacturers(modality = "CT")
39×1 DataFrame
│ Row │ Manufacturer                   │
│     │ Union{Missing, String}         │
├─────┼────────────────────────────────┤
│ 1   │ missing                        │
│ 2   │ GE MEDICAL SYSTEMS             │
│ 3   │ SIEMENS                        │
│ 4   │ TOSHIBA                        │
│ 5   │ Philips                        │
│ 6   │ Vital Images, Inc              │
│ 7   │ Posda RTOG Converter           │
⋮
│ 32  │ 004                            │
│ 33  │ 001                w\\/.419501 │
│ 34  │ 003                            │
│ 35  │ 003                 .419501    │
│ 36  │ 004     313619.2.55.3.419501   │
│ 37  │ 002     es scanned.ration      │
│ 38  │ 005                            │
│ 39  │ 005                         9  │

julia> tcia_manufacturers(bodypart = "BREAST")
10×1 DataFrame
│ Row │ Manufacturer                │
│     │ Union{Missing, String}      │
├─────┼─────────────────────────────┤
│ 1   │ GE MEDICAL SYSTEMS          │
│ 2   │ missing                     │
│ 3   │ Philips Medical Systems     │
│ 4   │ LORAD                       │
│ 5   │ Lorad, A Hologic Company    │
│ 6   │ SIEMENS                     │
│ 7   │ Confirma Inc.               │
│ 8   │ Siemens                     │
│ 9   │ GE MEDICAL SYSTEMS, NUCLEAR │
│ 10  │ VICTRE                      │

The same manufacturer can have different names, e.g. Philips/Philips Medical Systems and SIEMENS/Siemens.

Patients

The patients in a given collection are listed by:

julia> tcia_patients(collection = "TCGA-SARC")
5×4 DataFrame
│ Row │ PatientID    │ PatientName  │ PatientSex │ Collection │
│     │ String       │ String       │ String     │ String     │
├─────┼──────────────┼──────────────┼────────────┼────────────┤
│ 1   │ TCGA-QQ-A5V2 │ TCGA-QQ-A5V2 │ M          │ TCGA-SARC  │
│ 2   │ TCGA-QQ-A5VC │ TCGA-QQ-A5VC │ F          │ TCGA-SARC  │
│ 3   │ TCGA-QQ-A8VF │ TCGA-QQ-A8VF │ M          │ TCGA-SARC  │
│ 4   │ TCGA-QQ-A8VH │ TCGA-QQ-A8VH │ F          │ TCGA-SARC  │
│ 5   │ TCGA-QQ-A8VG │ TCGA-QQ-A8VG │ M          │ TCGA-SARC  │

Patients for specific modality

To get a patients for which a specific modality was used, a slightly different function is used:

julia> tcia_patients_by_modality(collection = "TCGA-SARC", modality = "CT")
4×3 DataFrame
│ Row │ PatientID    │ Collection │ Modality │
│     │ String       │ String     │ String   │
├─────┼──────────────┼────────────┼──────────┤
│ 1   │ TCGA-QQ-A8VG │ TCGA-SARC  │ CT       │
│ 2   │ TCGA-QQ-A5V2 │ TCGA-SARC  │ CT       │
│ 3   │ TCGA-QQ-A5VC │ TCGA-SARC  │ CT       │
│ 4   │ TCGA-QQ-A8VF │ TCGA-SARC  │ CT       │

julia> tcia_patients_by_modality(collection = "TCGA-SARC", modality = "MR")
2×3 DataFrame
│ Row │ PatientID    │ Collection │ Modality │
│     │ String       │ String     │ String   │
├─────┼──────────────┼────────────┼──────────┤
│ 1   │ TCGA-QQ-A8VF │ TCGA-SARC  │ MR       │
│ 2   │ TCGA-QQ-A8VH │ TCGA-SARC  │ MR       │
Note

Although the functionality of tcia_patients_by_modality() could be combined into the tcia_patients() function, they use a different query endpoint so the two functions were given different names to keep that difference explicit.

Patients added after specific date

In large collections, it can be useful to query patients that were added after a date specified as YYYY-MM-DD. This is accomplished by:

julia> tcia_newpatients(collection = "TCGA-GBM", date = "2015-01-01")
104×2 DataFrame
│ Row │ PatientID    │ Collection │
│     │ String       │ String     │
├─────┼──────────────┼────────────┤
│ 1   │ TCGA-06-1806 │ TCGA-GBM   │
│ 2   │ TCGA-02-0060 │ TCGA-GBM   │
│ 3   │ TCGA-02-0006 │ TCGA-GBM   │
│ 4   │ TCGA-02-0009 │ TCGA-GBM   │
│ 5   │ TCGA-02-0011 │ TCGA-GBM   │
│ 6   │ TCGA-02-0027 │ TCGA-GBM   │
│ 7   │ TCGA-02-0033 │ TCGA-GBM   │
⋮
│ 97  │ TCGA-76-6282 │ TCGA-GBM   │
│ 98  │ TCGA-76-6285 │ TCGA-GBM   │
│ 99  │ TCGA-76-6656 │ TCGA-GBM   │
│ 100 │ TCGA-76-6657 │ TCGA-GBM   │
│ 101 │ TCGA-76-6661 │ TCGA-GBM   │
│ 102 │ TCGA-76-6662 │ TCGA-GBM   │
│ 103 │ TCGA-76-6663 │ TCGA-GBM   │
│ 104 │ TCGA-76-6664 │ TCGA-GBM   │

Patient studies

A list of visits/studies for a given collection/patient is obtained by:

julia> tcia_studies(collection = "TCGA-THCA")
7×9 DataFrame. Omitted printing of 5 columns
│ Row │ Collection │ PatientID    │ PatientName  │ PatientSex │
│     │ String     │ String       │ String       │ String     │
├─────┼────────────┼──────────────┼──────────────┼────────────┤
│ 1   │ TCGA-THCA  │ TCGA-DE-A4MD │ TCGA-DE-A4MD │ M          │
│ 2   │ TCGA-THCA  │ TCGA-DE-A4MA │ TCGA-DE-A4MA │ F          │
│ 3   │ TCGA-THCA  │ TCGA-DE-A4MA │ TCGA-DE-A4MA │ F          │
│ 4   │ TCGA-THCA  │ TCGA-DE-A4MC │ TCGA-DE-A4MC │ F          │
│ 5   │ TCGA-THCA  │ TCGA-DE-A4MB │ TCGA-DE-A4MB │ F          │
│ 6   │ TCGA-THCA  │ TCGA-E3-A3DZ │ TCGA-E3-A3DZ │ F          │
│ 7   │ TCGA-THCA  │ TCGA-E3-A3E5 │ TCGA-E3-A3E5 │ M          │

julia> tcia_studies(patient = "TCGA-QQ-A8VF")
2×9 DataFrame. Omitted printing of 5 columns
│ Row │ Collection │ PatientID    │ PatientName  │ PatientSex │
│     │ String     │ String       │ String       │ String     │
├─────┼────────────┼──────────────┼──────────────┼────────────┤
│ 1   │ TCGA-SARC  │ TCGA-QQ-A8VF │ TCGA-QQ-A8VF │ M          │
│ 2   │ TCGA-SARC  │ TCGA-QQ-A8VF │ TCGA-QQ-A8VF │ M          │

If the unique identifier (UID) for a study is known (a.k.a. StudyInstanceUID), then that can also be used

julia> tcia_studies(study = "1.3.6.1.4.1.14519.5.2.1.3023.4024.298690116465423805879206377806")
1×9 DataFrame. Omitted printing of 5 columns
│ Row │ Collection │ PatientID    │ PatientName  │ PatientSex │
│     │ String     │ String       │ String       │ String     │
├─────┼────────────┼──────────────┼──────────────┼────────────┤
│ 1   │ TCGA-SARC  │ TCGA-QQ-A8VF │ TCGA-QQ-A8VF │ M          │

Patient studies added after specific data

A list of visits/studies that were added after some date, formatted by YYYY-MM-DD, can be obtained by:

julia> tcia_newstudies(collection="TCGA-GBM", date="2015-01-01")
106×3 DataFrame. Omitted printing of 2 columns
│ Row │ StudyInstanceUID                                                 │
│     │ String                                                           │
├─────┼──────────────────────────────────────────────────────────────────┤
│ 1   │ 1.3.6.1.4.1.14519.5.2.1.4591.4001.308657922884262137750855956361 │
│ 2   │ 1.3.6.1.4.1.14519.5.2.1.1706.4001.247522006211308616726493960307 │
│ 3   │ 1.3.6.1.4.1.14519.5.2.1.4591.4001.207566576057862441493246578379 │
│ 4   │ 1.3.6.1.4.1.14519.5.2.1.1706.4001.149500105036523046215258942545 │
│ 5   │ 1.3.6.1.4.1.14519.5.2.1.1706.4001.743358002952086773602945013452 │
│ 6   │ 1.3.6.1.4.1.14519.5.2.1.1706.4001.338073323505507625300877831709 │
│ 7   │ 1.3.6.1.4.1.14519.5.2.1.1706.4001.190188151913002985587952372782 │
⋮
│ 99  │ 1.3.6.1.4.1.14519.5.2.1.1188.4001.623989292006918600441736922866 │
│ 100 │ 1.3.6.1.4.1.14519.5.2.1.1188.4001.440169796153998480809455332999 │
│ 101 │ 1.3.6.1.4.1.14519.5.2.1.1188.4001.313245752188924211777085602901 │
│ 102 │ 1.3.6.1.4.1.14519.5.2.1.1188.4001.114621594146207945121756272697 │
│ 103 │ 1.3.6.1.4.1.14519.5.2.1.1188.4001.102058737511198476066014834840 │
│ 104 │ 1.3.6.1.4.1.14519.5.2.1.1188.4001.313762558732585076631143086043 │
│ 105 │ 1.3.6.1.4.1.14519.5.2.1.1188.4001.461523921338830081291431565499 │
│ 106 │ 1.3.6.1.4.1.14519.5.2.1.1188.4001.280508857811965887839758381790 │

Imaging series

Each patient study consists of one or more imaging series which can be obtained by:

julia> tcia_series(collection = "TCGA-THCA")
28×16 DataFrame. Omitted printing of 15 columns
│ Row │ PatientID    │
│     │ String       │
├─────┼──────────────┤
│ 1   │ TCGA-DE-A4MA │
│ 2   │ TCGA-DE-A4MA │
│ 3   │ TCGA-DE-A4MD │
│ 4   │ TCGA-DE-A4MD │
│ 5   │ TCGA-DE-A4MC │
│ 6   │ TCGA-DE-A4MC │
│ 7   │ TCGA-DE-A4MC │
⋮
│ 21  │ TCGA-DE-A4MA │
│ 22  │ TCGA-DE-A4MA │
│ 23  │ TCGA-DE-A4MA │
│ 24  │ TCGA-DE-A4MA │
│ 25  │ TCGA-DE-A4MA │
│ 26  │ TCGA-DE-A4MA │
│ 27  │ TCGA-E3-A3E5 │
│ 28  │ TCGA-E3-A3E5 │

julia> tcia_series(patient = "TCGA-QQ-A8VF")
15×16 DataFrame. Omitted printing of 15 columns
│ Row │ PatientID    │
│     │ String       │
├─────┼──────────────┤
│ 1   │ TCGA-QQ-A8VF │
│ 2   │ TCGA-QQ-A8VF │
│ 3   │ TCGA-QQ-A8VF │
│ 4   │ TCGA-QQ-A8VF │
│ 5   │ TCGA-QQ-A8VF │
│ 6   │ TCGA-QQ-A8VF │
│ 7   │ TCGA-QQ-A8VF │
│ 8   │ TCGA-QQ-A8VF │
│ 9   │ TCGA-QQ-A8VF │
│ 10  │ TCGA-QQ-A8VF │
│ 11  │ TCGA-QQ-A8VF │
│ 12  │ TCGA-QQ-A8VF │
│ 13  │ TCGA-QQ-A8VF │
│ 14  │ TCGA-QQ-A8VF │
│ 15  │ TCGA-QQ-A8VF │

julia> tcia_series(study = "1.3.6.1.4.1.14519.5.2.1.3023.4024.298690116465423805879206377806")
6×16 DataFrame. Omitted printing of 15 columns
│ Row │ PatientID    │
│     │ String       │
├─────┼──────────────┤
│ 1   │ TCGA-QQ-A8VF │
│ 2   │ TCGA-QQ-A8VF │
│ 3   │ TCGA-QQ-A8VF │
│ 4   │ TCGA-QQ-A8VF │
│ 5   │ TCGA-QQ-A8VF │
│ 6   │ TCGA-QQ-A8VF │

julia> tcia_series(modality = "CT", manufacturer = "TOSHIBA")
1214×16 DataFrame. Omitted printing of 15 columns
│ Row  │ PatientID            │
│      │ String               │
├──────┼──────────────────────┤
│ 1    │ LIDC-IDRI-0334       │
│ 2    │ LIDC-IDRI-0354       │
│ 3    │ LIDC-IDRI-0359       │
│ 4    │ LIDC-IDRI-0365       │
│ 5    │ LIDC-IDRI-0368       │
│ 6    │ LIDC-IDRI-0378       │
│ 7    │ LIDC-IDRI-0395       │
⋮
│ 1207 │ COVID-19-AR-16445168 │
│ 1208 │ COVID-19-AR-16445168 │
│ 1209 │ COVID-19-AR-16445168 │
│ 1210 │ COVID-19-AR-16445168 │
│ 1211 │ COVID-19-AR-16445168 │
│ 1212 │ COVID-19-AR-16445168 │
│ 1213 │ COVID-19-AR-16445168 │
│ 1214 │ COVID-19-AR-16445168 │

julia> tcia_series(bodypart = "EXTREMITY")
621×16 DataFrame. Omitted printing of 15 columns
│ Row │ PatientID │
│     │ String    │
├─────┼───────────┤
│ 1   │ STS_010   │
│ 2   │ STS_010   │
│ 3   │ STS_012   │
│ 4   │ STS_012   │
│ 5   │ STS_045   │
│ 6   │ STS_045   │
│ 7   │ STS_005   │
⋮
│ 614 │ C3L-02846 │
│ 615 │ C3L-02846 │
│ 616 │ C3L-02846 │
│ 617 │ C3L-02846 │
│ 618 │ C3L-02846 │
│ 619 │ C3L-02846 │
│ 620 │ C3L-02846 │
│ 621 │ C3N-00875 │

This query's importance is hinted by the smorgasbord of parameters it accepts. That's because this query returns the SeriesInstanceUID which is needed to download images. Although the above examples only show PatientID, the query actually returns more information which is not shown because of limited screen space. The complete list of columns are:

julia> series_dataframe = tcia_series(patient = "TCGA-QQ-A8VF");

julia> names(series_dataframe)
16-element Array{String,1}:
 "PatientID"
 "StudyInstanceUID"
 "SeriesInstanceUID"
 "Modality"
 "ProtocolName"
 "SeriesDate"
 "SeriesDescription"
 "BodyPartExamined"
 "SeriesNumber"
 "AnnotationsFlag"
 "Collection"
 "Manufacturer"
 "ManufacturerModelName"
 "SoftwareVersions"
 "Visibility"
 "ImageCount"
Note

The entire table could have been printed by:

show(series_dataframe, allrows = true, allcols = true)
Warning

Passing format = "json" will result in one fewer column. This is because the AnnotationsFlag field is returned for CSV output but not for JSON.

Imaging series size

The size (in bytes) and number of images for a given imaging series is given by

julia> tcia_series_size(series = "1.3.6.1.4.1.14519.5.2.1.4591.4001.241972527061347495484079664948")
1×2 DataFrame
│ Row │ TotalSizeInBytes │ ObjectCount │
│     │ Float64          │ Int64       │
├─────┼──────────────────┼─────────────┤
│ 1   │ 1.49149e8        │ 1120        │
Warning

It is recommended that tcia_series_size() should not be used with format = json. This is because the json version interprets the TotalSizeInBytes as string/text rather than a number.

Service-Object Pairs (SOP)

Each imaging series consists of one or more images, each of which have a service-object-pair unique identifier (SOPInstanceUID). These can be listed by

julia> tcia_sop(series = "1.3.6.1.4.1.14519.5.2.1.4591.4001.241972527061347495484079664948")
1120×1 DataFrame
│ Row  │ SOPInstanceUID                                                   │
│      │ String                                                           │
├──────┼──────────────────────────────────────────────────────────────────┤
│ 1    │ 1.3.6.1.4.1.14519.5.2.1.4591.4001.144018262636458572930176764010 │
│ 2    │ 1.3.6.1.4.1.14519.5.2.1.4591.4001.154324108703255968081619601090 │
│ 3    │ 1.3.6.1.4.1.14519.5.2.1.4591.4001.599091131190480435140295467926 │
│ 4    │ 1.3.6.1.4.1.14519.5.2.1.4591.4001.188777989493136421598164072645 │
│ 5    │ 1.3.6.1.4.1.14519.5.2.1.4591.4001.320926776031572167602265814085 │
│ 6    │ 1.3.6.1.4.1.14519.5.2.1.4591.4001.112724040783164800754854122892 │
│ 7    │ 1.3.6.1.4.1.14519.5.2.1.4591.4001.191428415955329981952616210493 │
⋮
│ 1113 │ 1.3.6.1.4.1.14519.5.2.1.4591.4001.216754588853997343696538450924 │
│ 1114 │ 1.3.6.1.4.1.14519.5.2.1.4591.4001.319332759706029416400995097296 │
│ 1115 │ 1.3.6.1.4.1.14519.5.2.1.4591.4001.165979465460040197695092146160 │
│ 1116 │ 1.3.6.1.4.1.14519.5.2.1.4591.4001.175785847532920059054266261284 │
│ 1117 │ 1.3.6.1.4.1.14519.5.2.1.4591.4001.272387938688710785108532808119 │
│ 1118 │ 1.3.6.1.4.1.14519.5.2.1.4591.4001.195096450043378492624813997806 │
│ 1119 │ 1.3.6.1.4.1.14519.5.2.1.4591.4001.849941981930994444945059747092 │
│ 1120 │ 1.3.6.1.4.1.14519.5.2.1.4591.4001.121399683271444137920842834255 │

These identifiers are useful for accessing a specific image without having to download the entire imaging series.