Formats

Most functions in this module return either a DataFrame (default) or a Dictionary Array. Both formats have their respective advantages and disadvantages.

DataFrames/CSV

DataFrames display the results in a table and this is the default behaviour. The table is useful for visually analysing the results. For example, the studies in the TCGA-SARC collection can be obtained by:

studies_df = tcia_studies(collection = "TCGA-SARC")
6×15 DataFrame
RowStudyInstanceUIDStudyDateStudyDescriptionAdmittingDiagnosesDescriptionStudyIDPatientAgePatientIDPatientNamePatientBirthDatePatientSexEthnicGroupCollectionSeriesCountLongitudinalTemporalEventTypeLongitudinalTemporalOffsetFromEvent
StringString31String31MissingMissingString7String15String15MissingString1MissingString15Int64MissingMissing
11.3.6.1.4.1.14519.5.2.1.3023.4024.1949458936505001054518181099941998-08-03 00:00:00.0CT ANGIOGRAPHY PELVIS W/WO CONmissingmissing042YTCGA-QQ-A5V2TCGA-QQ-A5V2missingMmissingTCGA-SARC3missingmissing
21.3.6.1.4.1.14519.5.2.1.3023.4024.2013970083420448838719955211332003-06-01 00:00:00.0CT CHEST ABD PELVIS W/ CONTRmissingmissing064YTCGA-QQ-A5VCTCGA-QQ-A5VCmissingFmissingTCGA-SARC2missingmissing
31.3.6.1.4.1.14519.5.2.1.3023.4024.1503535957753869608894483997481997-11-26 00:00:00.0MRI LOWER EXTREMITY W/WO CONTmissingmissing070YTCGA-QQ-A8VFTCGA-QQ-A8VFmissingMmissingTCGA-SARC9missingmissing
41.3.6.1.4.1.14519.5.2.1.3023.4024.2153087222881689176375553844852002-03-15 00:00:00.0MRI THORACIC W/WO CONTRASTmissingmissing031YTCGA-QQ-A8VHTCGA-QQ-A8VHmissingFmissingTCGA-SARC11missingmissing
51.3.6.1.4.1.14519.5.2.1.3023.4024.2986901164654238058792063778061997-11-29 00:00:00.0CT CHEST W/ CONTRASTmissingmissing070YTCGA-QQ-A8VFTCGA-QQ-A8VFmissingMmissingTCGA-SARC6missingmissing
61.3.6.1.4.1.14519.5.2.1.3023.4024.1227121754542955539288448695482001-08-05 00:00:00.0CT CHEST ABD PELVIS W/ CONTRmissingmissing052YTCGA-QQ-A8VGTCGA-QQ-A8VGmissingMmissingTCGA-SARC2missingmissing

Manipulating the DataFrame object

The table can be manipulated using tools from the DataFrames.jl and CSV.jl packages.

Individual columns in a DataFrame object–-suppose it is named data_frame–-can be accessed by data_frame.column_name where the available column names are in names(data_frame).

julia> names(studies_df)15-element Vector{String}:
 "StudyInstanceUID"
 "StudyDate"
 "StudyDescription"
 "AdmittingDiagnosesDescription"
 "StudyID"
 "PatientAge"
 "PatientID"
 "PatientName"
 "PatientBirthDate"
 "PatientSex"
 "EthnicGroup"
 "Collection"
 "SeriesCount"
 "LongitudinalTemporalEventType"
 "LongitudinalTemporalOffsetFromEvent"
julia> studies_df.StudyDate6-element Vector{InlineStrings.String31}: "1998-08-03 00:00:00.0" "2003-06-01 00:00:00.0" "1997-11-26 00:00:00.0" "2002-03-15 00:00:00.0" "1997-11-29 00:00:00.0" "2001-08-05 00:00:00.0"

The table can also be filtered and sorted. As an example, the following lines will sort the previous table by the number of series and then remove the StudyInstanceUID and PatientName columns:

using DataFrames
studies_sorted_by_count = sort(studies_df, :SeriesCount)
select!(studies_sorted_by_count, Not([:StudyInstanceUID, :PatientName]))
6×13 DataFrame
RowStudyDateStudyDescriptionAdmittingDiagnosesDescriptionStudyIDPatientAgePatientIDPatientBirthDatePatientSexEthnicGroupCollectionSeriesCountLongitudinalTemporalEventTypeLongitudinalTemporalOffsetFromEvent
String31String31MissingMissingString7String15MissingString1MissingString15Int64MissingMissing
12003-06-01 00:00:00.0CT CHEST ABD PELVIS W/ CONTRmissingmissing064YTCGA-QQ-A5VCmissingFmissingTCGA-SARC2missingmissing
22001-08-05 00:00:00.0CT CHEST ABD PELVIS W/ CONTRmissingmissing052YTCGA-QQ-A8VGmissingMmissingTCGA-SARC2missingmissing
31998-08-03 00:00:00.0CT ANGIOGRAPHY PELVIS W/WO CONmissingmissing042YTCGA-QQ-A5V2missingMmissingTCGA-SARC3missingmissing
41997-11-29 00:00:00.0CT CHEST W/ CONTRASTmissingmissing070YTCGA-QQ-A8VFmissingMmissingTCGA-SARC6missingmissing
51997-11-26 00:00:00.0MRI LOWER EXTREMITY W/WO CONTmissingmissing070YTCGA-QQ-A8VFmissingMmissingTCGA-SARC9missingmissing
62002-03-15 00:00:00.0MRI THORACIC W/WO CONTRASTmissingmissing031YTCGA-QQ-A8VHmissingFmissingTCGA-SARC11missingmissing

Saving DataFrame as CSV

The contents of the table can be written to a csv file by:

julia> dataframe_to_csv(dataframe = studies_df, file = "output_file.csv")

DictionaryArray/JSON

Instead of a table, an array of dictionaries can be obtained by passing format = "json" as an argument when calling the query function. For example, the DataFrame from the previous example could have been obtained as an array by:

studies_array = tcia_studies(collection = "TCGA-SARC", format = "json")
6-element Vector{Any}:
 Dict{String, Any}("StudyDescription" => "CT ANGIOGRAPHY PELVIS W/WO CON", "PatientName" => "TCGA-QQ-A5V2", "PatientSex" => "M", "PatientID" => "TCGA-QQ-A5V2", "SeriesCount" => 3, "PatientAge" => "042Y", "StudyInstanceUID" => "1.3.6.1.4.1.14519.5.2.1.3023.4024.194945893650500105451818109994", "Collection" => "TCGA-SARC", "StudyDate" => "1998-08-03 00:00:00.0")
 Dict{String, Any}("StudyDescription" => "CT CHEST ABD PELVIS W/ CONTR", "PatientName" => "TCGA-QQ-A5VC", "PatientSex" => "F", "PatientID" => "TCGA-QQ-A5VC", "SeriesCount" => 2, "PatientAge" => "064Y", "StudyInstanceUID" => "1.3.6.1.4.1.14519.5.2.1.3023.4024.201397008342044883871995521133", "Collection" => "TCGA-SARC", "StudyDate" => "2003-06-01 00:00:00.0")
 Dict{String, Any}("StudyDescription" => "MRI LOWER EXTREMITY W/WO CONT", "PatientName" => "TCGA-QQ-A8VF", "PatientSex" => "M", "PatientID" => "TCGA-QQ-A8VF", "SeriesCount" => 9, "PatientAge" => "070Y", "StudyInstanceUID" => "1.3.6.1.4.1.14519.5.2.1.3023.4024.150353595775386960889448399748", "Collection" => "TCGA-SARC", "StudyDate" => "1997-11-26 00:00:00.0")
 Dict{String, Any}("StudyDescription" => "MRI THORACIC W/WO CONTRAST", "PatientName" => "TCGA-QQ-A8VH", "PatientSex" => "F", "PatientID" => "TCGA-QQ-A8VH", "SeriesCount" => 11, "PatientAge" => "031Y", "StudyInstanceUID" => "1.3.6.1.4.1.14519.5.2.1.3023.4024.215308722288168917637555384485", "Collection" => "TCGA-SARC", "StudyDate" => "2002-03-15 00:00:00.0")
 Dict{String, Any}("StudyDescription" => "CT CHEST W/ CONTRAST", "PatientName" => "TCGA-QQ-A8VF", "PatientSex" => "M", "PatientID" => "TCGA-QQ-A8VF", "SeriesCount" => 6, "PatientAge" => "070Y", "StudyInstanceUID" => "1.3.6.1.4.1.14519.5.2.1.3023.4024.298690116465423805879206377806", "Collection" => "TCGA-SARC", "StudyDate" => "1997-11-29 00:00:00.0")
 Dict{String, Any}("StudyDescription" => "CT CHEST ABD PELVIS W/ CONTR", "PatientName" => "TCGA-QQ-A8VG", "PatientSex" => "M", "PatientID" => "TCGA-QQ-A8VG", "SeriesCount" => 2, "PatientAge" => "052Y", "StudyInstanceUID" => "1.3.6.1.4.1.14519.5.2.1.3023.4024.122712175454295553928844869548", "Collection" => "TCGA-SARC", "StudyDate" => "2001-08-05 00:00:00.0")

Manipulating the Dictionary Array

The array can be manipulated by iterating over the elements. As an example, the following lines will collect patients that are less than 60 years old:

patients_below_60Y = []
for patient in studies_array
  if patient["PatientAge"] < "060Y"
    push!(patients_below_60Y, patient)
  end
end
# Print the new array:
patients_below_60Y
3-element Vector{Any}:
 Dict{String, Any}("StudyDescription" => "CT ANGIOGRAPHY PELVIS W/WO CON", "PatientName" => "TCGA-QQ-A5V2", "PatientSex" => "M", "PatientID" => "TCGA-QQ-A5V2", "SeriesCount" => 3, "PatientAge" => "042Y", "StudyInstanceUID" => "1.3.6.1.4.1.14519.5.2.1.3023.4024.194945893650500105451818109994", "Collection" => "TCGA-SARC", "StudyDate" => "1998-08-03 00:00:00.0")
 Dict{String, Any}("StudyDescription" => "MRI THORACIC W/WO CONTRAST", "PatientName" => "TCGA-QQ-A8VH", "PatientSex" => "F", "PatientID" => "TCGA-QQ-A8VH", "SeriesCount" => 11, "PatientAge" => "031Y", "StudyInstanceUID" => "1.3.6.1.4.1.14519.5.2.1.3023.4024.215308722288168917637555384485", "Collection" => "TCGA-SARC", "StudyDate" => "2002-03-15 00:00:00.0")
 Dict{String, Any}("StudyDescription" => "CT CHEST ABD PELVIS W/ CONTR", "PatientName" => "TCGA-QQ-A8VG", "PatientSex" => "M", "PatientID" => "TCGA-QQ-A8VG", "SeriesCount" => 2, "PatientAge" => "052Y", "StudyInstanceUID" => "1.3.6.1.4.1.14519.5.2.1.3023.4024.122712175454295553928844869548", "Collection" => "TCGA-SARC", "StudyDate" => "2001-08-05 00:00:00.0")

The available keys for each dictionary in the array are listed by:

julia> keys(studies_array[1])KeySet for a Dict{String, Any} with 9 entries. Keys:
  "StudyDescription"
  "PatientName"
  "PatientSex"
  "PatientID"
  "SeriesCount"
  "PatientAge"
  "StudyInstanceUID"
  "Collection"
  "StudyDate"

Saving Dictionary Array as JSON

The array can be written to a JSON file by

julia> dictionary_to_json(dictionary = studies_array, file = "output_file.json")

Note on types

The DataFrames object tries to figure out the types from the input while the DictionaryArray just accepts whatever the API returns. For a practical example of this, suppose we want to know the size of an imaging series; the DataFrame version will be

tcia_series_size(series = "1.3.6.1.4.1.14519.5.2.1.4591.4001.241972527061347495484079664948")
1×2 DataFrame
RowTotalSizeInBytesObjectCount
Int64Int64
124012211806228124

while the JSON version will be

julia> tcia_series_size(series = "1.3.6.1.4.1.14519.5.2.1.4591.4001.241972527061347495484079664948", format="json")[1]Dict{String, Any} with 2 entries:
  "ObjectCount"      => 124
  "TotalSizeInBytes" => 24012211806228

The difference between the two is that the DataFrames version recognizes that TotalSizeInBytes is a number whereas the DictionaryArray displays it as a string (because the API returns it as a string).

DataFrames' ability to recognize types is usually helpful, but sometimes it can fail. For example, in an anonymized dataset where patient names are replaced by numbers, the DataFrames object will incorrectly treat the names as numbers.

These differences are unlikely to cause problems in practice so it isn't something to be actively concerned about.