Formats

Most functions in this module return either a DataFrame (default) or a Dictionary Array. Both formats have their respective advantages and disadvantages.

DataFrames/CSV

DataFrames display the results in a table and this is the default behaviour. The table is useful for visually analysing the results. For example, the studies in the TCGA-SARC collection can be obtained by:

studies_df = tcia_studies(collection = "TCGA-SARC")

6 rows × 9 columns

CollectionPatientIDPatientNamePatientSexStudyInstanceUIDStudyDateStudyDescriptionPatientAgeSeriesCount
StringStringStringStringStringDate…StringStringInt64
1TCGA-SARCTCGA-QQ-A5V2TCGA-QQ-A5V2M1.3.6.1.4.1.14519.5.2.1.3023.4024.1949458936505001054518181099941998-08-03CT ANGIOGRAPHY PELVIS W\\/WO CON042Y3
2TCGA-SARCTCGA-QQ-A5VCTCGA-QQ-A5VCF1.3.6.1.4.1.14519.5.2.1.3023.4024.2013970083420448838719955211332003-06-01CT CHEST ABD PELVIS W\\/ CONTR064Y2
3TCGA-SARCTCGA-QQ-A8VFTCGA-QQ-A8VFM1.3.6.1.4.1.14519.5.2.1.3023.4024.1503535957753869608894483997481997-11-26MRI LOWER EXTREMITY W\\/WO CONT070Y9
4TCGA-SARCTCGA-QQ-A8VFTCGA-QQ-A8VFM1.3.6.1.4.1.14519.5.2.1.3023.4024.2986901164654238058792063778061997-11-29CT CHEST W\\/ CONTRAST070Y6
5TCGA-SARCTCGA-QQ-A8VHTCGA-QQ-A8VHF1.3.6.1.4.1.14519.5.2.1.3023.4024.2153087222881689176375553844852002-03-15MRI THORACIC W\\/WO CONTRAST031Y11
6TCGA-SARCTCGA-QQ-A8VGTCGA-QQ-A8VGM1.3.6.1.4.1.14519.5.2.1.3023.4024.1227121754542955539288448695482001-08-05CT CHEST ABD PELVIS W\\/ CONTR052Y2

Manipulating the DataFrame object

The table can be manipulated using tools from the DataFrames.jl and CSV.jl packages.

Individual columns in a DataFrame object–-suppose it is named data_frame–-can be accessed by data_frame.column_name where the available column names are in names(data_frame).

julia> names(studies_df)
9-element Array{String,1}:
 "Collection"
 "PatientID"
 "PatientName"
 "PatientSex"
 "StudyInstanceUID"
 "StudyDate"
 "StudyDescription"
 "PatientAge"
 "SeriesCount"

julia> studies_df.StudyDate
6-element Array{Dates.Date,1}:
 1998-08-03
 2003-06-01
 1997-11-26
 1997-11-29
 2002-03-15
 2001-08-05

The table can also be filtered and sorted. As an example, the following lines will sort the previous table by the number of series and then remove the StudyInstanceUID and PatientName columns:

using DataFrames
studies_sorted_by_count = sort(studies_df, :SeriesCount)
select!(studies_sorted_by_count, Not([:StudyInstanceUID, :PatientName]))

6 rows × 7 columns

CollectionPatientIDPatientSexStudyDateStudyDescriptionPatientAgeSeriesCount
StringStringStringDate…StringStringInt64
1TCGA-SARCTCGA-QQ-A5VCF2003-06-01CT CHEST ABD PELVIS W\\/ CONTR064Y2
2TCGA-SARCTCGA-QQ-A8VGM2001-08-05CT CHEST ABD PELVIS W\\/ CONTR052Y2
3TCGA-SARCTCGA-QQ-A5V2M1998-08-03CT ANGIOGRAPHY PELVIS W\\/WO CON042Y3
4TCGA-SARCTCGA-QQ-A8VFM1997-11-29CT CHEST W\\/ CONTRAST070Y6
5TCGA-SARCTCGA-QQ-A8VFM1997-11-26MRI LOWER EXTREMITY W\\/WO CONT070Y9
6TCGA-SARCTCGA-QQ-A8VHF2002-03-15MRI THORACIC W\\/WO CONTRAST031Y11

Saving DataFrame as CSV

The contents of the table can be written to a csv file by:

julia> dataframe_to_csv(dataframe = studies_df, file = "output_file.csv")

DictionaryArray/JSON

Instead of a table, an array of dictionaries can be obtained by passing format = "json" as an argument when calling the query function. For example, the DataFrame from the previous example could have been obtained as an array by:

studies_array = tcia_studies(collection = "TCGA-SARC", format = "json")
6-element Array{Any,1}:
 Dict{String,Any}("PatientName"=>"TCGA-QQ-A5V2","PatientSex"=>"M","StudyDescription"=>"CT ANGIOGRAPHY PELVIS W/WO CON","PatientID"=>"TCGA-QQ-A5V2","SeriesCount"=>3,"PatientAge"=>"042Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.194945893650500105451818109994","Collection"=>"TCGA-SARC","StudyDate"=>"1998-08-03")
 Dict{String,Any}("PatientName"=>"TCGA-QQ-A5VC","PatientSex"=>"F","StudyDescription"=>"CT CHEST ABD PELVIS W/ CONTR","PatientID"=>"TCGA-QQ-A5VC","SeriesCount"=>2,"PatientAge"=>"064Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.201397008342044883871995521133","Collection"=>"TCGA-SARC","StudyDate"=>"2003-06-01")  
 Dict{String,Any}("PatientName"=>"TCGA-QQ-A8VF","PatientSex"=>"M","StudyDescription"=>"MRI LOWER EXTREMITY W/WO CONT","PatientID"=>"TCGA-QQ-A8VF","SeriesCount"=>9,"PatientAge"=>"070Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.150353595775386960889448399748","Collection"=>"TCGA-SARC","StudyDate"=>"1997-11-26") 
 Dict{String,Any}("PatientName"=>"TCGA-QQ-A8VF","PatientSex"=>"M","StudyDescription"=>"CT CHEST W/ CONTRAST","PatientID"=>"TCGA-QQ-A8VF","SeriesCount"=>6,"PatientAge"=>"070Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.298690116465423805879206377806","Collection"=>"TCGA-SARC","StudyDate"=>"1997-11-29")          
 Dict{String,Any}("PatientName"=>"TCGA-QQ-A8VH","PatientSex"=>"F","StudyDescription"=>"MRI THORACIC W/WO CONTRAST","PatientID"=>"TCGA-QQ-A8VH","SeriesCount"=>11,"PatientAge"=>"031Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.215308722288168917637555384485","Collection"=>"TCGA-SARC","StudyDate"=>"2002-03-15")   
 Dict{String,Any}("PatientName"=>"TCGA-QQ-A8VG","PatientSex"=>"M","StudyDescription"=>"CT CHEST ABD PELVIS W/ CONTR","PatientID"=>"TCGA-QQ-A8VG","SeriesCount"=>2,"PatientAge"=>"052Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.122712175454295553928844869548","Collection"=>"TCGA-SARC","StudyDate"=>"2001-08-05")  

Manipulating the Dictionary Array

The array can be manipulated by iterating over the elements. As an example, the following lines will collect patients that are less than 60 years old:

patients_below_60Y = []
for patient in studies_array
  if patient["PatientAge"] < "060Y"
    push!(patients_below_60Y, patient)
  end
end
# Print the new array:
patients_below_60Y
3-element Array{Any,1}:
 Dict{String,Any}("PatientName"=>"TCGA-QQ-A5V2","PatientSex"=>"M","StudyDescription"=>"CT ANGIOGRAPHY PELVIS W/WO CON","PatientID"=>"TCGA-QQ-A5V2","SeriesCount"=>3,"PatientAge"=>"042Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.194945893650500105451818109994","Collection"=>"TCGA-SARC","StudyDate"=>"1998-08-03")
 Dict{String,Any}("PatientName"=>"TCGA-QQ-A8VH","PatientSex"=>"F","StudyDescription"=>"MRI THORACIC W/WO CONTRAST","PatientID"=>"TCGA-QQ-A8VH","SeriesCount"=>11,"PatientAge"=>"031Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.215308722288168917637555384485","Collection"=>"TCGA-SARC","StudyDate"=>"2002-03-15")   
 Dict{String,Any}("PatientName"=>"TCGA-QQ-A8VG","PatientSex"=>"M","StudyDescription"=>"CT CHEST ABD PELVIS W/ CONTR","PatientID"=>"TCGA-QQ-A8VG","SeriesCount"=>2,"PatientAge"=>"052Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.122712175454295553928844869548","Collection"=>"TCGA-SARC","StudyDate"=>"2001-08-05")  

The available keys for each dictionary in the array are listed by:

julia> keys(studies_array[1])
Base.KeySet for a Dict{String,Any} with 9 entries. Keys:
  "PatientName"
  "PatientSex"
  "StudyDescription"
  "PatientID"
  "SeriesCount"
  "PatientAge"
  "StudyInstanceUID"
  "Collection"
  "StudyDate"

Saving Dictionary Array as JSON

The array can be written to a JSON file by

julia> dictionary_to_json(dictionary = studies_array, file = "output_file.json")

Note on types

The DataFrames object tries to figure out the types from the input while the DictionaryArray just accepts whatever the API returns. For a practical example of this, suppose we want to know the size of an imaging series; the DataFrame version will be

tcia_series_size(series = "1.3.6.1.4.1.14519.5.2.1.4591.4001.241972527061347495484079664948")

1 rows × 2 columns

TotalSizeInBytesObjectCount
Float64Int64
11.49149e81120

while the JSON version will be

julia> tcia_series_size(series = "1.3.6.1.4.1.14519.5.2.1.4591.4001.241972527061347495484079664948", format="json")[1]
Dict{String,Any} with 2 entries:
  "ObjectCount"      => 1120
  "TotalSizeInBytes" => "149149266.000000"

The difference between the two is that the DataFrames version recognizes that TotalSizeInBytes is a number whereas the DictionaryArray displays it as a string (because the API returns it as a string).

DataFrames' ability to recognize types is usually helpful, but sometimes it can fail. For example, in an anonymized dataset where patient names are replaced by numbers, the DataFrames object will incorrectly treat the names as numbers.

These differences are unlikely to cause problems in practice so it isn't something to be actively concerned about.