Formats
Most functions in this module return either a DataFrame (default) or a Dictionary Array. Both formats have their respective advantages and disadvantages.
DataFrames/CSV
DataFrames display the results in a table and this is the default behaviour. The table is useful for visually analysing the results. For example, the studies in the TCGA-SARC collection can be obtained by:
studies_df = tcia_studies(collection = "TCGA-SARC")
Collection | PatientID | PatientName | PatientSex | StudyInstanceUID | StudyDate | StudyDescription | PatientAge | SeriesCount | |
---|---|---|---|---|---|---|---|---|---|
String | String | String | String | String | Date… | String | String | Int64 | |
1 | TCGA-SARC | TCGA-QQ-A5V2 | TCGA-QQ-A5V2 | M | 1.3.6.1.4.1.14519.5.2.1.3023.4024.194945893650500105451818109994 | 1998-08-03 | CT ANGIOGRAPHY PELVIS W\\/WO CON | 042Y | 3 |
2 | TCGA-SARC | TCGA-QQ-A5VC | TCGA-QQ-A5VC | F | 1.3.6.1.4.1.14519.5.2.1.3023.4024.201397008342044883871995521133 | 2003-06-01 | CT CHEST ABD PELVIS W\\/ CONTR | 064Y | 2 |
3 | TCGA-SARC | TCGA-QQ-A8VF | TCGA-QQ-A8VF | M | 1.3.6.1.4.1.14519.5.2.1.3023.4024.150353595775386960889448399748 | 1997-11-26 | MRI LOWER EXTREMITY W\\/WO CONT | 070Y | 9 |
4 | TCGA-SARC | TCGA-QQ-A8VF | TCGA-QQ-A8VF | M | 1.3.6.1.4.1.14519.5.2.1.3023.4024.298690116465423805879206377806 | 1997-11-29 | CT CHEST W\\/ CONTRAST | 070Y | 6 |
5 | TCGA-SARC | TCGA-QQ-A8VH | TCGA-QQ-A8VH | F | 1.3.6.1.4.1.14519.5.2.1.3023.4024.215308722288168917637555384485 | 2002-03-15 | MRI THORACIC W\\/WO CONTRAST | 031Y | 11 |
6 | TCGA-SARC | TCGA-QQ-A8VG | TCGA-QQ-A8VG | M | 1.3.6.1.4.1.14519.5.2.1.3023.4024.122712175454295553928844869548 | 2001-08-05 | CT CHEST ABD PELVIS W\\/ CONTR | 052Y | 2 |
Manipulating the DataFrame object
The table can be manipulated using tools from the DataFrames.jl and CSV.jl packages.
Individual columns in a DataFrame object–-suppose it is named data_frame
–-can be accessed by data_frame.column_name
where the available column names are in names(data_frame)
.
julia> names(studies_df)
9-element Array{String,1}:
"Collection"
"PatientID"
"PatientName"
"PatientSex"
"StudyInstanceUID"
"StudyDate"
"StudyDescription"
"PatientAge"
"SeriesCount"
julia> studies_df.StudyDate
6-element Array{Dates.Date,1}:
1998-08-03
2003-06-01
1997-11-26
1997-11-29
2002-03-15
2001-08-05
The table can also be filtered and sorted. As an example, the following lines will sort the previous table by the number of series and then remove the StudyInstanceUID
and PatientName
columns:
using DataFrames
studies_sorted_by_count = sort(studies_df, :SeriesCount)
select!(studies_sorted_by_count, Not([:StudyInstanceUID, :PatientName]))
Collection | PatientID | PatientSex | StudyDate | StudyDescription | PatientAge | SeriesCount | |
---|---|---|---|---|---|---|---|
String | String | String | Date… | String | String | Int64 | |
1 | TCGA-SARC | TCGA-QQ-A5VC | F | 2003-06-01 | CT CHEST ABD PELVIS W\\/ CONTR | 064Y | 2 |
2 | TCGA-SARC | TCGA-QQ-A8VG | M | 2001-08-05 | CT CHEST ABD PELVIS W\\/ CONTR | 052Y | 2 |
3 | TCGA-SARC | TCGA-QQ-A5V2 | M | 1998-08-03 | CT ANGIOGRAPHY PELVIS W\\/WO CON | 042Y | 3 |
4 | TCGA-SARC | TCGA-QQ-A8VF | M | 1997-11-29 | CT CHEST W\\/ CONTRAST | 070Y | 6 |
5 | TCGA-SARC | TCGA-QQ-A8VF | M | 1997-11-26 | MRI LOWER EXTREMITY W\\/WO CONT | 070Y | 9 |
6 | TCGA-SARC | TCGA-QQ-A8VH | F | 2002-03-15 | MRI THORACIC W\\/WO CONTRAST | 031Y | 11 |
Saving DataFrame as CSV
The contents of the table can be written to a csv file by:
julia> dataframe_to_csv(dataframe = studies_df, file = "output_file.csv")
DictionaryArray/JSON
Instead of a table, an array of dictionaries can be obtained by passing format = "json"
as an argument when calling the query function. For example, the DataFrame from the previous example could have been obtained as an array by:
studies_array = tcia_studies(collection = "TCGA-SARC", format = "json")
6-element Array{Any,1}: Dict{String,Any}("PatientName"=>"TCGA-QQ-A5V2","PatientSex"=>"M","StudyDescription"=>"CT ANGIOGRAPHY PELVIS W/WO CON","PatientID"=>"TCGA-QQ-A5V2","SeriesCount"=>3,"PatientAge"=>"042Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.194945893650500105451818109994","Collection"=>"TCGA-SARC","StudyDate"=>"1998-08-03") Dict{String,Any}("PatientName"=>"TCGA-QQ-A5VC","PatientSex"=>"F","StudyDescription"=>"CT CHEST ABD PELVIS W/ CONTR","PatientID"=>"TCGA-QQ-A5VC","SeriesCount"=>2,"PatientAge"=>"064Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.201397008342044883871995521133","Collection"=>"TCGA-SARC","StudyDate"=>"2003-06-01") Dict{String,Any}("PatientName"=>"TCGA-QQ-A8VF","PatientSex"=>"M","StudyDescription"=>"MRI LOWER EXTREMITY W/WO CONT","PatientID"=>"TCGA-QQ-A8VF","SeriesCount"=>9,"PatientAge"=>"070Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.150353595775386960889448399748","Collection"=>"TCGA-SARC","StudyDate"=>"1997-11-26") Dict{String,Any}("PatientName"=>"TCGA-QQ-A8VF","PatientSex"=>"M","StudyDescription"=>"CT CHEST W/ CONTRAST","PatientID"=>"TCGA-QQ-A8VF","SeriesCount"=>6,"PatientAge"=>"070Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.298690116465423805879206377806","Collection"=>"TCGA-SARC","StudyDate"=>"1997-11-29") Dict{String,Any}("PatientName"=>"TCGA-QQ-A8VH","PatientSex"=>"F","StudyDescription"=>"MRI THORACIC W/WO CONTRAST","PatientID"=>"TCGA-QQ-A8VH","SeriesCount"=>11,"PatientAge"=>"031Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.215308722288168917637555384485","Collection"=>"TCGA-SARC","StudyDate"=>"2002-03-15") Dict{String,Any}("PatientName"=>"TCGA-QQ-A8VG","PatientSex"=>"M","StudyDescription"=>"CT CHEST ABD PELVIS W/ CONTR","PatientID"=>"TCGA-QQ-A8VG","SeriesCount"=>2,"PatientAge"=>"052Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.122712175454295553928844869548","Collection"=>"TCGA-SARC","StudyDate"=>"2001-08-05")
Manipulating the Dictionary Array
The array can be manipulated by iterating over the elements. As an example, the following lines will collect patients that are less than 60 years old:
patients_below_60Y = []
for patient in studies_array
if patient["PatientAge"] < "060Y"
push!(patients_below_60Y, patient)
end
end
# Print the new array:
patients_below_60Y
3-element Array{Any,1}: Dict{String,Any}("PatientName"=>"TCGA-QQ-A5V2","PatientSex"=>"M","StudyDescription"=>"CT ANGIOGRAPHY PELVIS W/WO CON","PatientID"=>"TCGA-QQ-A5V2","SeriesCount"=>3,"PatientAge"=>"042Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.194945893650500105451818109994","Collection"=>"TCGA-SARC","StudyDate"=>"1998-08-03") Dict{String,Any}("PatientName"=>"TCGA-QQ-A8VH","PatientSex"=>"F","StudyDescription"=>"MRI THORACIC W/WO CONTRAST","PatientID"=>"TCGA-QQ-A8VH","SeriesCount"=>11,"PatientAge"=>"031Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.215308722288168917637555384485","Collection"=>"TCGA-SARC","StudyDate"=>"2002-03-15") Dict{String,Any}("PatientName"=>"TCGA-QQ-A8VG","PatientSex"=>"M","StudyDescription"=>"CT CHEST ABD PELVIS W/ CONTR","PatientID"=>"TCGA-QQ-A8VG","SeriesCount"=>2,"PatientAge"=>"052Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.122712175454295553928844869548","Collection"=>"TCGA-SARC","StudyDate"=>"2001-08-05")
The available keys for each dictionary in the array are listed by:
julia> keys(studies_array[1])
Base.KeySet for a Dict{String,Any} with 9 entries. Keys:
"PatientName"
"PatientSex"
"StudyDescription"
"PatientID"
"SeriesCount"
"PatientAge"
"StudyInstanceUID"
"Collection"
"StudyDate"
Saving Dictionary Array as JSON
The array can be written to a JSON file by
julia> dictionary_to_json(dictionary = studies_array, file = "output_file.json")
Note on types
The DataFrames object tries to figure out the types from the input while the DictionaryArray just accepts whatever the API returns. For a practical example of this, suppose we want to know the size of an imaging series; the DataFrame version will be
tcia_series_size(series = "1.3.6.1.4.1.14519.5.2.1.4591.4001.241972527061347495484079664948")
TotalSizeInBytes | ObjectCount | |
---|---|---|
Float64 | Int64 | |
1 | 1.49149e8 | 1120 |
while the JSON version will be
julia> tcia_series_size(series = "1.3.6.1.4.1.14519.5.2.1.4591.4001.241972527061347495484079664948", format="json")[1]
Dict{String,Any} with 2 entries:
"ObjectCount" => 1120
"TotalSizeInBytes" => "149149266.000000"
The difference between the two is that the DataFrames version recognizes that TotalSizeInBytes
is a number whereas the DictionaryArray displays it as a string (because the API returns it as a string).
DataFrames' ability to recognize types is usually helpful, but sometimes it can fail. For example, in an anonymized dataset where patient names are replaced by numbers, the DataFrames object will incorrectly treat the names as numbers.
These differences are unlikely to cause problems in practice so it isn't something to be actively concerned about.