Formats

Most functions in this module return either a DataFrame (default) or a Dictionary Array. Both formats have their respective advantages and disadvantages.

DataFrames/CSV

DataFrames display the results in a table and this is the default behaviour. The table is useful for visually analysing the results. For example, the studies in the TCGA-SARC collection can be obtained by:

studies_df = tcia_studies(collection = "TCGA-SARC")

6 rows × 9 columns

	Collection	PatientID	PatientName	PatientSex	StudyInstanceUID	StudyDate	StudyDescription	PatientAge	SeriesCount
	String	String	String	String	String	Date…	String	String	Int64
1	TCGA-SARC	TCGA-QQ-A5V2	TCGA-QQ-A5V2	M	1.3.6.1.4.1.14519.5.2.1.3023.4024.194945893650500105451818109994	1998-08-03	CT ANGIOGRAPHY PELVIS W\\/WO CON	042Y	3
2	TCGA-SARC	TCGA-QQ-A5VC	TCGA-QQ-A5VC	F	1.3.6.1.4.1.14519.5.2.1.3023.4024.201397008342044883871995521133	2003-06-01	CT CHEST ABD PELVIS W\\/ CONTR	064Y	2
3	TCGA-SARC	TCGA-QQ-A8VF	TCGA-QQ-A8VF	M	1.3.6.1.4.1.14519.5.2.1.3023.4024.150353595775386960889448399748	1997-11-26	MRI LOWER EXTREMITY W\\/WO CONT	070Y	9
4	TCGA-SARC	TCGA-QQ-A8VF	TCGA-QQ-A8VF	M	1.3.6.1.4.1.14519.5.2.1.3023.4024.298690116465423805879206377806	1997-11-29	CT CHEST W\\/ CONTRAST	070Y	6
5	TCGA-SARC	TCGA-QQ-A8VH	TCGA-QQ-A8VH	F	1.3.6.1.4.1.14519.5.2.1.3023.4024.215308722288168917637555384485	2002-03-15	MRI THORACIC W\\/WO CONTRAST	031Y	11
6	TCGA-SARC	TCGA-QQ-A8VG	TCGA-QQ-A8VG	M	1.3.6.1.4.1.14519.5.2.1.3023.4024.122712175454295553928844869548	2001-08-05	CT CHEST ABD PELVIS W\\/ CONTR	052Y	2

Manipulating the DataFrame object

The table can be manipulated using tools from the DataFrames.jl and CSV.jl packages.

Individual columns in a DataFrame object–-suppose it is named data_frame–-can be accessed by data_frame.column_name where the available column names are in names(data_frame).

julia> names(studies_df)
9-element Array{String,1}:
 "Collection"
 "PatientID"
 "PatientName"
 "PatientSex"
 "StudyInstanceUID"
 "StudyDate"
 "StudyDescription"
 "PatientAge"
 "SeriesCount"

julia> studies_df.StudyDate
6-element Array{Dates.Date,1}:
 1998-08-03
 2003-06-01
 1997-11-26
 1997-11-29
 2002-03-15
 2001-08-05

The table can also be filtered and sorted. As an example, the following lines will sort the previous table by the number of series and then remove the StudyInstanceUID and PatientName columns:

using DataFrames
studies_sorted_by_count = sort(studies_df, :SeriesCount)
select!(studies_sorted_by_count, Not([:StudyInstanceUID, :PatientName]))

6 rows × 7 columns

	Collection	PatientID	PatientSex	StudyDate	StudyDescription	PatientAge	SeriesCount
	String	String	String	Date…	String	String	Int64
1	TCGA-SARC	TCGA-QQ-A5VC	F	2003-06-01	CT CHEST ABD PELVIS W\\/ CONTR	064Y	2
2	TCGA-SARC	TCGA-QQ-A8VG	M	2001-08-05	CT CHEST ABD PELVIS W\\/ CONTR	052Y	2
3	TCGA-SARC	TCGA-QQ-A5V2	M	1998-08-03	CT ANGIOGRAPHY PELVIS W\\/WO CON	042Y	3
4	TCGA-SARC	TCGA-QQ-A8VF	M	1997-11-29	CT CHEST W\\/ CONTRAST	070Y	6
5	TCGA-SARC	TCGA-QQ-A8VF	M	1997-11-26	MRI LOWER EXTREMITY W\\/WO CONT	070Y	9
6	TCGA-SARC	TCGA-QQ-A8VH	F	2002-03-15	MRI THORACIC W\\/WO CONTRAST	031Y	11

Saving DataFrame as CSV

The contents of the table can be written to a csv file by:

julia> dataframe_to_csv(dataframe = studies_df, file = "output_file.csv")

DictionaryArray/JSON

Instead of a table, an array of dictionaries can be obtained by passing format = "json" as an argument when calling the query function. For example, the DataFrame from the previous example could have been obtained as an array by:

studies_array = tcia_studies(collection = "TCGA-SARC", format = "json")

6-element Array{Any,1}:
 Dict{String,Any}("PatientName"=>"TCGA-QQ-A5V2","PatientSex"=>"M","StudyDescription"=>"CT ANGIOGRAPHY PELVIS W/WO CON","PatientID"=>"TCGA-QQ-A5V2","SeriesCount"=>3,"PatientAge"=>"042Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.194945893650500105451818109994","Collection"=>"TCGA-SARC","StudyDate"=>"1998-08-03")
 Dict{String,Any}("PatientName"=>"TCGA-QQ-A5VC","PatientSex"=>"F","StudyDescription"=>"CT CHEST ABD PELVIS W/ CONTR","PatientID"=>"TCGA-QQ-A5VC","SeriesCount"=>2,"PatientAge"=>"064Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.201397008342044883871995521133","Collection"=>"TCGA-SARC","StudyDate"=>"2003-06-01")  
 Dict{String,Any}("PatientName"=>"TCGA-QQ-A8VF","PatientSex"=>"M","StudyDescription"=>"MRI LOWER EXTREMITY W/WO CONT","PatientID"=>"TCGA-QQ-A8VF","SeriesCount"=>9,"PatientAge"=>"070Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.150353595775386960889448399748","Collection"=>"TCGA-SARC","StudyDate"=>"1997-11-26") 
 Dict{String,Any}("PatientName"=>"TCGA-QQ-A8VF","PatientSex"=>"M","StudyDescription"=>"CT CHEST W/ CONTRAST","PatientID"=>"TCGA-QQ-A8VF","SeriesCount"=>6,"PatientAge"=>"070Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.298690116465423805879206377806","Collection"=>"TCGA-SARC","StudyDate"=>"1997-11-29")          
 Dict{String,Any}("PatientName"=>"TCGA-QQ-A8VH","PatientSex"=>"F","StudyDescription"=>"MRI THORACIC W/WO CONTRAST","PatientID"=>"TCGA-QQ-A8VH","SeriesCount"=>11,"PatientAge"=>"031Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.215308722288168917637555384485","Collection"=>"TCGA-SARC","StudyDate"=>"2002-03-15")   
 Dict{String,Any}("PatientName"=>"TCGA-QQ-A8VG","PatientSex"=>"M","StudyDescription"=>"CT CHEST ABD PELVIS W/ CONTR","PatientID"=>"TCGA-QQ-A8VG","SeriesCount"=>2,"PatientAge"=>"052Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.122712175454295553928844869548","Collection"=>"TCGA-SARC","StudyDate"=>"2001-08-05")

Manipulating the Dictionary Array

The array can be manipulated by iterating over the elements. As an example, the following lines will collect patients that are less than 60 years old:

patients_below_60Y = []
for patient in studies_array
  if patient["PatientAge"] < "060Y"
    push!(patients_below_60Y, patient)
  end
end
# Print the new array:
patients_below_60Y

3-element Array{Any,1}:
 Dict{String,Any}("PatientName"=>"TCGA-QQ-A5V2","PatientSex"=>"M","StudyDescription"=>"CT ANGIOGRAPHY PELVIS W/WO CON","PatientID"=>"TCGA-QQ-A5V2","SeriesCount"=>3,"PatientAge"=>"042Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.194945893650500105451818109994","Collection"=>"TCGA-SARC","StudyDate"=>"1998-08-03")
 Dict{String,Any}("PatientName"=>"TCGA-QQ-A8VH","PatientSex"=>"F","StudyDescription"=>"MRI THORACIC W/WO CONTRAST","PatientID"=>"TCGA-QQ-A8VH","SeriesCount"=>11,"PatientAge"=>"031Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.215308722288168917637555384485","Collection"=>"TCGA-SARC","StudyDate"=>"2002-03-15")   
 Dict{String,Any}("PatientName"=>"TCGA-QQ-A8VG","PatientSex"=>"M","StudyDescription"=>"CT CHEST ABD PELVIS W/ CONTR","PatientID"=>"TCGA-QQ-A8VG","SeriesCount"=>2,"PatientAge"=>"052Y","StudyInstanceUID"=>"1.3.6.1.4.1.14519.5.2.1.3023.4024.122712175454295553928844869548","Collection"=>"TCGA-SARC","StudyDate"=>"2001-08-05")

The available keys for each dictionary in the array are listed by:

julia> keys(studies_array[1])
Base.KeySet for a Dict{String,Any} with 9 entries. Keys:
  "PatientName"
  "PatientSex"
  "StudyDescription"
  "PatientID"
  "SeriesCount"
  "PatientAge"
  "StudyInstanceUID"
  "Collection"
  "StudyDate"

Saving Dictionary Array as JSON

The array can be written to a JSON file by

julia> dictionary_to_json(dictionary = studies_array, file = "output_file.json")

Note on types

The DataFrames object tries to figure out the types from the input while the DictionaryArray just accepts whatever the API returns. For a practical example of this, suppose we want to know the size of an imaging series; the DataFrame version will be

tcia_series_size(series = "1.3.6.1.4.1.14519.5.2.1.4591.4001.241972527061347495484079664948")

1 rows × 2 columns

	TotalSizeInBytes	ObjectCount
	Float64	Int64
1	1.49149e8	1120

while the JSON version will be

julia> tcia_series_size(series = "1.3.6.1.4.1.14519.5.2.1.4591.4001.241972527061347495484079664948", format="json")[1]
Dict{String,Any} with 2 entries:
  "ObjectCount"      => 1120
  "TotalSizeInBytes" => "149149266.000000"

The difference between the two is that the DataFrames version recognizes that TotalSizeInBytes is a number whereas the DictionaryArray displays it as a string (because the API returns it as a string).

DataFrames' ability to recognize types is usually helpful, but sometimes it can fail. For example, in an anonymized dataset where patient names are replaced by numbers, the DataFrames object will incorrectly treat the names as numbers.

These differences are unlikely to cause problems in practice so it isn't something to be actively concerned about.