Formats

Most functions in this module return either a DataFrame (default) or a Dictionary Array. Both formats have their respective advantages and disadvantages.

DataFrames/CSV

DataFrames display the results in a table and this is the default behaviour. The table is useful for visually analysing the results. For example, the studies in the TCGA-SARC collection can be obtained by:

studies_df = tcia_studies(collection = "TCGA-SARC")

6×15 DataFrame

Row	StudyInstanceUID	StudyDate	StudyDescription	AdmittingDiagnosesDescription	StudyID	PatientAge	PatientID	PatientName	PatientBirthDate	PatientSex	EthnicGroup	Collection	SeriesCount	LongitudinalTemporalEventType	LongitudinalTemporalOffsetFromEvent
	String	String31	String31	Missing	Missing	String7	String15	String15	Missing	String1	Missing	String15	Int64	Missing	Missing
1	1.3.6.1.4.1.14519.5.2.1.3023.4024.194945893650500105451818109994	1998-08-03 00:00:00.0	CT ANGIOGRAPHY PELVIS W/WO CON	missing	missing	042Y	TCGA-QQ-A5V2	TCGA-QQ-A5V2	missing	M	missing	TCGA-SARC	3	missing	missing
2	1.3.6.1.4.1.14519.5.2.1.3023.4024.201397008342044883871995521133	2003-06-01 00:00:00.0	CT CHEST ABD PELVIS W/ CONTR	missing	missing	064Y	TCGA-QQ-A5VC	TCGA-QQ-A5VC	missing	F	missing	TCGA-SARC	2	missing	missing
3	1.3.6.1.4.1.14519.5.2.1.3023.4024.150353595775386960889448399748	1997-11-26 00:00:00.0	MRI LOWER EXTREMITY W/WO CONT	missing	missing	070Y	TCGA-QQ-A8VF	TCGA-QQ-A8VF	missing	M	missing	TCGA-SARC	9	missing	missing
4	1.3.6.1.4.1.14519.5.2.1.3023.4024.215308722288168917637555384485	2002-03-15 00:00:00.0	MRI THORACIC W/WO CONTRAST	missing	missing	031Y	TCGA-QQ-A8VH	TCGA-QQ-A8VH	missing	F	missing	TCGA-SARC	11	missing	missing
5	1.3.6.1.4.1.14519.5.2.1.3023.4024.298690116465423805879206377806	1997-11-29 00:00:00.0	CT CHEST W/ CONTRAST	missing	missing	070Y	TCGA-QQ-A8VF	TCGA-QQ-A8VF	missing	M	missing	TCGA-SARC	6	missing	missing
6	1.3.6.1.4.1.14519.5.2.1.3023.4024.122712175454295553928844869548	2001-08-05 00:00:00.0	CT CHEST ABD PELVIS W/ CONTR	missing	missing	052Y	TCGA-QQ-A8VG	TCGA-QQ-A8VG	missing	M	missing	TCGA-SARC	2	missing	missing

Manipulating the DataFrame object

The table can be manipulated using tools from the DataFrames.jl and CSV.jl packages.

Individual columns in a DataFrame object–-suppose it is named data_frame–-can be accessed by data_frame.column_name where the available column names are in names(data_frame).

julia> names(studies_df)15-element Vector{String}:
 "StudyInstanceUID"
 "StudyDate"
 "StudyDescription"
 "AdmittingDiagnosesDescription"
 "StudyID"
 "PatientAge"
 "PatientID"
 "PatientName"
 "PatientBirthDate"
 "PatientSex"
 "EthnicGroup"
 "Collection"
 "SeriesCount"
 "LongitudinalTemporalEventType"
 "LongitudinalTemporalOffsetFromEvent"
julia> studies_df.StudyDate6-element Vector{InlineStrings.String31}:
 "1998-08-03 00:00:00.0"
 "2003-06-01 00:00:00.0"
 "1997-11-26 00:00:00.0"
 "2002-03-15 00:00:00.0"
 "1997-11-29 00:00:00.0"
 "2001-08-05 00:00:00.0"

The table can also be filtered and sorted. As an example, the following lines will sort the previous table by the number of series and then remove the StudyInstanceUID and PatientName columns:

using DataFrames
studies_sorted_by_count = sort(studies_df, :SeriesCount)
select!(studies_sorted_by_count, Not([:StudyInstanceUID, :PatientName]))

6×13 DataFrame

Row	StudyDate	StudyDescription	AdmittingDiagnosesDescription	StudyID	PatientAge	PatientID	PatientBirthDate	PatientSex	EthnicGroup	Collection	SeriesCount	LongitudinalTemporalEventType	LongitudinalTemporalOffsetFromEvent
	String31	String31	Missing	Missing	String7	String15	Missing	String1	Missing	String15	Int64	Missing	Missing
1	2003-06-01 00:00:00.0	CT CHEST ABD PELVIS W/ CONTR	missing	missing	064Y	TCGA-QQ-A5VC	missing	F	missing	TCGA-SARC	2	missing	missing
2	2001-08-05 00:00:00.0	CT CHEST ABD PELVIS W/ CONTR	missing	missing	052Y	TCGA-QQ-A8VG	missing	M	missing	TCGA-SARC	2	missing	missing
3	1998-08-03 00:00:00.0	CT ANGIOGRAPHY PELVIS W/WO CON	missing	missing	042Y	TCGA-QQ-A5V2	missing	M	missing	TCGA-SARC	3	missing	missing
4	1997-11-29 00:00:00.0	CT CHEST W/ CONTRAST	missing	missing	070Y	TCGA-QQ-A8VF	missing	M	missing	TCGA-SARC	6	missing	missing
5	1997-11-26 00:00:00.0	MRI LOWER EXTREMITY W/WO CONT	missing	missing	070Y	TCGA-QQ-A8VF	missing	M	missing	TCGA-SARC	9	missing	missing
6	2002-03-15 00:00:00.0	MRI THORACIC W/WO CONTRAST	missing	missing	031Y	TCGA-QQ-A8VH	missing	F	missing	TCGA-SARC	11	missing	missing

Saving DataFrame as CSV

The contents of the table can be written to a csv file by:

julia> dataframe_to_csv(dataframe = studies_df, file = "output_file.csv")

DictionaryArray/JSON

Instead of a table, an array of dictionaries can be obtained by passing format = "json" as an argument when calling the query function. For example, the DataFrame from the previous example could have been obtained as an array by:

studies_array = tcia_studies(collection = "TCGA-SARC", format = "json")

6-element Vector{Any}:
 Dict{String, Any}("StudyDescription" => "CT ANGIOGRAPHY PELVIS W/WO CON", "PatientName" => "TCGA-QQ-A5V2", "PatientSex" => "M", "PatientID" => "TCGA-QQ-A5V2", "SeriesCount" => 3, "PatientAge" => "042Y", "StudyInstanceUID" => "1.3.6.1.4.1.14519.5.2.1.3023.4024.194945893650500105451818109994", "Collection" => "TCGA-SARC", "StudyDate" => "1998-08-03 00:00:00.0")
 Dict{String, Any}("StudyDescription" => "CT CHEST ABD PELVIS W/ CONTR", "PatientName" => "TCGA-QQ-A5VC", "PatientSex" => "F", "PatientID" => "TCGA-QQ-A5VC", "SeriesCount" => 2, "PatientAge" => "064Y", "StudyInstanceUID" => "1.3.6.1.4.1.14519.5.2.1.3023.4024.201397008342044883871995521133", "Collection" => "TCGA-SARC", "StudyDate" => "2003-06-01 00:00:00.0")
 Dict{String, Any}("StudyDescription" => "MRI LOWER EXTREMITY W/WO CONT", "PatientName" => "TCGA-QQ-A8VF", "PatientSex" => "M", "PatientID" => "TCGA-QQ-A8VF", "SeriesCount" => 9, "PatientAge" => "070Y", "StudyInstanceUID" => "1.3.6.1.4.1.14519.5.2.1.3023.4024.150353595775386960889448399748", "Collection" => "TCGA-SARC", "StudyDate" => "1997-11-26 00:00:00.0")
 Dict{String, Any}("StudyDescription" => "MRI THORACIC W/WO CONTRAST", "PatientName" => "TCGA-QQ-A8VH", "PatientSex" => "F", "PatientID" => "TCGA-QQ-A8VH", "SeriesCount" => 11, "PatientAge" => "031Y", "StudyInstanceUID" => "1.3.6.1.4.1.14519.5.2.1.3023.4024.215308722288168917637555384485", "Collection" => "TCGA-SARC", "StudyDate" => "2002-03-15 00:00:00.0")
 Dict{String, Any}("StudyDescription" => "CT CHEST W/ CONTRAST", "PatientName" => "TCGA-QQ-A8VF", "PatientSex" => "M", "PatientID" => "TCGA-QQ-A8VF", "SeriesCount" => 6, "PatientAge" => "070Y", "StudyInstanceUID" => "1.3.6.1.4.1.14519.5.2.1.3023.4024.298690116465423805879206377806", "Collection" => "TCGA-SARC", "StudyDate" => "1997-11-29 00:00:00.0")
 Dict{String, Any}("StudyDescription" => "CT CHEST ABD PELVIS W/ CONTR", "PatientName" => "TCGA-QQ-A8VG", "PatientSex" => "M", "PatientID" => "TCGA-QQ-A8VG", "SeriesCount" => 2, "PatientAge" => "052Y", "StudyInstanceUID" => "1.3.6.1.4.1.14519.5.2.1.3023.4024.122712175454295553928844869548", "Collection" => "TCGA-SARC", "StudyDate" => "2001-08-05 00:00:00.0")

Manipulating the Dictionary Array

The array can be manipulated by iterating over the elements. As an example, the following lines will collect patients that are less than 60 years old:

patients_below_60Y = []
for patient in studies_array
  if patient["PatientAge"] < "060Y"
    push!(patients_below_60Y, patient)
  end
end
# Print the new array:
patients_below_60Y

3-element Vector{Any}:
 Dict{String, Any}("StudyDescription" => "CT ANGIOGRAPHY PELVIS W/WO CON", "PatientName" => "TCGA-QQ-A5V2", "PatientSex" => "M", "PatientID" => "TCGA-QQ-A5V2", "SeriesCount" => 3, "PatientAge" => "042Y", "StudyInstanceUID" => "1.3.6.1.4.1.14519.5.2.1.3023.4024.194945893650500105451818109994", "Collection" => "TCGA-SARC", "StudyDate" => "1998-08-03 00:00:00.0")
 Dict{String, Any}("StudyDescription" => "MRI THORACIC W/WO CONTRAST", "PatientName" => "TCGA-QQ-A8VH", "PatientSex" => "F", "PatientID" => "TCGA-QQ-A8VH", "SeriesCount" => 11, "PatientAge" => "031Y", "StudyInstanceUID" => "1.3.6.1.4.1.14519.5.2.1.3023.4024.215308722288168917637555384485", "Collection" => "TCGA-SARC", "StudyDate" => "2002-03-15 00:00:00.0")
 Dict{String, Any}("StudyDescription" => "CT CHEST ABD PELVIS W/ CONTR", "PatientName" => "TCGA-QQ-A8VG", "PatientSex" => "M", "PatientID" => "TCGA-QQ-A8VG", "SeriesCount" => 2, "PatientAge" => "052Y", "StudyInstanceUID" => "1.3.6.1.4.1.14519.5.2.1.3023.4024.122712175454295553928844869548", "Collection" => "TCGA-SARC", "StudyDate" => "2001-08-05 00:00:00.0")

The available keys for each dictionary in the array are listed by:

julia> keys(studies_array[1])KeySet for a Dict{String, Any} with 9 entries. Keys:
  "StudyDescription"
  "PatientName"
  "PatientSex"
  "PatientID"
  "SeriesCount"
  "PatientAge"
  "StudyInstanceUID"
  "Collection"
  "StudyDate"

Saving Dictionary Array as JSON

The array can be written to a JSON file by

julia> dictionary_to_json(dictionary = studies_array, file = "output_file.json")

Note on types

The DataFrames object tries to figure out the types from the input while the DictionaryArray just accepts whatever the API returns. For a practical example of this, suppose we want to know the size of an imaging series; the DataFrame version will be

tcia_series_size(series = "1.3.6.1.4.1.14519.5.2.1.4591.4001.241972527061347495484079664948")

1×2 DataFrame

Row	TotalSizeInBytes	ObjectCount
	Int64	Int64
1	24012211806228	124

while the JSON version will be

julia> tcia_series_size(series = "1.3.6.1.4.1.14519.5.2.1.4591.4001.241972527061347495484079664948", format="json")[1]Dict{String, Any} with 2 entries:
  "ObjectCount"      => 124
  "TotalSizeInBytes" => 24012211806228

The difference between the two is that the DataFrames version recognizes that TotalSizeInBytes is a number whereas the DictionaryArray displays it as a string (because the API returns it as a string).

DataFrames' ability to recognize types is usually helpful, but sometimes it can fail. For example, in an anonymized dataset where patient names are replaced by numbers, the DataFrames object will incorrectly treat the names as numbers.

These differences are unlikely to cause problems in practice so it isn't something to be actively concerned about.