How to perform an exploratory data analysisΒΆ
To explore preprocessing, encodings and/or reports without running a machine learning algorithm, the ExploratoryAnalysis instruction should be used. The components in the definitions section are defined in the same manner as for all other instructions (see: How to specify an analysis with YAML, for importing a dataset see How to import data into immuneML).
The instruction consists of a list of analyses to be performed. Each analysis should
contain at least a dataset
and a reports
sections. Optionally, the analysis may also contain an
encoding
, dim_reduction
, and labels
if applicable.
In the example below, two analyses are done:
my_analysis_1
runs reportmy_seq_lengths
directly on datasetmy_dataset
my_analysis_2
first encodesmy_dataset
usingmy_regex_matches
before running reportmy_matches
.my_dim_red_analysis
first encodesmy_dataset
with 3-mer frequencies, performs dimensionality reduction using tSNE, and then runs reportmy_dim_red_report
.
definitions:
datasets:
# imported datasets
my_dataset: # user-defined dataset name
format: AIRR
params:
metadata_file: path/to/metadata.csv
path: path/to/data/
encodings:
my_regex_matches:
MatchedRegex:
motif_filepath: path/to/regex_file.tsv
3mer_freq: KmerFrequency
ml_methods:
my_dim_red_method: # user-defined method name
TSNE:
n_components: 2
init: pca
reports:
my_seq_lengths: SequenceLengthDistribution # reports without parameters
my_matches: Matches
dim_red_report:
DimensionalityReduction:
labels: [disease_label] # a list of labels to be used for coloring the points in the plot (a separate plot will be made for each label)
instructions:
my_instruction: # user-defined instruction name
type: ExploratoryAnalysis
analyses:
my_analysis_1: # user-defined analysis name
dataset: my_dataset
reports: [my_seq_lengths]
my_analysis_2:
dataset: my_dataset
encoding: my_regex_matches
reports: [my_matches]
my_dim_red_analysis:
dataset: my_dataset
encoding: 3mer_freq
dim_reduction: my_dim_red_method
reports: [dim_red_report]
Note that for analysis 2, the file regex_file.tsv
must be a tab-separated
file, which may contain the following contents:
id |
TRB_regex |
---|---|
1 |
ACG |
2 |
EDNA |
3 |
DFWG |