How to perform an exploratory data analysisΒΆ

To explore preprocessing, encodings and/or reports without running a machine learning algorithm, the ExploratoryAnalysis instruction should be used. The components in the definitions section are defined in the same manner as for all other instructions (see: How to specify an analysis with YAML, for importing a dataset see How to import data into immuneML).

The instruction consists of a list of analyses to be performed. Each analysis should contain at least a dataset and a reports sections. Optionally, the analysis may also contain an encoding, dim_reduction, and labels if applicable. In the example below, two analyses are done:

  • my_analysis_1 runs report my_seq_lengths directly on dataset my_dataset

  • my_analysis_2 first encodes my_dataset using my_regex_matches before running report my_matches.

  • my_dim_red_analysis first encodes my_dataset with 3-mer frequencies, performs dimensionality reduction using tSNE, and then runs report my_dim_red_report.

definitions:
  datasets:
    # imported datasets
    my_dataset: # user-defined dataset name
      format: AIRR
      params:
        metadata_file: path/to/metadata.csv
        path: path/to/data/

  encodings:
    my_regex_matches:
      MatchedRegex:
        motif_filepath: path/to/regex_file.tsv
    3mer_freq: KmerFrequency

  ml_methods:
    my_dim_red_method: # user-defined method name
      TSNE:
        n_components: 2
        init: pca

  reports:
    my_seq_lengths: SequenceLengthDistribution # reports without parameters
    my_matches: Matches
    dim_red_report:
      DimensionalityReduction:
        labels: [disease_label] # a list of labels to be used for coloring the points in the plot (a separate plot will be made for each label)

instructions:
  my_instruction: # user-defined instruction name
    type: ExploratoryAnalysis
    analyses:
      my_analysis_1: # user-defined analysis name
        dataset: my_dataset
        reports: [my_seq_lengths]
      my_analysis_2:
        dataset: my_dataset
        encoding: my_regex_matches
        reports: [my_matches]
      my_dim_red_analysis:
        dataset: my_dataset
        encoding: 3mer_freq
        dim_reduction: my_dim_red_method
        reports: [dim_red_report]

Note that for analysis 2, the file regex_file.tsv must be a tab-separated file, which may contain the following contents:

id

TRB_regex

1

ACG

2

EDNA

3

DFWG