How to combine multiple encodings to represent a dataset

Sometimes it might be of interest to combine multiple encodings to represent a dataset, e.g., by combining k-mer frequencies with V or J gene frequencies, or k-mer frequencies with certain metadata fields to try to control for differences, e.g., in HLA types. immuneML support combining multiple encodings by using the Composite encoder and this tutorial illustrates how to do this.

To illustrate this usage, we will:

  • create a random repertoire dataset and assign some random metadata values to each repertoire

  • create a composite encoding that combines k-mer frequencies with the metadata values

  • combine the composite encoding with a logistic regression classifier to create a simple ML pipeline

Additionally, we will illustrate how to use LogRegressionCustomPenalty to assign different penalties to the different parts of the composite encoding. This might be of interest when the different parts of the encoding have different number of features or larger differences in value ranges.

To create a random dataset with randomly assigned HLA metadata values, we can use the following YAML specification:

datasets:
  dataset:
    format: RandomRepertoireDataset
    params:
      repertoire_count: 100 # number of repertoires to generate
      sequence_count_probabilities:
        10: 0.5 # probability that any repertoire would have 10 receptor sequences
        20: 0.5 # probability that any repertoire would have 20 receptor sequences
      sequence_length_probabilities:
        10: 0.5 # probability that any sequence in a repertoire is 10 a.a. long
        12: 0.5 # probability that any sequences in a repertoire is 12 a.a. long
      labels: # labels which can be used for machine learning
        disease: # the name of the label corresponding to an immune event
          True: 0.5 # probability that a repertoire is positive w.r.t. the label
          False: 0.5 # probability that a repertoire is negative w.r.t. the label
        hla:
          A1: 0.5
          A2: 0.5

To create a composite encoder that combines k-mer frequencies with the HLA metadata values, we can use the following YAML specification:

encodings:
  kmer_freq_hla:
    Composite:
      encoders:
      - KmerFrequency:
          k: 3
      - Metadata:
          metadata_fields: [hla]

The resulting feature vector will contain the k-mer frequencies followed by the one-hot encoded HLA metadata values.

To create a logistic regression classifier that assigns different penalties to the k-mer frequencies and the HLA metadata values, we can use the following YAML specification:

ml_methods:
  log_reg_custom_penalty:
    LogRegressionCustomPenalty:
      alpha: 1
      n_lambda: 100
      non_penalized_encodings: ['Metadata']
      random_state: 42

In this case, we do not penalize the coefficients corresponding to the HLA metadata values because they will be few and one-hot encoded, while the k-mer frequencies will be many and continuous-valued.

Here is the complete YAML specification that combines all of the above to create a simple ML pipeline:

definitions:
  datasets:
    dataset:
      format: RandomRepertoireDataset
      params:
        repertoire_count: 100 # number of repertoires to generate
        sequence_count_probabilities:
          10: 0.5 # probability that any repertoire would have 10 receptor sequences
          20: 0.5 # probability that any repertoire would have 20 receptor sequences
        sequence_length_probabilities:
          10: 0.5 # probability that any sequence in a repertoire is 10 a.a. long
          12: 0.5 # probability that any sequences in a repertoire is 12 a.a. long
        labels: # labels which can be used for machine learning
          disease: # the name of the label corresponding to an immune event
            True: 0.5 # probability that a repertoire is positive w.r.t. the label
            False: 0.5 # probability that a repertoire is negative w.r.t. the label
          hla:
            A1: 0.5
            A2: 0.5

  encodings:
    kmer_freq_hla:
      Composite:
        encoders:
        - KmerFrequency:
            k: 3
        - Metadata:
            metadata_fields: [hla]

  ml_methods:
    log_reg_custom_penalty:
      LogRegressionCustomPenalty:
        alpha: 1
        n_lambda: 100
        non_penalized_encodings: ['Metadata']
        random_state: 42

  reports:
    coefficients:
      Coefficients:
        coefs_to_plot:
          - all
          - nonzero
          - n_largest
        n_largest:
          - 30
    performance_per_hla:
      PerformancePerLabel:
        alternative_label: hla
        metric: balanced_accuracy
        compute_for_selection: false
        compute_for_assessment: true

instructions:
  train_classifier:
    type: TrainMLModel

    dataset: dataset
    labels: [disease]

    settings:
      - encoding: kmer_freq_hla
        ml_method: log_reg_custom_penalty

    assessment:
      split_strategy: random
      split_count: 1
      reports:
        models:                # plot the coefficients of the trained models
        - coefficients

    selection:
      split_strategy: k_fold
      split_count: 3

    optimization_metric: balanced_accuracy # the metric used for optimization
    metrics: # other metrics to compute
    - precision
    - recall
    - auc

    reports:
    - performance_per_hla

To run this example, save the above YAML specification to a file named composite_encoding_example.yaml. We assume you have installed immuneML (in a virtual environment) and can run from the console. If you haven’t set it up yet, see Install immuneML with a package manager. When you have immuneML installed and environment activated, run the following command in your terminal:

immune-ml composite_encoding_example.yaml composite_encoding_example_output

The results can be explored from composite_encoding_example_output/index.html.