How to combine multiple encodings to represent a dataset
============================================================

Sometimes it might be of interest to combine multiple encodings to represent a dataset, e.g., by combining k-mer
frequencies with V or J gene frequencies, or k-mer frequencies with certain metadata fields to try to control for
differences, e.g., in HLA types. immuneML support combining multiple encodings by using the :ref:`Composite` encoder and
this tutorial illustrates how to do this.

To illustrate this usage, we will:

- create a random repertoire dataset and assign some random metadata values to each repertoire
- create a composite encoding that combines k-mer frequencies with the metadata values
- combine the composite encoding with a logistic regression classifier to create a simple ML pipeline

Additionally, we will illustrate how to use :ref:`LogRegressionCustomPenalty` to assign different penalties to the
different parts of the composite encoding. This might be of interest when the different parts of the encoding have
different number of features or larger differences in value ranges.

To create a random dataset with randomly assigned HLA metadata values, we can use the following YAML specification:

.. code-block:: yaml

    datasets:
      dataset:
        format: RandomRepertoireDataset
        params:
          repertoire_count: 100 # number of repertoires to generate
          sequence_count_probabilities:
            10: 0.5 # probability that any repertoire would have 10 receptor sequences
            20: 0.5 # probability that any repertoire would have 20 receptor sequences
          sequence_length_probabilities:
            10: 0.5 # probability that any sequence in a repertoire is 10 a.a. long
            12: 0.5 # probability that any sequences in a repertoire is 12 a.a. long
          labels: # labels which can be used for machine learning
            disease: # the name of the label corresponding to an immune event
              True: 0.5 # probability that a repertoire is positive w.r.t. the label
              False: 0.5 # probability that a repertoire is negative w.r.t. the label
            hla:
              A1: 0.5
              A2: 0.5

To create a composite encoder that combines k-mer frequencies with the HLA metadata values, we can use the following YAML
specification:

.. code-block:: yaml

    encodings:
      kmer_freq_hla:
        Composite:
          encoders:
          - KmerFrequency:
              k: 3
          - Metadata:
              metadata_fields: [hla]

The resulting feature vector will contain the k-mer frequencies followed by the one-hot encoded HLA metadata values.

To create a logistic regression classifier that assigns different penalties to the k-mer frequencies and the HLA metadata
values, we can use the following YAML specification:

.. code-block:: yaml

  ml_methods:
    log_reg_custom_penalty:
      LogRegressionCustomPenalty:
        alpha: 1
        n_lambda: 100
        non_penalized_encodings: ['Metadata']
        random_state: 42

In this case, we do not penalize the coefficients corresponding to the HLA metadata values because they will be
few and one-hot encoded, while the k-mer frequencies will be many and continuous-valued.

Here is the complete YAML specification that combines all of the above to create a simple ML pipeline:

.. code-block:: yaml

  definitions:
    datasets:
      dataset:
        format: RandomRepertoireDataset
        params:
          repertoire_count: 100 # number of repertoires to generate
          sequence_count_probabilities:
            10: 0.5 # probability that any repertoire would have 10 receptor sequences
            20: 0.5 # probability that any repertoire would have 20 receptor sequences
          sequence_length_probabilities:
            10: 0.5 # probability that any sequence in a repertoire is 10 a.a. long
            12: 0.5 # probability that any sequences in a repertoire is 12 a.a. long
          labels: # labels which can be used for machine learning
            disease: # the name of the label corresponding to an immune event
              True: 0.5 # probability that a repertoire is positive w.r.t. the label
              False: 0.5 # probability that a repertoire is negative w.r.t. the label
            hla:
              A1: 0.5
              A2: 0.5

    encodings:
      kmer_freq_hla:
        Composite:
          encoders:
          - KmerFrequency:
              k: 3
          - Metadata:
              metadata_fields: [hla]

    ml_methods:
      log_reg_custom_penalty:
        LogRegressionCustomPenalty:
          alpha: 1
          n_lambda: 100
          non_penalized_encodings: ['Metadata']
          random_state: 42

    reports:
      coefficients:
        Coefficients:
          coefs_to_plot:
            - all
            - nonzero
            - n_largest
          n_largest:
            - 30
      performance_per_hla:
        PerformancePerLabel:
          alternative_label: hla
          metric: balanced_accuracy
          compute_for_selection: false
          compute_for_assessment: true

  instructions:
    train_classifier:
      type: TrainMLModel

      dataset: dataset
      labels: [disease]

      settings:
        - encoding: kmer_freq_hla
          ml_method: log_reg_custom_penalty

      assessment:
        split_strategy: random
        split_count: 1
        reports:
          models:                # plot the coefficients of the trained models
          - coefficients

      selection:
        split_strategy: k_fold
        split_count: 3

      optimization_metric: balanced_accuracy # the metric used for optimization
      metrics: # other metrics to compute
      - precision
      - recall
      - auc

      reports:
      - performance_per_hla

To run this example, save the above YAML specification to a file named ``composite_encoding_example.yaml``. We assume you
have installed immuneML (in a virtual environment) and can run from the console. If you haven't set it up yet,
see :ref:`Install immuneML with a package manager`.
When you have immuneML installed and environment activated, run the following command in your terminal:

.. code-block:: console

    immune-ml composite_encoding_example.yaml composite_encoding_example_output

The results can be explored from `composite_encoding_example_output/index.html`.