How to combine multiple encodings to represent a dataset¶
Sometimes it might be of interest to combine multiple encodings to represent a dataset, e.g., by combining k-mer frequencies with V or J gene frequencies, or k-mer frequencies with certain metadata fields to try to control for differences, e.g., in HLA types. immuneML support combining multiple encodings by using the Composite encoder and this tutorial illustrates how to do this.
To illustrate this usage, we will:
create a random repertoire dataset and assign some random metadata values to each repertoire
create a composite encoding that combines k-mer frequencies with the metadata values
combine the composite encoding with a logistic regression classifier to create a simple ML pipeline
Additionally, we will illustrate how to use LogRegressionCustomPenalty to assign different penalties to the different parts of the composite encoding. This might be of interest when the different parts of the encoding have different number of features or larger differences in value ranges.
To create a random dataset with randomly assigned HLA metadata values, we can use the following YAML specification:
datasets:
dataset:
format: RandomRepertoireDataset
params:
repertoire_count: 100 # number of repertoires to generate
sequence_count_probabilities:
10: 0.5 # probability that any repertoire would have 10 receptor sequences
20: 0.5 # probability that any repertoire would have 20 receptor sequences
sequence_length_probabilities:
10: 0.5 # probability that any sequence in a repertoire is 10 a.a. long
12: 0.5 # probability that any sequences in a repertoire is 12 a.a. long
labels: # labels which can be used for machine learning
disease: # the name of the label corresponding to an immune event
True: 0.5 # probability that a repertoire is positive w.r.t. the label
False: 0.5 # probability that a repertoire is negative w.r.t. the label
hla:
A1: 0.5
A2: 0.5
To create a composite encoder that combines k-mer frequencies with the HLA metadata values, we can use the following YAML specification:
encodings:
kmer_freq_hla:
Composite:
encoders:
- KmerFrequency:
k: 3
- Metadata:
metadata_fields: [hla]
The resulting feature vector will contain the k-mer frequencies followed by the one-hot encoded HLA metadata values.
To create a logistic regression classifier that assigns different penalties to the k-mer frequencies and the HLA metadata values, we can use the following YAML specification:
ml_methods:
log_reg_custom_penalty:
LogRegressionCustomPenalty:
alpha: 1
n_lambda: 100
non_penalized_encodings: ['Metadata']
random_state: 42
In this case, we do not penalize the coefficients corresponding to the HLA metadata values because they will be few and one-hot encoded, while the k-mer frequencies will be many and continuous-valued.
Here is the complete YAML specification that combines all of the above to create a simple ML pipeline:
definitions:
datasets:
dataset:
format: RandomRepertoireDataset
params:
repertoire_count: 100 # number of repertoires to generate
sequence_count_probabilities:
10: 0.5 # probability that any repertoire would have 10 receptor sequences
20: 0.5 # probability that any repertoire would have 20 receptor sequences
sequence_length_probabilities:
10: 0.5 # probability that any sequence in a repertoire is 10 a.a. long
12: 0.5 # probability that any sequences in a repertoire is 12 a.a. long
labels: # labels which can be used for machine learning
disease: # the name of the label corresponding to an immune event
True: 0.5 # probability that a repertoire is positive w.r.t. the label
False: 0.5 # probability that a repertoire is negative w.r.t. the label
hla:
A1: 0.5
A2: 0.5
encodings:
kmer_freq_hla:
Composite:
encoders:
- KmerFrequency:
k: 3
- Metadata:
metadata_fields: [hla]
ml_methods:
log_reg_custom_penalty:
LogRegressionCustomPenalty:
alpha: 1
n_lambda: 100
non_penalized_encodings: ['Metadata']
random_state: 42
reports:
coefficients:
Coefficients:
coefs_to_plot:
- all
- nonzero
- n_largest
n_largest:
- 30
performance_per_hla:
PerformancePerLabel:
alternative_label: hla
metric: balanced_accuracy
compute_for_selection: false
compute_for_assessment: true
instructions:
train_classifier:
type: TrainMLModel
dataset: dataset
labels: [disease]
settings:
- encoding: kmer_freq_hla
ml_method: log_reg_custom_penalty
assessment:
split_strategy: random
split_count: 1
reports:
models: # plot the coefficients of the trained models
- coefficients
selection:
split_strategy: k_fold
split_count: 3
optimization_metric: balanced_accuracy # the metric used for optimization
metrics: # other metrics to compute
- precision
- recall
- auc
reports:
- performance_per_hla
To run this example, save the above YAML specification to a file named composite_encoding_example.yaml. We assume you
have installed immuneML (in a virtual environment) and can run from the console. If you haven’t set it up yet,
see Install immuneML with a package manager.
When you have immuneML installed and environment activated, run the following command in your terminal:
immune-ml composite_encoding_example.yaml composite_encoding_example_output
The results can be explored from composite_encoding_example_output/index.html.