Manuscript use case 2: Limitations of conventional encoding schemes for repertoire-level binary classification when immune signals co-occur within the same AIR
============================================================
Previous studies indicate k-mer encoding captures information necessary for classifying immune repertoires. However, since immune signals can include multiple motifs within a receptor, isolated k-mer treatment might in some cases be inadequate for representing different immune states. Using LIgO, we simulated datasets with such complex co-occurring immune signals. Machine Learning (ML) on these datasets highlighted the limitations of ML methods that overlooked motif co-occurrence. Particularly, the traditional k-mer encoding failed to distinguish between different immune states.
In this use case, we demonstrate how LIgO can be used to encode complex immune signals and how such simulations can aid the development of suited ML models in classifying immune repertoires. The configuration specification for LIgO is shown below. More detailed use case 2 description can be found in the LIgO manuscript.
Although we generated many datasets for gauging the uncertainty in the performance of ML models, we show the configuration only for one dataset for demonstration purposes. The comments in the configuration provide a detailed explanation of the parameters when needed.
Simulation configuration
------------------------
In the configuration below, the motif1 (GDT) occurs at 1% frequency on average in the repertoires of both positive (AIRR1) and negative (AIRR2) classes. Similarly motif2 (SGL) also occurs at 1% frequency on average in the repertoires of both positive (AIRR1) and negative (AIRR2) classes. However, the positive class repertoires have a higher co-occurrence rate of motif1 and motif2 (`signal1__signal2`) compared to the negative class repertoires. The positive class repertoires have a co-occurrence rate of 0.1%, while the negative class repertoires have a co-occurrence rate of 0.01%.
.. image:: ../_static/figures/usecase2_signals.png
:width: 800
.. code-block:: yaml
definitions:
motifs:
motif1:
seed: GDT # motif-1
motif2:
seed: SGL # motif-2
signals:
signal1:
motifs:
- motif1
signal2:
motifs:
- motif2
simulations:
sim1:
is_repertoire: true # indicating that we are simulating repertoires
paired: false # indicating that we are not simulating paired repertoires
sequence_type: amino_acid
simulation_strategy: RejectionSampling # we are using rejection sampling to obtain sequences that contain motifs
sim_items:
AIRR1: # positive class of repertoires generated using the following generative model
generative_model:
chain: beta
default_model_name: humanTRB
model_path: null
type: OLGA
is_noise: false
number_of_examples: 100 # number of repertoires to be generated
receptors_in_repertoire_count: 100000 # number of receptors in each repertoire
signals:
signal1__signal2: 0.001 # signal1 and signal2 co-occur in the same receptor at this witness rate
signal1: 0.009 # signal1 occurs in isolation in the repertoires at this witness rate
signal2: 0.009 # signal2 occurs in isolation in the repertoires at this witness rate
AIRR2: # negative class of repertoires generated using the following generative model
generative_model:
chain: beta
default_model_name: humanTRB
model_path: null
type: OLGA
is_noise: true # indicating that these repertoires receive signal at some lower witness rate and thus are not representative of the positive class
number_of_examples: 100 # number of repertoires to be generated
receptors_in_repertoire_count: 100000 # number of receptors in each repertoire
signals:
signal1__signal2: 0.0001 # signal1 and signal2 co-occur in the same receptor at this witness rate
signal1: 0.0099 # signal1 occurs in isolation in the repertoires at this witness rate
signal2: 0.0099 # signal2 occurs in isolation in the repertoires at this witness rate
instructions:
inst1:
export_p_gens: false # could take some time to compute (from olga)
max_iterations: 10000 # maximum number of iterations for the rejection sampling, where in each iteration `sequence_batch_size` sequences are generated and searched for signal sequences
number_of_processes: 32 # number of processes to be used for parallelization
sequence_batch_size: 100000 # number of sequences to be generated in each iteration
simulation: sim1
store_signal_in_receptors: true
type: LigoSim
output:
format: HTML
ML configuration
-----------------
The L1-penalized logistic regression model on k-mer encoded data is trained used ImmuneML. The configuration for the ML model is shown below (as an example for one dataset). The parameters specification of ImmuneML are extensively documented in the `ImmuneML documentation `_.
.. image:: ../_static/figures/usecase2_LR.png
:width: 600
.. code-block:: yaml
definitions:
datasets:
cooccurring_motifs_dataset:
format: AIRR
params:
path: path_to/exported_dataset/airr/repertoires
metadata_file: path_to/exported_dataset/airr/repertoires/metadata.csv
encodings:
feature_size_3:
KmerFrequency:
k: 3
scale_to_unit_variance: True
scale_to_zero_mean: True
ml_methods:
logistic_regression:
LogisticRegression:
penalty: l1
C: [0.1,0.05,0.01,0.001]
max_iter: 500
model_selection_cv: True
model_selection_n_folds: 5
reports:
my_hp_benchmark: MLSettingsPerformance
my_coef_report:
Coefficients:
coefs_to_plot:
- n_largest
n_largest: [15]
my_motif_report:
MotifSeedRecovery:
implanted_motifs_per_label:
signal_cancer:
seeds: [GDT, SGL]
hamming_distance: False
gap_sizes: [0]
instructions:
hpoptim_instr:
assessment:
reports:
models: [my_coef_report, my_motif_report]
split_count: 5
split_strategy: k_fold
training_percentage: 0.7
number_of_processes: 32
dataset: cooccurring_motifs_dataset
labels:
- sim_item
metrics:
- auc
- recall
optimization_metric: balanced_accuracy
reports: [my_hp_benchmark]
selection:
split_count: 1
split_strategy: random
training_percentage: 1.0
settings:
- encoding: feature_size_3
ml_method: logistic_regression
strategy: GridSearch
type: TrainMLModel
refit_optimal_model: False
The custom multiple instance learning (MIL) model `implemented here `_ as a command-line tool is trained with the following configuration:
.. image:: ../_static/figures/usecase2_MIL.png
:width: 600
.. code-block:: yaml
concatenated_receptors_file: path_to/exported_dataset/airr/concatenated_flat_receptor_file.tsv
label_field: "epitope"
k: 3
num_processes: 50
group_field: "repertoire_id"
n_splits: 5
pval_threshold: 0.001
output_path: path/to/output/