Manuscript use case 1: Out-of-distribution receptor-level simulation using LIgO ============================================================ The composition of train and test data can impact the quality and accuracy of AIRR-ML models. In our use case, we showed how splitting the train and test data may decrease predictive performance in out-of-distribution cases. With LIgO, we simulated a scenario where an immune signal is defined as one motif, but each individual carries a slightly different modification of this motif. We trained and evaluated a logistic regression model (LR) for receptor-level classification using two different train-test strategies. The first strategy involved a random split of all AIRs from all individuals into train and test sets, while the second strategy used a leave-one-individual-out approach, placing all AIRs from one individual into the test set and AIRs from the other individuals into the training set. Our findings revealed that LR trained on the random train-test strategy achieved higher balanced accuracy compared to LR trained on the leave-one-individual-out approach. In this tutorial, we give an example of a simulation configuration for a single dataset, along with detailed explanations of the parameters in the comments as needed. A detailed description of use case 1 can be found in the LIgO manuscript. Simulation configuration ------------------------ In this use case, we considered the immune signal as a motif AA-A with four variations — AAAA, AANA, AACA, and AAGA reflecting AIRs from four different individuals. Specifically, the configuration below describes the simulation of a dataset consisting of: - AIRR1: 5000 IGHs containing AAAA + 5000 IGHs without any of the four signal 4-mers; - AIRR2: 5000 IGHs containing AACA + 5000 IGHs without any of the four signal 4-mers; - AIRR3: 5000 IGHs containing AAGA + 5000 IGHs without any of the four signal 4-mers; - AIRR4: 5000 IGHs containing AANA + 5000 IGHs without any of the four signal 4-mers. .. image:: ../_static/figures/usecase1_signals.png :width: 500 .. code-block:: yaml definitions: motifs: motif1: seed: AAAA motif2: seed: AACA motif3: seed: AAGA motif4: seed: AANA signals: signal1: motifs: - motif1 signal2: motifs: - motif2 signal3: motifs: - motif3 signal4: motifs: - motif4 simulations: sim1: is_repertoire: true paired: false sequence_type: amino_acid simulation_strategy: Implanting # use implanting for simulation signal-specific AIRs remove_seqs_with_signals: true # remove all 4 signals from the background receptors keep_p_gen_dist: false sim_items: airr1: # repertoire for individual 1 generative_model: default_model_name: humanIGH model_path: null type: OLGA is_noise: false number_of_examples: 1 # for demonstration purposes we simulate airr1 1 time receptors_in_repertoire_count: 10000 # each AIRR containing 10000 receptors signals: signal1: 0.5 # 50% of receptors (5000) should contain signal1 (AAAA) airr2: # repertoire for individual 2 generative_model: default_model_name: humanIGH model_path: null type: OLGA is_noise: false number_of_examples: 1 # for demonstration purposes we simulate airr2 1 time receptors_in_repertoire_count: 10000 # each AIRR containing 10000 receptors signals: signal2: 0.5 # 50% of receptors (5000) should contain signal2 (AACA) airr3: # repertoire for individual 3 generative_model: default_model_name: humanIGH model_path: null type: OLGA is_noise: false number_of_examples: 1 # for demonstration purposes we simulate airr3 1 time receptors_in_repertoire_count: 10000 # each AIRR containing 10000 receptors signals: signal3: 0.5 # 50% of receptors (5000) should contain signal3 (AAGA) airr4: # repertoire for individual 4 generative_model: default_model_name: humanIGH model_path: null type: OLGA is_noise: false number_of_examples: 1 # for demonstration purposes we simulate airr4 1 time receptors_in_repertoire_count: 10000 # each AIRR containing 10000 receptors signals: signal4: 0.5 # 50% of receptors (5000) should contain signal4 (AANA) instructions: inst1: export_p_gens: false # we don't need pgens for this use case max_iterations: 10000 number_of_processes: 32 sequence_batch_size: 100000 simulation: sim1 type: LigoSim output: format: HTML Data preprocessing ----------------- Before applying Logistic regression we additionally preprocessed the LIgO-simulated data the following way: - We merged AIRR1, AIRR2, AIRR3, and AIRR4 into one dataset (merged.tsv) - We added two additional columns — **individual** and **signal_label**. The **individual** column indicates the corresponding individual for each AIR. The **individual** column contains values 0 and 1, where 1 represents that an AIR contains any of the signal motifs and 0 otherwise. Here is an example of a python script which can do the preprocessing described above .. code-block:: python import pandas as pd import os data_path = "path_to_simulated_dataset/inst1/exported_dataset/airr" # path to the LIgO output metadata = pd.read_csv(os.path.join(data_path, "metadata.csv")) merged = pd.DataFrame() for i in range(4): filename = metadata.loc[i, "filename"] df = pd.read_csv(os.path.join(data_path, "repertoires", filename), sep='\t') df['individual'] = metadata.loc[i, "sim_item"] df['signal_label'] = df['signal1'] + df['signal2'] + df['signal3'] + df['signal4'] merged = pd.concat([merged, df], ignore_index=True) # write the result in the merged_airrs.tsv merged.to_csv(os.path.join(data_path, "merged_airrs.tsv"), sep='\t', index=False) ML configuration -------------------------- We used a logistic regression (LR) model to classify AIRs into signal-specific or non-specific. The LR was trained and tested using two train-test split approaches — (i) random train-test split and (ii) leave-one individual out train-test split, see the illustration below. You can find more details about our usage of LR in the LIgO manuscript. .. image:: ../_static/figures/usecase1_splits.png :width: 800 The L1-regularized logistic regression on k-mer encoded data is trained used ImmuneML. The configuration for the ML model is shown below (as an example for one dataset). The parameters specification of ImmuneML are extensively documented in the `ImmuneML documentation `_. **Random train-test split strategy** .. code-block:: yaml definitions: datasets: d1: format: AIRR params: path: merged_airrs.tsv is_repertoire: false import_out_of_frame: True metadata_column_mapping: class: signal_label encodings: 4mer: KmerFrequency: k: 4 sequence_type: AMINO_ACID sequence_encoding: continuous_kmer # split sequence into overlapping k-mers scale_to_unit_variance: True # scale the normalized examples to have unit variance scale_to_zero_mean: False ml_methods: LR: LogisticRegression: penalty: l1 C: [0.1,0.05,0.01,0.001] reports: coefs: Coefficients: coefs_to_plot: - n_largest n_largest: - 25 name: coefs instructions: inst1: dataset: d1 labels: - signal_label assessment: reports: models: - coefs split_count: 4 split_strategy: stratified_k_fold selection: split_count: 4 split_strategy: stratified_k_fold metrics: [balanced_accuracy, precision, recall] optimization_metric: balanced_accuracy settings: - encoding: 4mer ml_method: LR refit_optimal_model: false number_of_processes: 32 strategy: GridSearch type: TrainMLModel output: format: HTML **Leave-one individual out train-test split strategy** .. code-block:: yaml definitions: datasets: d1: format: AIRR params: path: merged_airrs.tsv is_repertoire: false import_out_of_frame: True metadata_column_mapping: signal_label: signal_label individual: individual encodings: 4mer: KmerFrequency: k: 4 sequence_type: AMINO_ACID sequence_encoding: continuous_kmer # split sequence into overlapping k-mers scale_to_unit_variance: True # scale the normalized examples to have unit variance scale_to_zero_mean: False ml_methods: LR: LogisticRegression: penalty: l1 C: [0.1,0.05,0.01,0.001] reports: coefs: Coefficients: coefs_to_plot: - n_largest n_largest: - 25 name: coefs instructions: inst1: dataset: d1 labels: - signal_label assessment: reports: models: - coefs split_count: 4 split_strategy: leave_one_out_stratification leave_one_out_config: # perform leave-(subject)-out CV parameter: individual min_count: 1 selection: split_count: 3 split_strategy: leave_one_out_stratification leave_one_out_config: # perform leave-(subject)-out CV parameter: individual min_count: 1 metrics: [balanced_accuracy, precision, recall] optimization_metric: balanced_accuracy settings: - encoding: 4mer ml_method: LR refit_optimal_model: false number_of_processes: 32 strategy: GridSearch type: TrainMLModel output: format: HTML Results and approaches to increase simulation complexity if required ----------------- The logistic regression models trained on the random train-test split achieved higher accuracy with a median balanced accuracy of 0.98, and their largest coefficients corresponded to the four signal 4-mers. However, the logistic regression model trained on the leave-one individual out train-test split demonstrated lower performance with a median balanced accuracy of 0.67. This shows that while perfect prediction accuracy may be observed with a random split approach, prediction accuracy may appear lower based on a leave-one-individual-out split strategy. Since this use case only illustrates the utilization of LIgO for benchmarking and developing AIRR-ML methods, the simulation performed here is simplified. To maintain focus, we assumed that each individual contains only one specific 4-mer and any other signal-specific AIRs are removed from the background. To create a more complex simulation, one may utilize the following LIgO options: - Do not exclude signal-specific receptors from the background. For example, all AIRs containing AAAA, AAGA, AANA, and AACA should be present in the background of 5000 AIRs for each individual with their natural occurrence. To achieve this, set remove_seqs_with_signals to false. .. code-block:: yaml remove_seqs_with_signals: false # keep all signal-specific AIRs in the background receptors - Explicitly add AIRs containing other signal-specific 4-mers. For example, if an individual carries the signal AAAA, we can also include AIRs containing AAGA, AANA, and AACA to their repertoire. The YAML below demonstrates how to incorporate 1% of these extra signal-specific AIRs for individual 1. .. code-block:: yaml airr1: # repertoire for individual 1 generative_model: default_model_name: humanIGH model_path: null type: OLGA is_noise: false number_of_examples: 1 # for demonstration purposes we simulate airr1 1 time receptors_in_repertoire_count: 10000 # each AIRR containing 10000 receptors signals: signal1: 0.5 # 50% of receptors (5000) should contain signal1 (AAAA) signal2: 0.01 # 1% of receptors (5000) should contain signal1 (AACA) signal3: 0.01 # 1% of receptors (5000) should contain signal1 (AAGA) signal4: 0.01 # 1% of receptors (5000) should contain signal1 (AANA)