How to train a generative model ======================================== This tutorial provides a practical introduction for AIRR researchers interested in training generative machine learning models on immune receptor sequences using immuneML and :ref:`TrainGenModel` instruction. Choosing a Dataset --------------------- To train a generative model, you need a dataset of immune receptor sequences. The sequences should be in any format supported by immuneML, such as AIRR or VDJdb formats. See :ref:`Dataset parameters` for the full list of supported formats and necessary parameters. Overview of Generative Models in immuneML --------------------------------------------- immuneML supports several approaches for training generative models: - positional weight matrices (:ref:`PWM`), - LSTM-based generative models (:ref:`SimpleLSTM`), - Variational Autoencoders (:ref:`SimpleVAE`), - SoNNia model (:ref:`SoNNia`). See the documentation for each model for details on how to configure them. Some require almost no parameters, while others allow greater flexibility and customization. Reports to Analyze the Results ----------------------------------- immuneML provides built-in reports to inspect and evaluate generative models - either directly or in combination with different feature representations: - :ref:`PWMSummary` showing probabilities of generated sequences having different lengths and PWMs for each length, - :ref:`VAESummary` showing the latent space after reducing the dimensionality to 2 dimensions, histogram for each latent dimension, loss per epoch. - :ref:`AminoAcidFrequencyDistribution` showing the distribution of amino acids in the generated vs original sequences, - :ref:`SequenceLengthDistribution` showing the distribution of sequence lengths in the generated vs original sequences, - :ref:`FeatureComparison` comparing the generated sequences with the original dataset using different encodings (e.g., k-mer frequencies (:ref:`KmerFrequency`), or protein embeddings (:ref:`ESMC`, :ref:`TCRBert`, :ref:`ProtT5`)). - :ref:`DimensionalityReduction` to compare encoded sequences after applying dimensionality reduction (see :ref:`***Dimensionality reduction methods***`) and coloring the points by labels (e.g., generated vs original sequences). Full Training Example with LSTM --------------------------------- To train an LSTM, the following YAML configuration may be used: .. code-block:: yaml definitions: datasets: dataset: format: AIRR params: path: original_dataset.tsv is_repertoire: False paired: False region_type: IMGT_CDR3 separator: "\t" ml_methods: LSTM: SimpleLSTM: locus: beta sequence_type: amino_acid num_epochs: 20 hidden_size: 1024 learning_rate: 0.001 batch_size: 100 embed_size: 256 temperature: 1 num_layers: 3 device: cpu region_type: IMGT_CDR3 instructions: LSTM: type: TrainGenModel export_combined_dataset: True dataset: dataset method: LSTM gen_examples_count: 1500 number_of_processes: 1 training_percentage: 0.7 To explore the dataset with original and generated sequences, we could encode them using k-mer frequencies and visualize with feature value barplots. The exported dataset from the previous instruction will contain both the original and generated sequences, and the column 'dataset_split' will indicate which sequences are original and used for training, which are original, but not used during training (test) and which are generated. .. code-block:: yaml definitions: datasets: LSTM_dataset: format: AIRR params: path: dataset.tsv is_repertoire: False paired: False region_type: IMGT_CDR3 separator: "\t" import_illegal_characters: True encodings: 3mer_encoding: KmerFrequency: k: 3 sequence_type: amino_acid scale_to_unit_variance: False scale_to_zero_mean: False gapped_4mer_encoding: KmerFrequency: sequence_encoding: gapped_kmer sequence_type: amino_acid k_left: 2 k_right: 2 min_gap: 1 max_gap: 1 scale_to_unit_variance: False scale_to_zero_mean: False reports: feature_value_barplot: FeatureValueBarplot: color_grouping_label: dataset_split plot_all_features: false plot_top_n: 25 error_function: sem instructions: data_reports: type: ExploratoryAnalysis number_of_processes: 1 analyses: LSTM_3mer_analysis: dataset: LSTM_dataset encoding: 3mer_encoding reports: [ feature_value_barplot ] LSTM_gapped_4mer_analysis: dataset: LSTM_dataset encoding: gapped_4mer_encoding reports: [ feature_value_barplot ] Using Trained VAE to Generate New Sequences ----------------------------------------------- To generate new sequences using the trained LSTM, we can also use the :ref:`ApplyGenModel` instruction: .. code-block:: yaml definitions: reports: data_report: SequenceLengthDistribution instructions: my_apply_gen_model_inst: type: ApplyGenModel gen_examples_count: 100 ml_config_path: ./config.zip reports: [data_report]