Paired chain simulations in LIgO ================================== LIgO supports paired chain simulations under a simplifying assumption that the individual chains are generated independently, but allows for some logic in terms of how to pair the chains. This tutorial contains two sections: paired chain simulation on the receptor level and paired chain simulation on the repertoire level. Paired chain simulation on the receptor level ------------------------------------------------ To simulate paired receptor sequences, we define how to simulate each chain, and then how to pair them. Step 1: Define immune signals ```````````````````````````````````` We first define immune signals that will be used for simulation. There are no additional restrictions on how to define them compared to any other LIgO simulation, except that if the immune signal contains V or J gene, it can only be present in the chain that has those genes (e.g., signal with TRBV1 can only exist in TRB sequences, but not in TRA). We define two immune signals consisting of one k-mer each: .. code-block:: yaml motifs: motif1: seed: AS motif2: seed: GG signals: signal1: motifs: [motif1] signal2: motifs: [motif2] Step 2: Define the simulation of individual chains and their pairing ```````````````````````````````````````````````````````````````````````` The following specification defines the simulation with 10 TRA sequences and TRB sequences, and uses `paired` parameter to specify which simulation items to combine (here `sim_alpha` and `sim_beta`). .. code-block:: yaml simulations: sim1: sim_items: sim_alpha: # how to simulate alpha chain generative_model: default_model_name: humanTRA type: OLGA number_of_examples: 10 # 10 sequences seed: 100 signals: signal2: 1 # all of the sequences should contain signal2 sim_beta: # how to simulate beta chain generative_model: default_model_name: humanTRB type: OLGA number_of_examples: 10 # 10 sequences seed: 2 signals: signal1: 1 # all of the sequences should contain signal1 is_repertoire: false # simulation is on the receptor level, not repertoire paired: # how to pair simulation items - [sim_alpha, sim_beta] # here only one combination, but could be many sequence_type: amino_acid simulation_strategy: RejectionSampling Step 3: Run the simulation ````````````````````````````` Save the full specification to `paired_sim_receptor_specs.yaml`: .. code-block:: yaml definitions: motifs: motif1: seed: AS motif2: seed: GG signals: signal1: motifs: [motif1] signal2: motifs: [motif2] simulations: sim1: sim_items: sim_alpha: # how to simulate alpha chain generative_model: default_model_name: humanTRA type: OLGA number_of_examples: 10 # 10 sequences seed: 100 signals: signal2: 1 # all of the sequences should contain signal2 sim_beta: # how to simulate beta chain generative_model: default_model_name: humanTRB type: OLGA number_of_examples: 10 # 10 sequences seed: 2 signals: signal1: 1 # all of the sequences should contain signal1 is_repertoire: false # simulation is on the receptor level, not repertoire paired: # how to pair simulation items - [sim_alpha, sim_beta] # here only one combination, but could be many sequence_type: amino_acid simulation_strategy: RejectionSampling instructions: inst1: # user-defined instruction name and the name of the output folder export_p_gens: false max_iterations: 100 number_of_processes: 2 sequence_batch_size: 100 simulation: sim1 type: LigoSim output: format: HTML Run LIgO using the following command: .. code-block:: console ligo paired_sim_receptor_specs.yaml simulation_output_receptor All results will be located in `simulation_output_receptor` folder. Step 4: Explore results of receptor-level simulation ``````````````````````````````````````````````````````` The simulated dataset is located under `simulation_output_receptor/inst1/exported_dataset/airr/batch1.tsv`. Some of the columns are shown in the table below: .. list-table:: Simulated receptors in AIRR format :header-rows: 1 * - junction_aa - locus - cell_id - signal1 - signal2 * - CAFHGGATNKLIF - TRA - eb73d6fabc684aa5bb2c3faaff5bc1d1 - 0 - 1 * - CASSESEKVRSSTDTQYF - TRB - eb73d6fabc684aa5bb2c3faaff5bc1d1 - 1 - 0 * - CAETGGTSYGKLTF - TRA - 307e92e0fd734fc48239aaa8af911637 - 0 - 1 * - CASSPEGQGCNQPQHF - TRB - 307e92e0fd734fc48239aaa8af911637 - 1 - 0 In the output, each row represent one chain, and if the chains come from the same receptor, they have the same `cell_id`. The `cell_id` field contains a unique hex value to fully determine the cell. Paired chain simulation on the repertoire level ------------------------------------------------ The paired chain simulation on the repertoire level is defined in the same way as the receptor level simulation. The difference between these two levels comes from the fact that not all receptors in the repertoire contain signals, so the signal pairing itself is slightly different. Signals for repertoires are defined in the following way: .. code-block:: yaml signals: signal1: 0.2 signal2: 0.1 This means that 20% of the repertoire sequences will contain signal1 and 10% of the sequences will contain signal2. This definition is on the level of a single chain for one repertoire (e.g., beta). For the other chain (e.g., alpha), the repertoire signals could be defined like this: .. code-block:: yaml signals: signal1: 0.1 signal2: 0.2 This means that 10% of the repertoire sequences will contain signal1 and 20% will contain signal2. If these were now to be used to make paired chain repertoire, the resulting repertoires will contain: - 10% of receptors with signal1 in both chains, - 10% of receptors in with signal1 in beta chain and signal2 in alpha chain, - 10% of receptors with signal2 in both chains. The chains are simply combined in the order that signals are defined. If for the alpha chain for example, the proportion of signal2 was larger that 0.2, the remaining alpha sequences with signal2 would be paired with beta chain sequences without any signal. Step 1: Define the YAML specification for repertoire-level paired chain simulation ``````````````````````````````````````````````````````````````````````````````````````` Save the specification to `paired_sim_repertoire_specs.yaml`: .. code-block:: yaml definitions: motifs: motif1: seed: AS motif2: seed: GG signals: signal1: motifs: [motif1] signal2: motifs: [motif2] simulations: sim1: is_repertoire: true paired: - [sim_alpha, sim_beta] # combine repertoires from these two simulation items sequence_type: amino_acid sim_items: sim_alpha: generative_model: default_model_name: humanTRA type: OLGA number_of_examples: 10 # 10 repertoires receptors_in_repertoire_count: 10 # 10 alpha chain sequences per repertoire seed: 100 signals: signal1: 0.1 signal2: 0.2 sim_beta: generative_model: default_model_name: humanTRB type: OLGA number_of_examples: 10 # 10 repertoires receptors_in_repertoire_count: 10 # 10 beta chain sequences per repertoire seed: 2 signals: signal1: 0.2 signal2: 0.1 simulation_strategy: RejectionSampling instructions: inst1: export_p_gens: false max_iterations: 100 number_of_processes: 2 sequence_batch_size: 100 simulation: sim1 type: LigoSim output: format: HTML Step 2: Run the simulation ````````````````````````````` Run LIgO using the following command: .. code-block:: console ligo paired_sim_repertoire_specs.yaml simulation_output_repertoire All results will be located in `simulation_output_repertoire` folder. Each repertoire file will contain alpha and beta chain sequences. The sequences coming from the same receptor will have the same `cell_id` as illustrated in the receptor simulation case above.