Paired chain simulations in LIgO

LIgO supports paired chain simulations under a simplifying assumption that the individual chains are generated independently, but allows for some logic in terms of how to pair the chains.

This tutorial contains two sections: paired chain simulation on the receptor level and paired chain simulation on the repertoire level.

Paired chain simulation on the receptor level

To simulate paired receptor sequences, we define how to simulate each chain, and then how to pair them.

Step 1: Define immune signals

We first define immune signals that will be used for simulation. There are no additional restrictions on how to define them compared to any other LIgO simulation, except that if the immune signal contains V or J gene, it can only be present in the chain that has those genes (e.g., signal with TRBV1 can only exist in TRB sequences, but not in TRA).

We define two immune signals consisting of one k-mer each:

motifs:
  motif1:
    seed: AS
  motif2:
    seed: GG
signals:
  signal1:
    motifs: [motif1]
  signal2:
    motifs: [motif2]

Step 2: Define the simulation of individual chains and their pairing

The following specification defines the simulation with 10 TRA sequences and TRB sequences, and uses paired parameter to specify which simulation items to combine (here sim_alpha and sim_beta).

simulations:
  sim1:
    sim_items:
      sim_alpha: # how to simulate alpha chain
        generative_model:
          default_model_name: humanTRA
          type: OLGA
        number_of_examples: 10 # 10 sequences
        seed: 100
        signals:
          signal2: 1 # all of the sequences should contain signal2
      sim_beta: # how to simulate beta chain
        generative_model:
          default_model_name: humanTRB
          type: OLGA
        number_of_examples: 10 # 10 sequences
        seed: 2
        signals:
          signal1: 1 # all of the sequences should contain signal1
    is_repertoire: false # simulation is on the receptor level, not repertoire
    paired: # how to pair simulation items
    - [sim_alpha, sim_beta] # here only one combination, but could be many
    sequence_type: amino_acid
    simulation_strategy: RejectionSampling

Step 3: Run the simulation

Save the full specification to paired_sim_receptor_specs.yaml:

definitions:
  motifs:
    motif1:
      seed: AS
    motif2:
      seed: GG
  signals:
    signal1:
      motifs: [motif1]
    signal2:
      motifs: [motif2]
  simulations:
    sim1:
      sim_items:
        sim_alpha: # how to simulate alpha chain
          generative_model:
            default_model_name: humanTRA
            type: OLGA
          number_of_examples: 10 # 10 sequences
          seed: 100
          signals:
            signal2: 1 # all of the sequences should contain signal2
        sim_beta: # how to simulate beta chain
          generative_model:
            default_model_name: humanTRB
            type: OLGA
          number_of_examples: 10 # 10 sequences
          seed: 2
          signals:
            signal1: 1 # all of the sequences should contain signal1
      is_repertoire: false # simulation is on the receptor level, not repertoire
      paired: # how to pair simulation items
      - [sim_alpha, sim_beta] # here only one combination, but could be many
      sequence_type: amino_acid
      simulation_strategy: RejectionSampling
instructions:
  inst1: # user-defined instruction name and the name of the output folder
    export_p_gens: false
    max_iterations: 100
    number_of_processes: 2
    sequence_batch_size: 100
    simulation: sim1
    type: LigoSim
output:
  format: HTML

Run LIgO using the following command:

ligo paired_sim_receptor_specs.yaml simulation_output_receptor

All results will be located in simulation_output_receptor folder.

Step 4: Explore results of receptor-level simulation

The simulated dataset is located under simulation_output_receptor/inst1/exported_dataset/airr/batch1.tsv. Some of the columns are shown in the table below:

Simulated receptors in AIRR format

junction_aa

locus

cell_id

signal1

signal2

CAFHGGATNKLIF

TRA

eb73d6fabc684aa5bb2c3faaff5bc1d1

0

1

CASSESEKVRSSTDTQYF

TRB

eb73d6fabc684aa5bb2c3faaff5bc1d1

1

0

CAETGGTSYGKLTF

TRA

307e92e0fd734fc48239aaa8af911637

0

1

CASSPEGQGCNQPQHF

TRB

307e92e0fd734fc48239aaa8af911637

1

0

In the output, each row represent one chain, and if the chains come from the same receptor, they have the same cell_id. The cell_id field contains a unique hex value to fully determine the cell.

Paired chain simulation on the repertoire level

The paired chain simulation on the repertoire level is defined in the same way as the receptor level simulation. The difference between these two levels comes from the fact that not all receptors in the repertoire contain signals, so the signal pairing itself is slightly different.

Signals for repertoires are defined in the following way:

signals:
    signal1: 0.2
    signal2: 0.1

This means that 20% of the repertoire sequences will contain signal1 and 10% of the sequences will contain signal2. This definition is on the level of a single chain for one repertoire (e.g., beta).

For the other chain (e.g., alpha), the repertoire signals could be defined like this:

signals:
    signal1: 0.1
    signal2: 0.2

This means that 10% of the repertoire sequences will contain signal1 and 20% will contain signal2.

If these were now to be used to make paired chain repertoire, the resulting repertoires will contain:

  • 10% of receptors with signal1 in both chains,

  • 10% of receptors in with signal1 in beta chain and signal2 in alpha chain,

  • 10% of receptors with signal2 in both chains.

The chains are simply combined in the order that signals are defined.

If for the alpha chain for example, the proportion of signal2 was larger that 0.2, the remaining alpha sequences with signal2 would be paired with beta chain sequences without any signal.

Step 1: Define the YAML specification for repertoire-level paired chain simulation

Save the specification to paired_sim_repertoire_specs.yaml:

definitions:
  motifs:
    motif1:
      seed: AS
    motif2:
      seed: GG
  signals:
    signal1:
      motifs: [motif1]
    signal2:
      motifs: [motif2]
  simulations:
    sim1:
      is_repertoire: true
      paired:
        - [sim_alpha, sim_beta] # combine repertoires from these two simulation items
      sequence_type: amino_acid
      sim_items:
        sim_alpha:
          generative_model:
            default_model_name: humanTRA
            type: OLGA
          number_of_examples: 10 # 10 repertoires
          receptors_in_repertoire_count: 10 # 10 alpha chain sequences per repertoire
          seed: 100
          signals:
            signal1: 0.1
            signal2: 0.2
        sim_beta:
          generative_model:
            default_model_name: humanTRB
            type: OLGA
          number_of_examples: 10 # 10 repertoires
          receptors_in_repertoire_count: 10 # 10 beta chain sequences per repertoire
          seed: 2
          signals:
            signal1: 0.2
            signal2: 0.1
      simulation_strategy: RejectionSampling
instructions:
  inst1:
    export_p_gens: false
    max_iterations: 100
    number_of_processes: 2
    sequence_batch_size: 100
    simulation: sim1
    type: LigoSim
output:
  format: HTML

Step 2: Run the simulation

Run LIgO using the following command:

ligo paired_sim_repertoire_specs.yaml simulation_output_repertoire

All results will be located in simulation_output_repertoire folder. Each repertoire file will contain alpha and beta chain sequences. The sequences coming from the same receptor will have the same cell_id as illustrated in the receptor simulation case above.