How to use LIgO for receptor-level simulation

Simulation of a TCR dataset containing two immune signals

In this quickstart tutorial, we will simulate a dataset of 300 productive TRB receptors such that (see the illustration below):

  • 100 TRBs containing signal 1;

  • 100 TRBs containing signal 2;

  • 100 TRBs containing no immune signal (background receptors).

We define signal 1 and signal 2 as the following (see the illustration below):

  • Signal 1 consists of a 2-mer {AS} and TRBV7, i.e., only TRBs containing both TRBV7 and 2-mer {AS} contain Signal 1;

  • Signal 2 consists of two gapped k-mers {G.G} and {G..G}, i.e., only TRBs containing {G.G} or {G..G} contain Signal 2.

Signal-specific TRBs will be generated using the rejection sampling strategy and the default OLGA model (humanTRB).

../_images/quickstart_receptor-level.png

By default, LIgO reports the simulated TRBs as a triple of the TRBV gene name, CDR3 AA sequence, and TRBJ gene name. To add full-length TCRs to the output, it is necessary to download the Stitchr reference data using stitchrdl -s human, see Stitcher documentation for more information. Once the Stitchr reference data is downloaded, LigO will automatically add full-length TCR sequences to the output.

If you also want to report the generation probabilities (pgen) of the simulated receptors according to the default OLGA humanTRB model, set the export_p_gens parameter to true. Please keep in mind that pgen evaluation may take time.

Step 1: YAML specification

We need to define the YAML file describing the simulation parameters. First, we define the immune signals 1 and 2 in the definitions section. You can read more about the yaml file parameters in YAML specification.

definitions:
  motifs:
    motif1:
      seed: AS
    motif2:
      seed: G/G
      max_gap: 2
      min_gap: 1
  signals:
    signal1:
      v_call: TRBV7
      motifs: [motif1]
    signal2:
      motifs: [motif2]

Second, we define the number of TRBs per signal in the simulations section. You can read more about the yaml file parameters in YAML specification.

simulations:
  sim1:
    is_repertoire: false
    paired: false
    sequence_type: amino_acid
    simulation_strategy: RejectionSampling
    remove_seqs_with_signals: true # remove signal-specific AIRs from the background
    sim_items:
      sim_item1: # group of AIRs with the same parameters
        generative_model:
          chain: beta
          default_model_name: humanTRB
          model_path: null
          type: OLGA
        number_of_examples: 100
        signals:
          signal1: 1
      sim_item2:
        generative_model:
          chain: beta
          default_model_name: humanTRB
          model_path: null
          type: OLGA
        number_of_examples: 100
        signals:
          signal2: 1
      sim_item3:
        generative_model:
          chain: beta
          default_model_name: humanTRB
          model_path: null
          type: OLGA
        number_of_examples: 100
        signals: {} # no signal

Finally, we define the technical parameters of the simulation in the instructions section. You can read more about the yaml file parameters in specification.

instructions:
  my_sim_inst:
    export_p_gens: false
    max_iterations: 100
    number_of_processes: 4
    sequence_batch_size: 1000
    simulation: sim1
    type: LigoSim

Here is the complete YAML specification for the simulation:

definitions:
  motifs:
    motif1:
      seed: AS
    motif2:
      seed: G/G
      max_gap: 2
      min_gap: 1
  signals:
    signal1:
      v_call: TRBV7
      motifs: [motif1]
    signal2:
      motifs: [motif2]
  simulations:
    sim1:
      is_repertoire: false
      paired: false
      sequence_type: amino_acid
      simulation_strategy: RejectionSampling
      remove_seqs_with_signals: true # remove signal-specific AIRs from the background
      sim_items:
        sim_item1: # group of AIRs with the same parameters
          generative_model:
            chain: beta
            default_model_name: humanTRB
            model_path: null
            type: OLGA
          number_of_examples: 100
          signals:
            signal1: 1
        sim_item2:
          generative_model:
            chain: beta
            default_model_name: humanTRB
            model_path: null
            type: OLGA
          number_of_examples: 100
          signals:
            signal2: 1
        sim_item3:
          generative_model:
            chain: beta
            default_model_name: humanTRB
            model_path: null
            type: OLGA
          number_of_examples: 100
          signals: {} # no signal
instructions:
  my_sim_inst:
    export_p_gens: false
    max_iterations: 100
    number_of_processes: 4
    sequence_batch_size: 1000
    simulation: sim1
    type: LigoSim

Step 2: Running LIgO

After saving the yaml specification to a file (e.g., quickstart_receptor.yaml), you can proceed with the analysis by following these steps:

  1. Activate the virtual environment where you have installed LIgO, for example

source ligo_env/bin/activate
  1. Navigate to the directory where the yaml specification (quickstart_receptor.yaml) was saved.

  2. Execute the following command:

ligo quickstart_receptor.yaml quickstart_output_receptor

All results will be located in quickstart_output_receptor. Note that the output folder (quickstart_output_receptor) should not exist prior to the run.

Step 3: Understanding the output

The simulated dataset is located under

quickstart_output_receptor/my_sim_inst/exported_dataset/airr/batch1.tsv.

In the output, each row represents one AIR. Some of the output columns are shown in the table below.

  • v_call: V gene of a simulated AIR

  • j_call: J gene of a simulated AIR

  • junction_aa: junction of a simulated AIR

  • signal1: 1 if a simulated AIR contains signal1; 0 otherwise

  • signal2: 1 if a simulated AIR contains signal2; 0 otherwise

  • signal1_position: binary mask representing position of signal1 in a simulated AIR

  • signal2_position: binary mask representing position of signal2 in a simulated AIR

Simulated receptors in AIRR format

v_call

j_call

junction_aa

signal1

signal2

signal1_position

signal2_position

TRBV10-1*01

TRBJ2-5*01

CARPDRGGGYTF

0

1

m000000000000

m000000100000

TRBV7-2*02

TRBJ2-5*01

CASSRGHFQETQYF

1

0

m01000000000000

m00000000000000

TRBV7-8*01

TRBJ2-3*01

CASSSPGGVRIYSTDTQYF

1

0

m0100000000000000000

m0000000000000000000

Next steps