How to use LIgO for receptor-level simulation¶

Simulation of a TCR dataset containing two immune signals¶

In this quickstart tutorial, we will simulate a dataset of 300 productive TRB receptors such that (see the illustration below):

100 TRBs containing signal 1;
100 TRBs containing signal 2;
100 TRBs containing no immune signal (background receptors).

We define signal 1 and signal 2 as the following (see the illustration below):

Signal 1 consists of a 2-mer {AS} and TRBV7, i.e., only TRBs containing both TRBV7 and 2-mer {AS} contain Signal 1;
Signal 2 consists of two gapped k-mers {G.G} and {G..G}, i.e., only TRBs containing {G.G} or {G..G} contain Signal 2.

Signal-specific TRBs will be generated using the rejection sampling strategy and the default OLGA model (humanTRB).

../_images/quickstart_receptor-level.png

By default, LIgO reports the simulated TRBs as a triple of the TRBV gene name, CDR3 AA sequence, and TRBJ gene name. To add full-length TCRs to the output, it is necessary to download the Stitchr reference data using stitchrdl -s human, see Stitcher documentation for more information. Once the Stitchr reference data is downloaded, LigO will automatically add full-length TCR sequences to the output.

If you also want to report the generation probabilities (pgen) of the simulated receptors according to the default OLGA humanTRB model, set the export_p_gens parameter to true. Please keep in mind that pgen evaluation may take time.

Step 1: YAML specification¶

We need to define the YAML file describing the simulation parameters. First, we define the immune signals 1 and 2 in the definitions section. You can read more about the yaml file parameters in YAML specification.

definitions:
  motifs:
    motif1:
      seed: AS
    motif2:
      seed: G/G
      max_gap: 2
      min_gap: 1
  signals:
    signal1:
      v_call: TRBV7
      motifs: [motif1]
    signal2:
      motifs: [motif2]

Second, we define the number of TRBs per signal in the simulations section. You can read more about the yaml file parameters in YAML specification.

simulations:
  sim1:
    is_repertoire: false
    paired: false
    sequence_type: amino_acid
    simulation_strategy: RejectionSampling
    remove_seqs_with_signals: true # remove signal-specific AIRs from the background
    sim_items:
      sim_item1: # group of AIRs with the same parameters
        generative_model:
          chain: beta
          default_model_name: humanTRB
          model_path: null
          type: OLGA
        number_of_examples: 100
        signals:
          signal1: 1
      sim_item2:
        generative_model:
          chain: beta
          default_model_name: humanTRB
          model_path: null
          type: OLGA
        number_of_examples: 100
        signals:
          signal2: 1
      sim_item3:
        generative_model:
          chain: beta
          default_model_name: humanTRB
          model_path: null
          type: OLGA
        number_of_examples: 100
        signals: {} # no signal

Finally, we define the technical parameters of the simulation in the instructions section. You can read more about the yaml file parameters in specification.

instructions:
  my_sim_inst:
    export_p_gens: false
    max_iterations: 100
    number_of_processes: 4
    sequence_batch_size: 1000
    simulation: sim1
    type: LigoSim

Here is the complete YAML specification for the simulation:

definitions:
  motifs:
    motif1:
      seed: AS
    motif2:
      seed: G/G
      max_gap: 2
      min_gap: 1
  signals:
    signal1:
      v_call: TRBV7
      motifs: [motif1]
    signal2:
      motifs: [motif2]
  simulations:
    sim1:
      is_repertoire: false
      paired: false
      sequence_type: amino_acid
      simulation_strategy: RejectionSampling
      remove_seqs_with_signals: true # remove signal-specific AIRs from the background
      sim_items:
        sim_item1: # group of AIRs with the same parameters
          generative_model:
            chain: beta
            default_model_name: humanTRB
            model_path: null
            type: OLGA
          number_of_examples: 100
          signals:
            signal1: 1
        sim_item2:
          generative_model:
            chain: beta
            default_model_name: humanTRB
            model_path: null
            type: OLGA
          number_of_examples: 100
          signals:
            signal2: 1
        sim_item3:
          generative_model:
            chain: beta
            default_model_name: humanTRB
            model_path: null
            type: OLGA
          number_of_examples: 100
          signals: {} # no signal
instructions:
  my_sim_inst:
    export_p_gens: false
    max_iterations: 100
    number_of_processes: 4
    sequence_batch_size: 1000
    simulation: sim1
    type: LigoSim

Step 2: Running LIgO¶

After saving the yaml specification to a file (e.g., quickstart_receptor.yaml), you can proceed with the analysis by following these steps:

Activate the virtual environment where you have installed LIgO, for example

source ligo_env/bin/activate

Navigate to the directory where the yaml specification (quickstart_receptor.yaml) was saved.
Execute the following command:

ligo quickstart_receptor.yaml quickstart_output_receptor

All results will be located in quickstart_output_receptor. Note that the output folder (quickstart_output_receptor) should not exist prior to the run.

Step 3: Understanding the output¶

The simulated dataset is located under

quickstart_output_receptor/my_sim_inst/exported_dataset/airr/batch1.tsv.

In the output, each row represents one AIR. Some of the output columns are shown in the table below.

v_call: V gene of a simulated AIR
j_call: J gene of a simulated AIR
junction_aa: junction of a simulated AIR
signal1: 1 if a simulated AIR contains signal1; 0 otherwise
signal2: 1 if a simulated AIR contains signal2; 0 otherwise
signal1_position: binary mask representing position of signal1 in a simulated AIR
signal2_position: binary mask representing position of signal2 in a simulated AIR

Simulated receptors in AIRR format¶
v_call	j_call	junction_aa	signal1	signal2	signal1_position	signal2_position
TRBV10-1*01	TRBJ2-5*01	CARPDRGGGYTF	0	1	m000000000000	m000000100000
TRBV7-2*02	TRBJ2-5*01	CASSRGHFQETQYF	1	0	m01000000000000	m00000000000000
TRBV7-8*01	TRBJ2-3*01	CASSSPGGVRIYSTDTQYF	1	0	m0100000000000000000	m0000000000000000000
…	…	…	…	…	…	…

Next steps¶

For a quickstart guide on repertoire-level simulation see How to use LIgO for repertoire-level simulation.
Other tutorials for how to use LIgO can be found under Tutorials.
You can find more information about yaml parameters in YAML specification.