How to define immune signals and immune events¶
Adaptive immune receptors (AIRs) specifically recognize various antigens, so they can be labeled with antigen specificity. Adaptive immune receptor repertoires (AIRRs) coming from individual patients can in a similar way be labeled with individual’s disease state(s). To facilitate development and benchmarking of AIRR-related machine learning models, we focus on biological immune events (e.g., disease, allergy, vaccination) and immune signals that reflect the binding rules to the immune event antigens. Because of this formalization, multiple immune signals can be associated with a single immune event.
In this tutorial, we describe this formalization we make in more details and connect it to the simulation specification. More details on each of the options in the specification can be found under YAML specification page.
To define an immune signal on the AIR level, we define the following:
a set of motifs determining the content of the receptor sequence, where motifs are defined as a distribution over amino acids or nucleotides,
motif locations in the CDR3,
V gene,
J gene.
Motifs¶
LIgO allows for two types of motifs:
a motif based on the seed string with possible gaps and allowed variations from the seed and
a positional weight matrix describing multinomial distribution of amino acids or nucleotides over the motif positions.
Seed motif¶
Here is an example of the motif definition based on a seed. It is possible to define seed that can optionally contain a gap denoted with / sign, minimum and maximum size of the gap, probabilities of different Hamming distances (how many letters in the motif can be changed and with what probability), position weights (probabilities that a letter in the seed will be changed for each letter), and alphabet weights (which letters to pick for replacement to implement the required Hamming distance).
my_simple_motif: # the name of the motif used for reference later
seed: AAA # motif is always AAA
my_gapped_motif: # the name of the more complex motif where / sign denotes a possible gap location
seed: AA/A # this motif can be AAA, AA_A, CAA, CA_A, DAA, DA_A, EAA, EA_A
min_gap: 0 # how many gaps can there be: min 0 and max 1
max_gap: 1
hamming_distance_probabilities: # it can have a max of 1 substitution
0: 0.7
1: 0.3
position_weights: # note that index 2, the position of the gap, is excluded from position_weights
0: 1 # only first position can be changed
1: 0
3: 0
alphabet_weights: # the first A can be replaced by C, D or E
C: 0.4
D: 0.4
E: 0.2
Positional weight matrix¶
Motifs can alternatively be defined as positional weight matrices (PWMs). For importing PWMs (and later annotation of sequences with motifs defined as PWMs), LIgO relies on bionumpy library and supports the formats supported by the library. For more information, see bionumpy documentation on PWMs.
Here is an example of a motif defined in the JASPAR format. It includes the name of the motif in the first line and in the subsequent lines the counts of specific nucleotide at the given position are provided.
>MA0080.1 SPI1
A [ 14 4 3 56 56 3 ]
C [ 21 2 0 1 0 18 ]
G [ 19 48 52 0 0 34 ]
T [ 3 3 2 0 1 2 ]
To define such motif in the YAML specification, one needs to provide the path to the file where the motif is stored and the threshold value - when matching the PWM to a sequence later, this is the threshold to consider the sequence as containing the motif.
my_custom_pwm: # this will be the identifier of the motif
file_path: my_pwm_1.jaspar
threshold: 2
Position of the motif in the sequence¶
LIgO supports the use of IMGT positions to specify where the motifs of one signal may occur in the sequence. To specify the positions for the signal, one can define the position and the corresponding probability. For positions not explicitly mentioned in the definition, the probability of the motifs occurring will be redistributed from remaining probability. Specifically, for the example below, the motifs have the probability of 0.9 to occur at positions 109 and 110 and 0.1 total probability to occur at any other position in the sequence. If some positions are explicitly not allowed, their probability can be set to 0.
sequence_position_weights: # the motifs have the probability of 0.9 to occur at positions 109 and 110
'109': 0.5
'110': 0.4
User-defined functions for signal definition¶
While the previously presented options allow for flexible definitions of motifs and signals, it is possible that the user might have a different idea of how to define the signal. For that purpose, LIgO supports defining custom functions that will for the given sequence return True/False based on whether the signal is in the sequence. For more details on this option, see Simulation with custom signal functions.
Complete example of signal definition for receptor-level simulation¶
Here is an example of how a set of motifs can be defined and put together under my_signal.
motifs:
my_simple_motif: # the name of the motif used for reference later
seed: AAA # motif is always AAA
my_gapped_motif: # the name of the more complex motif where / sign denotes a possible gap location
seed: AA/A # this motif can be AAA, AA_A, CAA, CA_A, DAA, DA_A, EAA, EA_A
min_gap: 0 # how many gaps can there be: min 0 and max 1
max_gap: 1
hamming_distance_probabilities: # it can have a max of 1 substitution
0: 0.7
1: 0.3
position_weights: # note that index 2, the position of the gap, is excluded from position_weights
0: 1 # only first position can be changed
1: 0
3: 0
alphabet_weights: # the first A can be replaced by C, D or E
C: 0.4
D: 0.4
E: 0.2
signals:
my_signal: # the name of the signal used for reference later in the simulation specification
motifs:
- my_simple_motif
- my_gapped_motif
sequence_position_weights:
'109': 0.5
'110': 0.4
v_call: TRBV1
j_call: TRBJ1
Here my_signal has two possible motifs that occur in IMGT positions 109 or 110 with probability 0.9 or in any other position with probability 0.1, and that have to occur in combination with TRBV1 and TRBJ1 genes.
Repertoire-level simulation¶
In addition to the immune signal parameters described above, when defining the immune signal on the repertoire level, we additionally provide the percentage of the repertoire containing the given immune signal and the clonal frequency. Clonal frequency is modeled via zeta distribution function from scipy and it is parameterized by shape parameter of the distribution (called a in scipy) and the loc parameter that can be used to shift the distribution. Here is an example:
signals:
my_signal: # same signal as before, with added clonal frequency parameters
motifs:
- my_simple_motif
- my_gapped_motif
sequence_position_weights:
'109': 0.5
'110': 0.5
v_call: TRBV1
j_call: TRBJ1
clonal_frequency:
a: 2
loc: 0
Simulating immune events¶
We can define one or more signals as described above. If we want to combine multiple signals under a single label to denote a single immune event (e.g., T1D disease state), LIgO supports this in the following way:
First all signals are defined,
The simulation configuration is provided that defines how the signals will be combined.
For each group of examples with the same parameters, we define the combination of signals under signals key. Then, for that group, we assign the value of the label of interest. For example, to simulate a group of 30 type 1 diabetes (T1D)-specific repertoires, where 20 repertoires have the disease and 10 do not, we may specify that positive repertoires contain my_signal, but all of these repertoires contain my_other_signal, which could be specific to some other immune event that is present in the full cohort. An example of such simulation configuration is provided below.
my_simulation_config:
is_repertoire: true # we do repertoire-level simulation
paired: false
sequence_type: amino_acid
simulation_strategy: RejectionSampling
remove_seqs_with_signals: true # remove signal-specific AIRs from the background
sim_items:
t1d_positive_repertoires_group1: # group of AIRs with the same parameters
generative_model:
chain: beta
default_model_name: humanTRB
model_path: null
type: OLGA
number_of_examples: 10 # we simulate 10 repertoires
receptors_in_repertoire_count: 100 # each repertoire has 100 receptors
signals:
my_signal: 0.1 # 10% of the receptors in the repertoire contain my_signal
my_other_signal: 0.2 # 20% of the receptors contain my_other_signal
immune_events:
T1D: true
t1d_positive_repertoires_group2: # group of AIRs with the same parameters
generative_model:
chain: beta
default_model_name: humanTRB
model_path: null
type: OLGA
number_of_examples: 10 # we simulate 10 repertoires
receptors_in_repertoire_count: 100 # each repertoire has 100 receptors
signals:
my_signal: 0.15 # 15% of the receptors in the repertoire contain my_signal
my_other_signal: 0.2 # 20% of the receptors contain my_other_signal
immune_events:
T1D: true
t1d_negative_receptors:
generative_model:
chain: beta
default_model_name: humanTRB
model_path: null
type: OLGA
number_of_examples: 10 # we simulate 10 repertoires
receptors_in_repertoire_count: 100 # each repertoire has 100 receptors
signals:
my_other_signal: 0.03 # 3% of the receptors contain my_other_signal but none contain my_signal
immune_events:
T1D: false
Immune events on the receptor level¶
It is possible to define immune events in the same way on the receptor level. They could denote the same as on the repertoire level: the disease state to which the signals listed are specific to, or some other label of interest, e.g., experiment or patient from which the receptor came from.
Next steps¶
For a full minimal working example, see the Quickstart. For detailed description of all the parameters and possible values, see YAML specification. For help on choosing the content of the signals and motifs, see How to check feasibility of the simulation parameters. For the discussion on defining immune events and signals in this way, see also the LIgO manuscript.