Analyzing Your Own Dataset =================================== In this tutorial, we will show how to analyze your own dataset with some useful guidelines on which concrete steps to take, which instructions to use, what visualizations to create. This tutorial assumes immuneML is already installed at your machine. If that is not the case, see :ref:`Installing immuneML`. The examples in this tutorial will use a :download:`simulated dataset <../_static/files/dataset.zip>`. It consists of 200 repertoires with two labels: disease (True, False) and batch (True, False). The available analysis specifications should be general enough that it can be applied to any dataset with minimal changes (e.g., paths, label names). Step 1: Explore the dataset ----------------------------- Before doing any machine learning, it is important to check the dataset. For that purpose, immuneML offers some useful reports and visualizations: - :ref:`AminoAcidFrequencyDistribution`: shows the relative frequency of each amino acid at each position in the sequences; the report can also split the dataset per some predefined label (e.g., by disease status) - :ref:`LabelOverlap`: shows the overlap of two different labels in the dataset as a heatmap; useful if in addition to the labels you are interested in, you have some other labels (e.g., batch information) that you want to check for overlap - :ref:`SequenceLengthDistribution`: shows the histogram of the lengths of sequences by the dataset, optionally split by a label of interest; useful to see if the dataset is balanced in terms of sequence lengths - :ref:`VJGeneDistribution`: shows how frequently each of the V and J gene occurs in the dataset; shows joint V and J gene distributions as well; here the report can also be split by a label of interest Some useful reports specifically for repertoire datasets (with one repertoire per patient): - :ref:`RepertoireClonotypeSummary`: shows the number of distinct clonotypes per repertoire; the clonotype counts can be colored by label of interest to see if there are any differences in the number of clonotypes per label. Additional useful reports to be applied on encoded datasets: - :ref:`FeatureComparison`: splits the data by the label of interests and plots their average values per label as a scatterplot, allowing us to visually inspect which features differ the most between the label values. - :ref:`DimensionalityReduction`: uses a dimensionality reduction method (e.g., PCA, tSNE, UMAP) to visualize the encoded dataset and optionally colors the points (each corresponding to one example from the dataset) by the label of interest. Encodings useful for exploratory analyses: - :ref:`KmerFrequency` encoder: represents an example by the frequency of subsequences of the given length; see the documentation for additional configuration options; The analysis specification to run these reports: .. collapse:: step1_exploratory_analysis.yaml .. highlight:: yaml .. code-block:: yaml definitions: datasets: dataset: format: AIRR params: # make sure the paths to the downloaded dataset files are correct dataset_file: ./dataset/dataset.yaml metadata_file: ./dataset/metadata.csv encodings: 3mer_freq: KmerFrequency reports: aa_dist: AminoAcidFrequencyDistribution: relative_frequency: true split_by_label: true label: disease alignment: IMGT region_type: IMGT_CDR3 label_overlap: LabelOverlap: column_label: disease row_label: batch seq_len_dist_for_disease: SequenceLengthDistribution: sequence_type: AMINO_ACID region_type: IMGT_CDR3 label: disease split_by_label: true plot_frequencies: true seq_len_dist_for_batch: SequenceLengthDistribution: sequence_type: AMINO_ACID region_type: IMGT_CDR3 label: batch split_by_label: true plot_frequencies: true gene_dist_disease: VJGeneDistribution: split_by_label: true label: disease gene_dist_batch: VJGeneDistribution: split_by_label: true label: batch clonotype_summary_disease: RepertoireClonotypeSummary: split_by_label: true label: disease clonotype_summary_batch: RepertoireClonotypeSummary: split_by_label: true label: batch feature_comparison: FeatureComparison: comparison_label: disease column_grouping_label: batch opacity: 0.6 show_error_bar: False dim_red_disease: DimensionalityReduction: label: disease dim_red_method: KernelPCA: n_components: 2 dim_red_batch: DimensionalityReduction: label: batch dim_red_method: KernelPCA: n_components: 2 instructions: exploratory_analysis: type: ExploratoryAnalysis analyses: raw_data_analysis: dataset: dataset reports: [aa_dist, label_overlap, seq_len_dist_for_disease, seq_len_dist_for_batch, gene_dist_disease, gene_dist_batch, clonotype_summary_disease, clonotype_summary_batch] encoded_data_analysis: dataset: dataset encoding: 3mer_freq reports: [feature_comparison, dim_red_disease, dim_red_batch] Run the exploratory analysis from the command line: .. code-block:: bash immune-ml step1_exploratory_analysis.yaml ./exploratory_analysis_results/ Some of the results on the example dataset are shown below: Amino acid frequency distribution report per IMGT position, split by 'disease' label: .. image:: ../_static/images/usability_tutorial/aa_freq_dist.png :alt: Amino acid frequency distribution report V gene distribution across repertoires split by the 'disease' label: .. image:: ../_static/images/usability_tutorial/V_gene_dist.png :alt: V gene usage across repertoires Comparison of feature values (average 3-mer frequencies per repertoire) across disease and batch labels: .. image:: ../_static/images/usability_tutorial/feature_val_comp.png :alt: Comparison of feature values across disease and batch labels Visualization of the dataset after 3-mer frequency encoding and dimensionality reduction with repertoires (examples) colored by the 'disease' label: .. image:: ../_static/images/usability_tutorial/dim_red_with_label.png :alt: Visualization of a dataset after 3-mer frequency encoding and dimensionality reduction with examples colored by disease label Step 2: Make a smaller dataset to try out the ML analysis first ------------------------------------------------------------------- It is often a good idea to make a smaller dataset to try out the analysis first. This can be done by using the Subsampling instruction. .. collapse:: step2_make_small_dataset.yaml .. highlight:: yaml .. code-block:: yaml definitions: datasets: dataset: format: AIRR params: dataset_file: ./dataset/dataset.yaml metadata_file: ./dataset/metadata.csv instructions: subsampling: type: Subsampling dataset: dataset subsampled_dataset_sizes: [75] # keep only 75 randomly selected repertoires subsampled_repertoire_size: 50 # keep only 50 sequences per repertoire Run this specification from the command line: .. code-block:: bash immune-ml step2_make_small_dataset.yaml ./small_dataset/ To explore the results, see `./small_dataset/index.html`. The small dataset in AIRR format is then located at `./small_dataset/subsampling/dataset_75_subsampled_1/exported/airr/`. Step 3: Run the ML analysis on a smaller dataset & revise the analysis as needed ------------------------------------------------------------------------------------- Here we define one example of immuneML analysis to build a repertoire classifier to distinguish between diseased and healthy. We try out different ML methods and encodings and choose the "best" one using nested cross validation. For more details on how this procedure is done in immuneML, see :ref:`How to train and assess a receptor or repertoire-level ML classifier`. For a detailed list of data encodings and ML methods, see :ref:`YAML specification`. .. note:: The results we obtain here are not so interesting per se, as we run the analysis on a subset of data. The aim is to make sure the analysis runs without issues and that all the parameters of interest (e.g., all visualizations and ML-encoding combinations) have been provided. Here is an example of the specification: .. collapse:: step3_small_ML_analysis.yaml .. highlight:: yaml .. code-block:: yaml definitions: datasets: dataset: format: AIRR params: dataset_file: ./small_dataset/subsampling/dataset_75_subsampled_1/exported/airr/dataset.yaml metadata_file: ./small_dataset/subsampling/dataset_75_subsampled_1/exported/airr/metadata.csv encodings: 3mer_freq: KmerFrequency esmc: ESMC ml_methods: log_reg_l1: LogisticRegression: penalty: l1 log_reg_l2: LogisticRegression: penalty: l2 random_forest: RandomForestClassifier: n_estimators: 50 reports: coefficients: # examine the coefficients of individual models (eg betas of logistic regression) Coefficients: coeff_to_plot: [n_largest] n_largest: [20] ml_setting_comparison: MLSettingsPerformance # plot performance of different ML settings performance_per_label: # performance of the model where examples are grouped by alternative label PerformancePerLabel: # it can be used to see if the model is learning alternative label instead of label of interest alternative_label: batch metric: balanced_accuracy roc_curve_summary: ROCCurveSummary instructions: fit_and_eval_classifiers: type: TrainMLModel settings: - encoding: 3mer_freq ml_method: log_reg_l1 - encoding: 3mer_freq ml_method: log_reg_l2 - encoding: 3mer_freq ml_method: random_forest - encoding: esmc ml_method: log_reg_l1 - encoding: esmc ml_method: log_reg_l2 - encoding: esmc ml_method: random_forest assessment: # outer loop of the nested cross validation (for model assessment) split_strategy: random # randomly split to train+validation and test split_count: 1 training_percentage: 0.7 reports: models: [coefficients] # plot coefficients of models trained on the train+validation data selection: # inner loop of the nested cross validation (for model selection) split_strategy: k_fold # do 5-fold CV starting from train+validation data [70% of the original dataset as defined in assessment] split_count: 5 labels: [disease] dataset: dataset metrics: [auc, precision, recall] optimization_metric: balanced_accuracy reports: [ml_setting_comparison, performance_per_label, roc_curve_summary] Run this specification from the command line: .. code-block:: bash immune-ml step3_small_ML_analysis.yaml ./small_ML_result/ To explore the results, see `./small_ML_result/index.html`. In case something went wrong, all details can be found at `./small_ML_result/log.txt`. Step 4: Run the analysis on the full dataset -------------------------------------------------- If everything worked out well so far (or if all necessary adjustments have been made), run the analysis on the full dataset. The specification file is the same as in step 3, except that dataset file and metadata file should point to the full dataset instead. Here are some of the results on the full dataset: - ROC curves with AUC annotated for all combinations of encoding + ML method listed on the test set: .. image:: ../_static/images/usability_tutorial/roc_full.png :alt: ROC curves with AUC for encodings and ML methods As the disease-associated signal is a 3-mer (by the simulation setup), the best performing method is a logistic regression with L1 penalty and with data encoded by 3-mer frequencies. - Since there is also batch information in the dataset, it is possible to check if the performance of the models differs for different batches (the results shown here are for the test set): .. image:: ../_static/images/usability_tutorial/bacc_per_batch.png :alt: Performance on label disease split by the batch for test set Additional information ----------------------------- The simulated dataset used in this tutorial was created using LIgO tool from immuneML. It uses OLGA to generate naive immune receptor sequences and simulates immune events by either implanting or filtering sequences based on user-specified conditions. For more information on LIgO and simulations, see :ref:`Dataset simulation with LIgO`.