Special nodes¶

DagSim has three special types of nodes that could be useful in simulations, namely a Selection node, a Missing node, and a Stratify node. A Selection node allows the user to simulate selection bias in the simulated data based on some user-specified criteria.

On the other hand, and as the name suggests, a Stratify node allows the user to easily stratify the resulting data set into different strata, again according to user-specified criteria. The results will be returned as a dictionary of different dictionaries, one for each stratum, and the samples from each stratum are saved in a separate .csv file.

Finally, a Missing node allows the user to drop some values from the resulting dataset and replace them by NaN, again based on the criterion set by the user.

In this tutorial, you will learn how to use each of these nodes. If you are not familiar with how to specify a simulation using DagSim, see this.

Selection¶

Similar to a Node node, to define a Selection node, you need to specify the following:

name (str): A name for the node.

function: The function to evaluate to get the value of the node. This function should return True to keep and False to discard a sample. Note that here you need to specify only the name of the function without any arguments.

args (list) (Optional): A list of positional arguments. An argument can be either another node in the graph or an object of the correct data type for the corresponding argument.

kwargs (dict) (Optional): A dictionary of key word arguments with key-value pairs in the form “name_of_argument”:value. A value can be either another node in the graph or an object of the correct data type for the corresponding argument.

visible (bool) (Optional): Default is True to show the node when drawing the graph. False hides the node in the graph.

The difference from a Node node is that the function here should return a boolean; True to include a sample, and False to discard a sample.

The following code shows an example where only the samples that have a value of node Y greater than a certain threshold are included in the data set.

import dagsim.base as ds
import numpy as np


def add(param1, param2):
    return param1 + param2


def square(param):
    return np.square(param)


def is_greater_than2(node, threshold):
    if node < threshold:
        return True
    else:
        return False


A = ds.Node(name="A", function=np.random.normal)
B = ds.Node(name="B", function=np.random.normal)
C = ds.Node(name="C", function=add, kwargs={"param1": A, "param2": B})
D = ds.Node(name="D", function=square, kwargs={"param": C})
SB = ds.Selection(name="SB", function=is_greater_than2, kwargs={"node": C, "threshold":2})

listNodes = [A, B, C, D, SB]
my_graph = ds.Graph(listNodes, "SelectionExample")
output = my_graph.simulate(num_samples=10, csv_name="SelectionExample")

graph:
  python_file: functions.py
  nodes:
    A:
      function: numpy.random.normal
    B:
      function: numpy.random.normal
    C:
      function: add(param1= A, param2= B)
    D:
      function: square(C)
    SB:
      function: is_greater_than2(C, 2)
      type: Selection


instructions:
  simulation:
    csv_name: parser
    num_samples: 10

Stratify¶

The arguments needed to specify a Stratify node are exactly the same as for a Selection node. However, the function here should return the name (str) of the stratum to which a given example should belong. These names will be used as suffixes to the main .csv file name.

The following code shows an example where the samples are split into three categories, namely “less than -1”, “greater than +1”, and “between -1 and +1”.

import dagsim.base as ds
import numpy as np


def add(param1, param2):
return param1 + param2


def square(param):
return np.square(param)


def check_strata(node):
    if node < -1:
        return "<-1"
    else:
        if node > 1:
            return ">1"
        else:
            return ">-1|<+1"


A = ds.Node(name="A", function=np.random.normal)
B = ds.Node(name="B", function=np.random.normal)
C = ds.Node(name="C", function=add, kwargs={"param1": A, "param2": B})
D = ds.Node(name="D", function=square, kwargs={"param": C})
St = ds.Stratify(name="St", function=check_strata, kwargs={"node": C})

listNodes = [A, B, C, D, St]
my_graph = ds.Graph(listNodes, "StratificationExample")
output = my_graph.simulate(num_samples=10, csv_name="StratificationExample")

graph:
  python_file: hello_world_functions.py
  nodes:
    A:
      function: numpy.random.normal
    B:
      function: numpy.random.normal
    C:
      function: add(param1= A, param2= B)
    D:
      function: square(C)
    St:
      function: check_strata(C)
      type: Stratify


instructions:
  simulation:
    csv_name: parser
    num_samples: 10

Missing¶

To specify a Missing node, the user provides the following:

name (str): A name for the node,

underlying_value (Node): The node that will eventually have missing values

index_node (Node): A Node node that will provide the indices of the entries that will go missing: True to consider the entry as missing and False to keep it.

visible (bool) (Optional): Default is True to show the node when drawing the graph. False hides the node in the graph.

We decided on this way of defining the node to keep the processes of specifying the indices of the missing entries and removing the corresponding values separate.

Note that the data with the missing entries would be saved as the output of the Missing node itself rather than that of the underlying_value node. The output of the latter would be the complete data without any missing entries. If you with to discard the complete data, you can use the observed=False argument when defining the underlying_value node.

In the following, we explore how you can simulate missing values according to the three types of missing data models defined in Rubin (1976). Here, the observed data are collectively denoted by \(Y_\mathrm{obs}\), and the missing, would-have-been, data are collectively denoted as \(Y_\mathrm{mis}\), and \(\psi\) refers to the parameters of the missing data model.

Missing Completely At Random (MCAR)¶

In this case, the missingness pattern is random and the probability of an entry going missing, \(Pr(M=0)\), is independent of any missing or non-missing values of other variables in the data-generating process. In other words,

\[\Pr(M=0|Y_{obs},Y_{mis},\psi) = \Pr(M=0|\psi)\]

import dagsim.base as ds
import numpy as np


underlying_value = ds.Node(name="underlying_value", function=np.random.normal)
index_node = ds.Node(name="index_node", function=np.random.randint, kwargs={"low":0, "high":2})
MCAR = ds.Missing(name="MCAR", underlying_value=underlying_value, index_node=index_node)

list_nodes = [underlying_value, index_node, MCAR]
my_graph = ds.Graph(list_nodes=list_nodes, name="MCAR")

data = my_graph.simulate(num_samples=10, csv_name="MCAR")

graph:
  python_file: hello_world_functions.py
  nodes:
    underlying_value:
      function: numpy.random.normal
    index_node:
      function: numpy.random.randint(0,2)
    MCAR:
      underlying_value: underlying_value
      index_node: index_node


instructions:
  simulation:
    csv_name: parser
    num_samples: 10

Missing At Random (MAR)¶

In this case, the probability of an entry going missing depends on other observed values in the model, but does not depend on any unobserved quantities:

\[\Pr(M=0|Y_{obs},Y_{mis},\psi) = \Pr(M=0|Y_{obs},\psi)\]

In this case, \(\Pr(M=0)\) depends on the observed value of \(Y_{obs}\).

import dagsim.base as ds
import numpy as np

 def get_index(Y_observed):
    val = 0
    if Y_observed > 0:
        val = 1
    return val


underlying_value = ds.Node(name="underlying_value", function=np.random.normal)
Y_observed = ds.Node(name="Y_observed", function=np.random.normal)
index_node = ds.Node(name="index_node", function=get_index, kwargs={"Y_observed": Y_observed})
MAR = ds.Missing(name="MAR", underlying_value=underlying_value, index_node=index_node)

list_nodes = [underlying_value, index_node, Y_observed, MAR]
my_graph = ds.Graph(list_nodes=list_nodes, name="MAR")

data = my_graph.simulate(num_samples=10, csv_name="MAR")

Missing Not At Random (MNAR)¶

In the MNAR case, the probability that an entry is missing depends not only on observed quantities but also on missing ones, so the conditional probability does not simplify:

\[\Pr(M=0|Y_{obs},Y_{mis},\psi) = \Pr(M=0|Y_{obs},Y_{mis},\psi)\]

In this case, \(\Pr(M=0)\) depends on the observed value of \(Y_{obs}\) and the, possibly, unobserved, would-have-been value of \(Y_{mis}\).

import dagsim.base as ds
import numpy as np

 def get_index(Y_observed, Y_missing):
    val = 0
    if Y_observed + Y_missing > 0.5:
        val = 1
    return val



underlying_value = ds.Node(name="underlying_value", function=np.random.normal)
Y_observed = ds.Node(name="Y_observed", function=np.random.normal)
Y_missing = ds.Node(name="Y_missing", function=np.random.normal)

index_node_Y = ds.Node(name="index_node_Y", function=np.random.randint, kwargs={"low":0, "high":2})
index_node = ds.Node(name="index_node", function=get_index, kwargs={"Y_observed":Y_observed, "Y_missing":Y_missing})
MNAR = ds.Missing(name="MNAR", underlying_value=underlying_value, index_node=index_node)

list_nodes = [underlying_value, Y_observed, Y_missing, index_node, index_node_Y, MNAR]
my_graph = ds.Graph(list_nodes=list_nodes, name="MNAR")

data = my_graph.simulate(num_samples=10, csv_name="MNAR")