Special nodes¶
DagSim has three special types of nodes that could be useful in simulations, namely a Selection
node, a Missing
node, and a Stratify
node. A Selection
node allows the user to simulate selection bias in the simulated data based on some user-specified criteria.
On the other hand, and as the name suggests, a Stratify
node allows the user to easily stratify the resulting data set into different strata, again according to user-specified criteria. The results will be returned as a dictionary of different dictionaries, one for each stratum, and the samples from each stratum are saved in a separate .csv file.
Finally, a Missing
node allows the user to drop some values from the resulting dataset and replace them by NaN
, again based on the criterion set by the user.
In this tutorial, you will learn how to use each of these nodes. If you are not familiar with how to specify a simulation using DagSim, see this.
Selection¶
Similar to a Node
node, to define a Selection
node, you need to specify the following:
name (str)
: A name for the node.
function
: The function to evaluate to get the value of the node. This function should returnTrue
to keep andFalse
to discard a sample. Note that here you need to specify only the name of the function without any arguments.
args (list)
(Optional): A list of positional arguments. An argument can be either another node in the graph or an object of the correct data type for the corresponding argument.
kwargs (dict)
(Optional): A dictionary of key word arguments with key-value pairs in the form “name_of_argument”:value. A value can be either another node in the graph or an object of the correct data type for the corresponding argument.
visible (bool)
(Optional): Default isTrue
to show the node when drawing the graph.False
hides the node in the graph.
The difference from a Node
node is that the function here should return a boolean; True
to include a sample, and False
to discard a sample.
The following code shows an example where only the samples that have a value of node Y greater than a certain threshold are included in the data set.
import dagsim.base as ds
import numpy as np
def add(param1, param2):
return param1 + param2
def square(param):
return np.square(param)
def is_greater_than2(node, threshold):
if node < threshold:
return True
else:
return False
A = ds.Node(name="A", function=np.random.normal)
B = ds.Node(name="B", function=np.random.normal)
C = ds.Node(name="C", function=add, kwargs={"param1": A, "param2": B})
D = ds.Node(name="D", function=square, kwargs={"param": C})
SB = ds.Selection(name="SB", function=is_greater_than2, kwargs={"node": C, "threshold":2})
listNodes = [A, B, C, D, SB]
my_graph = ds.Graph(listNodes, "SelectionExample")
output = my_graph.simulate(num_samples=10, csv_name="SelectionExample")
graph:
python_file: functions.py
nodes:
A:
function: numpy.random.normal
B:
function: numpy.random.normal
C:
function: add(param1= A, param2= B)
D:
function: square(C)
SB:
function: is_greater_than2(C, 2)
type: Selection
instructions:
simulation:
csv_name: parser
num_samples: 10
Stratify¶
The arguments needed to specify a Stratify
node are exactly the same as for a Selection
node. However, the function here should return the name (str)
of the stratum to which a given example should belong. These names will be used as suffixes to the main .csv file name.
The following code shows an example where the samples are split into three categories, namely “less than -1”, “greater than +1”, and “between -1 and +1”.
import dagsim.base as ds
import numpy as np
def add(param1, param2):
return param1 + param2
def square(param):
return np.square(param)
def check_strata(node):
if node < -1:
return "<-1"
else:
if node > 1:
return ">1"
else:
return ">-1|<+1"
A = ds.Node(name="A", function=np.random.normal)
B = ds.Node(name="B", function=np.random.normal)
C = ds.Node(name="C", function=add, kwargs={"param1": A, "param2": B})
D = ds.Node(name="D", function=square, kwargs={"param": C})
St = ds.Stratify(name="St", function=check_strata, kwargs={"node": C})
listNodes = [A, B, C, D, St]
my_graph = ds.Graph(listNodes, "StratificationExample")
output = my_graph.simulate(num_samples=10, csv_name="StratificationExample")
graph:
python_file: hello_world_functions.py
nodes:
A:
function: numpy.random.normal
B:
function: numpy.random.normal
C:
function: add(param1= A, param2= B)
D:
function: square(C)
St:
function: check_strata(C)
type: Stratify
instructions:
simulation:
csv_name: parser
num_samples: 10
Missing¶
To specify a Missing
node, the user provides the following:
name (str)
: A name for the node,
underlying_value (Node)
: The node that will eventually have missing values
index_node (Node)
: ANode
node that will provide the indices of the entries that will go missing:True
to consider the entry as missing andFalse
to keep it.
visible (bool)
(Optional): Default isTrue
to show the node when drawing the graph.False
hides the node in the graph.
We decided on this way of defining the node to keep the processes of specifying the indices of the missing entries and removing the corresponding values separate.
Note that the data with the missing entries would be saved as the output of the Missing
node itself rather than that of the underlying_value
node.
The output of the latter would be the complete data without any missing entries. If you with to discard the complete data, you can use the observed=False
argument when defining the underlying_value
node.
In the following, we explore how you can simulate missing values according to the three types of missing data models defined in Rubin (1976). Here, the observed data are collectively denoted by \(Y_\mathrm{obs}\), and the missing, would-have-been, data are collectively denoted as \(Y_\mathrm{mis}\), and \(\psi\) refers to the parameters of the missing data model.
Missing Completely At Random (MCAR)¶
In this case, the missingness pattern is random and the probability of an entry going missing, \(Pr(M=0)\), is independent of any missing or non-missing values of other variables in the data-generating process. In other words,
import dagsim.base as ds
import numpy as np
underlying_value = ds.Node(name="underlying_value", function=np.random.normal)
index_node = ds.Node(name="index_node", function=np.random.randint, kwargs={"low":0, "high":2})
MCAR = ds.Missing(name="MCAR", underlying_value=underlying_value, index_node=index_node)
list_nodes = [underlying_value, index_node, MCAR]
my_graph = ds.Graph(list_nodes=list_nodes, name="MCAR")
data = my_graph.simulate(num_samples=10, csv_name="MCAR")
graph:
python_file: hello_world_functions.py
nodes:
underlying_value:
function: numpy.random.normal
index_node:
function: numpy.random.randint(0,2)
MCAR:
underlying_value: underlying_value
index_node: index_node
instructions:
simulation:
csv_name: parser
num_samples: 10
Missing At Random (MAR)¶
In this case, the probability of an entry going missing depends on other observed values in the model, but does not depend on any unobserved quantities:
In this case, \(\Pr(M=0)\) depends on the observed value of \(Y_{obs}\).
import dagsim.base as ds
import numpy as np
def get_index(Y_observed):
val = 0
if Y_observed > 0:
val = 1
return val
underlying_value = ds.Node(name="underlying_value", function=np.random.normal)
Y_observed = ds.Node(name="Y_observed", function=np.random.normal)
index_node = ds.Node(name="index_node", function=get_index, kwargs={"Y_observed": Y_observed})
MAR = ds.Missing(name="MAR", underlying_value=underlying_value, index_node=index_node)
list_nodes = [underlying_value, index_node, Y_observed, MAR]
my_graph = ds.Graph(list_nodes=list_nodes, name="MAR")
data = my_graph.simulate(num_samples=10, csv_name="MAR")
Missing Not At Random (MNAR)¶
In the MNAR case, the probability that an entry is missing depends not only on observed quantities but also on missing ones, so the conditional probability does not simplify:
In this case, \(\Pr(M=0)\) depends on the observed value of \(Y_{obs}\) and the, possibly, unobserved, would-have-been value of \(Y_{mis}\).
import dagsim.base as ds
import numpy as np
def get_index(Y_observed, Y_missing):
val = 0
if Y_observed + Y_missing > 0.5:
val = 1
return val
underlying_value = ds.Node(name="underlying_value", function=np.random.normal)
Y_observed = ds.Node(name="Y_observed", function=np.random.normal)
Y_missing = ds.Node(name="Y_missing", function=np.random.normal)
index_node_Y = ds.Node(name="index_node_Y", function=np.random.randint, kwargs={"low":0, "high":2})
index_node = ds.Node(name="index_node", function=get_index, kwargs={"Y_observed":Y_observed, "Y_missing":Y_missing})
MNAR = ds.Missing(name="MNAR", underlying_value=underlying_value, index_node=index_node)
list_nodes = [underlying_value, Y_observed, Y_missing, index_node, index_node_Y, MNAR]
my_graph = ds.Graph(list_nodes=list_nodes, name="MNAR")
data = my_graph.simulate(num_samples=10, csv_name="MNAR")