def log_likelihood_of_character( tree: CassiopeiaTree, character: int, use_internal_character_states: bool, mutation_probability_function_of_time: Callable[[float], float], missing_probability_function_of_time: Callable[[float], float], stochastic_missing_probability: float, implicit_root_branch_length: float, ) -> float: """Calculates the log likelihood of a given character on the tree. Calculates the log likelihood of a tree given the states at a given character in the leaves using Felsenstein's Pruning Algorithm, which sets up a recursive relation between the likelihoods of states at nodes for this character. The likelihood L(s, n) at a given state s at a given node n is: L(s, n) = Π_{n'}(Σ_{s'}(P(s'|s) * L(s', n'))) for all n' that are children of n, and s' in the state space, with P(s'|s) being the transition probability from s to s'. That is, the likelihood at a given state at a given node is the product of the likelihoods of the states at this character at the children scaled by the probability of the current state transitioning to those states. This includes the missing state, as specified by `tree.missing_state_indicator`. We assume here that mutations are irreversible. Once a character mutates to a certain state that character cannot mutate again, with the exception of the fact that any non-missing state can mutate to a missing state. `mutation_probability_function_of_time` is expected to be a function that determine the probability of a mutation occuring given an amount of time. To determine the probability of acquiring a given (non-missing) state once a mutation occurs, the priors of the tree are used. Likewise, `missing_probability_function_of_time` determines the the probability of a missing data event occuring given an amount of time. The user can choose to use the character states annotated at internal nodes. If these are not used, then the likelihood is marginalized over all possible internal state characters. If the actual internal states are not provided, then the root is assumed to have the unmutated state at each character. Additionally, it is assumed that there is a single branch leading from the root that represents the roots' lifetime. If this branch does not exist and `use_internal_character_states` is set to False, then this branch is added with branch length equal to the average branch length of this tree. Args: tree: The tree on which to calculate the likelihood character: The index of the character to calculate the likelihood of use_internal_character_states: Indicates if internal node character states should be assumed to be specified exactly mutation_probability_function_of_time: The function defining the probability of a lineage acquiring a mutation within a given time missing_probability_function_of_time: The function defining the probability of a lineage acquiring heritable missing data within a given time stochastic_missing_probability: The probability that a cell/character pair acquires stochastic missing data at the end of the lineage implicit_root_branch_length: The length of the implicit root branch. Used if the implicit root needs to be added Returns: The log likelihood of the tree on one character """ # This dictionary uses a nested dictionary structure. Each node is mapped # to a dictionary storing the likelihood for each possible state # (states that have non-0 likelihood) likelihoods_at_nodes = {} # Perform a DFS to propagate the likelihood from the leaves for n in tree.depth_first_traverse_nodes(postorder=True): state_at_n = tree.get_character_states(n) # If states are observed, their likelihoods are set to 1 if tree.is_leaf(n): likelihoods_at_nodes[n] = {state_at_n[character]: 0} continue possible_states = [] # If internal character states are to be used, then the likelihood # for all other states are ignored. Otherwise, marginalize over # only states that do not break irreversibility, as all states that # do have likelihood of 0 if use_internal_character_states: possible_states = [state_at_n[character]] else: child_possible_states = [] for c in [ set(likelihoods_at_nodes[child]) for child in tree.children(n) ]: if tree.missing_state_indicator not in c and "&" not in c: child_possible_states.append(c) # "&" stands in for any non-missing state (including uncut), and # is a possible state when all children are missing, as any # state could have occurred at the parent if all missing data # events occurred independently. Used to avoid marginalizing # over the entire state space. if child_possible_states == []: possible_states = [ "&", tree.missing_state_indicator, ] else: possible_states = list( set.intersection(*child_possible_states)) if 0 not in possible_states: possible_states.append(0) # This stores the likelihood of each possible state at the current node likelihoods_per_state_at_n = {} # We calculate the likelihood of the states at the current node # according to the recurrence relation. For each state, we marginalize # over the likelihoods of the states that it could transition to in the # daughter nodes for s in possible_states: likelihood_for_s = 0 for child in tree.children(n): likelihoods_for_s_marginalize_over_s_ = [] for s_ in likelihoods_at_nodes[child]: likelihood_s_ = (log_transition_probability( tree, character, s, s_, tree.get_branch_length(n, child), mutation_probability_function_of_time, missing_probability_function_of_time, ) + likelihoods_at_nodes[child][s_]) # Here we take into account the probability of # stochastic missing data if tree.is_leaf(child): if (s_ == tree.missing_state_indicator and s != tree.missing_state_indicator): likelihood_s_ = np.log( np.exp(likelihood_s_) + (1 - missing_probability_function_of_time( tree.get_branch_length(n, child))) * stochastic_missing_probability) if s_ != tree.missing_state_indicator: likelihood_s_ += np.log( 1 - stochastic_missing_probability) likelihoods_for_s_marginalize_over_s_.append(likelihood_s_) likelihood_for_s += scipy.special.logsumexp( np.array(likelihoods_for_s_marginalize_over_s_)) likelihoods_per_state_at_n[s] = likelihood_for_s likelihoods_at_nodes[n] = likelihoods_per_state_at_n # If we are not to use the internal state annotations explicitly, # then we assume an implicit root where each state is the uncut state (0) # Thus, we marginalize over the transition from 0 in the implicit root # to all non-0 states in its child if not use_internal_character_states: # If the implicit root does not exist in the tree, then we impose it, # with the length of the branch being specified as # `implicit_root_branch_length`. Otherwise, we just use the existing # root with a singleton child as the implicit root if len(tree.children(tree.root)) != 1: likelihood_contribution_from_each_root_state = [ log_transition_probability( tree, character, 0, s_, implicit_root_branch_length, mutation_probability_function_of_time, missing_probability_function_of_time, ) + likelihoods_at_nodes[tree.root][s_] for s_ in likelihoods_at_nodes[tree.root] ] likelihood_at_implicit_root = scipy.special.logsumexp( likelihood_contribution_from_each_root_state) return likelihood_at_implicit_root else: # Here we account for the edge case in which all of the leaves are # missing, in which case the root will have "&" in place of 0. The # likelihood at "&" will have the same likelihood as 0 based on the # transition rules regarding "&". As "&" is a placeholder when the # state is unknown, this can be thought of realizing "&" as 0. if 0 not in likelihoods_at_nodes[tree.root]: return likelihoods_at_nodes[tree.root]["&"] else: # Otherwise, we return the likelihood of the 0 state at the # existing implicit root return likelihoods_at_nodes[tree.root][0] # If we use the internal state annotations explicitly, then we return # the likelihood of the state annotated at this character at the root else: return list(likelihoods_at_nodes[tree.root].values())[0]
def calculate_parsimony( tree: CassiopeiaTree, infer_ancestral_characters: bool = False, treat_missing_as_mutation: bool = False, ) -> int: """ Calculates the number of mutations that have occurred on a tree. Calculates the parsimony, defined as the number of character/state mutations that occur on edges of the tree, from the character state annotations at the nodes. A mutation is said to have occurred on an edge if a state is present at a character at the child node and this state is not in the parent node. If `infer_ancestral_characters` is set to True, then the internal nodes' character states are inferred by Camin-Sokal Parsimony from the current character states at the leaves. Use `tree.set_character_states_at_leaves` to use a different layer to infer ancestral states. Otherwise, the current annotations at the internal states are used. If `treat_missing_as_mutations` is set to True, then transitions from a non-missing state to a missing state are counted in the parsimony calculation. Otherwise, they are not included. Args: tree: The tree to calculate parsimony over infer_ancestral_characters: Whether to infer the ancestral characters states of the tree treat_missing_as_mutations: Whether to treat missing states as mutations Returns: The number of mutations that have occurred on the tree Raises: TreeMetricError if the tree has not been initialized or if a node does not have character states initialized """ if infer_ancestral_characters: tree.reconstruct_ancestral_characters() parsimony = 0 if tree.get_character_states(tree.root) == []: raise TreeMetricError( f"Character states empty at internal node. Annotate" " character states or infer ancestral characters by" " setting infer_ancestral_characters=True.") for u, v in tree.depth_first_traverse_edges(): if tree.get_character_states(v) == []: if tree.is_leaf(v): raise TreeMetricError( "Character states have not been initialized at leaves." " Use set_character_states_at_leaves or populate_tree" " with the character matrix that specifies the leaf" " character states.") else: raise TreeMetricError( f"Character states empty at internal node. Annotate" " character states or infer ancestral characters by" " setting infer_ancestral_characters=True.") parsimony += len( tree.get_mutations_along_edge(u, v, treat_missing_as_mutation)) return parsimony