Documentation

Core Functionality

class protein_inference.problem_network.ProblemNetwork(network)

Stores a networkx object with nodes for proteins and peptides and edges for PSMs (protein-spectral-matches).

This object stores the protein-peptide PSM network along with several handy methods that support tagging, scoring and plotting functionalities.

network

A networkx graph storing the ProblemNetwork

Type

nx.Graph

get_edge_attribute_dict(attribute)

Returns a dictionary of all edge:value pairings for a particular value.

Parameters

attribute (string) – The name of the attribute for which to retrieve the edge:value pairings.

get_node_attribute_dict(attribute)

Returns a dictionary of all node:value pairings for a particular value.

Parameters

attribute (string) – The name of the attribute for which to retrieve the node:value pairings.

get_peptides()

Returns a list of peptides contained in the network.

get_proteins()

Return a list of the proteins contained in the network.

pick_nodes(attribute, value)

Returns a list of all nodes with a particular attribute - value pairing

Parameters
  • attribute (string) – The node attribute.

  • value (NA) – The value the attribute must be set to.

print_edges()

Uses pretty print to show all edges and their attributes.

print_nodes()

Uses pretty print to show all nodes and their attributes.

update_nodes(nodes, attribute, value)

Takes a list of nodes in the networks and gives them a node attribute with a particular value.

Parameters
  • nodes (list) – A list of nodes ids (strings) to update.

  • attribute (string) – The name of the attribute to update

  • value (NA) – The value to set the attribute to. Can be string or float.

class protein_inference.network_grapher.NetworkGrapher

This class generates a pyvis graph representing a PSM network.

This class uses information encoded in the network component of the ProblemNetwork class. It creates an interactive network representation using the pyvis package. Several colouring options are available such as by status, groups or score.

colour_by_group(pn)

Updates the node colour attributes in a ProblemNetworks network attribute according to the “allocated” and “major” attributes of those nodes”

Param

pn: the ProblemNetwork to update.

colour_by_score(pn)

Updates the node colour attributes in a ProblemNetworks network attribute according to the score attributes of those nodes”

Param

pn: the ProblemNetwork to update.

colour_by_status(pn)

” Updates the node colour attributes in a ProblemNetworks network attribute according to the status attributes of those nodes”

Param

pn: the ProblemNetwork to update.

draw(problem_network, by='status', name='nx', size=[800, 800])

Draw creates the interactive network visualization from the ProblemNetwork object using pyvis and the colouring methods in this classs.

Parameters
  • problem_network (ProblemNetwork) – An instance of the problem network object

  • by (string) – A string indicating the colouring method (“status”, “colour” or “group”)

  • name (string) – A string that will be used to generate the file name when the visualization is saved as an HTML file.

  • size (list) – A list of length two indicating the size in pixels of the resulting visualization

Returns

Return type

A static HTML object that is also written to the current working directory.

class protein_inference.protein_inference_runner.ProteinInferenceRunner

The main class that drives the protein inference workflow.

This class groups the main coordinating functions that organise a protein inference workflow including PSM processing, graph generation, splitting, tagging, scoring, fdr calculations and output file writing.

To instantiate:

>>> runner = ProteinInferenceRunner()
get_output(psms, decoy=0, scoring_method=<class 'protein_inference.reprisal.greedy_algorithm.GreedyAlgorithm'>, psms_q_value_threshold=0.01)

Processes either target or decoy data.

Preprocesses data,generates, splits and tags networks. Scores proteins, merges proteins with identical neighbors and retrieves protein and peptide tables.

Parameters
  • psms (pandas.DataFrame) – A pandas dataframe contain a psms table.

  • decoy (bool) – A boolean indicating whether decoys should be removed from the input table.

  • scoring_method (scoring method object) – A valid scoring method such as those in inference.scorers

Returns

  • protein_table (pandas.DataFrame) – A pandas dataframe containing proteins and inference metrics.

  • peptide_table (pandas.DataFrame) – A pandas dataframe containing peptides and assignments.

  • solved_networks (a list of problem networks with scores and node) – labels.

parallel_apply(pns, func)

Uses multiprocessing module in mython to apply a function to every problem network in a list of networks

pnslist

A list of ProblemNetworks.

func: function or method

A function or method that acts on a ProblemNetwork.

pnslist

A list of processed ProblemNetworks.

run(target_path, decoy_path, output_directory, scoring_method=<class 'protein_inference.reprisal.greedy_algorithm.GreedyAlgorithm'>, psms_q_value_threshold=0.01)

This method calls the protein inference workflow.

This method locates two psm tables, corresponding to target and decoy matches, and uses them to generate output peptide and protein tables with scores and false discovery rates indicating which proteins are inferred and which peptides have been assigned to each protein (as evidence).

target_pathstring

A path to a valid psm table with target matches

decoy_pathstring

A path to a valid psm table with decoy matches

output_directorystring

A path to the output directory

scoring_methodscoring method object

A valid scoring method such as those in inference.scorers

Returns

  • reprisal.target.proteins.csv (csv) – A file in the output directory. Describes the inferences made.

  • reprisal.target.peptides.csv (csv) – A file in the output directory. Describes the inferences made.

  • target_networks.p (pickle file) – A file that can be loaded using pickle to retrieve the target networks for visualization.

write_tables(target_protein_table, target_peptide_table, decoy_protein_table, decoy_peptide_table, output_directory)

Writes tables to csvs.

target_protein_tablepandas.DataFrame

A pandas dataframe containing proteins and inference metrics.

target_peptide_tablepandas.DataFrame

A pandas dataframe containing peptides and assignments.

decoy_protein_tablepandas.DataFrame

A pandas dataframe containing proteins and inference metrics.

decoy_peptide_tablepandas.DataFrame

A pandas dataframe containing peptides and assignments.

output_directorystring

A path to the output directory

Returns

  • reprisal.target.proteins.csv (csv) – A file in the output directory. Describes the inferences made.

  • reprisal.target.peptides.csv (csv) – A file in the output directory. Describes the inferences made.

  • reprisal.decoy.proteins.csv (csv) – A file in the output directory. Describes the inferences made.

  • reprisal.decoy.peptides.csv (csv) – A file in the output directory. Describes the inferences made.

class protein_inference.table_maker.TableMaker

A class that groups functions required to generate output tables.

get_protein_table(pn)

Processes a ProblemNetwork to retrieve the corresponding protein table

get_protein_tables(pns)

Processes a list of ProblemNetworks to retrieve the corresponding protein tables in a list.

get_system_protein_table(pns):

Processes a list of ProblemNetworks to retrieve the corresponding protein table.

get_peptide_table(pn)

Processes a ProblemNetwork to retrieve the corresponding peptide table

get_peptide_tables(pns)

Processes a list of ProblemNetwork to retrieve the corresponding peptide tables in a list.

get_system_peptide_table(pns):

Processes a list of ProblemNetworks to retrieve the corresponding peptide table.

_get_edge_list(pn):

Retrieves a pandas dataframe encoding the edges from peptide to protein for each edge in a ProblemNetwork.

find_molecule(pns, molecule):

Find the molecule (a string corresponding to a protein id or modified peptide sequence) in the list of ProblemNetworks and returns the corresponding network.

Processing

class protein_inference.processing.processed_psms.ProcessedPSMs(df)

A class that bundles a psm table in a pandas dataframe with some useful methods.

dfpandas.DataFrame

A pandas dataframe containing the psms list.

get_proteins()

Returns a list of proteins contained in the dataframe attribute.

get_peptides()

Returns a list of peptides contained in the dataframe attribute.

get_shape()

Returns the shape of the dataframe attribute in a length 2 tuple.

class protein_inference.processing.psms_preprocessor.PSMsPreprocessor(df, partial=0, decoy=0)

A class that collects methods for preprocessing a psms list contained in a pandas dataframe.

PSMspandas.DataFrame

a pandas dataframe containing the psms list.

partialint

The number of psms to retrieve if a partial processing is to be done. 0 implies complete processing.

decoybool

0 if input needs to be processed to remove decoy proteins

get_processed_psms(df, partial=0, decoy=0)

Applies a series of preprocessing methods to the PSMs attribute and returns the preprocessed data as a ProcessedPSMs object.

_preprocess_q_values(df)

Returns only the psms with q_value < 0.01

_preprocess_columns(df)

Returns the highest scoring/lowest pep/lowest q value associated with each psm

_preprocess_duplicates(df)

Splits rows that have duplicate proteins id’s into rows in the dataframe.

_preprocess_decoys(df)

Removes rows with a protein id prefixed by decoy

_preprocess_column_names(df)

Rename columns where percolator naming is used to more generic column names

_preprocess_no_q_values(df)

If no q values are present in the psms list, set all q values to negative one.

Warning: _preprocess_no_q_values(df) allows use of potentially bad psms because information is available for filtering matches.

class protein_inference.processing.psms_network_splitter.PSMsNetworkSplitter(network)

A class that splits a networkx graph generated a psm table.

networknetworkx.Graph

A networkx graph generated by PSMsNetworkGenerator.

split_networks()

Splits the networkx graph into connected components, converts them into ProblemNetworks and returns the results as a list.

class protein_inference.processing.psms_network_generator.PSMsNetworkGenerator(PSMs)

A class that generates a networkx graph from a psm table.

PSMsProcessedPSMs

A ProcessedPSMs instance.

generate_network()

Generates the networkx graph with appropriate node types, labels and edge attributes.

Inference

class protein_inference.inference.false_discovery_rate_calculator.FalseDiscoveryRateCalculator

Packages related function for estimating false discovery rates and q - values.

FDR(threshold, target, decoy)

Calculates the false discovery rate at a given threshold based on the score distributions in the target and decoy pandas tables. ‘

Parameters
  • threshold (float) – A number between 0 and 1.

  • target (pandas.DataFrame) – A target protein table produced by ProteinInferenceRunner().get_output()

  • decoy (pandas.DataFrame) – A decoy protein table produced by ProteinInferenceRunner().get_output()

tag_FDR(target, decoy, entrapment=False)

Creates an FDR column in the target table. If entrapment is set to true it changes the column name (used in benchmarking.)

Parameters
  • target (pandas.DataFrame) – A target protein table produced by ProteinInferenceRunner().get_output()

  • decoy (pandas.DataFrame) – A decoy protein table produced by ProteinInferenceRunner().get_output()

  • entrapment (bool) – Indicates whether or not to prefix the FDR column with “entrapment”

tag_q_value(target, decoy, entrapment=False)

Create a q-value column for the target table based on the score distributions in the target and decoy pandas tables.

Parameters
  • target (pandas.DataFrame) – A target protein table produced by ProteinInferenceRunner().get_output()

  • decoy (pandas.DataFrame) – A decoy protein table produced by ProteinInferenceRunner().get_output()

  • entrapment (bool) – Indicates whether or not to prefix the FDR column with “entrapment”

class protein_inference.inference.protein_merger.ProteinMerger

A class responsible for identifying indistinguishable proteins inside a problem network and mergine those nodes.

run(pn):

Merges proteins with the the same peptide connections and returns an updated problem network.

_list_to_string(l):

Utility for converting a list to a string seperated by semi-colons.

_string_to_list(l):

Utility for a string seperated by semi-colons to a list.

get_mapping(pn):

Identifies proteins with same peptide neighbors, returning a dataframe of peptide neighbour list, protein list pairs.

get_named_proteins(pn):

Chooses proteins from the protein lists with the same peptide neighbours to keep (by sort order). Returns a dataframe indicating these.

class protein_inference.inference.scorers.BestPeptideScorer

A class that includes one function, run, which takes a problem network and scores peptides use the best peptide method.

run(pn)

Scores proteins using the PEP Product Method.

Parameters

pn (ProblemNetwork) – An instance of the problem network object

Returns

pn – An instance of the problem network object where protein nodes have a score attribute.

Return type

ProblemNetwork

class protein_inference.inference.scorers.BestTwoPeptideScorer

A class that includes one function, run, which takes a problem network and scores peptides use the best peptide method.

run(pn)

Scores proteins using the Best Two Peptides method.

Parameters

pn (ProblemNetwork) – An instance of the problem network object

Returns

pn – An instance of the problem network object where protein nodes have a score attribute.

Return type

ProblemNetwork

class protein_inference.inference.scorers.PEPProductScorer

A class that includes one function, run, which takes a problem network and scores peptides use the PEP Product method.

run(pn)

Scores proteins using the PEP Product Method.

Parameters

pn (ProblemNetwork) – An instance of the problem network object

Returns

pn – An instance of the problem network object where protein nodes have a score attribute.

Return type

ProblemNetwork