Identifies subnetworks which can discriminate given conditions according to PPI network and gene expression data
Categories: integrated analysis
# What is PinnacleZ? PinnacleZ is a tool for classifying gene expression profiles by integrating gene expression data and protein networks. It is an implementation for Cytoscape of the searching and scoring algorithms specified in Chuang, H. Y. and Lee, E., et al., "Network-based classification of breast cancer metastasis," <i>Molecular Systems Biology</i> 3:140 (2007). By applying a protein network-based approach, indicators of a phenotype, or <i>markers</i>, are not just genes, but subnetworks of the given protein network. This approach assumes a direct correspondence between a gene in expression data and a protein in a protein network. In other words, gene X in the given expression data is related to protein X in the given protein network. # Terms used in this page * <b>Adjacent Node:</b> this is typically used in a sentence like <i>X is the adjacent node to Y</i>. This means there is an edge between nodes X and Y. In biological terms, protein X and Y interact. * <b>Edge:</b> a line in Cytoscape between two nodes. In biological terms, this indicates an interaction between two proteins. * <b>Gene Expression Matrix:</b> a table of numbers, where each row represents a gene and each column represents a unique molecular condition or state of a biological cell. Each cell in the table specifies the level of expression of a given gene under a given condition. * <b>Gene Expression Vector:</b> a row in the gene expression matrix.</b> Each vector corresponds to a protein in the protein network. * <b>Protein Network:</b> a network in Cytoscape representing protein interactions. For example, assume a protein network in Cytoscape has nodes X and Y, and there is a connection between X and Y. This means protein X interacts with protein Y, or vice versa. In graph theory jargon, a network is called a <i>graph</i>. * <b>Node:</b> a protein in the protein network. In graph theory parlance, this is also called a <i>vertex</i>. # The Overall Process of PinnacleZ 1. PinnacleZ calculates a set of <i>modules</i>. A module is merely a subnetwork of the given protein network. A module is calculated by starting out only with a <i>starting node</i>. A starting node can be any node in the protein network. Nodes are then added to the module. A node is added to the module only if * the node is adjacent to any node already in the module, and * the node improves the overall score of the module. (Scoring is defined in the next step.) If no node can be added that meet the two criteria above, the process of building a module stops. PinnacleZ goes through each node in the protein network and calculates its module. It collects all of these modules together. These are called <i>real modules</i>. 1. PinnacleZ scores each real module. A score is a numerical quantity that measures how "good" a module is. PinnacleZ gives the user a choice between two scoring methods: mutual information and T test. The score depends on the gene expression vectors contained in the module. 1. Most of the real modules were produced by mere chance and are statistically insignificant. These modules must be removed. PinnacleZ filters out insignificant modules by passing them through statistical tests. In order to do this, PinnacleZ first: 1. randomly associates a gene expression vector and its corresponding protein; 1. recalculates all the modules now that the associations between gene expression vectors and nodes have been randomized; 1. scores the modules--these modules are collected together and are called <i>random modules</i>. 1. This randomized process is repeated many times. The number of random trials is determined by the user. The more random trials, the better the results, but the computation time becomes longer. 1. <i>Statistical Test 1</i>: The scores of all random modules are collected together and are placed in a null distribution. If a real module's score is insignificant when compared against the null distribution, it is discarded. 1. <i>Statistical Test 2</i>: The random module scores are used to estimate the parameters of a distribution. If the user selected mutual information for a scoring method, the gamma distribution is used. If the user selected T test, the normal distribution is used. If a real module has an insigficant score compared to the distribution, it is discarded. 1. <i>Statistical Test 3</i>: The gene expression vectors of a real module are combined into one vector. A score is calculated based on this vector. The order of the vector's columns is then randomized. Another score is calculated from this randomized vector. This randomization process is repeated many times. The randomized scores are then placed in a null distribution. If the real score is insignificant compared to the null distribution, the module is discarded. 1. The real modules that passed the three statistical tests are presented to the user. # Input for PinnacleZ PinnacleZ requires three sources of input: a gene expression matrix, a class file, and a protein network. <b>Note</b>: `\ws` indicates white space, which is a tab or a space. ## The Gene Expression Matrix The gene expression matrix is a text file with the following format: * The first line describes the names of the columns of the matrix. It follows this format: `names \ws condition1 \ws condition2 \ws` ... `conditionN` * Subsequent lines describe gene expression vectors. It follows this format: gene1 \ws number1 \ws number2 \ws ... numberN gene2 \ws number1 \ws number2 \ws ... numberN ... geneM \ws number1 \ws number2 \ws ... numberN <b>Note:</b> The number of gene expression numbers in a row must exactly be the number of conditions given in the first line. If this is not so, PinnacleZ will not accept the gene expression matrix file. The following is an example of a valid gene expression matrix file: names +Glucose -Glucose +Succinate AT_Gene_01 1.0 2 3e-9 AT_Gene_02 7.0 26 10e10 AT_Gene_03 2.0 22 62e10 AT_Gene_04 9.0 12 6e12 In the above example, there are three conditions and four genes. ## The Class File The class file specifies the classification of each condition specified in the gene expression. It has the following format: * Each line specifies the classification of a condition. It has this format: `condition-name \ws classification` * Each condition specified in the gene expression matrix must have a classification. If it does not, PinnacleZ will not accept the class file. * The classification must be a positive integer. * If T-Test is used for the scoring method, <i>only two classes are allowed</i>: `1` and `2`. * If mutual information is used, any number of classes are allowed. The following is an example of a valid class file based on the example gene expression matrix given above: +Glucose 1 -Glucose 2 +Succinate 1 In the above example, `+Glucose` and `+Succinate` are in one class, and `-Glucose` is in another. ## The Protein Network * Protein networks must be loaded in Cytoscape, but do not need a view.</li> * The <i>ID</i> property of nodes must match the gene names specified in the gene expression matrix. If a node's <i>ID</i> property does not match any of the gene names in the expression matrix, it will be ignored. # Options <ul> <li><b>Score Model:</b> the score model to use; see steps 3 and 6 in the Overall Process.</li> <li><b>Number of Random Trials:</b> the number of random trials to calculate; see step 4 in the Overall Process.</li> <li><b>ST 1 P-value cutoff:</b> the P-value cutoff to determine if a module is significant for Statistical Test 1.</li> <li><b>ST 2 P-value cutoff:</b> the P-value cutoff to determine if a module is significant for Statistical Test 2.</li> <li><b>ST 3 P-value cutoff:</b> the P-value cutoff to determine if a module is significant for Statistical Test 3.</li> <li><b>Number of ST 3 Trials:</b> the number of randomizations to perform for Statistical Test 3.</li> <li><b>Max Node Degree:</b> the maximum number of adjacent nodes a node can have before being added to a module. This is useful to exclude <i>hubs</i>, or nodes with a lot of edges, from being over-represented in the results.</li> <li><b>Min Improvement:</b> the minimum percentage a module's score must improve before adding any more nodes.</li> <li><b>Max Module Size:</b> the maximum number of nodes a module is contained.</li> <li><b>Max Radius:</b> only allow nodes that are a specified distance from the starting node.</li> </ul>
No download available.