Overview
- General
- Files overview
- Project info file: Pnnnnn_info.txt
- What images have you sent to me?
- Raw data
- Normalisation
- Result files
- Deciding which genes are differentially expressed
- Quality control & Diagnostic plots
- Useful analysis and visualisation tools
- Glossary of terms
1. General
Each project is split into replica groups that relate to pairs of samples. Each of these replica groups is hybridised to four microarrays (depending on how many arrays were initially requested). Two of these hybridisations are so-called dye swaps where a sample originally labelled with Cy3 is now labelled with Cy5 and vice versa. Dye swapping compensates for the differential labelling efficiencies of each dye.
Project info files
Describe how the replicate hybridisations are grouped together and how each sampled was labelled: which dye each sample was labelled with, which sample it was paired with and which slide it was hybridised to.
Projects codes
Use the format Pnnnnn, where nnnnn is the assigned project number.
Replica group codes
Use the format Rnnnnn. Some file names are based on which of the microarray slides they represent.
Microarray slide codes
Use the format Snnnnnn. Microarrays are numbered in the order they were printed..
Each of the terms and codes defined above will be used throughout the following sections. Most of the files we have sent to you will have names derived from these codes. We will now explain what each file type contains. This might be easier to understand if you simultaneously open the relevant file. To help with interpretation we include an example project, providing additional information for each section.
2. Files overview
This is a list of the files we have sent to you.
- Pnnnnn_info.txt
- raw_data folder containing raw data files (Snnnnnn.state.dat) for all arrays of your project.
- analysis folder containing normalised data files, including some basic statistical analysis.
- plots folder containing diagnostic plots of raw data and normalised data.
- images_bw folder containing the raw grey-scale TIF images of the arrays.
- color_images folder containing the false colour PNG images of the arrays.
3. Project info file: Pnnnnn_info.txt
Pnnnnn_info.txt files describe how we processed your project and records which dye (or “channel”) was used to label each of your samples within each replica group. You may want to refer to this file when analysing your results.
N.B. Pnnnnn_info files are tab-delimited. They are best viewed using the freely available spreadsheets from OpenOffice or Microsoft Excel. Some text editors will have problems with the variable column spacing.
Caution - when using Excel be sure to use the open file command and indicate that columns such as gene names and unique IDs are labelled as text. If you do not do this, Excel may mangle some of these identifiers (PubMed).Important note; dye swaps and gene expression ratios:
- Raw data come as one file per slide. If they are dye-swap slides, the dye swap has not been taken into account (i.e. indicated by the label “swap_status = 1” in the Pnnnnn_info file). To normalise these data yourself, you need to consider the dye-swap. The ratios are presented as a background-subtracted ratio of Cy3 over Cy5 (i.e. default-setting of the spot-finding tool) and are only included for historical reasons. DO NOT USE THESE RATIOS.
- Normalised data take the dye swap taken into account. They are presented as one file for each replica group and a separate column for each slide. The dye-swap slide column data are un-swapped during the analysis and are presented as the ratio of Cy5 over Cy3. Accordingly, all Snnnnnn:M numbers should be treated as though there has been no dye swap (i.e. Swap_status = 0 in Pnnnnn_info.file); further analysis does not require any compensation for the dye-swaps (i.e. a negative Cy5/Cy3 ratio would be positive in the raw dye swap data but we have corrected this). We chose the Cy5/Cy3 ratios because it is the form predominantly used in publications.
- Project_Number: Project number in the form Pnnnnn, where nnnnn is a unique integer
- Replicate_Group: Replica group in the form Rnnnnn, where nnnnn is a unique integer
- Slide_Number: Slide number in the form Snnnnnn, where nnnnnn is a unique integer
- Hyb_Number: Hybridisation batch number in the form Hnnnnn, where nnnnn is a unique integer
- Cy3_Image: Cy3 image name, derived from the Slide_Number and denoted by the 532 suffix
- Cy5_Image: Cy5 image name, derived from the Slide_Number and denoted by the 635 suffix
- Cy3_Sample_Name: the name you provided for this sample
- Cy5_Sample_Name: the name you provided for this sample
- Swap_Status: denotes whether the slide is a dye swap (1) or not (0)
- Comments: refers to, e.g., problems with the slide (broken, high background, poor signal, etc.)
Project info file example (male vs female):
The following table describes project P99934, where we compared the gene expression patterns of adult male and female flies. This project comprises one replicate group (R10342) with four slides, two of which were dye swapped (S106438 and S106430). For this example, the ratio is male over female with the swap status 0 representing the male sample in the Cy5 channel and the female sample in Cy3 channel.
| Project_Number | Replicate_Group | Slide_Number | Hyb_Number | Cy3_Image | Cy5_Image | Cy3_Sample_Name | Cy5_Sample_Name | Swap_status | Comments |
|---|---|---|---|---|---|---|---|---|---|
| P99934 | R10342 | S106438 | H10001 | S106438_532 | S106438_635 | male_1 | female_1 | 1 | - |
| P99934 | R10342 | S106404 | H10001 | S106404_532 | S106404_635 | female_2 | male_2 | 0 | local background on bottom of slide |
| P99934 | R10342 | S106472 | H10001 | S106472_532 | S106472_635 | female_3 | male_3 | 0 | - |
| P99934 | R10342 | S106430 | H10001 | S106430_532 | S106430_635 | male_4 | female_4 | 1 | - |
4. What images have you sent to me?
We provide you with two different types of images:
(1) Raw grey-scale 16-bit TIF images used to quantify how much labelled sample has bound to each spot for each channel. These files can be found in the images_bw folder. These are the primary raw data for your microarray experiment!
- Snnnnnn_532.tif: grey-scale 16-bit TIF image of the Cy3 channel
- Snnnnnn_635.tif: grey-scale 16-bit TIF image of the Cy5 channel
(2) False colour PNG images (color_images folder). These provide you with a visual and non-normalised representation of your results. These images can be used to check for slide-specific problems as well as for presentations.
- Snnnnnn_c.png:: false colour image where Cy3 is green, Cy5 is red, and equal red and green is yellow
5. Raw data
For each slide, spot-finding and quantitation is presently performed by dapple. The file name is based on which microarray slide it represents. This file contains the raw unprocessed data without taking any dye swap into account. To normalise these data yourself, you will need to consider the dye swap status. In this context, we recommend using the single channel data. For technical information on spot finding and quantitation tool, see Buhler J. et al (2000). Dapple: improved techniques for finding spots on DNA microarrays. University of Washington technical report, UWTR 2000-08-05 (report).
- Snnnnnn.state.dat: spot quantification file from dapple with associated spot identities
Header information is always denoted by a hash (#) at the beginning of the line. All other columns are defined below. However, please note that the first number within the grid_x column is the total spot number (e.g., 18240 for FL002), and the first number within the grid_y column is the total number of channels (e.g., 2).
Column definitions for Snnnnnn.state.dat
These first few columns denote the spot location in the microarray. Locations are provided using a system of Cartesian co-ordinates. The x-axis corresponds to the width of the image (the shortest side) and the y-axis corresponds to the length of the image (the longest side). The reference point for these co-ordinates (0,0) is the top left spot in each image.
- tool_x: x-axis co-ordinate for the sub-grid (a.k.a. "block" or "pin-patch")
- tool_y: y-axis co-ordinate for the sub-grid
- sgrid_x: x-axis co-ordinate for the spot within the sub-grid
- sgrid_y: y-axis co-ordinate for the spot within the sub-grid
The following columns describe the nature of each spot. The description includes the Drosophila transcript and the predicted gene for each spot. The last column defines whether the spot should be included in any normalisation if you should choose to do this yourself.
- UniqueID: Unique FlyChip spot identifier
- TargetID: Drosophila transcript identity
- oID: FlyChip assigned oligonucleotide identity
- oSeq: oligonucleotide sequence
- oLen: oligonucleotide length
- target_length: target mRNA transcript length
- dist3p: oligonucleotide target sequence coordinate from the 3' end of the transcript
- dist5p: oligonucleotide target sequence coordinate from the 5' end of the transcript
- Accession: FBgn number for the current gene assignment of the oligo
- TargetName: FlyBase symbol for the current gene assignment of the oligo, if available.
- spike or spikes: FlyChip assigned identity for Arabidopsis columbia spike controls
- norm_ignore: value of 1 indicates that the spot should not be included in any normalisation
- show: flag is set to 0 when data should not be included in any downstream analysis after normalisation, e.g., the spot maps to an empty well or a control. The data should be included when the flag is set to 1, i.e., this is a Drosophila oligonucleotide probe.
Note that genome annotations change and for the most up to date view of the gene models and transcripts interrogated by the INDAC oligonucleotide probes you can consult FlyMine at www.flymine.org
Further columns provide details about the spot status, signal and a pixel count for the foreground (i.e. the spot) and background (i.e. the area surrounding the spot). Spots with very few pixels in the foreground are probably unreliable because they contain too few pixels for any reliable spot signal estimate. Column headers N (or n) indicates the channel, with N=1 representing the Cy3 and N=2 the Cy5 channel.
- StatusN: status of each spot for channel N; where A = [A]ccepted, R =[R]ejected, and S=[S]uspicious
- fgMedianN: foreground (spot) median pixel intensity of channel N
- fgAdjMADn: foreground (spot) pixel intensity variability of channel n
- bgMedianN: background (local area around spot) median pixel intensity of channel N
- bgAdjMADn: background (local area around spot) pixel intensity variability of channel n
- fgN: number of pixels in the foreground
- bgN: number of pixels in the background
What does this mean for the example of the male vs female comparison?
The raw data file of slide S106404 (S106404.state.dat) contains the foreground median intensity of the female sample in the fgMedian1 column (Cy3) and the male sample in the fgMedian2 column (Cy5). However, for the dye swap slide S106438, the male sample is given in the fgMedian1 column and the female sample in the fgMedian2 column. Therefore always check the swap status before analysing the raw data.
6. Normalisation
Measured fluorescent spot signals will differ systematically between different microarray hybridisations and dyes, including differences in background fluorescence, and in overall brightness with, e.g., one dye being twice as bright as another one. The process of correcting for such systematic differences is called normalisation. We use no background correction method, since this may significantly increase the overall variance of the data.
Differential expression is presented as a ratio of Cy5 over Cy3. A more symmetric (i.e., a “Gaussian” or “normal-like”) distribution is achieved using log-transformation with a log[2] scale.
The log differential expression ratio for each spot or 'M' is calculated as follows:
| M = log2 (Cy5 / Cy3) | or | M = log Cy5 - log Cy3 |
|---|
The log intensity of the spot or 'A' (a measure of the overall brightness of the spot) is:
| A = (log2 (Cy5 * Cy3) /2) | or | A = ( log Cy5 + log Cy3) / 2 |
|---|
In our files, M-values represent:
- +4 = Cy5 is 8-fold higher than Cy3
- +3 = Cy5 is 6-fold higher than Cy3
- +2 = Cy5 is 4-fold higher than Cy3
- +1 = Cy5 is 2-fold higher than Cy3
- 0 = no change
- -1 = Cy5 is 2-fold lower than Cy3
- -2 = Cy5 is 4-fold lower than Cy3
- -3 = Cy5 is 6-fold lower than Cy3
- -4 = Cy5 is 8-fold lower than Cy3
Depending on the quality of the data, three different methods of data normalisation are available. Our prefered normalisation method uses the vsn software and is performed using tools from Bioconductor.
Variance Stabalisation Normalisation (vsn) - our default normalisation method
This method is based on the work published by Huber et al. (2002). Bioinformatics 18(1), S96-104 and the corresponding Bioconductor package is called vsn.
For each dye and microarray, the background fluorescence and a factor reflecting overall brightness are inferred to make the signals identical for this subset of non-differentially expressed genes. A necessary assumption is that more than half the genes are NOT differentially expressed. For further (technical) information, see Huber et al. (2002) Bioinformatics 18(1), S96-104 (abstract).
Loess & Quantile normalisation (lquant)
Performs a within-array global loess normalisation of the M-values, followed by a between-array quantile normalisation. These normalisation methods are implemented in the Bioconductor limma software package.
Loess normalisation assumes that the bulk of probes on the array are NOT differentially expressed. It does not assume that there are equal numbers of up and down regulated genes or that differential expression is symmetric about zero. For further information, see Yang, Dudoit et al. (2002) Nucleic Acids Res. 30(4), e15 (abstract).
The aim of quantile normalisation is to ensure that all the intensity distributions on each array are identical. It involves an initial array-specific centering of the data, with the centred data being subsequently ordered from lowest to highest. Afterwards a distribution is calculated whereby the lowest value is the average of the lowest expressed gene on each of the arrays. This calculation is repeated for each subsequent order of intensity values up to the average value of the highs from each of the arrays. Each measurement on each array is then replaced with the corresponding average value in the distribution. For further (technical) information, see Bolstand et al. (2003) Bioinformatics 19(2), S185-93 (abstract).
Quantile normalisation (quant)
Performs a between-array quantile normalisation that does not correct for within slide bias. This normalisation method is implemented in the Bioconductor limma software package.
7. Result files
We perform a very basic statistical analysis of your data using different analysis tools. This includes a p-value estimation of your data for direct comparison, e.g., sample versus control. For other experimental designs, e.g., reference designs, or time-course analysis, we do not routinely perform a statistical analysis. However, we offer a cost-recovery based analysis service. We have listed some analysis tools in section 10 that can be used for these designs.
The three statistical analysis tools we use differ in their stringency of assigning significance to the data, with CyberT being the least stringent (gives the most genes with p-value < 0.05). Limma and siggnes are the most stringent (fewest genes with a p-value < 0.05).
Result file names consist of the replicate group number, normalisation method identifier and the statistical software used for the analysis. For each replicate group of your project we produce summary files which describe the nature of each spot, the transformed normalised intensity differences between the Cy5 and Cy3 channels for the replicate slides (M-values), and the transformed normalised average intensities of the replicate slides (A-values). Normalised data have the dye-swap taken into account; therefore, all M-values should be treated as if there had been no dye swap.
Tip: Excel will change cell contents, e.g., some of the genes names! To avoid this, always open excel first, and then change text columns to “text format” while importing the data (PubMed).
All result files have the following columns in common. The first few columns contain the gene information.
- UniqueID: Unique FlyChip spot identifier
- TargetID: Drosophila transcript identity
- Accession: FBgn number for the current gene assignment of the oligo
- TargetName: FlyBase symbol for the current gene assignment of the oligo, if available.
The following columns contain the M and A-values for each slide within this replicate group. Positive M-values indicate an increase in relative intensity (Cy5 greater than Cy3), negative values indicate a decrease in relative intensity (Cy5 less than Cy3). Remember that both M and A-values are log2 transformed. Numbers of equal value but opposite sign indicate equivalent fold changes up and down respectively. N.B. Since dye-swaps have been taken into account, the numbers across all replicate slides are comparable and should ideally change in the same direction.
- avgM: average of the M-values within this replicate group
- Snnnnnn.M: M-value for slide Snnnnnn
- Snnnnnn.A: A-value for slide Snnnnnn
The next column indicates whether a spot was saturated during the scanning process. The scanner has a pixel intensity range from 1 (20) to 65535 (216 -1). Image brightness can be changed by choosing a different PMT gain. We try to reduce the amount of saturated spots, however, there is a trade-off between having few saturated spots and losing low intensity spots, when scanning at lower PMT gains. Spots with raw intensities >65000 in either one or both channels are flagged with “1” in the SaturationFlag column.
- SaturationFlag gives the saturation flag for each slide. 1 = intensity value > 65000 = saturated.
The next column provides an indication of the spot quality. This can be used to determine whether the reported expression changes are reliable or subject to error due to problems with the either the printing, hybridisation, or spot-finding.
- all:spotfindStatus: this is the spot-finding status for each channel of each slide in the form Cy3-Cy5. A = accepted. R = rejected. S = suspect.
The following columns are specific for each analysis tool.
Rnnnnn.vsn.CyberT.tab
CyberT is a statistics program with a web interface that can be conveniently used on high-dimensional array data for the identification of statistically significant differentially expressed genes. It employs statistical analyses based on simple t-tests that use the observed variance of replicate gene measurements across replicate experiments. We have implemented the R code hdarray in our analysis pipeline. For further technical information, go to the CyberT web page (http://cybert.microarray.ics.uci.edu/). The first few columns are as described above, followed by the CyberT analysis specific columns.
- N minimum number of replicates used in the CyberT analysis
- x average of the M-values within this replicate group
- sd standard deviation of the M-values
- t the t-statistic calculated from x and sd
- p the p-value associated with the standard t-statistic
Rnnnnn.vsn.Limma.fdr.tab
Limma stands for Linear Models for Microarray Data. For each gene, it fits a linear model to the expression data and employs an empirical Bayes method to stabilise the analysis.
A design matrix and a contrast matrix have to be established for the data of a given replicate group. In a paired-data design the number of coefficients is one fewer than RNA sources (e.g. wildtype vs mutant, the number of coefficients equals 1). The first step is to fit a linear model that describes the systematic part of the data, followed by moderated t-statistics calculated for each probe and each contrast, respectively. Moderated t-statistics allow the same interpretation as an ordinary t-statistic although the standard errors have been moderated across genes, i.e. lowered towards a common value by using a simple Bayesian model. P-values are then adjusted using Benjamini and Hochberg's step-up method for controlling the false discovery rate (FDR). For more information, see Smyth (2004) Statistical Applications in Genetics and Molecular Biology, 3:Article 3 (abstract).
Again the first few columns are as described above followed by the limma analysis specific columns.
- logFC log fold change (same as averageM column)
- AveExpr log average A-value
- t moderated t-statistic
- P.Value associated p-value after adjusting for multiple testing
- adj.P.Val adjusted p-value using the selected adjustment method, in this case false discovery rate (FDR)
- B the log-odds that the gene is differentially expressed
Rnnnnn.vsn.siggenes.delta.nn.tab
We can perform the popular Significant Analysis of Microarrays (SAM) using the siggenes software package in Bioconductor. SAM identifies genes with statistically significant changes in expression by assimilating a set of gene-specific t-tests. Each gene is assigned a score on the basis of its change in gene expression relative to the standard deviation of repeated measurements for that gene. Genes with scores beyond a certain threshold are deemed potentially significant. The percentage of such genes identified by chance is the false discovery rate (FDR). To estimate the FDR, nonsense genes are identified by analyzing permutations of the measurements. The threshold can be adjusted to identify smaller or larger sets of genes, and FDRs are calculated for each set. For further information, see Tusher et al. (2001) PNAS 98(9), S5116-21 (abstract).
The analysis for a replicate group creates a table (Rnnnnn.vsn.Siggenes.mat.fdr.table.txt), stating the number of called genes, false genes and the FDR for the delta values. Ideally, we choose the delta value that has a FDR < 1 and a number of false genes < 1. However, in some cases it might be appropriate to include a larger number of genes, particularly, if there are only very few genes in the list. Please refer to this table for the FDR cutoff and number of predicted false genes for your results.
Again the first few columns are as described above followed by the limma analysis specific columns.
- d.value a numeric vector consisting of the expression score of the gene, positive = up-regulated, negative = down-regulated, N.B. This is not an expression ratio
- stdev standard deviation
- rawp unadjusted p-value of the gene
- q.value the q-value assigns significance in terms of the false discovery rate (fdr)
Pnnnnn.vsn.matrix.txt
For your convenience we have combined the data of all replicate groups of your project including the (adjusted) p-values of all three analysis tools within one file. This file facilitates a comparison between the different analysis tools and replicate groups. The replicate group number has been added to the column header to clarify what replicate group M-, A- values or p-values belong to. The averageM column has been omitted as it makes no sense to average over several replicate groups.
Pnnnnn.saturated.matrix.txt
Saturated spots (raw intensity value > 65000) are flagged in the result files. If a gene is saturated in one channel but not in another the ratio is compressed. To facilitate reviewing which genes were saturated in which channel and on which slide, we have created the Pnnnnn.saturated.matrix.txt which lists all spots which have been flagged.
The first few columns contain again the gene information, then the raw intensity values for each slide are listed.
- Snnnnnn.Cy3 raw intensity value of slide Snnnnnn in channel Cy3
- Snnnnnn.Cy5 raw intensity value of slide Snnnnnn in channel Cy5
Please note that the dye-swap has not been taken in consideration.
Male vs female example files
Below are extracts of result files of the male vs female example. The ratio defined in the Pnnnnn_info.txt is male/female. After sorting the files by the averageM column, the female specific genes are displayed at the top of the list. Yp2 (Yolk protein 2) has an averageM value of -8, this means that this gene is 256-fold higher expressed in females than males in this particular experiment. For Yp3 you can see that at least one of the channels was saturated on all 4 slides.
- R10342.vsn.CyberT.tab - for size reasons, only the top 10 and bottom 10 genes are displayed.
- R10342.vsn.Limma.fdr.tab - for size reasons, only the top 10 and bottom 10 genes are displayed.
Siggenes analysis creates two files:
- R10342.vsn.Siggenes.mat.fdr.table.txt - this file contains the delta values with the corresponding number of called genes.
- R10342.vsn.Siggenes.delta10.2.tab - as seen in the table above only 14 significant genes are found using a delta value of 10.2
Summary file of the project:
- P99934.vsn.matrix.txt does not have the averageM column and contains the p-values from each analysis method
There were two saturated spots in this project:
- P99934.saturated.matrix.txt - please observe the dye-swap
8. Deciding which genes are differentially expressed
Since we do not rank the data, there are several points to consider before deciding whether a gene in your experiment is differentially expressed.
- p-values - As seen above, p-value estimates vary strongly between tools. The list of genes with good p-values can be quite long and other criteria should be considered.
- all:spotfindStatus - Genes which have reject flags [R-R] over several slides within a replicate group could be removed.
- A-values - Dim spots are more variable, so flagging genes with very low A-values might be advisable. The A-value range is dependant on your experimental conditions, therefore we cannot suggest a cut-off value.
- M-values - There is also no cut-off value for M-values, low fold changes might still be significant.
9. Quality control & Diagnostic plots
All RNA samples of your project have to pass a RNA quality control test. Subsequently, the good quality RNA samples are mixed with spike control RNA in a reverse transcription or amplification reaction and labelled. After the hybridisation the signals of these spike control spots are checked to verify a successful labelling and hybridisation. The quality of our printed arrays within each print batch is ascertained by staining a random sample of slides (protocol). We perform quality control of the data before and after the normalisation and are able to identify problem slides. Any problem we identify will be reported in the corresponding comment column in the project index file. Depending on the severity of the problem we will occasionally exclude a slide from the analysis (this will be stated in the project index file). Minor problems (e.g. high local background level on a slide) can be solved by rejecting spots in the affected array area and will result in a RR-flag for these genes in your file.
The following examples were primarily taken from the aforementioned male versus female example. The plots were created using the following packages in Bioconductor: limma, vsn, marray and siggenes. For reasons of brevity, not all images for all slides have been displayed.
Background and Foreground images
FL002 arrays were printed using a print head with 4x12 arrangement of print-tips and the microarrays are partitioned into a 4x12 grid of tip groups (FL003 design). Each grid was printed with a single print-tip. It is interesting to look at the variation of background and foreground intensity values across the array. For each slide we create three images displaying, from left to right, the log2 transformed raw intensity values of the green (Cy3), red minus green and red (Cy5) channel. The darker the colour, the higher the signal intensity. In the background images slide S108122 shows that there is high background mostly on the left side of the array. The foreground plots indicate that this background will cause problems when looking at the ratio (red minus green). We reject spots during the spot finding in areas with high background, theses spots will have RR-flags in the result files.
| Rnnnnn.raw.Backgound.png |
|---|
|
|
| Rnnnnn.raw.Foregound.png |
|---|
|
|
Scatter plots with correlation coefficients
The following plots are created for the raw data as well as the normalised data. Only spots with good quality flags on all slides are included, spots that contain RR-flags in any one slide are removed. In the upper right panels, pair-wise scatter plots of the raw intensity values of each channel within a replicate group are shown. The lower panels show the correlation coefficients. Channels which contain the same sample type should be close to the red line and have a correlation coefficient close to 1. Below is the raw scatter plot for the male vs female example, showing that there are large differences between the samples. Channels S106438.Cy3 and S106472.Cy5 both contain male samples.
| R10342.raw.scatterplots.png |
|---|
|
MA plots
MA-plots, where M is the log differential expression ratio (i.e., the expression ratio) and A is the mean log intensity between the two channels of the slide (i.e., the spot signal). In the raw plots control spots are colour coded (see table below), bad quality spots [RR or RS-flagged] are not plotted and low intensity spots show larger variance (left side of the plot). The data appears symmetrical around the horizontal line if the two channels behave similarly. After normalisation only gene spots are plotted (control spots are omitted). The example below shows two dye-swap slides, therefore the M-values (ratios) of the second slide are a mirror image of the first slide. The MA-plot of slide S106404 after normalisation could indicate that there is a problem with this slide, however if one ignores the low intensity spots (A-values below 5) the plot looks similar to S106438.
| Spot Color | Spot Type |
|---|---|
| Black | Gene spots |
| Green | Controls (e.g. degradation probes, FLPase, LacZ, modified_GFP, Gal4_cds) |
| Orange | Spikes |
| Yellow | Empty |
| Red | Spotting Buffer |
| R10342.raw.MAplots.png |
|---|
|
After normalisation only gene spots are plotted (control and bad quality spots are omitted). The red dotted line shows the loess fit. The MA-plot of slide S106404 after normalisation could indicate that there is a problem with this slide, however if one ignores the low intensity spots (A-values below 5) the plot looks similar to the first slide.
| R10342.vsn.MAplots.png |
|---|
|
MAprinttip plots
Raw data MA-plots, with colour-coded lowess fits for each print-tip-groups. These plots can highlight printing or hybridisation artefacts of the array. Ideally all lines should be close to the zero line.
| R10342.raw.MAprintipplots.png |
|---|
|
Boxplot plots
Boxplots of the M-distribution per print-tip-group can be useful to identify spot or hybridisation artefacts. The central box in the plot represents the inter-quartile range (IQR), which is defined as the difference between the 75th percentile and 25th percentile, i.e., the upper and lower quartiles. The line in the middle of the box represents the 50th percentile, i.e., the median. Extreme values, greater then 1.5 IQR above the 75th percentile and less than 1.5 IQR below the 25th percentile, are typically plotted as individual data points.
| R10342.raw.Boxplots.png |
|---|
|
Additionally, we create box plots of the overall M-distribution for each slide of the raw and normalised data. After normalisation the median should be centred around zero and the spread should be similar between the slides.
| R10342.vsn.Boxplot.png |
|---|
|
Density plots
Density plots display smoothed empirical densities for the individual green and red channels. Without any normalisation there is considerable variation between both channels and between arrays.
| R10342.raw.Densityplots.png |
|---|
|
Summary plots of the densities of all slides within this replicate group before and after normalisation.
| R10342.vsn.Densityplots.png |
|---|
|
Mhist and Ahist plots
The histograms for the M- and A-values should be similar for slides within a replicate group. In the Ahist-plot of slide S106404, as in the MA-plot of this slide, some of the low intensity spots are causing a distortion visible as a peak to the left of the graph.
| R10342.vsn.Mhist.png |
|---|
|
| R10342.vsn.Ahist.png |
|---|
|
Sample Cluster Dendogram
All samples are normalised together and cluster dendograms are calculated using the euclidian distance method. Identical sample pairs (biological repliate samples) should cluster together, allowing outliers to be detected. However, if the differences between the different sample types within the project are very small the Dye effect may be stronger than the sample difference, so that the samples cluster by Dye.
| P99934.sample.cluster.png |
|---|
|
Standard deviation versus rank of the mean and the mean plot (only for vsn normalised data)
The aim of these plots is to determine whether there is a systematic trend in the standard deviation of the data as a function of overall expression. The assumption that underlies the usefulness of these plots is that most genes are not differentially expressed, consequently the running median should be a reasonable estimator of the standard deviation of feature level data conditional on the mean. The red dots show the running median of the standard deviation. The curve given by the red line is an estimate of the systematic dependence of the standard deviation on the mean. After Variance Stabilization Normalization, this should be a horizontal line. It may have some random fluctuations, but should not show an overall trend. If this is not the case, that usually indicates a data quality problem, or is a consequence of inadequate prior data pre-processing. The rank ordering distributes the data evenly along the x-axis.
| R10342.vsn.stdev_vs_mean.png |
|---|
|
Limma Volcano plot
A volcano-plot of log-fold-change versus a non-negative statistic that tests the significance of the fold change. Since the size of the statistic tends to increase with absolute log-fold change, such a plot has the characteristic shape of the open crater of a volcano. Volcano plots are sometimes used to emphasise that variability plays a role in significance as well as fold-change. Coloured dots represent probes that show large and very significant fold changes (adjusted p-value < 0.05).
| R10342.vsn.Limma.fdr.png |
|---|
|
Siggenes plot
Visualisation of SAM plot using the in the analysis specified delta value specified in the analysis. The green dots in the distribution represent the up (top) and down-regulated (bottom) genes.
| R10342.vsn.Siggenes.delta1.5.png |
|---|
|
10. Useful analysis and visualisation tools
There are many tools available for normalisation and statistical analysis of microarray data. See data analysis for an introduction to data analysis and links to useful tools.
Data analysis depends on the experimental design and will be different for each project. However, to get you started we have listed below links that might be useful. Please refer to the tool specific software documentation for details.
p-value calculation
Identify differentially expressed genes and estimate False Discovery Rates.
| Tool | Type | Direct design | Time-course design | Reference design |
|---|---|---|---|---|
| CyberT | online tool | yes | - | yes |
| GEPAS | online tool | yes | yes | yes |
| limma | bioconductor tool | yes | yes | yes |
| siggenes | bioconductor tool | yes | yes | yes |
| maSigPro | bioconductor tool | - | yes | - |
Clustering
Identify genes with similar expression profiles using clustering.
| Tool | Type |
|---|---|
| Cluster 3.0 & TreeView | free download tools |
| maSigPro | bioconductor tool |
| GEPAS | online tool |
11. Glossary of terms
- FBgn - FlyBase unique gene identification number
- Dye swap - Sample or control that was labelled with Cy3 is now labelled with Cy5 and the sample or control that was labelled with Cy5 is now labelled with Cy3
- Gene Ontology (GO) - Conserved vocabulary used to define gene structure, function, and expression (GO)
- Project(s) - The experiment to be performed on your behalf by FlyChip
- Replicate group(s) - A replica groups corresponds to each user defined sample-control pair
- Replicate slide(s) - Slides that have been hybridised with the same sample-control
- Sample(s) - The biological material that was submitted to FlyChip for analysis
- M-value - Log[2] scale differential expression ratio (Cy5 over Cy3)
- A-value - Log[2] scale signal intensity
Version 2.0. B.Fischer (12-03-2008)

