Logo FlyChip
FlyChip
Functional Genomics for Drosophila
Cambridge Systems Biology Centre, Tennis Court Road, Cambridge, CB2 1QR, UK  [map]
Tel: +44 (0)1223-760280.   Fax: +44 (0)1223-760241.

Overview

  1. Files overview
  2. Project info file: Pnnnnn_info.txt
  3. Black and White images: images_bw
  4. Color images: color_images
  5. Raw data: raw_data
  6. Normalisation
  7. Result files: analysis
  8. Quality control & Diagnostic plots: plots
  9. Gene Expression Omnibus submission: GEO
  10. Project_report.doc
  11. Useful analysis and visualisation tools
  12. Glossary of terms

1. Files overview

This is a list of the files we have sent to you.


2. Project info file: Pnnnnn_info.txt

The Pnnnnn_info.txt file describes how we processed your project and records which dye (or "channel") was used to label each of your samples within each replica group.

Each project is split into replica groups "Rnnnnn" that relate to pairs of sample types. Each of these replica groups is hybridised to four microarrays "Snnnnnn" (depending on how many arrays were initially requested). Two of these hybridisations are so-called dye swaps where a sample originally labelled with Cy3 is now labelled with Cy5 and vice versa. Dye swapping compensates for the differential labelling efficiencies of each dye.

Dye swap definition:

Two color microarray expression ratios are generally represented as Cy5 over Cy3. Most of the time we cannot identify which of the submitted samples is the "sample" and which is the "control" in the experiment. We therefore assign the samples in a random fashion into the Cy3 or Cy5 channels. The ratios in the result files are therefore represented by the ratio of Cy5-sample/Cy3-sample with a Swap_status = 0.

Project info file column definitions:

Project info file example of male vs female:

The following table describes project P99934, where we compared the gene expression patterns of adult male and female flies. This project comprises one replicate group (R10342) with four slides, two of which were dye swapped (S106438 and S106430). For this example, the ratio is male over female with the swap status 0 representing the male sample in the Cy5 channel and the female sample in Cy3 channel. Throughout this help file we will refer back to this example.

Project_Number Replicate_Group Slide_Number Hyb_Number Cy3_Image Cy5_Image Cy3_Sample_Name Cy5_Sample_Name Cy3_Sample_Type Cy5_Sample_Type Swap_status Comments
P99934 R10342 S106438 H10001 S106438_532 S106438_635 male_1 female_1 male female 1 -
P99934 R10342 S106404 H10001 S106404_532 S106404_635 female_2 male_2 female male 0 local background on bottom of slide
P99934 R10342 S106472 H10001 S106472_532 S106472_635 female_3 male_3 female male 0 -
P99934 R10342 S106430 H10001 S106430_532 S106430_635 male_4 female_4 male female 1 -

3. Black and White images: images_bw

Each microarray slide is scanned with the GenePix 4000B Microarray Scanner, at 5 µm resolution and saved as grey-scale 16-bit TIF images. The scanner has a pixel intensity range from 1 (20) to 65535 (216 -1). Image brightness can be changed by choosing a different PMT gain. We try to reduce the amount of saturated spots, however, there is a trade-off between having few saturated spots and losing low intensity spots, when scanning at lower PMT gains. The TIF images are then used to quantify how much labelled sample has bound to each spot for each channel. These are the primary raw data for your microarray experiment!


4. Color images: color_images

Contains false colour PNG images. These provide you with a visual and non-normalised representation of your results. These images can be used to check for slide-specific problems as well as for presentations.


5. Raw data: raw_data

For each slide, spot-finding and quantitation is presently performed by dapple. The file name "Snnnnnn.state.dat" is based on which microarray slide it represents. This file contains the raw unprocessed data without taking any dye swap into account. Please refer to the Pnnnnn_info.txt file to find out which sample is located in which column. To normalise these data yourself, you will need to consider the dye swap status. In this context, we recommend using the single channel data. For technical information on spot finding and quantitation tool, see Buhler J. et al (2000). Dapple: improved techniques for finding spots on DNA microarrays. University of Washington technical report, UWTR 2000-08-05 (report).

Column definitions for Snnnnnn.state.dat

Header information is always denoted by a hash (#) at the beginning of the line. All other columns are defined below. However, please note that the first number within the grid_x column is the total spot number (e.g., 18240 for FL003), and the first number within the grid_y column is the total number of channels (e.g., 2).

These first few columns denote the spot location in the microarray. Locations are provided using a system of Cartesian co-ordinates. The x-axis corresponds to the width of the image (the shortest side) and the y-axis corresponds to the length of the image (the longest side). The reference point for these co-ordinates (0,0) is the top left spot in each image.

The following columns describe the nature of each spot. The description includes the Drosophila transcript and the predicted gene for each spot. The last column defines whether the spot should be included in any normalisation if you should choose to do this yourself.

Note: Genome annotations change and for the most up to date view of the gene models and transcripts interrogated by the INDAC oligonucleotide probes you can consult FlyMine at www.flymine.org

Further columns provide details about the spot status, signal and a pixel count for the foreground (i.e. the spot) and background (i.e. the area surrounding the spot). Spots with very few pixels in the foreground are probably unreliable because they contain too few pixels for any reliable spot signal estimate. Column headers N (or n) indicates the channel, with N=1 representing the Cy3 and N=2 the Cy5 channel.

Column definitions for Pnnnnn.raw.intensity.matrix.txt

The first few columns describe the nature of each spot. Then we present the Cy3 and Cy5 raw spot intensity for each slide. Spot signals for highly expressed genes can sometimes be close to or above the saturation level (maximum pixel intensity 65535). Calculating an expression ratio when one of the samples is close to saturation will result in a compressed ratio. We try to minimise the number of saturated spots. However, reducing the scan setting so that no spots are saturated can lead to the loss of data for low expressed genes. For this reason we choose an optimal PMT gain setting and any spot with raw intensity >65000 in either one or both channels are flagged with "1" in the SaturationFlag column. The flag order is the same as the slide order in this file.

Male vs female example:

The raw data file of slide S106404 (S106404.state.dat) contains the foreground median intensity of the female sample in the fgMedian1 column (Cy3) and the male sample in the fgMedian2 column (Cy5). However, for the dye swap slide S106438, the male sample is given in the fgMedian1 column and the female sample in the fgMedian2 column. Therefore always check the swap status before analysing the raw data.


6. Normalisation

Measured fluorescent spot signals will differ systematically between different microarray hybridisations and dyes, including differences in background fluorescence, and in overall brightness with, e.g., one dye being twice as bright as another one. The process of correcting for such systematic differences is called normalisation. We use no background correction method, since this may significantly increase the overall variance of the data. Depending on the quality of the data, three different methods of data normalisation are available ("vsn", "lquant" or "quant"). The result file name indicate which method was used to normalise your data. All normalisation and quality control steps are performed in Bioconductor using limma and vsn software packages.

Variance Stabalisation Normalisation (vsn) - our default normalisation method

This method is based on the work published by Huber et al. (2002). Bioinformatics 18(1), S96-104 and the corresponding Bioconductor package is called vsn.

For each dye and microarray, the background fluorescence and a factor reflecting overall brightness are inferred to make the signals identical for this subset of non-differentially expressed genes. A necessary assumption is that more than half the genes are NOT differentially expressed. For further (technical) information, see Huber et al. (2002) Bioinformatics 18(1), S96-104 (abstract).

Loess & Quantile normalisation (lquant)

Performs a within-array global loess normalisation of the M-values, followed by a between-array quantile normalisation. These normalisation methods are implemented in the Bioconductor limma software package.

Loess normalisation assumes that the bulk of probes on the array are NOT differentially expressed. It does not assume that there are equal numbers of up and down regulated genes or that differential expression is symmetric about zero. For further information, see Yang, Dudoit et al. (2002) Nucleic Acids Res. 30(4), e15 (abstract).

The aim of quantile normalisation is to ensure that all the intensity distributions on each array are identical. It involves an initial array-specific centering of the data, with the centred data being subsequently ordered from lowest to highest. Afterwards a distribution is calculated whereby the lowest value is the average of the lowest expressed gene on each of the arrays. This calculation is repeated for each subsequent order of intensity values up to the average value of the highs from each of the arrays. Each measurement on each array is then replaced with the corresponding average value in the distribution. For further (technical) information, see Bolstand et al. (2003) Bioinformatics 19(2), S185-93 (abstract).

Quantile normalisation (quant)

Performs a between-array quantile normalisation that does not correct for within slide bias. This normalisation method is implemented in the Bioconductor limma software package.


7. Result files: analysis

We perform a very basic statistical analysis of your data using different analysis tools. This includes a p-value estimation of your data for direct comparison, e.g., sample versus control. For other experimental designs, e.g., reference designs, or time-course analysis, we do not routinely perform a statistical analysis. We have listed some analysis tools in section 10 that can be used for these designs.

Result file names consist of the replicate group number, normalisation method identifier and the statistical software used for the analysis. For each replicate group of your project we produce summary files which describe the nature of each spot, the transformed normalised intensity differences between the Cy5 and Cy3 channels for the replicate slides (M-values), and the transformed normalised average intensities of the replicate slides (A-values). Normalised data have the dye-swap taken into account; therefore, all M-values should be treated as if there had been no dye swap.

M and A value definition

Differential expression is presented as a ratio of Cy5 over Cy3. A more symmetric (i.e., a "Gaussian" or "normal-like") distribution is achieved using log-transformation with a log[2] scale.

The log differential expression ratio for each spot or 'M' is calculated as follows:

M = log2 (Cy5 / Cy3) or M = log Cy5 - log Cy3

The log intensity of the spot or 'A' (a measure of the overall brightness of the spot) is:

A = (log2 (Cy5 * Cy3) /2) or A = ( log Cy5 + log Cy3) / 2

In our files, M-values represent:

Positive M-values indicate an increase in relative intensity (Cy5 greater than Cy3), negative values indicate a decrease in relative intensity (Cy5 less than Cy3). Remember that both M and A-values are log2 transformed. Numbers of equal value but opposite sign indicate equivalent fold changes up and down respectively.

Note: Since dye-swaps have been taken into account, the numbers across all replicate slides are comparable and should ideally change in the same direction.

Rnnnnn.vsn.limma.fdr.txt

Limma stands for Linear Models for Microarray Data. For each gene, it fits a linear model to the expression data and employs an empirical Bayes method to stabilise the analysis.

A design matrix and a contrast matrix have to be established for the data of a given replicate group. In a paired-data design the number of coefficients is one fewer than RNA sources (e.g. wildtype vs mutant, the number of coefficients equals 1). The first step is to fit a linear model that describes the systematic part of the data, followed by moderated t-statistics calculated for each probe and each contrast, respectively. Moderated t-statistics allow the same interpretation as an ordinary t-statistic although the standard errors have been moderated across genes, i.e. lowered towards a common value by using a simple Bayesian model. P-values are then adjusted using Benjamini and Hochberg`s step-up method for controlling the false discovery rate (FDR). For more information, see Smyth (2004) Statistical Applications in Genetics and Molecular Biology, 3:Article 3 (abstract).

The first few columns contain the gene information.

The next column indicates whether a spot was saturated during the scanning process. The scanner has a pixel intensity range from 1 (20) to 65535 (216 -1). Image brightness can be changed by choosing a different PMT gain. We try to reduce the amount of saturated spots, however, there is a trade-off between having few saturated spots and losing low intensity spots, when scanning at lower PMT gains. Spots with raw intensities >65000 in either one or both channels are flagged with "1" in the SaturationFlag column.

The next column provides an indication of the spot quality. This can be used to determine whether the reported expression changes are reliable or subject to error due to problems with the either the printing, hybridisation, or spot-finding.

The following columns are specific for the limma analysis.

The B-statistic (lods or B) is the log-odds that the gene is differentially expressed. Suppose for example that B = 1:5. The odds of differential expression is exp(1.5)=4.48, i.e, about four and a half to one. The probability that the gene is differentially expressed is 4.48/(1+4.48)=0.82, i.e., the probability is about 82% that this gene is differentially expressed. A B-statistic of zero corresponds to a 50-50 chance that the gene is differentially expressed.

Pnnnnn.vsn.matrix.txt

For your convenience we have combined the data of all replicate groups of your project including the (adjusted) p-values of the limma analysis within one file. This file facilitates a comparison between the different replicate groups. The replicate group number has been added to the column header to clarify what replicate group M-, A- values or p-values belong to.

Pnnnnn.saturated.matrix.txt

Saturated spots (raw intensity value > 65000) are flagged in the result files. If a gene is saturated in one channel but not in another the ratio is compressed. To facilitate reviewing which genes were saturated in which channel and on which slide, we have created the Pnnnnn.saturated.matrix.txt which lists all spots which have been flagged.

The first few columns contain again the gene information, then the raw intensity values for each slide are listed.

Note: The dye-swap has not been taken in consideration as this file represents raw data.

Rnnnnn.vsn.siggenes.mat.fdr.table.txt (optional)

We can perform the popular Significant Analysis of Microarrays (SAM) using the siggenes software package in Bioconductor. SAM identifies genes with statistically significant changes in expression by assimilating a set of gene-specific t-tests. Each gene is assigned a score on the basis of its change in gene expression relative to the standard deviation of repeated measurements for that gene. Genes with scores beyond a certain threshold are deemed potentially significant. The percentage of such genes identified by chance is the false discovery rate (FDR). To estimate the FDR, nonsense genes are identified by analyzing permutations of the measurements. The threshold can be adjusted to identify smaller or larger sets of genes, and FDRs are calculated for each set. For further information, see Tusher et al. (2001) PNAS 98(9), S5116-21 (abstract).

The analysis for a replicate group creates a table (Rnnnnn.vsn.Siggenes.mat.fdr.table.txt), stating the number of called genes, false genes and the FDR for the delta values. We use a FDR cut-off value of 0.05 as default. However, in some cases this might lead to no significant genes detected at this FDR cut-off.

Note: If the siggenes analysis doesn`t identify any significant genes, no siggenes files will be produced for your project. In that case you will only receive the limma results.

Note: The table displays only a selection of delta values, so the most accurate delta value for your data might not be shown. The actual p0, Called and False values for your data can be found in the Rnnnnn.vsn.siggenes.delta-nn.png plot (described in the QC section). The expected number of false positives is given by p0 x False such that the number of falsely called genes denoted by False is only equal to the expected number of false positives if p0 = 1.

Rnnnnn.vsn.siggenes.delta.nn.txt (optional)

We use a FDR cut-off value of 0.05 as default. This may lead to no significant differentially expressed genes being identified and no siggene file is created.

This file only contains the significan genes (and not the complete gene list). Again the first few columns contain the gene information, followed by the M- and A-values for each slide Please refer to Rnnnnn.vsn.limma.fdr.tab section for specific details. The following columns are specific for the siggenes analysis

Pnnnnn.vsn.matrix.siggenes.txt (optional)

For your convenience we have combined the data of all replicate groups of your project including the (adjusted) p-values of both limma and siggenes within one file. This file facilitates a comparison between the different analysis tools and replicate groups. The replicate group number has been added to the column header to clarify what replicate group M-, A- values or p-values belong to.

Deciding which genes are differentially expressed

Since we do not rank the data, there are several points to consider before deciding whether a gene in your experiment is differentially expressed.


Male vs female example files

Below are extracts of result files of the male vs female example. The ratio defined in the Pnnnnn_info.txt is male/female. After sorting the files by the averageM column, the female specific genes are displayed at the top of the list. Yp2 (Yolk protein 2) has an averageM value of -8, this means that this gene is 256-fold higher expressed in females than males in this particular experiment. For Yp3 you can see that at least one of the channels was saturated on all 4 slides.

Siggenes analysis creates two files:

In our example a FDR cut-off of 0.05 suggested a delta value of 0.54853. The R10342.vsn.siggenes.mat.fdr.table.txt shown below only gives a selection of delta values, so the most accurate delta value for your data might not be displayed. For explanatory reason lets assume that a delta value of 3.1 was used for the analysis. The table shows that for this delta value 2126 genes were called. The number of false positives is given by p0 x False, here: 0.2 x 263.5 = 52.7. Therefore the R10342.vsn.siggenes.delta-nn.tab will only contain the 2126 genes (and not the complete gene list). The actual p0, Called and False values for your data can be found in the R10342.vsn.siggenes.delta-0.54853.png plot (described in the QC section).

Summary file of the project:

There were two saturated spots in this project:

Note: If the siggenes analysis doesn't identify any significant genes, no siggenes files will be produced for your project. In that case you will only receive the limma results.


8. Quality control & Diagnostic plots

All RNA samples of your project have to pass a RNA quality control test. Subsequently, the good quality RNA samples are mixed with spike control RNA in a reverse transcription or amplification reaction and labelled. After the hybridisation the signals of these spike control spots are checked to verify a successful labelling and hybridisation. The quality of our printed arrays within each print batch is ascertained by staining a random sample of slides (protocol). We perform quality control of the data before and after the normalisation and are able to identify problem slides. Any problem we identify will be reported in the corresponding comment column in the project index file. Depending on the severity of the problem we will occasionally exclude a slide from the analysis (this will be stated in the project index file). Minor problems (e.g. high local background level on a slide) can be solved by rejecting spots in the affected array area and will result in a RR-flag for these genes in your file.

Note: The data on all plots show the dye-swap.

The following examples were primarily taken from the aforementioned male versus female example. The plots were created using the following packages in Bioconductor: limma, vsn, marray and siggenes. For reasons of brevity, not all images for all slides have been displayed.

Background and Foreground images

FL003 arrays were printed using a print head with 4x12 arrangement of print-tips and the microarrays are partitioned into a 4x12 grid of tip groups (FL003 design). Each grid was printed with a single print-tip. It is interesting to look at the variation of background and foreground intensity values across the array. For each slide we create three images displaying, from left to right, the log2 transformed raw intensity values of the green (Cy3), red minus green and red (Cy5) channel. The darker the colour, the higher the signal intensity. In the background images slide S108122 shows that there is high background mostly on the left side of the array. The foreground plots indicate that this background will cause problems when looking at the ratio (red minus green). We reject spots during the spot finding in areas with high background, theses spots will have RR-flags in the result files.

Rnnnnn.raw.Backgound.png
Rnnnnn.raw.Foregound.png

Scatter plots with correlation coefficients

The following plots are created for the raw data as well as the normalised data. Only spots with good quality flags on all slides are included, spots that contain RR-flags in any one slide are removed. In the upper right panels, pair-wise scatter plots of the raw intensity values of each channel within a replicate group are shown. The lower panels show the correlation coefficients. Channels which contain the same sample type should be close to the red line and have a correlation coefficient close to 1. Below is the raw scatter plot for the male vs female example, showing that there are large differences between the samples. Channels S106438.Cy3 and S106472.Cy5 both contain male samples.

R10342.raw.scatterplots.png

MA plots

MA-plots, where M is the log differential expression ratio (i.e., the expression ratio) and A is the mean log intensity between the two channels of the slide (i.e., the spot signal). In the raw plots control spots are colour coded (see table below), bad quality spots [RR or RS-flagged] are not plotted and low intensity spots show larger variance (left side of the plot). The data appears symmetrical around the horizontal line if the two channels behave similarly. After normalisation only gene spots are plotted (control spots are omitted). The example below shows two dye-swap slides, therefore the M-values (ratios) of the second slide are a mirror image of the first slide.

Spot Color Spot Type
Black Gene spots
Green Controls (e.g. degradation probes, FLPase, LacZ, modified_GFP, Gal4_cds)
Orange Spikes
Yellow Empty
Red Spotting Buffer

R10342.raw.MAplots.png

After normalisation only gene spots are plotted (control and bad quality spots are omitted). The red dotted line shows the loess fit.

R10342.vsn.MAplots.png

MAprinttip plots

Raw data MA-plots, with colour-coded lowess fits for each print-tip-group. These plots can highlight printing or hybridisation artefacts of the array. Ideally all lines should be close to the zero line.

R10342.raw.MAprintipplots.png

Boxplot plots

Boxplots of the M-distribution per print-tip-group can be useful to identify spot or hybridisation artefacts. The central box in the plot represents the inter-quartile range (IQR), which is defined as the difference between the 75th percentile and 25th percentile, i.e., the upper and lower quartiles. The line in the middle of the box represents the 50th percentile, i.e., the median. Extreme values, greater then 1.5 IQR above the 75th percentile and less than 1.5 IQR below the 25th percentile, are typically plotted as individual data points.

R10342.raw.Boxplots.png

Additionally, we create box plots of the overall M-distribution for each slide of the raw and normalised data. After normalisation the median should be centred around zero and the spread should be similar between the slides.

R10342.vsn.Boxplot.png

Density plots

Density plots display smoothed empirical densities for the individual green and red channels. Without any normalisation there is considerable variation between both channels and between arrays.

R10342.raw.Densityplots.png

Summary plots of the densities of all slides within this replicate group before and after normalisation.

R10342.vsn.Densityplots.png

Mhist and Ahist plots

The histograms for the M- and A-values should be similar for slides within a replicate group. In the Ahist-plot of slide S106404, as in the MA-plot of this slide, some of the low intensity spots are causing a distortion visible as a peak to the left of the graph.

R10342.vsn.Mhist.png

R10342.vsn.Ahist.png

Sample Cluster Dendogram

All samples are normalised together and cluster dendograms are calculated using the euclidian distance method. Identical sample pairs (biological repliate samples) should cluster together, allowing outliers to be detected. However, if the differences between the different sample types within the project are very small the Dye effect may be stronger than the sample difference, so that the samples cluster by Dye.

P99934.sample_cluster.png

Standard deviation versus rank of the mean and the mean plot (only for vsn normalised data)

The aim of these plots is to determine whether there is a systematic trend in the standard deviation of the data as a function of overall expression. The assumption that underlies the usefulness of these plots is that most genes are not differentially expressed, consequently the running median should be a reasonable estimator of the standard deviation of feature level data conditional on the mean. The red dots show the running median of the standard deviation. The curve given by the red line is an estimate of the systematic dependence of the standard deviation on the mean. After Variance Stabilization Normalization, this should be a horizontal line. It may have some random fluctuations, but should not show an overall trend. If this is not the case, that usually indicates a data quality problem, or is a consequence of inadequate prior data pre-processing. The rank ordering distributes the data evenly along the x-axis.

R10342.vsn.stdev_vs_mean.png

Limma Volcano plot

A volcano-plot of log-fold-change versus a non-negative statistic that tests the significance of the fold change. Since the size of the statistic tends to increase with absolute log-fold change, such a plot has the characteristic shape of the open crater of a volcano. Volcano plots are sometimes used to emphasise that variability plays a role in significance as well as fold-change. Coloured dots represent probes that show large and very significant fold changes (adjusted p-value < 0.05).

R10342.vsn.Limma.fdr.png

Siggenes plot

Visualisation of SAM plot using the in the analysis specified delta value specified in the analysis. The green dots in the distribution represent the up (top) and down-regulated (bottom) genes. The actual p0, Called and False values for your data can be found and you can calculate the number of false positives for your data as p0 x False.

R10342.vsn.Siggenes.delta1.5.png

9. Gene Expression Omnibus submission: GEO

This folder contains plain text files for each slide (Snnnnnn.GEO.txt) which can be used to submit your data to Gene Expression Omnibus GEO. These files contain the raw intensity and normalised M- and A-values for each gene, the GEO platform accession for the array type and the protocols we have used to process your samples. You will need to add the sample specific details, e.g. growing conditions. We have marked entries that are [required] and that are [optional]. Please refer to the GEO SOFT submission instructions for the required details.


10. Project_report.doc

The project_report file contains information specific to your project. In this file you can find a list of the procotols we have used during the processing of your project. Any comments regarding the data can be found here.


11. Useful analysis and visualisation tools

There are many tools available for normalisation and statistical analysis of microarray data. See data analysis for an introduction to data analysis and links to useful tools.

Data analysis depends on the experimental design and will be different for each project. However, to get you started we have listed below links that might be useful. Please refer to the tool specific software documentation for details.

p-value calculation

Identify differentially expressed genes and estimate False Discovery Rates.

Tool Type Direct design Time-course design Reference design
CyberT online tool yes - yes
GEPAS online tool yes yes yes
limma bioconductor tool yes yes yes
siggenes bioconductor tool yes yes yes
maSigPro bioconductor tool - yes -

Clustering

Identify genes with similar expression profiles using clustering.

Tool Type
Cluster 3.0 & TreeView free download tools
maSigPro bioconductor tool
GEPAS online tool

11. Glossary of terms

Version 3.0. B.Fischer (03-08-2011)