Which normalization is best




















When we face this challenge, the solution is normalization. Perhaps the most common type of normalization is z-scores. In simple terms, a z-score normalizes each data point to the standard deviation. The formula is the following:. Imagine you run a wholesale watch company and you would like to normalize the data concerning the number of orders they place in a year and the average price of those order.

A snippet of your database looks like the following:. Most computer programs will calculate the z-scores for you. It will apply the formula shown above. Linear normalization is arguably the easier and most flexible normalization technique.

This is a good starting strategy, and in reality, analysts can normalize data points to any base once they have completed linear normalization. Now we just need to modify the numerator with the same idea: x — min. In this case, it becomes 15 Of course, if we want to normalize to , we just have to multiply or divide the fraction by the number needed to get the denominator to Most age values falls between 0 and 90, and every part of the range has a substantial number of people.

In contrast, you would not use scaling on income, because only a few people have very high incomes. The upper bound of the linear scale for income would be very high, and most people would be squeezed into a small part of the scale. If your data set contains extreme outliers, you might try feature clipping, which caps all feature values above or below a certain value to fixed value. For example, you could clip all temperature values above 40 to be exactly Log scaling is helpful when a handful of your values have many points, while most other values have few points.

This data distribution is known as the power law distribution. Movie ratings are a good example. In the chart below, most movies have very few ratings the data in the tail , while a few have lots of ratings the data in the head. Log scaling changes the distribution, helping to improve linear model performance.

All methods were performed using the default settings of their respective R packages. Table 2 Summary of normalization methods, including the basic description and whether the method uses spike-in genes information. SAMseq is a nonparametric approach to differential expression in bulk sequencing, designed to be resistant to outliers Li and Tibshirani, It uses the two-sample Wilcoxon test and Poisson resampling to detect the genes associated with an outcome in RNA-Seq or other sequencing-based comparative genomic experiments.

SAMstrt is developed on this method and alters the sequencing depth estimation by assuming equivalent spike-in-molecules per cell in each condition. In other words, SAMstrt builds on the method by accounting for and correcting differences in sequenced spike-in reads across cells, even in the presence of highly variable sequencing depths. SAMstrt is effective at identifying highly expressed features, though it depends on the number of total reads per spike-in gene across all cells, with higher performance expected for deeper sequencing depth.

BASiCS uses a fully Bayesian approach that separates data variation into gene-specific constants, technical noise, and biological variation. Though using spike-in genes, it does not rely on them exclusively to determine technical variation, instead electing to jointly model spike-in genes and biological genes i.

For normalization, BASiCS first utilizes the spike-in genes to determine a hierarchical model for technical variation, with X ij representing the expression count of spike-in gene i in cell j , using the Poisson and Gamma distributions.

Once established for the spike-in genes, this model expands to account for the biological genes:. Note that cell-size factors are only applicable to the biological genes in this equation; spike-in genes are included in equal amounts for each cell, regardless of cell-size, so spike-in genes are independent of such effects.

The Gamma Regression Model uses spike-in genes to construct a model that accounts for differences between observed and known spike-in concentrations.

Notably, it predicts concentration as a function of the fragments per kilobase of transcript per million mapped reads FPKM. To account for a wide range of gene concentrations, GRM employs the gamma regression model to fit the log transformed concentration data against the log transformed FPKM reads for the spike-in genes.

Furthermore, GRM is applied to cells individually rather than simultaneously, unlike other normalization methods. Once the regression model is formed for an individual cell, it calculates the expectation of the biological gene concentration based on the given FPKM as follows:. The parameters are determined using maximum likelihood estimation for Gamma regression. The normalized expression for gene i can be calculated for the observed gene expression by:.

GRM is a method that performs normalization and denoising for individual cells. As such, it is primarily used for denoising effect, as it does not perform normalization directly among cells. The scran method uses a deconvolution approach that partitions cells into pools, normalizes across cells in each pool, then uses the resulting system of linear equations to define individual cell factors.

This method takes extra steps to counteract assumptions about differential expression DE genes by estimating a size factor term for each cell pool and utilizing these estimates to approximate a size factor term for individual cells. Cell pools are designed to have comparable library sizes to account for potential estimation errors. By keeping sizes consistent, estimation errors are less likely to disproportionately affect extreme cells under the same condition.

Now for pool k of arbitrary cell set S k let V ik be the sum of Y ij across all cells in the set, U i be the mean of Y ij across the entire data set of N cells, and R i k be the ratio of V i k to U i , so:. This results in the pool-based size factor R i k , whose estimates can be applied to individual cells accurately.

The pool sizes chosen for scran are restricted by the number of cells in each condition within a data set. The authors recommended that the number of cells is at least 20, though smaller sizes may be possible. A larger number of cells will also lead to computational complexity, so this may impact the execution time for certain data sets. This recently developed method does not rely on global scaling factors that many other methods use for normalization.

Instead, it focuses on two layers of quantile regression to effectively group genes and estimate their dependence. While it does not require spike-ins, SCnorm can use them to improve its accuracy, provided the spike-in genes cover a similar range of gene expression to that of the biological genes in the study. For SCnorm, let X i,j be the log nonzero expression count for gene i in cell j , and D j as the log sequencing depth for cell j.

The gene-specific relationship between log of raw expression and log sequencing depth with median quantile regression is as follows:. The scale factor SF j for cell j is defined as below, as are the normalized counts Y i,j :. If any mode exceeds 0. SCnorm sequentially chooses the number K of pools for which the count-depth relationship varies significantly.

Depending on the nature of the data set, the number of pools necessary can greatly vary, and data sets of similar size may have noticeably different execution times as a result.

In particular, exceptionally large amounts of zeros in the data may require a large number of pools to converge, increasing total computational time. Linnorm is a newer method designed for both normalization and transformation of scRNA-seq data, though for the purpose of this paper we consider the normalization portion only.

It first identifies stable genes i. When only performing normalization, Linnorm then defines the expression level G ij as. Linnorm depends on the assumption that genes are homogeneously expressed across different cells, and genes with low expression can introduce skewness to data that may violate these assumptions. Also, though it relies on only stable genes to determine normalization parameters, these parameters are applied universally, which may introduce additional variation for less stable genes.

In addition to the above methods, we obtain a baseline comparison for normalization through the use of the Seurat Satija et al. Although this paper is focused on the performance of single-cell-based normalization methods, an additional section comparing the performance of two popular bulk-based normalization methods is included. Analysis for bulk methods can be found in section 4 of the Supplementary Materials.

For spike-in gene case, we chose two scRNA-seq mouse data sets, one on embryonic stem cells and fibroblasts and the other on four development stages of epithelial lung cells. We also simulated a data set based on the stem cell study. For non-spike-in gene case, we utilized a human data set with three cell-cycle states.

The first data set, drawn from a study by Islam et al. This data set also includes a set of 8 spike-in genes synthetic control mRNAs. The second spike-in real data set originates in a study by Treutlein et al. The simulated data set is based on the mouse embryonic data. Using a similar approach to anexisting study on scRNA-seq simulation for normalization Vallejos et al. Based on the mean counts for genes across all cells, global scaling factors for each gene, and the overdispersion of each gene, we then generated new read counts through a negative binomial distribution, mimicking scRNA-seq data.

In this instance, global scaling factors refer to constant factors by which expression within each cell is scaled in order to remove cell-specific biases Vallejos et al. These factors assume read counts in a cell have an expected value proportional to both gene-specific expression and cell-specific scaling factors. This allows us to observe each method's sensitivity to various levels of differential expression.

In addition to the data sets with provided spike-in genes, we selected a data set specifically for the methods that do not use spike-in genes to provide a closer comparison. To demonstrate the effectiveness of these methods with more recent protocols, we selected thisdata set, hereafter referred to as the PBMC data. This data set is a representative of 10X Genomics's more recent GemCode platform, which several scRNA-seq studies have adopted, and which typically processes much larger numbers of cells at more sparse count levels.

As with the human embryonic data set, there are no spike-in genes present, and we again compare only methods that do not use spike-in genes. The data set consists of PBMCs of which were used in this study and is part of a larger dataset used in a study by Zheng et al. Cell groups for this dataset were determined through a clustering pipeline as part of the Seurat R package.

The chosen normalization methods were first compared on two mouse scRNA-seq data sets. For further comparisons, we also constructed a simulated data set based on one of the real sets. We also include a human data set with three cell-cycle states to test methods not dependent on spike-in genes.

The raw data are noisy and must be cleaned before subjecting the data to normalization. Details for the basic cleaning process can be found in the Supplementary Materials. All final plots after normalization are at a log-transformed scale, as employed in similar studies Chapman et al. Principal component analysis, a linear method of dimensionality reduction, is used to demonstrate differences among cells under different conditions within each data set Wold et al.

By projecting the data through orthogonal transformations, PCA allows most of the variance in a data set to be contained within the first few principal components. Very often, by plotting the first two principal components, we can observe separation within the data from a PCA visual representation of the scRNA-seq data. However, since PCA is a linear method in dimension reduction, it is not always able to capture differences among cell groups.

Generalized Procrustes Analysis GPA is a method of statistical analysis and is widely applied to normalize data shapes Xiong et al. GPA consists of three transformations, i. The optimal transformation of the GPA procedure is the one that has a minimum sum of the squared deviation among corresponding data points in the MA-plot. Specifically, GPA is used to minimize the deviation of transcript intensities among arrays.

Firstly, a reference array is set as the median intensity of each transcript over all arrays. Then, in the MA-plot, data points in each array are translated in order to make their centroid point the same as that of the reference array. Then the data points of each array are rotated and scaled to obtain a minimum residual discrepancy with the reference array. The transformations translation, rotation, and scaling are based on global optimization rather than local optimization.

In terms of choosing the reference array, GPA normalization employs median values across all arrays as the common reference array, which is further verified to be a right choice and perform better than the individual array.

Another advantage of using GPA for normalization over other methods is that it assumes nothing about data distributions, which makes it applicable for all types of data. This method stressed that normalization is not necessary for microarray studies for which traditional normalization methods may produce more false positive discoveries Klinglmueller et al. Non-normalized data were compared to normalized data, in the context of a microarray titration experiment, which is designed for producing reliable biological data with a proportion of DEGs larger than what can be simulated using spike-in experiments.

They provide an alternative way to study the robustness of conventional normalization methods against violations of the basic assumptions. The titration experiment highlights some of the pitfalls of microarray data analysis and some evaluating measurements for normalization methods are provided.

Non-normalized data provided higher accuracy and agreement in the titration experiment in comparison with the normalized data. Normalization procedures pose a tradeoff between accuracy and agreement as well as repeatability and power, when the processed array data contain a large partition of DEGs.

Wang et al. As with non-normalized data, however, it is difficult to tell the difference between biological variation and the technical variation caused by batch effect, limited sampling, probe hybridizing conditions, or scanning power. So most of the researchers still claim normalization is a necessary step before downstream array analysis.

Performing RMA normalization to each pedigree pool separately allows for the same number of distributions as pedigrees, instead of only one reference distribution. RMA eliminates technical variation using quantile normalization, which assumes that the transcript intensities on each array are from the same distributions and the intensity values are then normalized according to a reference distribution.

However, the selection of optimal reference is quite artificial for family data in genomics studies. The familial similarity within pedigree is fundamental in linkage analysis. Also, pedigree data are chartered by more homogeneity within pedigrees than between pedigrees for the studied traits. Therefore, traditional normalization methods may improperly adjust the trait values by imposing identical distributions across pedigrees, which ignores the pedigree property in linkage analysis.

The procedure of normalizing arrays in each pedigree separately maintains the individual familial distributions. From the perspective of gene expression direction, CrossNorm Cheng et al.

CrossNorm first divided the expression matrix into two submatrix, cancer matrix and normal matrix, with rows represent genes while columns represent samples. Then, the two matrixes were combined by samples to generate a cross-matrix. After that, a traditional normalization method, such as Quantile, was applied to process the cross-matrix. Finally, the normalized matrix was reverted to the original format. CrossNorm guarantees the columns of the combined expression matrix have the same intensity distribution, which is exactly the basic assumption of the traditional methods.

The shortcoming of this method is that it is very fit for the datasets with case-control paired samples, but time-consuming for the datasets without the matching information. The normalization steps, including identifying invariant transcript, selecting common reference, and regression, are of vital importance for all the preprocessing procedures of expression data. Figure 1 illustrates the general steps of processing the strewed gene expression data.

It is noted that the global normalization methods developed for balanced expression data also follow these steps, like lowess and quantile, make use of the whole array as Invariant Transcript Set ITS , instead of preselecting a subset, and then the intensities on each array are transformed using non-linear regression or scaling algorithms considering all the features.

Figure 1. General steps of the preprocessing methods for skewed transcriptome data. The core step of normalization is the selection of reference, which was summarized into three categories, data-driven invariant subset, foreign subset, and the entire set. Sometimes summarization is after normalization. Four other methods based on external negative controls are quite similar to the methods in the invariant-set family, with the foreign controls as the predefined invariant-set instead of driving from data.

As with common reference selection, several approaches are applied among these normalization methods, such as the root mean squared distance RMSD between all array pairs, average expression value for each transcript over all the arrays, or the trimmed mean for each transcript among all arrays. When the ITS and common reference array are set well, non-linear regression approaches are implemented to adjust all the data points, such as loess, lowess, weighted lowess, piecewise spline, and LVS.

Therefore, a series of models and approaches have been conducted for normalizing a given expression data. A tuned combination of them is expected to show a higher performance, e. Consequently, an optimal hybridization of normalization steps is expected to maximize the power of these available methods. Although more than 20 normalization methods have been developed for the skewed expression data, most of them have their own assumptions and suffer from similar problems.

No real assumption-free methods exist but methods based on reasonable and adaptive assumptions are always highly required. Therefore, we stress that an in-depth understanding of data property is critical before conducing normalization. In addition, the interactions between different approaches among these steps should be noted. In other words, the result of one step may affect the other steps heavily, either the parallel or subsequent ones. Besides, it still calls for novel methods for RNA-seq data as for which limited solutions are proposed.

Hopefully some existing methods originally developed for arrays can be employed after some sophisticated adaptions. The normalization algorithms we summarized are equally important for mRNAs and lncRNAs, because it is common to perform secondary analyses of lncRNAs by leveraging previously published gene expression data including both microarray and RNA-seq Zhou et al. In the present study, we studied the skewness between samples in different states, such as disease or normal, with a special emphasize.

As for as we know, no normalization algorithms have been specifically developed for lncRNA expression data for this purpose.

Recently, Assefa et al. They concluded that no methods compared can outperform other tools. For the differential analysis of lncRNAs, all tools exhibit substandard performance. They also concluded that large sample size is necessary for accurate differential expression inference of lncRNAs.

Normalization methods impact little to the coexpression analysis, mainly because the gene expression samples in different state or subgroups used to be stratified first and then perform the coexpression analysis. In this case, the skewness or difference between distinct biological states is not taken into account. Like the protein interaction network Cheng et al. Similarly, normalization methods have little effect on the identification of gene coexpression modules.

Overall, the impact of normalization methods on coexpression analysis is not as much as differential analysis. Normalization methods developed for the unbalanced expression data were summarized into three classes based on the expression reference, data-driven reference, extra negative controls, and all genes. The anatomy and summarization of these algorithms shed light on the understanding and appropriate application of preprocessing methods.

LC and XL wrote the paper. LC and K-SL contributed to the overall paper design. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Aanes, H. Anders, S. Differential expression analysis for sequence count data. Genome Biol. Assefa, A. Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA-sequencing data. Barucca, P. Cross-correlations of American baby names. Berger, J. BMC Bioinformatics Bolstad, B. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias.

Bioinformatics 19, — Calza, S. Filtering genes to improve sensitivity in oligonucleotide microarray data analysis. Nucleic Acids Res. Normalization of oligonucleotide arrays based on the least-variant set of genes.



0コメント

  • 1000 / 1000