[python] Data preprocessing: quantile normalization Quantile Normalization + INSCODE AI creation assistant test

Article directory

  • written in front
  • Standardization / Normalization
  • The python module qnorm implements quantile normalization
  • R code for quantile normalization
  • Quantile Normalization – Answers from NSCODE AI Authoring Assistant
    • *Q1: What is Quantile Normalization? *
    • *Q2-1: When to use Quantile normalization? *
    • *Q2-2: When to do Quantile normalization? *
    • *Q3: Why do Quantile Normalization? *
    • *Q4-1: How to do Quantile normalization? *
    • *Q4-2: How to do Quantile Normalization with python? *
  • Visualizing changes before and after quantile normalization
  • When to use and not to use quantile normalization
  • Other ChatGPT tests

Write in front

Here we mainly look at Quantile Normalization (Quantile Normalization, QN). Unless otherwise specified, QN in this article is an abbreviation for quantile normalization.

The direct translation of Quantile Normalization is quantile normalization, but there are also translations for quantile normalization. The author understands that it is called quantile normalization according to literal translation, but according to the data processing method, it should be called quantile normalization. In English, there is only one kind: Quantile Normalization. The reason why there are two terms standardization and normalization is because they are two different processing methods.

The tentative unified name of this article is quantile normalization (QN).

Standardization/Normalization

Standardization (standardization) and normalization (normalization) are always called together. What is actually being done?

First look at some packages about preprocessing in the Python library:

These include: scaling (zooming), centering (centering), normalization (normalization), binarization (binarization).

Commonly used modules for standardization and normalization:

  • Standardization: py module StandardScaler is a commonly used standardized data processing, corresponding to z-score standardized formula:

    z

    =

    x

    ?

    mu

    the s

    z=\frac{x-\mu}{s}

    z=sx?μ? (where,

    mu

    \mu

    μ is the mean of the training samples,

    the s

    the s

    s is the standard deviation of the training samples). Converting the original data to new data with a mean of 0 and a variance of 1 also scales the data to a specific interval. From the calculation of the formula, this processing is interpreted as: the current data is several standard deviations more than the overall mean; from the data as a whole, it is interpreted as: the original data is standardized and scaled to the data that obeys the standard normal distribution.

  • Normalization normalization: py module MinMaxScaler is a more commonly used normalization method: min-max normalization,

    x

    ?

    =

    x

    i

    ?

    x

    m

    i

    no

    x

    m

    a

    x

    ?

    x

    m

    i

    no

    x^*=\frac{x_i? x_{min}}{x_{max} ? x_{min}}

    x?=xmaxxmin?xixmin.

Python module qnorm implements quantile normalization

Take a closer look at Python’s sklearn library (the screenshot of the # Standardization/Normalization section), and you will find that there is no quantile normalization package here. Yes, there is no QN in sklearn. You need to install qnorm (qnorm installation package official website), the installation command is: pip install qnorm. An example of use is as follows:

import pandas as pd
import qnorm

df = pd.DataFrame({<!-- -->'C1': {<!-- -->'A': 5, 'B': 2, 'C': 3, 'D': 4} ,
                   'C2': {<!-- -->'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {<!-- -->'A': 3, 'B': 4, 'C': 6, 'D': 8}})

print(qnorm.quantile_normalize(df, axis=1))

After execution, you can get:

R code implements quantile normalization

R code was used for initial testing and is also used in the default implementation of the “INSCODE AI Authoring Assistant” assistant. (It may be more likely to use R for quantile normalization)

The R package preprocessCore is used in the code:

df <- data.frame(one=c(5,2,3,4),
                 two=c(4,1,4,2),
                 three=c(3,4,6,8)
)
rownames(df) <- c('f1', 'f2', 'f3', 'f4')

df

preprocessCore::normalize.quantiles(x=as.matrix(df))

After execution get:

Quantile normalization – NSCODE AI authoring assistant’s answer

The above code can be executed directly to get the corresponding data. The following is to learn more about quantile normalization through the “NSCODE AI Creation Assistant”.

The best way to understand a concept is by asking and answering this 3W1H: Waht/Why/When/How … the “A”?

  1. What is quantile normalization (What)
  2. Why do quantile normalization (Why)
  3. When to do quantile normalization (When)
  4. How to do quantile normalization (How)

Try some answers given by “INSCODE AI Creation Assistant”:

Example question:

Answer from InsCode AI Creation Assistant: (The result is quite unexpected)

Q1: What is Quantile Normalization?

Quantile normalization is a method for processing high-throughput gene expression data. It sorts the data of all samples and normalizes the sorted data to remove any influence from data inconsistency between samples. The basic idea is to arrange the data values of each sample in ascending order, and then replace the arranged data values with the corresponding quantiles (for example, for a sample, the 10th The smallest value is replaced by the 10th smallest quantile). This process results in the same distribution of data for each sample, allowing for better comparisons and comprehensive analyses.

Q2-1: When to use Quantile normalization?

Quantile normalization is a data preprocessing method, usually used in gene expression data. When there is uneven distribution in the data set, Quantile normalization can be used to align the samples in the data set on the numerical distribution, so as to eliminate the data deviation caused by sample differences. Therefore, when you want to compare the gene expression data of different samples in gene expression data analysis, you may need to use Quantile normalization to eliminate the differences between samples to ensure the accuracy and reliability of the data.

Q2-2: When to do Quantile normalization?

Quantile normalization is often used in comparative analysis of gene expression data. Specifically, when you have multiple experimental groups (each containing multiple samples) and you want to compare gene expression levels between them, you typically use Quantile normalization for data normalization to remove confounding factors (such as times, RNA extraction and scanning reader) on the experimental results. This allows differences between samples to be due only to biological differences rather than technical differences. Therefore, Quantile normalization is often used in microarray and RNA sequencing data analysis to reduce the impact of technical differences on the comparison between samples.

Q3: Why do Quantile Normalization?

Quantile normalization is a data standardization method, its purpose is to eliminate the technical differences between different samples in the experiment by adjusting the ranking order of the data, so that the data can be better compared and analyzed.
In high-throughput gene expression research, due to differences in experimental procedures, experimental conditions and other factors, there is often some inconsistency in the gene expression data between different samples. If differential analysis or cluster analysis is directly performed, these technical differences may be mistaken for biological differences, resulting in biased analysis results. Therefore, it is necessary to standardize the data to eliminate these technical differences.
The method of Quantile normalization is to unify the ranking order of gene expression in different samples, so that the data of each sample are distributed in the same position. In this way, technical differences can be eliminated, and different samples can become comparable data, which improves the reliability and stability of the analysis results.

Question Q3 regenerate: (Get similar but different results)
Quantile normalization is a commonly used data normalization method, mainly used in high-throughput gene expression data analysis. The purpose of Quantile normalization is to eliminate the technical variation between samples (such as batch effects, different experimental treatments, etc.) and the influence of biological variation on gene expression, so that the gene expression values between samples are comparable.
The specific steps of Quantile normalization are to arrange the gene expression values of each sample in ascending order, and then assign the gene expression values with the same ranking to the corresponding samples, so that the distribution of the expression values of each gene in different samples Uniform, and the gene expression distribution of each sample remains unchanged.
Through Quantile normalization, we can quickly and effectively compare the differences in gene expression between different samples, which will help further research on gene regulation mechanisms and disease diagnosis.

Q4-1: How to do Quantile normalization?

Quantile normalization is a data standardization method used to compare and unify the data distribution between different samples. The following is the implementation method based on R language:

  1. Import datasets that need to be normalized.
  2. Sort the dataset.
data_sorted <- apply(data, 2, sort)
  1. Calculate the quantile for each row.
quantiles <- apply(data_sorted, 1, quantile, probs = seq(0, 1, length = nrow(data) + 1))
  1. Normalize the dataset.
data_norm <- apply(data, 2, function(x) {
  rank <- order(x)
  quantiles[rank]
})
  1. Check the normalized data.
summary(apply(data_norm, 2, quantile, probs = seq(0, 1, length = nrow(data) + 1)))

The above is the basic implementation method of Quantile Normalization.

It is a little surprising that the code example is directly given for the Q4 question. The code is directly implemented. I thought it would directly call the R package for normalization processing: preprocessCore. As for the accuracy of the code, it has yet to be tested. . I still hope to give an answer in Python, so I ask a question:

Q4-2: How to do Quantile Normalization with python?

Quantile normalization is a common data preprocessing technique used to standardize the data of different samples or experiments so that they have the same distribution. The following is a sample code to implement Quantile Normalization in Python:

import numpy as np
from scipy.stats import rankdata

def quantile_normalize(data):
    # Sort each column of data
    sorted_data = np.sort(data, axis=0)

    # Compute the quantiles for each row
    row_quantiles = np.mean(sorted_data, axis=1)

    # Sort each column of data and map it to the corresponding quantile
    rank_data = rankdata(data, axis=0)
    mapped_data = np.zeros_like(data)
    for i in range(data. shape[1]):
        mapped_data[:, i] = row_quantiles[rank_data[:, i]-1]

    return mapped_data

Example usage:

# generate random data
data = np.random.rand(10, 5)

# Perform Quantile Normalization on the data
normalized_data = quantile_normalize(data)

This function inputs a matrix data and outputs a matrix after Quantile Normalization. The function first sorts each column of the matrix and then calculates the quantile for each row. Next, for each column of data, map to the corresponding quantile. Finally, output the matrix after Quantile Normalization.

What is more unexpected about the python code is that Chinese comments have been added. I can’t help but sigh that I may not be able to write these codes in a short time. The same as the R script, it does not use the package, but directly implements it according to the principle of QN. As for the accuracy of the code, it still needs to be modified in detail:

The error should be due to an error when mapping the corresponding quantile. The reason for the error is that when the element is obtained according to the index, the index should be an integer, and the value output by rankdata here is a floating point type.

Adjust the code according to the error:

import numpy as np
from scipy.stats import rankdata

def quantile_normalize(data):
    # Sort each column of data
    sorted_data = np.sort(data, axis=0)

    # Compute the quantiles for each row
    row_quantiles = np.mean(sorted_data, axis=1)
    
    # Sort each column of data and map it to the corresponding quantile
    rank_data = rankdata(data, axis=0)
    mapped_data = np.zeros_like(data, dtype=float) #[Modify] Initialize the specified data type as floating point
    for i in range(data. shape[1]):
        # mapped_data[:, i] = row_quantiles[rank_data[:, i]-1]
        mapped_data[:, i] = row_quantiles[[int(i) for i in rank_data[:, i]-1]] # [Modification] The index is an integer
        
    return mapped_data

Note: In the code, when a column (a certain sample) has the same value (same gene expression value), the ranking problem can be obtained by directly calculating the ranking mean value and rounding down to an integer. Specifically, if there are n identical values in a column, they should be ranked as the mth in parallel. At this time, rankdata ranks the n numbers as follows:

m

+

(

m

+

1

)

+

.

.

.

+

(

m

+

no

?

1

)

no

\frac{m + (m + 1) + … + (m + n-1)}{n}

nm + (m + 1) + … + (m + n?1)?, for example, if two numbers are ranked third, then the ranking in rankdata is:

3

+

4

2

=

3.5

\frac{3 + 4}{2}=3.5

23 + 4?=3.5, rounded up in the code (int is rounded down), then it is 3; if there are 3 numbers that are all ranked third, then the ranking in rankdata is:

3

+

4

+

5

2

=

3

\frac{3 + 4 + 5}{2}=3

23 + 4 + 5?=3, if rounded up in the code, the ranking will be 3.

However, in the actual ranking, for the same value, the mean value of the corresponding value after the ranking is used (instead of the mean value of the ranking). That is to say, the above code still needs to be further improved: The problem of ranking the same value.

The above adjusted code execution result:

Visualizing the changes before and after quantile normalization

First, let’s look at the results of several examples before and after QN, and enter the data:

import pandas as pd
import qnorm
df1 = pd.DataFrame({<!-- -->'C1': {<!-- -->'A': 5, 'B': 2, 'C': 3, 'D': 4} ,
                   'C2': {<!-- -->'A': 6, 'B': 1, 'C': 7, 'D': 9},
                   'C3': {<!-- -->'A': 3, 'B': 4, 'C': 6, 'D': 8}})
dfn1 = qnorm.quantile_normalize(df1, axis=1)

df2 = pd.DataFrame({<!-- -->'C1': {<!-- -->'A': 5, 'B': 2, 'C': 3, 'D': 4} ,
                   'C2': {<!-- -->'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {<!-- -->'A': 3, 'B': 4, 'C': 6, 'D': 8}})
dfn2 = qnorm.quantile_normalize(df2, axis=1)

df3 = pd.DataFrame({<!-- -->'C1': {<!-- -->'A': 5, 'B': 2, 'C': 3, 'D': 4} ,
                   'C2': {<!-- -->'A': 4, 'B': 1, 'C': 4, 'D': 4},
                   'C3': {<!-- -->'A': 3, 'B': 4, 'C': 6, 'D': 8}})

dfn3 = qnorm.quantile_normalize(df3, axis=1)


Visualize the data before and after QN:

import matplotlib.pyplot as plt

fig,axs = plt. subplots(3, 2)
plt.rcParams.update({<!-- -->'font.size':5})
i = 0
for df, dfn in zip([df1, df2, df3], [dfn1, dfn2, dfn3]):
    axs[i, 0].scatter(range(len(df.T['A'])), df.T['A'], label='A')
    axs[i, 0].scatter(range(len(df.T['B'])), df.T['B'], label='B')
    axs[i, 0].scatter(range(len(df.T['C'])), df.T['C'], label='C')
    axs[i, 0].scatter(range(len(df.T['D'])), df.T['D'], label='D')
    axs[i, 0].legend(loc='upper left', shadow=True)
    axs[i, 0].set_xticks([0, 1, 2], [1, 2, 3])

    axs[i, 1].scatter(range(len(dfn.T['A'])), dfn.T['A'])
    axs[i, 1].scatter(range(len(dfn.T['B'])), dfn.T['B'])
    axs[i, 1].scatter(range(len(dfn.T['C'])), dfn.T['C'])
    axs[i, 1].scatter(range(len(dfn.T['D'])), dfn.T['D'])
    axs[i, 1].set_xticks([0, 1, 2], [1, 2, 3])
    i + = 1
    
fig. tight_layout()
plt. show()

The graph is divided into two columns, the left column is the original data, and the right column is the quantile normalized (QN) data. From the original data to the data after QN, it can be found that the arrangement order of the 4 data of each sample (one column) in the figure on the right has not changed. (Note: Some seem to be <4 data, because the original data has the same value, and the points overlap)


Manual calculation will give you a better understanding of the basic idea of quantile normalization: arrange the data values of each sample in ascending order, and then replace the arranged data values with the corresponding quantiles.

Quantile normalization method: http://bmbolstad.com/stuff/qnorm.pdf

step1: The features of each sample are sorted from small to large;
step2: Corresponding to the original data, obtain the ranked serial number (rank i/ii/iii/iv);
step3: Calculate the arithmetic mean of each featrue after sorting from small to large;
Step4: Replace the data ranking in step2 with the arithmetic mean corresponding to the ranking in step3;

When should/should not use quantile normalization

Reference: http://www.bio-info-trainee.com/2043.html

Although the previous question about “INSCODE AI Creation Assistant” is similar to when to use QN, it answered some QN application scenarios in general, without very specific or professional expressions.

There is a literature that specifically studies when to use quantile normalization: Literature pdf


The literature gives the use strategy of QN:

【Translation】: (did not study the literature carefully, the translation should be carefully referenced)

  1. Small variance within groups, small variance between groups (small variation) → small technical changes, no global changes → QN can be used (but not necessary)
  2. Large variance within group, small variance between groups (small variation) → large technical variation or batch effect within group, no global variation → use QN
  3. The variance within the group is small, the variance between groups is large (large variation) → [only the original data cannot detect the difference] → the global difference caused by detection technology and biological variation
    1) Global technical differences or batch effects between groups → use QN
    2) Global Biology difference between groups → no QN used

The quantro package was developed in the literature:

The R package quantro is: presents a data-driven approach to testing hypotheses of global normalization methods. Demonstrate the utility of the literature method (quantro) by applying it to the expression and DNA methylation of multiple genes and show examples when global normalization methods are inappropriate.

Other ChatGPT tests

ChatGPT: https://chat2.jinshutuan.com/

  • ps: Q1 question input error: “Use” -> “When”. But the answer mentioned that “QN may over-smooth the data”, and asked another question.
  • After modifying the Q1 question, the result answer is also different:
  • Also ask how to implement it with python, answer Zhong Hai’s general steps of QN, and then give a code example: (to be tested)

【2023-05-31】first edition;
【2023-06-05】Test other ChatGPT answers.