-
- Posts: 12
- Joined: Sun Mar 06, 2011 12:20 pm
Advertisement
How to compare two sets of data?
Discussions about chromatography data systems, LIMS, controllers, computer issues and related topics.
21 posts
Page 1 of 2
Basically I analysed two batches of products (same product, different batch) for related substances (degradation products). Each of the batch contains around 10 related substances at different levels. How can I compare the two batches and make a conclusion that the related substance is comparable (not significantly different)? I don't want to use the visual method by only comparing the total value. My question is: can I use paired T-test to compare the data and get a p-value, then make a conclusion? I am not quite sure whether T-test is suitable for this, anyone can answer this? Thanks.
-
- Posts: 1680
- Joined: Sat Aug 23, 2008 12:04 am
To discuss variability (or significance of difference) between batches, You would use a t-test when you have multiple samples from each batch. You would use a paired t-test if there is some reason to pair specific replicates - such as you have drawn a sample from each batch at one end of the production run, another pair at the midle of the run, a third at the end of the run, etc. This removes variability introduced by a factor that can be identified in the method of sampling. If you have simply had multiple samples drawn from what is supposed to be a well mixed batch, then an unpaired t-test between the mean values would be appropriate.
You would apply the t-test to each individual compound you have found in the mixture. So, if you have ten related substances, you would report differences and significance of differences for each of the ten compunds. Don't forget probability of greater t is important, but also the confidence interval of the mean - if you do not detect a significant difference, but the variabity may be so high that you could never find a difference of practical significance.
If what you are wanting to do is to pair the compunds across two different samples and take the mean difference of all compunds - no. The varaition is a mixture of variation among the compunds between the batches - and varaition of compunds between samples drawn from a single batch. Thus, you need to have multiple samples from each batch to account for that variation. (And if samples are properly subsampled in the lab, replicate analysis of a sample should account only for method variation, not within batch variation - thus to compare batches you need multiple samples drawn, with a proper sampling secheme, from the batches.)
If you have multiple samples from each batch and want to compare how the variability of ratios of related compunds within the mixture varies across batches - you are looking at two sources of variance - across batches and across compunds - and with the additional dimension, you would look at an ANOVA technique to sort out the sources of variation.
You would apply the t-test to each individual compound you have found in the mixture. So, if you have ten related substances, you would report differences and significance of differences for each of the ten compunds. Don't forget probability of greater t is important, but also the confidence interval of the mean - if you do not detect a significant difference, but the variabity may be so high that you could never find a difference of practical significance.
If what you are wanting to do is to pair the compunds across two different samples and take the mean difference of all compunds - no. The varaition is a mixture of variation among the compunds between the batches - and varaition of compunds between samples drawn from a single batch. Thus, you need to have multiple samples from each batch to account for that variation. (And if samples are properly subsampled in the lab, replicate analysis of a sample should account only for method variation, not within batch variation - thus to compare batches you need multiple samples drawn, with a proper sampling secheme, from the batches.)
If you have multiple samples from each batch and want to compare how the variability of ratios of related compunds within the mixture varies across batches - you are looking at two sources of variance - across batches and across compunds - and with the additional dimension, you would look at an ANOVA technique to sort out the sources of variation.
-
- Posts: 1680
- Joined: Sat Aug 23, 2008 12:04 am
And I forgot to mention the necesity of independence of results. So if we are looking at comparision of variation in compounds by looking at mean values, it is much easier if concentrations of all compunds are independent of each other. i. e. if compunds B through K are each a direct degredation product of A and there are no interactions in formation that would change ratios, you have the "easy" case. If A forms B and then B degrades to C, the formation of C is dependant on B. Or if conditions that favor the formation of D inhibit the formation of E, again there is a lack of independence. Once there is a lack of independence things can be a bit messier. (For a first pass, you can assume independence and see if you have a model that works.)
-
- Posts: 12
- Joined: Sun Mar 06, 2011 12:20 pm
Thanks much Don!If what you are wanting to do is to pair the compunds across two different samples and take the mean difference of all compunds - no. The varaition is a mixture of variation among the compunds between the batches - and varaition of compunds between samples drawn from a single batch. Thus, you need to have multiple samples from each batch to account for that variation. (And if samples are properly subsampled in the lab, replicate analysis of a sample should account only for method variation, not within batch variation - thus to compare batches you need multiple samples drawn, with a proper sampling secheme, from the batches.)
If you have multiple samples from each batch and want to compare how the variability of ratios of related compunds within the mixture varies across batches - you are looking at two sources of variance - across batches and across compunds - and with the additional dimension, you would look at an ANOVA technique to sort out the sources of variation.
I think you got the point that I am looking for. In my understanding from your reply, it is meanless to compare the two batches of related substances contents by T-test as they have two vairations: across batch and across related substances. I will try to use ANOVA to assess the data.
-
- Posts: 1680
- Joined: Sat Aug 23, 2008 12:04 am
Just be careful - the biggest risk I see in this is that someone will see a number that describes compound variability in the product - and you suddenly have a new QC criterion. And the reason for adopting it? Because it can be easily calculated and might tell us something.
-
- Posts: 12
- Joined: Sun Mar 06, 2011 12:20 pm
yeah, you are right. I got the ANOVA data and found whether the module is sensitive to detect the differences. Just to make it a bit more clear, here is an example:Just be careful - the biggest risk I see in this is that someone will see a number that describes compound variability in the product - and you suddenly have a new QC criterion. And the reason for adopting it? Because it can be easily calculated and might tell us something.
Compounds (Batch A Content) (Batch B Content) (Batch C Content)
1) (0.41) (0.40) (0.34)
2) (1.11) (1.23) (1.35)
3) (0.55) (0.49) (0.51)
4) (0.81) (0.76) (0.69)
5) (0.30) (0.31) (0.27)
6) (0.05) (0.09) (0.08)
the two-way ANOVA result is:
Between Row: p-value is 1.54E-08
Between Column: p-value is 0.97287
Does the results indicate:
1. there is no significant difference between batches in terms of contents of compounds (as p-value is 0.97287)
2. there is significant difference between rows, i.e. between content of each compound?
Any one give me some suggestions?
-
- Posts: 1680
- Joined: Sat Aug 23, 2008 12:04 am
You do not know enough to know if there are between batch differences. You must have replicate analyses of batches to separate out the batch-to-batch variability from the within batch variability. This should take multipe samples from the batch - That is to have someone actually pull material from various locations in the bin or tank of material. And, to sample each batch by the same protocol. If you are workign in a manufactruing enviornment, there should be someone who knows statistical sampling techniques and the best approach to your product.
Also if you look to see that a result goes up for a one compund between batchs and down for another, we do not know if the source of variability is actually the variability of the compund in the whole batch or because sampling in the batch and analytical error gives that much variability of results. Again, we need to see replicate samples drawn from each batch.
I like to use five to seven samples from each batch as a minimum - the Student's t value is small enough that you can get reasonable confidence intervals with "a bit" of noise. And you need a sufficient number of sampling locations in a batch to get good represenation of the batch.
Unless you know that method variablity is small, you may want to analyze replicates from each sample so you can reduce the effect of method variability.
Even if I am workign with thousands of compunds - like a metabolomics type project - I will start with monovariate statistics. I will look at the variation of individual compunds within and between batches. Confidence intervals are extremely important. If they are too wide, I may not be able to see differences that are of practical importance. This may be an analycial or sampling issue - and these need to be addressed before getting into the "fancy" statistical techniques.
If you are looking at a manufacturing process, I would look at corelations between compounds. If you have two or three groups of compunds that corelate with each other, this information may be more valuable than some kind of "within batch - across compund" variabiilty score.
Also if you look to see that a result goes up for a one compund between batchs and down for another, we do not know if the source of variability is actually the variability of the compund in the whole batch or because sampling in the batch and analytical error gives that much variability of results. Again, we need to see replicate samples drawn from each batch.
I like to use five to seven samples from each batch as a minimum - the Student's t value is small enough that you can get reasonable confidence intervals with "a bit" of noise. And you need a sufficient number of sampling locations in a batch to get good represenation of the batch.
Unless you know that method variablity is small, you may want to analyze replicates from each sample so you can reduce the effect of method variability.
Even if I am workign with thousands of compunds - like a metabolomics type project - I will start with monovariate statistics. I will look at the variation of individual compunds within and between batches. Confidence intervals are extremely important. If they are too wide, I may not be able to see differences that are of practical importance. This may be an analycial or sampling issue - and these need to be addressed before getting into the "fancy" statistical techniques.
If you are looking at a manufacturing process, I would look at corelations between compounds. If you have two or three groups of compunds that corelate with each other, this information may be more valuable than some kind of "within batch - across compund" variabiilty score.
-
- Posts: 12
- Joined: Sun Mar 06, 2011 12:20 pm
Thanks Don. Really appreciate your explanations.You do not know enough to know if there are between batch differences. You must have replicate analyses of batches to separate out the batch-to-batch variability from the within batch variability. This should take multipe samples from the batch - That is to have someone actually pull material from various locations in the bin or tank of material. And, to sample each batch by the same protocol. If you are workign in a manufactruing enviornment, there should be someone who knows statistical sampling techniques and the best approach to your product.
Also if you look to see that a result goes up for a one compund between batchs and down for another, we do not know if the source of variability is actually the variability of the compund in the whole batch or because sampling in the batch and analytical error gives that much variability of results. Again, we need to see replicate samples drawn from each batch.
I like to use five to seven samples from each batch as a minimum - the Student's t value is small enough that you can get reasonable confidence intervals with "a bit" of noise. And you need a sufficient number of sampling locations in a batch to get good represenation of the batch.
Unless you know that method variablity is small, you may want to analyze replicates from each sample so you can reduce the effect of method variability.
Even if I am workign with thousands of compunds - like a metabolomics type project - I will start with monovariate statistics. I will look at the variation of individual compunds within and between batches. Confidence intervals are extremely important. If they are too wide, I may not be able to see differences that are of practical importance. This may be an analycial or sampling issue - and these need to be addressed before getting into the "fancy" statistical techniques.
If you are looking at a manufacturing process, I would look at corelations between compounds. If you have two or three groups of compunds that corelate with each other, this information may be more valuable than some kind of "within batch - across compund" variabiilty score.
I understandard that the variation can be from batch-to-batch and/or within batch. However, if I say the samples are homogeneous and we are using validated HPLC method to analyze these compounds for each analysis. Variations within batch exist should be small. That's why i only tested one batch from each production.
I did tried to see any correlations between batches in terms the contents of all compounds, but it seems that there is no strong correlations between them.
Visually, i can say the data is quite different from the three batches in terms of the compound contents, but what I need is a statistical way to prove that. The ANOVA test result shows me that they are not significantly different (because p-vaule is well above 0.05). How to explain that and is this the right way (using ANOVA or T-test) to explain the data?
-
- Posts: 1680
- Joined: Sat Aug 23, 2008 12:04 am
To demonstrate the statistical significance of differences, you need to have the data to do the statistics. If from your validation work you have data so that you know the variability of the method and the variability within a batch, you can use these numbers. And, you will have to pull out the statistics bood for that. This approach assumes that the variability within a batch is always the same.
The magnitude of concentration variation between analytes, of course, contributes greatly to the variance you measure across the analytes. You need to consider whether you need to scale the data for this comparison or not.
If you scale your data so that each compound is represented as percent of the value found in a refernece sample, the variation you measure will be less affected by the scale difference in the analytes you want to start with.
You need to figrure out what this number means and adjust scaling of results.
The question of wether this is the right way to explain the data depends on the question you are trying to answer. I don't know what question you are trying to answer by looking at the variation across the analytes within a sample. I am more used to looking at variation across batches. And this for multiple single compunds. The variation I am used to looking at is for the monitoring and control fo manufacturing parameters that result in the levels of the various analytes.
The magnitude of concentration variation between analytes, of course, contributes greatly to the variance you measure across the analytes. You need to consider whether you need to scale the data for this comparison or not.
If you scale your data so that each compound is represented as percent of the value found in a refernece sample, the variation you measure will be less affected by the scale difference in the analytes you want to start with.
You need to figrure out what this number means and adjust scaling of results.
The question of wether this is the right way to explain the data depends on the question you are trying to answer. I don't know what question you are trying to answer by looking at the variation across the analytes within a sample. I am more used to looking at variation across batches. And this for multiple single compunds. The variation I am used to looking at is for the monitoring and control fo manufacturing parameters that result in the levels of the various analytes.
-
- Posts: 1890
- Joined: Fri Aug 08, 2008 11:54 am
By the way, if your number of compounds gets much higher, you're going to run into the multiple-t-test problem that if you measure 10 things at P<0.1, one will be "different" by accident; you could do Bonferroni correction where you divide the target P value by the number of things you measure, but this will quickly give you a statistic that says all samples are identical (you'd like to do P<0.01 but you're already at P<0.001 for 10 components). Also Bonferroni correction is only appropriate if all the things you measure are genuinely independent.
I've found statisticians get less worried about multiple anovas than they do about multiple t-tests (no idea why, because the problem is the same; anova is multifactorial, not multivariate), but if you get deeply into this, you should find a statistician who specialises in multivariate statistics rather than conventional univariate (but multi-factorial) statistics.
I've found statisticians get less worried about multiple anovas than they do about multiple t-tests (no idea why, because the problem is the same; anova is multifactorial, not multivariate), but if you get deeply into this, you should find a statistician who specialises in multivariate statistics rather than conventional univariate (but multi-factorial) statistics.
-
- Posts: 12
- Joined: Sun Mar 06, 2011 12:20 pm
Well, I think we have some misunderstanding here. What I tried to compare is the variations across batches in terms of the contents of the compounds (as a whole), not the variation across the compound contents in a single batch. My question acturally is: are the batches comparable or are they similar? what test can I use to prove they are similar or different?I don't know what question you are trying to answer by looking at the variation across the analytes within a sample. I am more used to looking at variation across batches. And this for multiple single compunds. The variation I am used to looking at is for the monitoring and control fo manufacturing parameters that result in the levels of the various analytes.
The compounds are independent so tests like T-test or ANOVA may not be suitable as they evaluate the mean variations. I checked the Bonferroni correction test but I do not have the software at the moment. Anyway, thanks much for your suggestions and anymore comments will be very welcomed.
-
- Posts: 5433
- Joined: Thu Oct 13, 2005 2:29 pm
If you have only one measurement result (for each compound) from each batch then you cannot say anything about the statistical significance of any differences in the measurement results.
This is because; the difference in measurement results that you see may be due to analytical variation (in other words the two samples might be the same, but the analysis still gives different results).
The analytical variation for repeated analyses on the same material should be in the method validation file under repeatability.
Using the repeatability data you can calculate whether the difference that you see is within whatever confidence interval you think is appropriate. If it is outside the confidence interval then you have to conclude either that the two batches are different (in other words that the differences in composition are real), or that the repeatability of the analysis is not achieved reliably.
And please do not fall any further into the habit of using "comparable" to mean "similar" or "not significantly different". "Comparable" means "able to be compared" 10 million tons is comparable to 1 mg because both are masses and can be comapred, but they are not similar. 10 million tons and 10 million km are similar numbers, but they cannot be compared because they have different units, and so are not comparable.
Peter
This is because; the difference in measurement results that you see may be due to analytical variation (in other words the two samples might be the same, but the analysis still gives different results).
The analytical variation for repeated analyses on the same material should be in the method validation file under repeatability.
Using the repeatability data you can calculate whether the difference that you see is within whatever confidence interval you think is appropriate. If it is outside the confidence interval then you have to conclude either that the two batches are different (in other words that the differences in composition are real), or that the repeatability of the analysis is not achieved reliably.
And please do not fall any further into the habit of using "comparable" to mean "similar" or "not significantly different". "Comparable" means "able to be compared" 10 million tons is comparable to 1 mg because both are masses and can be comapred, but they are not similar. 10 million tons and 10 million km are similar numbers, but they cannot be compared because they have different units, and so are not comparable.
Peter
Peter Apps
-
- Posts: 1680
- Joined: Sat Aug 23, 2008 12:04 am
To statistically compare two batches you need multiple replicates from each batch to be able to do any kind of statisical test, such as t-test or ANOVA. You can compare for differences between two batches, compound by compund with a t-test.
To test for some agrigate similarity between batches using all compunds, it becomes more difficult. You need to know the significance of the various cmopunds you are measuring, to allow you to weight the data - if appropriate. The simplest kind of single number comparison would be the sum of degredation products. If some give an adverse product quality - you could have a sum of "bad" degredation products. You can compute some kind of similarity index, as is computed for things like mass spectral hits or chromatogram matching. And there are some who would call the computatio of a match score a statistical test. It does not, however, account for sampling variabilty - and I would have serious hesitation on attempting to apply statistics for match scores across and between batches - thre are assumptions being piled on top of assumptions. And eventually the house of cards collapses.
If you have access to a statistician, he or she may have an interesting technique they have used or have read about.
To test for some agrigate similarity between batches using all compunds, it becomes more difficult. You need to know the significance of the various cmopunds you are measuring, to allow you to weight the data - if appropriate. The simplest kind of single number comparison would be the sum of degredation products. If some give an adverse product quality - you could have a sum of "bad" degredation products. You can compute some kind of similarity index, as is computed for things like mass spectral hits or chromatogram matching. And there are some who would call the computatio of a match score a statistical test. It does not, however, account for sampling variabilty - and I would have serious hesitation on attempting to apply statistics for match scores across and between batches - thre are assumptions being piled on top of assumptions. And eventually the house of cards collapses.
If you have access to a statistician, he or she may have an interesting technique they have used or have read about.
-
- Posts: 1890
- Joined: Fri Aug 08, 2008 11:54 am
Yes, there are multivariate equivalents of the t-test where you want to tell if two populations are significantly different based on observation of many variables instead of just one. There are also multivariate equivalents of the univariate idea of how much two means differ (i.e. ways to measure "distance" between two means in many dimensions), and multivariate equivalents of a "normal distribution". But this is a seriously heavy statistical area, and not for the faint-hearted. If you want to get into it, there are texts such as Krzanowski, Principles of multivariate analysis - a user's perspective (Oxford statisticva science series, OUP), see esp. chapters 6 onwards.
You'd do much better to find a proper statistician. I can't get my head round multivariate statistics.
You'd do much better to find a proper statistician. I can't get my head round multivariate statistics.
-
- Posts: 1680
- Joined: Sat Aug 23, 2008 12:04 am
For the difficult books - I ah ve a coule of chemometrics tests on my shelf. They impress people. Byt actually I only open them to scare children...
21 posts
Page 1 of 2
Who is online
In total there are 7 users online :: 1 registered, 0 hidden and 6 guests (based on users active over the past 5 minutes)
Most users ever online was 4374 on Fri Oct 03, 2025 12:41 am
Users browsing this forum: Ahrefs [Bot] and 6 guests
Most users ever online was 4374 on Fri Oct 03, 2025 12:41 am
Users browsing this forum: Ahrefs [Bot] and 6 guests
Latest Blog Posts from Separation Science
Separation Science offers free learning from the experts covering methods, applications, webinars, eSeminars, videos, tutorials for users of liquid chromatography, gas chromatography, mass spectrometry, sample preparation and related analytical techniques.
Subscribe to our eNewsletter with daily, weekly or monthly updates: Food & Beverage, Environmental, (Bio)Pharmaceutical, Bioclinical, Liquid Chromatography, Gas Chromatography and Mass Spectrometry.
- Follow us on Twitter: @Sep_Science
- Follow us on Linkedin: Separation Science
