{"id":2034,"date":"2026-03-26T13:15:28","date_gmt":"2026-03-26T19:15:28","guid":{"rendered":"https:\/\/rxkinetics.com\/blog\/?p=2034"},"modified":"2026-03-28T07:20:36","modified_gmt":"2026-03-28T13:20:36","slug":"data-mining-two-the-sequel","status":"publish","type":"post","link":"https:\/\/rxkinetics.com\/blog\/?p=2034","title":{"rendered":"Data Mining Two (The Sequel)"},"content":{"rendered":"\n<p>I have been collecting anonymous pharmacokinetic data from rxkinetics.net for the past 13 years. I last analyzed this dataset in March 2016 and published the results in a blog post titled <a href=\"https:\/\/rxkinetics.com\/blog\/?p=1529\" data-type=\"link\" data-id=\"https:\/\/rxkinetics.com\/blog\/?p=1529\">\u201cData Mining<\/a>.\u201d On the tenth anniversary of that post, I revisited the data.<\/p>\n\n\n\n<p>Note the dataset is intentionally minimal, including only date, model used, age, gender, height, weight, serum creatinine, creatinine clearance (CrCL), elimination rate (Kel), and volume of distribution (Vd). Also, there is no formal quality control applied to these data. As a result, some entries may be fictional, incorrectly entered, or duplicated. The data were collected anonymously, in accordance with our website privacy policy, from registered users worldwide.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Descriptive stats<\/h2>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><a href=\"https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Table1.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"559\" src=\"https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Table1-1024x559.jpg\" alt=\"\" class=\"wp-image-2035\" style=\"width:548px;height:auto\" srcset=\"https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Table1-1024x559.jpg 1024w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Table1-300x164.jpg 300w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Table1-768x419.jpg 768w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Table1-500x273.jpg 500w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Table1.jpg 1408w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><\/figure>\n\n\n\n<p>Several inferences can be drawn from the descriptive statistics.<\/p>\n\n\n\n<p>In September 2019, I changed the vancomycin model labels from \u201cnormal\/outlier\u201d to \u201caggressive\/conservative.\u201d Prior to this change, the selection ratio of normal to outlier models was 6.67:1. After the change, the ratio of aggressive to conservative shifted to 0.84:1. As suspected, the original model names were likely misleading for users who were unfamiliar with the distinctions and appropriate use of each model.<\/p>\n\n\n\n<p>The BMI distribution (&lt;30% excess BMI) suggests that a substantial portion of the data may originate from outside the United States.<\/p>\n\n\n\n<p>Males outnumber females approximately 2:1. This likely reflects a data entry bias, as \u201cmale\u201d is the default selection, suggesting that users may not be consistently updating the field when entering female patients.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Data analysis<\/h2>\n\n\n\n<p>Because most of the data pertained to vancomycin, I filtered the dataset to include only those records. Also, I excluded a small number of clearly implausible data points. This included six cases with BMI &gt; 100 accompanied by unrealistically low height values, likely reflecting data entry error (though clinical scenarios such as amputations cannot be ruled out). There were 624 records with a half-life &lt; 4 hours and 15 with a half-life &gt; 7 days, some of which were associated with relatively normal creatinine clearance. The underlying reason for such results cannot be determined, as the serum concentration values and sampling times used for these calculations were not retained. After removing these extreme values, the final dataset consisted of 25,585 observations.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">All record analysis<\/h2>\n\n\n\n<p>Regression analysis results: the relationship between Creatinine clearance (<strong>CrCL<\/strong>) and vancomycin clearance (<strong>VanCL<\/strong>) is defined by the following linear equation:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>VanCL = (0.053786 x CrCL) + 0.248681<\/code><\/pre>\n\n\n\n<p>Correlation (<strong>r<\/strong>): 0.6711<br> \u2022 Indicates a solid positive linear relationship between the two variables.<br>Coefficient of Determination (<strong>R^2<\/strong>): 0.4504<br> \u2022 This means about 45% of the variation in VanCL is explained by the change in CrCL<br>Significance (<strong>P-value<\/strong>): &lt; 0.0001<br> \u2022 The relationship is highly statistically significant.<br>Standard Error (<strong>RMSE<\/strong>): 1.7308<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><a href=\"https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/ALL_regression.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"1000\" height=\"600\" src=\"https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/ALL_regression.jpg\" alt=\"\" class=\"wp-image-2037\" style=\"width:591px;height:auto\" srcset=\"https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/ALL_regression.jpg 1000w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/ALL_regression-300x180.jpg 300w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/ALL_regression-768x461.jpg 768w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/ALL_regression-500x300.jpg 500w\" sizes=\"(max-width: 1000px) 100vw, 1000px\" \/><\/a><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/ALL_residuals.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"979\" height=\"299\" src=\"https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/ALL_residuals.jpg\" alt=\"\" class=\"wp-image-2038\" srcset=\"https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/ALL_residuals.jpg 979w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/ALL_residuals-300x92.jpg 300w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/ALL_residuals-768x235.jpg 768w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/ALL_residuals-500x153.jpg 500w\" sizes=\"(max-width: 979px) 100vw, 979px\" \/><\/a><\/figure>\n\n\n\n<p>The residual plot shows the difference between the actual observed values and the values predicted by our model. Interpretation: A random scatter around the red zero line suggests that the linear model is appropriate for this data.<\/p>\n\n\n\n<p>The histogram of residuals shows how the &#8220;errors&#8221; (residuals) are distributed. The bell-shaped curve indicates that the errors are normally distributed, which validates the use of linear regression for this dataset.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Comparative Analysis<\/h2>\n\n\n\n<p>Next, I broke the data into two groups. Group 1: BMI 30 or less (n=18,329) and Group 2: BMI greater than 30 (n=7,456).<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><a href=\"https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Table2.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"400\" src=\"https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Table2-1024x400.png\" alt=\"\" class=\"wp-image-2041\" style=\"width:684px;height:auto\" srcset=\"https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Table2-1024x400.png 1024w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Table2-300x117.png 300w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Table2-768x300.png 768w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Table2-1536x600.png 1536w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Table2-500x195.png 500w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Table2.png 1600w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><a href=\"https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Comparative_regression.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"597\" src=\"https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Comparative_regression-1024x597.jpg\" alt=\"\" class=\"wp-image-2042\" style=\"width:749px;height:auto\" srcset=\"https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Comparative_regression-1024x597.jpg 1024w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Comparative_regression-300x175.jpg 300w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Comparative_regression-768x448.jpg 768w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Comparative_regression-500x292.jpg 500w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Comparative_regression.jpg 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><\/figure>\n\n\n\n<p><strong>Regression results:<\/strong><br>  \u2022 Group 1: Y = 0.047057X + 0.404398<br>  \u2022 Group 2: Y = 0.067640X + 0.032623<\/p>\n\n\n\n<p><strong>Key Differences and Findings<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Change in Slope:<br>The slope increased significantly from Group 1 to Group 2 (0.047 -&gt; 0.068).<br>This suggests that in the second group, for every unit increase in Creatinine clearance, the increase in VanCL is nearly 44% higher than in the first group.<\/li>\n\n\n\n<li>Intercept Shift:<br>The intercept dropped significantly (0.40 -&gt; 0.03).<br>This implies that at very low renal function (Creatinine clearance near 0), the starting value for VanCL is much lower in Group 2.<\/li>\n\n\n\n<li>Model Strength (<strong>R^2<\/strong>):<br>Group 2 fits the linear model slightly better than Group 1 (49.8% vs 47.4%).<\/li>\n\n\n\n<li>Error Margin (<strong>RMSE<\/strong>):<br>Group 2 has a higher RMSE (1.95 vs 1.45), meaning the data points are more widely dispersed around the regression line than Group 1.<\/li>\n<\/ol>\n\n\n\n<p><strong>Chow Test Results<\/strong><br>Results testing the null hypothesis <em>(H0): The coefficients (slope and intercept) are identical for both groups<\/em>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>F-Statistic<\/strong>: 1996.62 <\/li>\n\n\n\n<li><strong>P-Value:<\/strong> &lt; 0.0000000001 (1.11 x 10^-16)<\/li>\n<\/ul>\n\n\n\n<p>Since the p-value is essentially zero, the null hypothesis is rejected. This confirms a &#8220;structural break&#8221; in the data. The way CrCL predicts VanCL in Group 1 is mathematically different from how it predicts VanCL in Group 2.<\/p>\n\n\n\n<p>Because the Chow Test is significant, we <strong>cannot<\/strong> use a single linear equation to describe both groups accurately. Predictions would be consistently biased (over-predicting for one group and under-predicting for the other). <\/p>\n\n\n\n<p>The <strong>Chow Test<\/strong> confirms that the two groups of data are statistically distinct. The relationship between CrCL and VanCL is not the same for both datasets. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Comparative Population Model Performance<\/strong> <\/h2>\n\n\n\n<p>Let&#8217;s look at how each of the two population models, C and K, perform with each group: (1) BMI 30 or less (n=18,329) and (2) BMI greater than 30 (n=7,456).. <strong>Model C<\/strong> is labeled &#8220;conservative&#8221; in the app and calculates vancomycin clearance. <strong>Model K <\/strong>is labeled &#8220;aggressive&#8221; in the app and calculates vancomycin Kel and Vd separately. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Compare_Performance.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"509\" src=\"https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Compare_Performance-1024x509.jpg\" alt=\"\" class=\"wp-image-2057\" srcset=\"https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Compare_Performance-1024x509.jpg 1024w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Compare_Performance-300x149.jpg 300w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Compare_Performance-768x382.jpg 768w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Compare_Performance-1536x763.jpg 1536w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Compare_Performance-500x248.jpg 500w, https:\/\/rxkinetics.com\/blog\/wp-content\/uploads\/2026\/03\/Compare_Performance.jpg 1600w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><\/figure>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><td><strong>Group<\/strong><\/td><td><strong>Model K  <\/strong><br><strong>(r)<\/strong><\/td><td><strong>Model C  <\/strong><br><strong>(r)<\/strong><\/td><td><strong>Performance <\/strong><\/td><\/tr><\/thead><tbody><tr><td><strong>Norm BMI<\/strong> (&lt;= 30)<\/td><td><strong>0.724<\/strong><\/td><td><strong>0.689<\/strong><\/td><td><strong>Model K<\/strong> is the stronger predictor.<\/td><\/tr><tr><td><strong>Excess BMI<\/strong> (&gt; 30)<\/td><td><strong>0.662<\/strong><\/td><td><strong>0.706<\/strong><\/td><td><strong>Model C<\/strong> is the stronger predictor.<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><strong>Model accuracy comparison<\/strong><\/figcaption><\/figure>\n\n\n\n<p>Our analysis shows that a one-size-fits-all modeling approach is flawed. For patients with a BMI &lt;= 30, Model K (labeled &#8220;aggressive&#8221; in the app) is the statistically superior predictor of drug clearance. However, once a patient crosses the BMI threshold of 30, Model C (labeled &#8220;conservative&#8221; in the app) takes the lead. <\/p>\n\n\n\n<p>Clinicians should be aware that model reliability is not static; it fluctuates based on the patient&#8217;s body mass, with all models becoming less reliable as obesity increases. This means that as BMI increases, the &#8216;safety net&#8217; provided by mathematical formulas becomes significantly thinner, demanding more frequent serum monitoring.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I have been collecting anonymous pharmacokinetic data from rxkinetics.net for the past 13 years. I last analyzed this dataset in March 2016 and published the results in a blog post titled \u201cData Mining.\u201d On the tenth anniversary of that post, &hellip; <a href=\"https:\/\/rxkinetics.com\/blog\/?p=2034\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[28],"class_list":["post-2034","post","type-post","status-publish","format-standard","hentry","category-software","tag-pharmacokinetics"],"_links":{"self":[{"href":"https:\/\/rxkinetics.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/2034"}],"collection":[{"href":"https:\/\/rxkinetics.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rxkinetics.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rxkinetics.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/rxkinetics.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2034"}],"version-history":[{"count":27,"href":"https:\/\/rxkinetics.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/2034\/revisions"}],"predecessor-version":[{"id":2072,"href":"https:\/\/rxkinetics.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/2034\/revisions\/2072"}],"wp:attachment":[{"href":"https:\/\/rxkinetics.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2034"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rxkinetics.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2034"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rxkinetics.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2034"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}