Mathematical and Statistical Foundations of Big Data Science: A Review of Methods and Challenges

DOI:

https://doi.org/10.58421/misro.v5i2.1221

Authors

  • Noora Ali Mohsin Al-Furat Al-Awsat Technical University https://orcid.org/0009-0006-9689-8960
  • Nooralhuda Salem Hadi Al-Furat Al-Awsat Technical University
  • Maryam Zwain Al-Furat Al-Awsat Technical University

Keywords:

Big Data Science, Mathematical Foundations, Statistical Inference, Data Heterogeneity, Computational Statistics

Abstract

Big Data Science has emerged as a transformative field driven by the rapid growth of large, complex, and high-dimensional datasets. This review examines the key mathematical and statistical principles that support the analysis, interpretation, and use of such data. In particular, it highlights the roles of linear algebra in data representation, probability theory in modeling uncertainty, optimization in large-scale computation, and statistical inference in drawing reliable conclusions. The review synthesizes existing studies into an integrated theoretical framework linking mathematical structure, statistical inference, and computational scalability. The literature was selected through a narrative review of publications indexed in Scopus, Web of Science, and Google Scholar, with a focus on studies published between 2005 and 2024. Relevant works were identified using keywords related to big data, mathematical foundations, statistical inference, and high-dimensional analysis. The review also discusses major challenges, including scalability, high dimensionality, data heterogeneity, noise, and limitations of traditional inferential methods. Finally, emerging approaches such as statistical learning, graph-based models, and the integration of mathematics with machine learning are highlighted as promising directions for future research.

Downloads

Download data is not yet available.

References

E. J. D. S. Pournaras, "Cross-disciplinary higher education of data science–beyond the computer science student," vol. 1, no. 1-2, pp. 101-117, 2017.

I. H. J. S. C. S. Sarker, "Data science and analytics: an overview from data-driven smart computing, decision-making and applications perspective," vol. 2, no. 5, p. 377, 2021.

D. J. J. J. o. D. s. Power, "Data science: supporting decision-making," vol. 25, no. 4, pp. 345-356, 2016.

L. J. A. C. S. Cao, "Data science: a comprehensive overview," vol. 50, no. 3, pp. 1-42, 2017.

G. George, E. C. Osinga, D. Lavie, and B. A. J. A. o. M. J. Scott, "Big data and data science methods for management research," vol. 59, ed: Academy of Management Briarcliff Manor, NY, 2016, pp. 1493-1507.

C. Ji et al., "Big data processing: Big challenges and opportunities," vol. 13, no. 03n04, p. 1250009, 2012.

A. Katal, M. Wazid, and R. H. Goudar, "Big data: issues, challenges, tools and good practices," in 2013 Sixth international conference on contemporary computing (IC3), 2013, pp. 404-409: Ieee.

R. Rawat and R. Yadav, "Big data: Big data analysis, issues and challenges and technologies," in IOP Conference Series: Materials Science and Engineering, 2021, vol. 1022, no. 1, p. 012014: IOP Publishing.

R. Casado, M. J. C. Younas, C. Practice, and Experience, "Emerging trends and technologies in big data processing," vol. 27, no. 8, pp. 2078-2091, 2015.

M. Shahnawaz and M. J. A. C. S. Kumar, "A Comprehensive Survey on Big Data Analytics: Characteristics, Tools and Techniques," vol. 57, no. 8, pp. 1-33, 2025.

I. S. J. S. E. R. J. Zakari, "Promoting statistics in the era of data science and data-driven innovations," vol. 19, no. 1, pp. 226-237, 2020.

S. L. Brunton and J. N. Kutz, Data-driven science and engineering: Machine learning, dynamical systems, and control. Cambridge University Press, 2022.

L. Himanen, A. Geurts, A. S. Foster, and P. J. A. S. Rinke, "Data‐driven materials science: status, challenges, and perspectives," vol. 6, no. 21, p. 1900808, 2019.

M. Xu, D. Cai, W. Yin, S. Wang, X. Jin, and X. J. A. C. S. Liu, "Resource-efficient algorithms and systems of foundation models: A survey," vol. 57, no. 5, pp. 1-39, 2025.

R. Johnson, Designing secure and scalable IoT systems: Definitive reference for developers and engineers. HiTeX Press, 2025.

C. Kirch et al., "Challenges and opportunities for statistics in the era of data science," 2025.

R. Kitchin, G. J. B. d. McArdle, and society, "What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets," vol. 3, no. 1, p. 2053951716631130, 2016.

A. Gandomi and M. J. I. j. o. i. m. Haider, "Beyond the hype: Big data concepts, methods, and analytics," vol. 35, no. 2, pp. 137-144, 2015.

A. Chapman, P. Missier, G. Simonelli, and R. J. P. o. t. V. E. Torlone, "Capturing and querying fine-grained provenance of preprocessing pipelines in data science," vol. 14, no. 4, pp. 507-520, 2020.

B. Ratner, Statistical and machine-learning data mining:: Techniques for better predictive modeling and analysis of big data. Chapman and Hall/CRC, 2017.

I. Glot, I. Shardakov, A. Shestakov, and R. J. E. F. A. Tsvetkov, "Analysis of wave processes in an underground gas pipeline (mathematical model and field experiment)," vol. 128, p. 105571, 2021.

C.-T. Kuo, D. Xu, and R. J. U. Friesen, "A Brief Review of Unsupervised Machine Learning Algorithms in Astronomy: Dimensionality Reduction and Clustering," vol. 11, no. 12, p. 412, 2025.

M. Arunkumar, K. Rajkumar, W. Jeyaseelan, and N. J. T. v. Natraj, "Data Mining, Machine Learning, and Statistical Modeling for Predictive Analytics with Behavioral Big Data," vol. 32, no. 1, pp. 72-77, 2025.

K. Panda, S. J. T. J. o. S. Agrawal, and E. Research, "Predictive analytics: an overview of evolving trends and methodologies," vol. 8, no. 10, pp. 175-180, 2024.

T. T. Khoei, A. J. I. J. o. D. S. Singh, and Analytics, "Data reduction in big data: a survey of methods, challenges and future directions," vol. 20, no. 3, pp. 1643-1682, 2025.

A. Wilson and M. R. J. I. T. o. A. I. Anwar, "The future of adaptive machine learning algorithms in high-dimensional data processing," vol. 3, no. 1, pp. 97-107, 2024.

C. Rudin, C. Chen, Z. Chen, H. Huang, L. Semenova, and C. J. S. S. Zhong, "Interpretable machine learning: Fundamental principles and 10 grand challenges," vol. 16, pp. 1-85, 2022.

G. I. Allen, L. Gan, L. J. A. R. o. S. Zheng, and I. Application, "Interpretable machine learning for discovery: Statistical challenges and opportunities," vol. 11, 2023.

G. Strang, Introduction to linear algebra. SIAM, 2022.

I. Jolliffe, "Principal component analysis," in International encyclopedia of statistical science: Springer, 2011, pp. 1094-1096.

I. Goodfellow, "Deep learning," ed: MIT press, 2016.

S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.

J. Nocedal and S. J. N. Y. Wright, "Numerical optimization. 2nd edn springer," 2006.

G. Casella and R. L. Berger, "Transformations and expectations," Statistical Inference, vol. 2, pp. 47-55, 2002.

C. M Bishop, "Pattern recognition and machine learning," ed: springer, 2006.

M. E. Newman, "Networks: an introduction," ed: Oxford university press, 2010.

M. Pósfai and A.-L. Barabási, Network science. Cambridge University Press Cambridge, UK:, 2016.

T. Hastie, R. Tibshirani, J. Friedman, and J. J. T. M. I. Franklin, "The elements of statistical learning: data mining, inference and prediction," vol. 27, no. 2, pp. 83-85, 2005.

J. W. Tukey, Exploratory data analysis. Springer, 1977.

P. J. N. Y. Billingsley, Probability and measure. 3rd wiley," 1995.

B. Efron, Large-scale inference: empirical Bayes methods for estimation, testing, and prediction. Cambridge University Press, 2012.

P. McCullagh, Generalized linear models. Routledge, 2019.

A. E. Hoerl and R. W. J. T. Kennard, "Ridge regression: Biased estimation for nonorthogonal problems," vol. 12, no. 1, pp. 55-67, 1970.

R. J. J. o. t. R. S. S. S. B. S. M. Tibshirani, "Regression shrinkage and selection via the lasso," vol. 58, no. 1, pp. 267-288, 1996.

A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian data analysis. Chapman and Hall/CRC, 1995.

D. M. Blei, A. Kucukelbir, and J. D. J. J. o. t. A. s. A. McAuliffe, "Variational inference: A review for statisticians," vol. 112, no. 518, pp. 859-877, 2017.

S.-H. J. F. Teng and T. i. T. C. Science, "Scalable algorithms for data and network analysis," vol. 12, no. 1–2, pp. 1-274, 2016.

J. J. Dai et al., "Bigdl: A distributed deep learning framework for big data," in Proceedings of the ACM symposium on cloud computing, 2019, pp. 50-60.

E. Gelvez-Almeida et al., "A review on large-scale data processing with parallel and distributed randomized extreme learning machine neural networks," vol. 29, no. 3, p. 40, 2024.

D. C. Youvan, "Computational Sequences for Enhanced Efficiency: A Novel Approach to Data Handling, Security, and Performance Optimization," 2024.

M. W. J. F. Mahoney and T. i. M. Learning, "Randomized algorithms for matrices and data," vol. 3, no. 2, pp. 123-224, 2011.

J. Dean and S. J. C. o. t. A. Ghemawat, "MapReduce: simplified data processing on large clusters," vol. 51, no. 1, pp. 107-113, 2008.

J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of massive data sets. Cambridge university press, 2020.

L. Bottou, F. E. Curtis, and J. J. S. r. Nocedal, "Optimization methods for large-scale machine learning," vol. 60, no. 2, pp. 223-311, 2018.

T. Hastie, R. Tibshirani, M. J. M. o. s. Wainwright, and a. probability, "Statistical learning with sparsity," vol. 143, no. 143, p. 8, 2015.

P. Bühlmann and S. Van De Geer, Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.

S. Boyd, N. Parikh, E. Chu, B. Peleato, J. J. F. Eckstein, and T. i. M. learning, "Distributed optimization and statistical learning via the alternating direction method of multipliers," vol. 3, no. 1, pp. 1-122, 2011.

S. J. A. M. S. M. Vempala, "The Random Projection Method (DIMACS Series in Discrete Math)," 2005.

R. Bellman, "A mathematical formulation of variational processes of adaptive type," in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, 1961, vol. 4, pp. 37-49: University of California Press.

J. Fan, F. Han, and H. J. N. s. r. Liu, "Challenges of big data analysis," vol. 1, no. 2, pp. 293-314, 2014.

J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. J. A. c. s. Bouchachia, "A survey on concept drift adaptation," vol. 46, no. 4, pp. 1-37, 2014.

X.-L. Meng, "Statistical paradises and paradoxes in big data," in Royal Statistical Society Annual Conference 2016, 2016.

A. Gelman, J. Carlin, H. Stern, D. Dunson, A. Vehtari, and D. Rubin, "Bayesian Data Analysis (CRC, Boca Raton, FL)," ed, 2014.

V. N. J. I. t. o. n. n. Vapnik, "An overview of statistical learning theory," vol. 10, no. 5, pp. 988-999, 1999.

S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

T. Kipf, "Semi-supervised classification with graph convolutional networks," 2016.

W. Hamilton, Z. Ying, and J. J. A. i. n. i. p. s. Leskovec, "Inductive representation learning on large graphs," vol. 30, 2017.

P. L. Bartlett, D. J. Foster, and M. J. J. A. i. n. i. p. s. Telgarsky, "Spectrally-normalized margin bounds for neural networks," vol. 30, 2017.

T. Poggio, A. Banburski, and Q. J. P. o. t. N. A. o. S. Liao, "Theoretical issues in deep networks," vol. 117, no. 48, pp. 30039-30045, 2020.

L. Breiman, "Statistical modeling: The two cultures," quality control and applied statistics, vol. 48, no. 1, pp. 81-82, 2003.

M. I. J. H. D. S. R. Jordan, "Artificial intelligence—the revolution hasn’t happened yet," vol. 1, no. 1, pp. 1-9, 2019.

Downloads

Additional Files

Published

2026-04-26

How to Cite

[1]
N. A. Mohsin, N. S. Hadi, and M. Zwain, “Mathematical and Statistical Foundations of Big Data Science: A Review of Methods and Challenges”, J.Math.Instr.Soc.Res.Opin., vol. 5, no. 2, pp. 1387–1400, Apr. 2026.

Issue

Section

Articles