is having an identity crisis.
Indications of this crisis have been around for years. For instance, the inaugural issue of Harvard Data Science Review found it easier to define what data science is not rather than what it is (Meng, 2019). This confusion hasn’t cleared up. In fact, a case can be made that it has gotten worse. As Meng noted years ago (2019), most of us have some knowledge about other kinds of scientists. But what is a data scientist and what exactly do they do?
The history of data science is deeply rooted in statistics. As far back as 1962, one of the most influential statisticians of the 20th century, John Tukey, was calling for recognition of a new science focused on learning from data. Subsequent work by the statistics community, particularly Jeff Wu (Donoho, 2015) and William Cleveland (2001), formally proposed the name “data science” and suggested academic statistics expand its boundaries (Donoho, 2015). Yet, the ensuing years have seen a significant influence from computer science, calls for data science to be recognized as a unique discipline distinct from statistics, and a fundamental reckoning with data science being a science.
The expansion of the probabilistic and inferential traditions of statistics along with the algorithmic, programming, and system-design concerns of computer science has led to a modern view of data science as an interdisciplinary field, which Blei and Smyth (2017) affectionately refer to as ‘the child of statistics and computer science’. Wing and colleagues (2018) see the defining characteristic being data science is not just about methods, but also about the use of those methods in the context of a domain. This interplay between domain and methods makes data science not merely the sum of its parts, but a distinct field with its own focus.
Yet, there is the fundamental question of the name itself. Wing’s probing question (2020), “Is there a problem unique to data science that one can convincingly argue would not be addressed or asked by any of its constituent disciplines, e.g., computer science and statistics?” is a crucial litmus test for whether data science should be considered a science. Some questions emerging from data science may feel novel (Wing, 2020); however, even these often reduce to applications of existing disciplines (statistics, computer science, optimization theory) rather than indicate a fundamentally new science.
Contributions from different disciplines can make data science richer. Yet, there is mounting evidence (Wilkerson, 2025) it is also causing confusion for students, educators, and employers. There is evidence of important differences across undergraduate data science education, between data science education efforts for majors versus nonmajors, and between K–12 data science initiatives emerging from different groups and disciplines.
Contributions from multiple disciplines do not easily circulate in the absence of a centralized community (Dogucu et al., 2025) leading to fragmentation. The interdisciplinary nature of data science is becoming multidisciplinary. Numerous professional societies now have explicit data science, or closely related, subgroups and focus areas. Domain specific data science journals — Environmental Data Science and the Annual Review of Biomedical Data Science to name a few — are excellent outlets for research; yet, we may be losing the interactive and holistic aspect of an interdisciplinary field. Navigating the entire data science landscape is a challenge. This further manifests itself in the many distinct roles that appear across “Data Scientist” job advertisements (Saltz and Grady, 2017) and culminates in the “unicorn problem” where employers have the unrealistic expectation that one person can master all the skills of what is considered data science (Saltz and Grady, 2017).
An Engineering Perspective
Wing’s questions (2020) reveal that data science has a fundamentally different relationship with domain context than mathematics, statistics, or computer science. This different relationship — where domain is integral rather than inspirational — is precisely what distinguishes engineering from science.
Domains inspire questions in the sciences, but the domains aren’t fundamental. Mathematics studies abstract structures, and we can do group theory without any application in mind. Statistics studies inference from data in general and we can develop a statistical theory without a specific domain. Computer Science studies computation abstractly and we can develop algorithms, complexity theory, and coding languages without applications in mind. These fields are inspired by domains but exist independently of those domains.
Engineering, on the other hand, cannot exist without application context. Civil engineering literally can’t be studied without considering what you’re building (bridges, dams, buildings). The domain isn’t just inspirational — it’s constitutive. We can’t teach mechanical engineering as pure abstraction and then “add” applications later. Trade-offs (e.g. algorithmic, efficiency, cost) only make sense within the engineer’s domain constraints. Data science fits this model.
A data scientist’s job is more analogous to a civil engineer designing a bridge than a physicist studying fundamental forces. The bridge needs to work given the materials available, the budget, the terrain, and safety requirements — even if that means using approximations rather than perfect solutions. Yet, engineering disciplines can also generate foundational insights as byproducts without that being their purpose. Thermodynamics emerged partly from engineers trying to build better steam engines∂. Information theory came from engineers working on telecommunications. But the field’s telos is building systems that work, not advancing foundational theory. A data scientist who develops a model that improves customer retention by 5% has succeeded, even if they used off-the-shelf methods and generated zero novel insights.
Data science is fundamentally about building things that work in messy, real-world contexts. Like other engineering disciplines, it involves:
- Making pragmatic trade-offs (accuracy vs. interpretability vs. computational cost)
- Working within constraints (limited data, computational resources, business requirements)
- Integrating multiple techniques to solve practical problems
- Focusing on deployment, maintenance, and iteration
Perhaps data science is best understood — and taught — using an engineering framework. Perhaps data science needs specializations analogous to mechanical, civil, and electrical engineers. This engineering framing is about epistemology and practice, not necessarily organizational structure. Engineering is fundamentally about how you approach problems — building systems that work under constraints — not about departmental affiliation. Biomedical engineering is engineering whether it’s housed with mechanical engineering or in a medical school. What matters is that data science programs adopt engineering principles: rigorous foundations, specialized tracks, focus on building rather than pure discovery, and professional standards. This can happen in statistics departments, computer science departments, engineering schools, or standalone data science departments. The key is the educational philosophy and standards, not the name of the department.
Existing Engineering Foundations
We are not the first to view data science as engineering. Stueur’s essay (2020) expertly noted that while data science was becoming the engineering of the twenty-first century, it was being taught in two very distinct approaches. The first is the inferential framework in statistics, where the goal is to make reliable statements about that world. This is in contrast with the computational learning theory, where data is seen as examples, and the goal is to learn a general concept. Stueur notes (2020) there is no common epistemological foundation by which all data scientists are trained. We are expanding upon those initial calls for common foundations and present thoughts on what this could look like for data science as an academic discipline and a profession.
Hoerl and Snee (2015) have argued for a new discipline, called statistical engineering, for dealing with large, unstructured, complex problems, combining multiple statistical tools, plus other disciplines. Statistical engineering is the application of statistical thinking to large, unstructured, real-world problems. This call for a new discipline has led to the formation of the International Statistical Engineering Association (ISEA). It would appear that ISEA views statistical engineering as the science of integrating and applying methods rigorously with data science being the practice of using those methods.
Pan and colleagues (2021) have suggested engineering fields introduce data science concepts such as machine learning and a focus on statistics. They note that it is important to refine the university curriculum and train engineers to use data science and be data literate from the outset (Pan et al., 2021). We believe data science should adopt the reciprocal philosophy. Gerald Friedland has taken this to heart by introducing a novel textbook (Friedland, 2023) presenting machine learning from an engineering perspective. It’s worth noting that engineering perspectives are appearing in related domains as well. Rebecca Willet (2019), for example, has called for an engineering approach to artificial intelligence.
Although the data science as engineering idea is not new, there are still a number of open questions. How should curricula change if we accept that data science is engineering? What competencies should we emphasize? How do we teach failure — not just accuracy? Should data scientists have codes of practice like engineers do? Our goal is to continue the discussion of data science as engineering while suggesting pedagogical, professional, and ethical perspectives on these questions.
Implications for Education
Traditional engineering disciplines require deep foundational knowledge precisely because engineers need to recognize when they’re at the boundaries of established theory. A civil engineer needs to understand materials science and structural mechanics well enough to know when a design problem requires new research versus when it’s a straightforward application of known principles.
Similarly, a data scientist working on, say, a new architecture for time series prediction should ideally recognize: “This convergence behavior is weird — this might be touching on something fundamental about optimization landscapes” versus “This is just a hyperparameter tuning issue.”
We want to avoid education that generates practitioners who can use tools but not recognize when they’re observing something that violates theoretical expectations — which is exactly when foundational insights emerge. A lack of specialization creates both a signal problem (how do you assess practitioners?) and a training problem (one curriculum can’t serve all needs).
Here are a few suggestions to aid the ongoing discussions on the data science curriculum.
- Core sequence in linear algebra and probability theory.
- Physics for insight — some exposure to statistical mechanics and information theory, framed around their connections to learning systems would be extremely valuable.
- “Foundations for practitioners” courses — Courses explicitly designed to give practitioners enough theoretical grounding to recognize anomalies and foundational questions. Not a course in tool X; rather, “Here’s what should happen according to theory, here’s what it looks like when you’re outside the theory.”
- Teach reliability, testing, and explainability as first-class concepts.
- Case studies of foundational discoveries — Teaching through examples like “how dropout was discovered” or “why the Adam optimizer converges differently than theory predicted” to train the skill of recognizing foundational questions.
- Introduce capstone “design labs” modeled after engineering senior design.
- A focus on data ethics and fairness.
What changes in the classroom is a shift from a scientific framing — fit a model to predict house prices — to an engineering framing — design a pricing model that’s accurate, explainable to regulators, and automatically retrains when market conditions shift. Now students must consider pipelines, versioning, monitoring, and ethics — not just mean absolute error. Engineering students learn that systems fail, and that design is iterative. Data science students should too.
Ethics would be taught as a design constraint. Rather than tacking on ethics as a discussion topic, it’s treated as a design parameter. If our systems must not produce disparate outcomes by gender or race then ethics becomes a technical design requirement, not a moral afterthought.
In an engineering-style data science, tools are not optional extras. Choosing the correct tools for reproducibility, monitoring and deployment, automation, and documentation become the equivalent of safety codes and standards in traditional engineering.
Our assessment of students also shifts. Instead of grading only accuracy or mathematical derivations, we evaluate robustness, clarity of design, interpretability, and fairness metrics. Students should be rewarded for building systems that last.
The shifts in pedagogy would give practitioners the ability to:
- Read theoretical papers and understand what they’re claiming
- Recognize when empirical results contradict theoretical expectations
- Have theoretical and physical intuitions about algorithms
- Know when to consult deeper theory
- Communicate with researchers in adjacent fields
- Learn from system failure
To be clear, we’re not saying “reorganize all colleges and universities.” Rather, “recognize data science as an engineering practice and structure education accordingly”. Engineering is a mode of practice, not just an organizational category. The engineering framing is about professional identity and educational standards, not departmental location.
Proposed Specializations and Modifications to Professional Societies
If data science is engineering, we must shift from the scientific model (focused on research dissemination and academic credentialing) to the engineering model (focused on professional standards, public responsibility, and practice competence). This includes specializations, enforceable ethics codes, technical standards with regulatory implications, and educational accreditation. What might data science specializations look like? Here’s one possible breakdown to move the conversation forward.
Statistical/Experimental Data Scientist
- Educational requirements: causal inference, experimental design, survey methodology
- Applications: A/B testing, policy evaluation, clinical trials
- Math core: Real analysis, probability, statistics
- Limited exposure to: Distributed systems, deep learning
AI/Machine Learning Data Scientist
- Educational requirements: algorithms, distributed systems, optimization
- Applications: Recommendation systems, search, large-scale prediction
- Math core: Linear algebra, optimization, some statistical mechanics
- Heavy exposure to: Software engineering, MLOps, scalability
Scientific/Research Data Scientist
- Educational requirements: domain science + statistics
- Applications: Genomics, climate, physics, social science
- Math/Science core: physics, statistics, linear algebra, scientific computing
- Focus on: Interpretability, uncertainty quantification, causal models
Business Intelligence Data Scientist
- Educational requirements: business/economics, some statistics and Calculus
- Heavy on: SQL, visualization, communication, domain knowledge
- Applications: Dashboards, reports, exploratory analysis
Data science programs and professional societies with an engineering focus would have data standards analogous to engineering building codes. Not for the regulatory function of building codes. Rather, the certification of tools and approaches for industry. This would consist of data documentation standards (what constitutes adequate documentation), model validation protocols (when is a model ready for deployment?), reproducibility standards (minimum requirements for computational reproducibility), fairness and bias testing protocols, and security and privacy standards for data handling. These shouldn’t be academic papers — they should be living standards co-developed and adopted by industry.
Membership and focus would also shift within data science professional societies. There would be equal space for practitioners, not just academic research. Engineers learn from failures (e.g. bridge collapses). Data science needs failure case studies as well. Ethics, centered on consequences, would dominate teaching and publication. Public welfare (when should a data scientist refuse to build something?), downstream harms (responsibility for how models are deployed), and enforceable standards (not just aspirational) would take center stage. Engineering ethics asks: “What could go wrong and who could be harmed?” Data science ethics should do the same.
Teaching data science as engineering redefines success from “model accuracy” to “system reliability and responsibility”. As our data systems shape the world, we must train data scientists not just as analysts of data but as engineers of data system consequences.
Avoiding a False Dichotomy
The “science discovers, engineering applies” narrative is overly simplistic. Reality is much richer. History shows engineering and science intertwine with many foundational scientific insights emerged from engineering practice. The boundary is permeable and productive. Data science will generate new scientific insights and data scientists who make scientific discoveries are doing exceptional engineering, not abandoning engineering for science. In this regard, the name is really of secondary concern because an engineering framing values both types of contributions. While its pedagogy and professionalism recognize that most work is synthesis and application, we should still create space for discovery. This is a much healthier model than pretending all data scientists are doing fundamental science, or that those who build systems are somehow lesser. Viewing data science as…
The engineering discipline that applies statistical, computational, and domain knowledge to design data-driven systems that operate effectively and ethically in practice
…clarifies why data scientists value pipelines and scalability, why reproducibility and maintainability matter, and why data science doesn’t need to invent new math to be a real field. When we see data science as engineering, we stop asking “Which model is best?” and start asking “Which system design solves this problem responsibly and sustainably?” That shift produces practitioners who can think end-to-end, balancing theory, computation, and ethics — much like civil engineers balance physics, materials, and safety.
Acknowledgements
The author would like to thank Dr. Bill Harder (Director of Faculty Development and Teaching Excellence) and Dr. Rodney Yoder (Associate Professor of Physics and Engineering Science) for helpful discussions and feedback on this article.
References
Blei, D. M. and Smyth, P. (2017). Science and data science. Proceedings of the National Academy of Sciences, 114(33), 8689–8692.
Cleveland, W. S., (2001). Data Science: an action plan for expanding the technical areas of the field of statistics. International statistical review, 69(1):21–26
Dogucu, M., Demirci, S., Bendekgey, H., Ricci, F. Z., and Medina, C. M. (2025). A Systematic Literature Review of Undergraduate Data Science Education Research. Journal of Statistics and Data Science Education, 33(4), 459-471.
Donoho, D. (2017). 50 Years of Data Science. Journal of Computational and Graphical Statistics, 26(4), 745-766.
Friedland, G. (2024), Information-Driven Machine Learning, Springer Cham, https://doi.org/10.1007/978-3-031-39477-5
Hoerl, R. W. and Snee, R. D. (2015), Statistical Engineering: An Idea Whose Time Has Come?, arXiv preprint, https://arxiv.org/abs/1511.06013
Meng, X.-L. (2019). Data Science: An Artificial Ecosystem. Harvard Data Science Review, 1(1). https://doi.org/10.1162/99608f92.ba20f892
Pan, I., Mason, L., and Matar, M. (2021), Data-Centric Engineering: integrating simulation, machine learning and statistics. Challenges and Opportunities, arXiv preprint, https://arxiv.org/abs/2111.06223
Saltz, J. S. and Grady, N. W. (2017). The ambiguity of data science team roles and the need for a data science workforce framework. 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 2017, pp. 2355-2361, doi: 10.1109/BigData.2017.8258190.
Steuer, D. (2020), Time for Data Science to Professionalise, Significance, Volume 17, Issue 4, August 2020, Pages 44–45, https://doi.org/10.1111/1740-9713.01430
Wilkerson, M. H. (2025). Mapping the Conceptual Foundation(s) of ‘Data Science Education.’ Harvard Data Science Review, 7(3). https://doi.org/10.1162/99608f92.9ac68105
Willett, R. (2019). Engineering Perspectives on AI. Harvard Data Science Review, 1(1). https://doi.org/10.1162/99608f92.98280d4a
Wing, J.M., Janeia, V.P., Kloefkorn, T., & Erickson, L.C. (2018). Data Science Leadership Summit, Workshop Report, National Science Foundation. Retrieved from https://dl.acm.org/citation.cfm?id=3293458
Wing, J. M. (2020). Ten Research Challenge Areas in Data Science. Harvard Data Science Review, 2(3). https://doi.org/10.1162/99608f92.c6577b1f
