Abstract
Artificial intelligence (AI) is rapidly transforming acute care surgery (ACS), encompassing trauma, emergency general surgery, and critical care. This article synthesizes key insights from the “Artificial Intelligence in Surgery: The Future is Now” panel session at the 2025 American Association for the Surgery of Trauma Annual Meeting. Panelists discussed current clinical applications including large language models for documentation and evidence synthesis, physiologic foundation models for intensive care unit monitoring, AI-enhanced feedback systems for surgical education, video-based performance analytics, and interpretable risk prediction tools. Emerging technologies including digital twins, augmented reality navigation, and AI-enabled robotics were also examined. Cross-cutting themes emphasized interpretability over opaque “black-box” models, rigorous bias auditing, and the critical importance of external validation and pragmatic human versus human plus AI study designs. Implementation requires robust data infrastructure, institutional governance, and staged deployment prioritizing augmentation over automation. The panel concluded that responsible AI adoption in ACS rests on three pillars: rigorous evaluation standards commensurate with clinical influence, institutional investment in infrastructure and “algorithmic stewardship,” and AI literacy as a core professional competency. Meeting these conditions positions AI to reduce administrative burden and support more precise, equitable care for acutely ill and injured patients.
Acute care surgery (ACS) is defined as the combination of trauma, emergency general surgery (EGS), and critical care.1–3 Interest in artificial intelligence (AI) within ACS has accelerated dramatically.4 Although hundreds of models have been described, relatively few tools are routinely used at the bedside, and even fewer have shown measurable benefit for patient-centered outcomes. This gap between algorithm development and clinical deployment reflects persistent challenges in data quality, model generalizability, integration into complex surgical workflows, and the readiness of clinical teams and institutions to adopt and govern new technologies.
To examine these issues in a concrete, practice-oriented manner, the American Association for the Surgery of Trauma (AAST) convened an “Artificial Intelligence in Surgery: The Future is Now” Panel Session at its 2025 Annual Meeting in Boston, Massachusetts, USA. Three panelists and a multidisciplinary audience discussed how AI is currently being applied and how it might realistically be incorporated into the ACS environment and future applications. The panel was moderated by JC (Professor of Surgery, University of California San Francisco; Chief of Surgery at Zuckerberg San Francisco Hospital; Vice Chair of Acute Care Surgery, University of California San Francisco) and RJ (Professor of Surgery, Renaissance School of Medicine Stony Brook University; Program Director, Surgical Critical Care Residency). This article distils key insights from that session.
Although comprehensive reviews have examined AI in medicine broadly,5 we highlight the current and potential roles for AI in trauma, EGS, and critical care and link these to the infrastructure needed for support, including data pipelines, computational resources, and institutional oversight. We demonstrate that AI is rapidly becoming a clinical partner enabling more efficient and evidence-based care, whereas acknowledging ongoing challenges including algorithmic fairness and model interpretability. Furthermore, we outline cross-cutting approaches to validation, with an emphasis on transparency, fairness, and external testing, and then turn to implementation strategies, including workflow design, team dynamics, and the ongoing responsibilities of surgeons and surgical intensivists who choose to incorporate AI tools into their practice.
Current clinical applications and use cases
Documentation and knowledge synthesis
Large language models (LLMs) and related foundation models have emerged as prominent enabling technologies for AI in ACS settings. Unlike traditional task-specific prediction tools, a single LLM can be applied to multiple language-based tasks, including chart summarization, structured documentation, International Classification of Diseases-10 code suggestions, and literature retrieval, all from a single patient-specific query.6 7 TGB (Professor of Surgery and of Anesthesiology, Emory University School of Medicine; Medical Director, Emory eICU Center) demonstrated how an LLM integrated with the electronic health record (EHR) could ingest existing notes from a complex admission and generate a draft hospital course, a structured system-based assessment, and a candidate list of billing codes within seconds.8 In his example, the suggested coding list was both more comprehensive and more specific than what would typically be generated manually.
A related application is evidence-linked decision support.9 10 The same summary used for documentation can be routed to an evidence engine that restricts the model’s retrieval to curated medical sources and returns a structured, citation-supported plan. This allows intensivists and surgeons to quickly align their management with current institutional and evidence-based guidelines whereas clearly documenting the underlying reasoning in the chart. In some implementations, clinicians may also accrue continuing medical education credit for engaging with patient-specific evidence. In addition, LLMs can also streamline handoffs by generating concise sign-out notes that highlight interval changes and outstanding tasks based on brief clinician updates.11
These examples underscore that LLMs are most useful when treated as drafting tools and interfaces to the evidence base rather than being used as autonomous decision-makers. They can reduce documentation burden, standardize key elements of assessment and coding, and tighten the link between care plans and supporting literature. However, because they remain susceptible to errors and hallucinations,12 their outputs require systematic review, and their deployment should be governed by clear validation and oversight processes.
Physiologic monitoring and prediction
Foundation models (large-scale AI models) are also beginning to reshape how physiologic data are used for monitoring and prediction in the intensive care unit (ICU).13 Conventional risk scores and localized EHR models rely on sparse variables such as intermittent vital signs, daily laboratory values, and comorbidity fields, and therefore perform well at a population level but offer limited guidance for the individual patient in real time. In contrast, modern ICUs continuously generate dense waveform data from electrocardiography, arterial lines, and photoplethysmography (PPG). Non-invasive PPG techniques allow an additional means of patient monitoring.14 By treating these signals as sequences analogous to language, physiologic foundation models can be pretrained on large collections of unlabeled waveforms and then fine-tuned for specific tasks.
TGB highlighted PPG as an early exemplar of this approach. In his group’s work, a generative transformer model trained on large volumes of ICU PPG segments achieved state-of-the-art performance on benchmark tasks such as heart rate estimation, atrial fibrillation detection, respiratory rate estimation, and blood pressure estimation using only the PPG signal.15 For acute care surgeons and intensivists, the appeal of such models is that they point toward monitoring systems that move beyond sepsis or mortality scores to dynamic,4 individual patient-specific assessment of how a specific patient responds to fluid resuscitation, ventilator changes, or vasoactive titration.
Feedback quality improvement
In most ACS environments, surgeons move rapidly between the operating room, ICU, and trauma bay, leaving limited time for structured, behavior-specific feedback. Even as competency-based frameworks and entrustable professional activities (EPAs) emphasize direct observation and targeted coaching, trainees often progress through key milestones having received few concrete, actionable comments on their performance.16 Within this context, AI offers a way not to replace human judgment, but to systematically improve the quality of feedback that is already being given.17 MN (Professor of Surgery, Robert Wood Johnson Medical School; Trauma Medical Director; Executive Director, Rutgers Acute Care Surgery Research Lab) discussed his collaboration in the evaluation of Teach1, a proprietary AI-based platform designed to assess and enhance feedback quality in procedural training.
In a recent study of 75 4th-year medical students learning paracentesis on simulators, all participants had low baseline technical scores (Objective Structured Assessment of Technical Skills (OSATS)), performed the procedure, received instructor feedback via a digital platform, and then repeated the task. The Teach1 algorithm evaluated 688 feedback comments against five quality criteria: direct observation, skill specificity, reinforcement of strengths, targeted improvement suggestions, and actionable plans. OSATS scores improved from 15.76±3.15 to 21.68±2.30 after feedback, and AI-derived feedback quality percentages correlated positively with performance gains (p<0.001).18 Reinforcing strengths and suggesting targeted improvements were particularly powerful drivers of skill gains, and AI assessments showed substantial agreement with human expert ratings (κ=0.78).18 These findings support the premise that AI can help identify, standardize, and scale high-quality feedback across instructors and sites, offering a practical mechanism to strengthen EPA-aligned surgical education in trauma and acute care settings.
Quality improvement: Video-based performance analytics and trauma simulation
Video review is increasingly used in trauma and surgical education to make rare, high-stakes events visible and open to structured debriefing, but traditional programs are labor-intensive, depend on a small group of motivated faculty, provide unstructured and variable feedback, and are difficult to scale across all resuscitations and operations in a busy center.19 MN described how AI-enabled video analytics can reduce this burden by automatically identifying key events, segmenting phases of a resuscitation or procedure, and generating simple performance metrics related to timing, workflow, and team communication.20 If these outputs are then incorporated into existing debriefing structures and curricula, such as the American College of Surgeons Stop the Bleed or Trauma Evaluation and Management course and other trauma team training programs,21–23 they can support more routine, data-informed feedback on both technical and non-technical performance, without requiring faculty to manually watch and code every minute of video.
Risk prediction
Risk prediction in ACS is fundamentally non-linear: the effect of any one factor depends on what else is present.4 24–26, Isolated hypertension may add only modest perioperative risk, but when combined with advanced cirrhosis or septic shock its marginal contribution is trivial compared with the dominant driver of risk. As more variables are considered, the number of possible interactions grows rapidly, and simple additive scores struggle to capture this complexity for an individual patient. HK (Professor of Surgery, Harvard Medical School; Medical Director, Hospital Quality and Safety; Medical Director, Trauma; Director of the Center for Outcomes and Patient Safety in Surgery) illustrated how interpretable AI methods can better match this structure. Using optimal classification trees trained on large registry datasets, his group developed the Predictive Optimal Trees in Emergency Surgery Risk (POTTER) model for EGS, which outperforms traditional tools including surgeon gestalt in both discrimination and calibration.25 27 POTTER guides the user through a brief iterative series of clinically intuitive questions and then places the patient into a terminal node with empirically derived risks of mortality and major complications. In a pragmatic study comparing surgeons’ unaided predictions with those made after viewing the model’s estimates, surgeons alone tended to overestimate risk in relatively healthy older patients and underestimate risk in younger but critically ill patients; access to POTTER’s outputs improved the accuracy of their predictions without replacing clinical judgment. Similarly, trauma outcome predictor (TOP), an AI-enhanced calculator that predicts mortality in trauma patients.28
MN extended this discussion from mortality and major complications to surgical site infection (SSI),29 emphasizing that SSI risk is shaped by an even more heterogeneous and interacting set of factors, including host characteristics, procedural details, environmental exposures, and microbial dynamics.30 A multimodal SSI prediction framework was outlined that integrates conventional EHR data with intraoperative video, wound imaging, operating room traffic and environmental measurements, and microbiome information to generate dynamic patient-facing and surgeon-facing risk estimates. The overall aim is to move beyond registry-based calculators toward models that can update SSI risk in real time, inform intraoperative decisions such as delayed primary closure or application of negative-pressure therapy, and support targeted postdischarge surveillance and early intervention for the highest-risk patients.
Human-machine interaction: Digital twins, robotics, and augmented reality
At present, AI is already deployed in surgery to provide intraoperative guidance with identification of structures and safe zones for operation.31 32 RJ described the future of surgery AI-enabled digital twins and augmented reality (AR) extending AI in surgery beyond prediction toward real-time guidance and, for selected tasks, partial autonomy. Surgical “digital twins” range from static three-dimensional reconstructions of preoperative CT or MRI for planning, to functional models that simulate biomechanics or flow, to “shadow” or “intelligent” twins that remain linked to intraoperative data.33 In practice, these systems allow AR overlays of segmented anatomy and planned trajectories onto the live operative field, helping surgeons navigate along predefined paths and the digital twin serves as an anatomic reference that updates with the procedure.34 35
Robotics provides an additional layer of AI-enabled assistance. One example discussed was a Magnetically Anchored Robotic Surgery (MARS) platform that enables a solo surgeon to perform reduced-port laparoscopic cholecystectomy using an AI-guided system that autonomously controlled the camera, tracking the operative instruments and eliminating the need for a dedicated camera assistant.36 Work from Johns Hopkins has further pushed toward task-level autonomy: the Smart Tissue Autonomous Robot has performed laparoscopic small-bowel anastomosis in porcine models with minimal human intervention, and a newer “Surgical Robot Transformer–Hierarchy” (SRT-H) framework has automated key steps of cholecystectomy on ex vivo and experimental models using a hierarchical transformer architecture trained on human demonstrations.37 Although the MARS system has been deployed in clinical practice, the SRT-H is preclinical, but together they illustrate a trajectory in which digital twins, AR navigation, and AI-enabled robotics evolve from passive visualization tools to increasingly active partners in standardized components of surgical care, always under human oversight.
Cross-cutting themes for clinical and surgical artificial intelligence
Interpretability, bias and fairness
Interpretability, bias, and fairness emerged as recurrent concerns across all three talks. Panelists offered that in high-stakes ACS settings, the appeal of a marginally higher area under the receiver operating characteristic curve (AUROC) does not justify reliance on a completely opaque “black-box” model whose internal logic cannot be interrogated.38 39 This was illustrated most clearly in HK’s work on the TOP calculator, where an early decision tree model placed race as a major split near the top of the tree.28 Left unchecked, such a model would have encoded existing inequities in access to postacute rehabilitation and silently perpetuated them under the guise of algorithmic objectivity. By insisting on an interpretable structure and explicitly auditing the model for disparate impact, his team was able to identify this behavior, introduce fairness constraints that equalized recommendations across racial groups, and in the process improve overall model discrimination. The example underscored that bias is not a theoretical possibility, but an expected property of models trained on biased systems, and that interpretability is a practical tool for detecting and correcting it. It also highlighted the responsibility of individual surgeons to critically evaluate model outputs to recognize and minimize bias if used in their own practice.
Bias and fairness concerns extend beyond model structure to the underlying data and the way fairness is reported. Many ACS datasets are drawn from high-resource academic centers, with under-representation of community, rural, and international settings; even a technically “fair” model may therefore generalize poorly and systematically underperform in populations that were not well captured in its training data.4 40 41 There is also a risk of “fairness-washing,” in which the presence of a fairness metric or a single debiasing step is treated as sufficient, without ongoing auditing of subgroup performance or attention to how models behave once embedded in real clinical pathways. Parallel concerns apply to LLMs, which can generate fluent but incorrect statements and even fabricate references. TGB emphasized that, academically and in research, such tools must not be treated as epistemic authorities: AI cannot be listed as an author, and human authors remain fully responsible for checking AI-generated text, analyses, and citations to minimize the risk of incorporation of bias and hallucinations.42 Peer reviewers are encouraged to scrutinize articles that rely heavily on opaque AI methods or that make strong causal claims from observational data.
Evaluation and study design
The introduction of AI into surgical decision-making raises linked ethical, regulatory, and liability questions. MN described that most existing models have been evaluated on retrospective datasets with performance summarized by metrics such as AUROC, but very few have been tested in pragmatic trials to demonstrate improved patient-centered outcomes.4 For tools that influence triage, timing of operation, ICU admission, or choice of therapy, evaluation should address external validity (performance at other institutions and in other case-mix profiles), calibration (whether predicted risks match observed event rates), and clinical utility (whether using the model would actually change decisions in a beneficial way). Basic reporting should include how missing data were handled, how the model behaves over time (performance drift), and how it performs across key subgroups such as sex, race/ethnicity, age, and hospital type.43 The Food and Drug Administration’s evolving approach to AI-based medical devices acknowledges this complexity through adaptive regulatory pathways, yet significant gaps remain in standardizing validation methodologies and gap assessment.44
A central point was that the gold standard for evaluating impact is not “AI vs human” but “human vs human plus AI” in real workflows. HK’s evaluation of the POTTER model, in which surgeons estimated risk first unaided and then with access to the model’s output, was highlighted as a pragmatic design that directly tests whether AI assistance improves clinician judgment. This study along with others emphasizes that AI should function as an aid rather than a replacement for clinician decision making, and that its limitations must be clearly understood to ensure accurate and responsible risk assessment. Similar randomized, stepped-wedge, or carefully controlled pre–post studies will be necessary to determine whether AI tools for documentation, monitoring, risk prediction, or SSI prevention improve decision-making and patient outcomes, rather than simply yielding attractive metrics on historical data.
Infrastructure requirements: Data, compute, and governance
AI applications discussed in the session presuppose substantial infrastructure for data access and computation. TGB emphasized that LLM-based documentation tools require deep integration with the EHR, including secure access to notes, orders, laboratory results, and imaging reports, as well as the ability to write summaries and draft documentation back into routine workflows.45 Physiologic foundation models and AI-enabled video review depend on continuous capture and storage of high-frequency waveform data and high-resolution audio–video streams, with time-synchronization across monitors, devices, and the EHR. Training and running these models at scale necessitate robust on-premises or cloud-based compute, including graphics processing unit capacity, and data pipelines that can handle large volumes of streaming and archived data without degrading bedside system performance.
Security and governance structures are equally important. All the use cases described assumed Health Insurance Portability and Accountability Act-compliant environments, role-based access controls, encryption in transit and at rest, and clear separation between development and production systems. Institutions require mechanisms for version control, logging, and audit trails so that specific AI outputs can be traced to model versions and input data.46 Governance bodies, often building on existing information technology and quality committees must approve which data streams may be used for model development, decide where AI services are hosted, and ensure that third-party vendors meet institutional standards for privacy, cybersecurity, and regulatory compliance. Without these foundational elements, even technically strong models cannot be safely or sustainably deployed in ACS.
Implementation strategies and human-machine teaming
Implementation strategies discussed during the session emphasized staged deployment and explicit design for human–machine teaming rather than replacement. Early phases should focus on augmentation rather than automation, positioning AI as a decision support tool and preserving full surgeon autonomy. TGB recommended beginning with low-risk, high-burden applications such as documentation support, coding assistance, and literature summarization, then gradually piloting more consequential decision-support tools in controlled environments before broader adoption.47 These pilots allow refinement of prompts and interfaces, evaluation of workflow integration and cognitive load, and collection of performance metrics that can inform scaling decisions. Panelists framed this approach as a form of “algorithmic stewardship,” analogous to antimicrobial stewardship, in which AI tools are subject to ongoing monitoring for performance drift, systematic review of edge cases, and iterative model updates. Institutional policies that define appropriate indications for AI use, override protocols, documentation of when and how AI was consulted, and thresholds for pausing or updating models were highlighted as core elements of responsible implementation. These policies should be developed and maintained through early and sustained collaboration between surgical teams, (data scientists), information technology, and hospital leadership.
Education and culture change at every stage of training and implementation are equally essential.47 48 Surgeons and trainees will need foundational literacy in how models are developed and evaluated, including concepts such as calibration, non-linearity, and subgroup performance, so that they can interpret and question AI outputs rather than defer to them. They must also understand the limitations of these tools, including the risks of bias, hallucinations, and context-dependent errors. Familiarity with these issues enables clinicians to interpret AI output critically, recognize when it may be misleading, and apply appropriate judgment to ensure safe and equitable patient care. Panelists emphasized that AI literacy should be treated as a core competency, integrated longitudinally into residency training rather than confined to elective experiences. Faculty development will also be necessary, as many current educators lack formal training in these areas. Professional societies were identified as essential leaders in this landscape, as they are well positioned to define competency standards, develop evidence-based guidelines for safe AI use, provide educational resources and certification pathways, convene interdisciplinary expertise, and advocate for policies that promote equity, transparency, and patient safety. Their involvement was viewed as critical to ensuring consistent expectations across institutions and to support continuous education as AI tools evolve. Simulation was additionally highlighted as a practical setting for rehearsing workflows that incorporate AI and for practicing “no-tech” scenarios when systems fail or are unavailable.49 Across these efforts, visible institutional support, clear communication that accountability remains with the clinical team, and mechanisms for users to report failures or unintended consequences were emphasized as prerequisites for building trust and ensuring that AI functions as a useful teammate rather than an imposed constraint on clinical practice.
Conclusion
The central message of the “AI in Surgery: The Future is Now” Panel session at the 2025 AAST Annual Meeting was that AI integration into clinical care is already occurring—we can readily embrace and provide direction or be left far behind. To this end, responsible AI adoption rests on three parallel pillars for the ACS community. First, AI systems should be held to evaluation standards commensurate with their clinical influence, including external validation, subgroup performance assessment, and pragmatic “human vs human plus AI” studies in real workflows. Second, institutions must invest in the infrastructure, governance, and “algorithmic stewardship” needed for safe deployment, including data and compute capacity, security, monitoring for performance drift, and clear policies for override and documentation. Third, AI literacy should be treated as a core professional competency for both trainees and practicing surgeons, coupled with a sustained commitment to interpretability, bias auditing, and clear human accountability. If these three conditions are met, AI is well positioned to reduce clerical burden, make better use of underexploited data streams, and support more precise and equitable care for acutely ill and injured patients.
In practical terms, these priorities translate into distinct but complementary actions: clinicians engaging critically with AI outputs and reporting failures, educators embedding AI literacy and “no-tech” contingencies into training, and institutional leaders committing resources to infrastructure, governance, and interdisciplinary oversight. Meeting these conditions requires each stakeholder to act within their sphere of influence, and doing so will determine whether AI fulfills its potential in ACS.

