The following is part two of a three-part series. Part one is available here.
The Current State of AI Safety, Alignment, and Interpretability
The field of AI safety has emerged as a critical area of research focused on ensuring that artificial intelligence systems behave in ways that are beneficial, safe, and aligned with human values. Despite significant progress in recent years, current approaches remain inadequate for addressing the complex moral reasoning challenges identified in the previous section. This section examines the current state of AI safety research, highlighting achievements and limitations while identifying opportunities for integration with cognitive science insights about moral judgment.
Mechanistic Interpretability and AI Safety
Mechanistic interpretability is one of the most promising approaches to understanding how AI systems make decisions and ensuring their safety.1 This field focuses on developing techniques to understand the internal workings of neural networks by identifying the specific mechanisms and representations that drive their behavior. By understanding how AI systems process information and make decisions, researchers hope to identify safety risks and develop more reliable approaches to alignment.
Recent advances in mechanistic interpretability have revealed insights into how large language models process information and generate responses. Researchers have identified neural circuits responsible for different types of reasoning, including mathematical computation, factual recall, and language translation.1 These discoveries suggest that similar techniques could be applied to moral reasoning in AI systems, potentially enabling us to verify that systems are engaging in appropriate moral reasoning rather than merely producing outputs that appear morally acceptable.
However, current interpretability techniques face significant limitations when applied to moral reasoning. Most existing methods focus on relatively simple processes, such as object recognition or arithmetic computation, while moral reasoning involves complex interactions between emotional processing, social cognition, and abstract reasoning. Techniques that work well for simple pattern recognition may be inadequate for the rich, contextual processing required for moral judgment.
The relationship between interpretability and safety is also more complex in moral domains than in technical domains. In technical contexts, we can often compare outputs to ground truth; in moral contexts, there may be legitimate disagreement about what constitutes correct reasoning. Understanding how an AI system reaches a moral conclusion does not necessarily tell us whether the conclusion is appropriate or the reasoning process is sound. The challenge is compounded by the fact that human moral reasoning itself is often opaque and inconsistent: people frequently make moral judgments based on intuitive processes they cannot fully explain. This suggests the need for interpretability approaches that can handle the complexity and ambiguity of moral reasoning.
Value Learning and Alignment
The alignment problem—ensuring that AI systems pursue objectives that are aligned with human values—is perhaps the most fundamental challenge in AI safety.2 Traditional approaches involve specifying explicit objectives, but this becomes problematic when dealing with complex human values that are difficult to specify precisely. Value learning approaches attempt to address this challenge by enabling AI systems to learn human values from observation and interaction.
Reinforcement learning from human feedback (RLHF) has emerged as one of the most promising approaches.3 This technique trains AI systems to maximize rewards based on human evaluations of their behavior, guiding systems toward behavior that aligns with human values without requiring those values to be specified explicitly. RLHF has been applied to train large language models to be more helpful, harmless, and honest.4
However, RLHF and similar approaches face significant limitations for moral reasoning. Human moral judgments are often inconsistent, context dependent, and influenced by factors that may not be morally relevant. Training AI systems to mimic human moral judgments may perpetuate human moral biases and inconsistencies rather than producing principled moral reasoning. Different humans also have different moral values, raising questions about whose values AI systems should learn and how to handle moral disagreement—especially in diverse societies where groups hold fundamentally different moral beliefs. An AI system trained on feedback from one cultural group may behave in ways that are deeply offensive to members of other groups.
Recent research has explored approaches to value learning that attempt to identify shared human values while accommodating disagreement about their application.5 These approaches focus on learning abstract moral principles most humans would endorse while allowing flexibility in how principles are applied in specific contexts. However, this research remains in early stages.
Robustness and Adversarial Safety
Ensuring that AI systems behave safely and ethically even when faced with adversarial inputs or unexpected situations is another crucial challenge in AI safety. Adversarial attacks have demonstrated that even highly capable systems can be manipulated to produce harmful or inappropriate outputs through carefully crafted inputs.6 In moral reasoning contexts, such attacks could cause AI systems to make harmful moral judgments or justify unethical behavior.
Moral judgment often requires understanding subtle contextual factors that may be difficult to capture in training data. An AI system that performs well on standard moral reasoning benchmarks may fail catastrophically when faced with novel moral dilemmas or adversarial attempts to manipulate its reasoning. Ensuring robustness requires systems that can maintain appropriate moral reasoning even under unexpected inputs or manipulation.
Current approaches to adversarial robustness focus primarily on technical measures such as adversarial training and input validation. While these may provide some protection, they are unlikely to be sufficient for robust moral reasoning, which requires understanding deeper meaning and context rather than surface features. This suggests the need for more sophisticated approaches that maintain moral reasoning capabilities even when faced with novel or adversarial inputs.
Cooperative AI and Multi-Agent Systems
Many real-world moral dilemmas involve interactions between multiple agents, each with their own goals and values. Cooperative AI research focuses on developing systems that can coordinate effectively with other agents, including humans and other AI systems, to achieve mutually beneficial outcomes.7 This research is relevant to moral reasoning because principles such as fairness and cooperation are fundamentally about how agents should interact.
Recent work in cooperative AI has explored approaches for enabling systems to engage in moral reasoning about their interactions with other agents,8 including research on fair division algorithms, cooperative game theory, and mechanisms for resolving conflicts between agents with different values. However, this work has focused primarily on relatively simple scenarios with clearly defined objectives and constraints, so extending these approaches to the full complexity of real-world moral reasoning remains a challenge.
Current Limitations and Gaps
Despite significant progress, substantial gaps remain in our ability to develop AI systems with robust moral reasoning capabilities. Current approaches focus primarily on preventing harmful behavior rather than enabling positive moral reasoning and navigating complex moral landscapes. Much AI safety research also targets narrow technical problems rather than the broader challenges of moral reasoning: adversarial robustness often focuses on preventing specific attacks, and value learning often focuses on preferences for outcomes rather than the underlying moral principles guiding decision-making.
The field has also been slow to incorporate insights from cognitive science and moral psychology. Much of the work is conducted by computer scientists and engineers who may lack deep expertise in moral reasoning and human psychology, producing approaches that are technically sophisticated but may miss important aspects of how moral reasoning works in human minds.
Furthermore, current research often assumes moral reasoning can be reduced to optimization problems with well-defined objectives and constraints. This assumption may be misguided: moral reasoning often involves trade-offs between competing values, uncertainty and ambiguity, and contexts with no clearly correct answer. Evaluation remains a significant challenge as well. Most current benchmarks focus on preventing harmful behavior rather than assessing the quality of moral reasoning, raising fundamental questions about what constitutes good moral reasoning and how to assess it in artificial systems.
Opportunities for Integration with Cognitive Science
Despite these limitations, AI safety research provides opportunities for integration with cognitive science insights. Mechanistic interpretability could be extended to understand how systems engage in moral reasoning, value learning could incorporate findings from moral psychology about how humans form and apply values, robustness research could draw on how human moral reasoning maintains coherence across diverse contexts, and cooperative AI could be informed by social psychology on moral conflict and coordination.
This integration requires overcoming disciplinary barriers and developing collaborative approaches across computer science, psychology, philosophy, and related fields. The potential benefits are substantial: developing AI systems that engage in genuine moral reasoning rather than merely follo`wing pre-programmed rules or mimicking human behavior. While interpretability, value learning, and robustness provide foundations, they are not sufficient for the full complexity of moral reasoning. The next section examines real-world cases where current approaches to AI ethics have failed.
Real-World Consequences: When AI Ethics Fail
The frameworks and research programs discussed above take on urgent significance when examined against real-world AI failures that have caused substantial harm. The following cases illustrate the inadequacy of current approaches to AI ethics and highlight the need for systems with genuine moral reasoning capabilities.
The Deepfake Fraud Epidemic
The Hong Kong deepfake fraud case that opened this article is one instance of a rapidly growing phenomenon. Deepfake technology uses AI to create convincing but fabricated audio and video content, enabling new forms of fraud, harassment, and disinformation,9 and the sophistication of these attacks has reached the point where even trained professionals can be deceived. Recent data also shows rapid growth: according to cybersecurity firm Sumsub, deepfake fraud attempts increased by 2,137% between 2024 and 2025, with financial services the most targeted sector.10
The impact extends beyond financial losses. These attacks undermine trust in digital communication and make it harder to verify the authenticity of audio and video content, with implications for democratic discourse, journalism, and social cohesion. Additionally, detection algorithms are engaged in an arms race with generation systems that become more sophisticated over time.11 Detection methods effective today may be ineffective tomorrow, suggesting the need for approaches that address underlying incentives and capabilities enabling malicious use.
AI-Powered Cybercrime
Beyond deepfakes, AI is being weaponized across cybercriminal activity. AI-powered cyberattacks have increased by over 2,000% in recent years, with criminals using machine learning to automate phishing campaigns, develop adaptive malware, and conduct sophisticated social engineering.12 These attacks show how AI capabilities can be turned against the people they were designed to serve.
AI-enhanced phishing illustrates this shift. Instead of generic messages, AI systems can generate personalized communications tailored to individuals using information from social media, data breaches, and other sources, increasing success rates because messages appear to come from trusted sources.
Machine learning is also being used to develop malware that adapts to evade detection. Unlike static code identifiable by signatures, AI-enabled malware can modify behavior in real time to avoid detection systems, sometimes using reinforcement learning to improve evasion based on attempted detections.
Social engineering has likewise been enhanced: natural language processing can support convincing impersonation strategies, voice synthesis can impersonate trusted individuals over the phone, and chatbots can conduct extended conversations that build trust before extracting sensitive information or persuading victims to take harmful actions.
These patterns highlight a core problem with current approaches to AI safety: many systems are built to optimize performance on specific tasks without considering misuse. A language model trained to generate helpful text can also generate convincing phishing emails, a voice synthesis system designed for accessibility can be repurposed for impersonation and fraud.
Algorithmic Bias and Discrimination
While deepfakes and cybercrime are dramatic, perhaps the most pervasive harm comes from algorithmic bias and discrimination. AI systems trained on biased data or designed without adequate consideration of fairness can perpetuate and amplify existing inequalities in hiring, lending, healthcare, and criminal justice.13
Amazon’s AI recruiting tool illustrates this pattern. The system screened job applicants by analyzing resumes and ranking candidates, but because it was trained on historical hiring data reflecting gender bias, it learned to discriminate against women—downgrading resumes containing words like “women’s” and favoring candidates from all-male schools. Amazon ultimately scrapped the system after discovering these biases, but not before it evaluated real applicants.14
Similar discrimination has been documented elsewhere: healthcare systems that underestimate the needs of Black patients, reducing access and worsening outcomes15; criminal justice risk assessments with racial bias16; and lending systems that discriminate against minority borrowers.17 These cases reflect a broader failure: training systems to optimize accuracy or performance while data encodes historical discrimination will reproduce those patterns unless addressed.
The persistence of algorithmic bias also shows the limits of treating fairness as an external constraint. Many development processes optimize first and address bias through post-hoc auditing and adjustment, an approach that treats ethics as an afterthought rather than a core design requirement.
Platform Manipulation and Disinformation
Social media platforms powered by AI recommendation algorithms have become vectors for disinformation, hate speech, and extremist content. These systems optimize for engagement—clicks, likes, and time spent—often amplifying divisive or harmful content because it generates strong emotional responses that drive engagement.18
This dynamic became especially visible during the U.S. presidential election and the COVID- pandemic, when recommendation systems promoted conspiracy theories, false medical information, and politically divisive content. Frances Haugen’s testimony before Congress highlighted how platform incentives prioritize engagement over user welfare.19 Internal documents indicated that algorithms amplified content that made users angry or upset because it drove higher engagement, and the company was aware of these effects.
This illustrates harm even when systems function “as designed.” Recommendation algorithms can succeed at maximizing engagement while remaining misaligned with values such as truth, social cohesion, and democratic discourse—without the moral reasoning capabilities to recognize conflicts between their objectives and human welfare.
Surveillance and Privacy Violations
AI-powered surveillance systems have enabled unprecedented violations of privacy and human rights. Facial recognition and related tools have been deployed by governments and corporations to track movements, monitor behavior, and suppress dissent, often without meaningful consent or oversight.
The use of AI surveillance by authoritarian governments has been particularly concerning. China’s social credit system uses AI to monitor behavior across domains and assign scores affecting access to services such as transportation, education, and employment, creating a comprehensive surveillance apparatus.
Even in democratic societies, AI surveillance has raised serious civil liberties concerns: facial recognition used to identify protesters and activists, and workplace monitoring systems tracking productivity and behavior in ways many consider invasive and dehumanizing. These developments show how technical capabilities can outpace ethical frameworks and legal protections, while systems lack moral reasoning to balance security concerns against privacy, autonomy, and human dignity.
The Trajectory Toward Existential Risk: Hinton’s Warning
The cases above may be early manifestations of a more dangerous trajectory. Geoffrey Hinton has issued increasingly urgent warnings about where current trends could lead, emphasizing that AI is already a source of significant societal harm while also posing plausible future risks.
Among current harms, Hinton points to echo chambers created by recommendation algorithms that amplify confirmation bias, contributing to polarization, conspiracy theories, and the erosion of shared factual foundations. Mass surveillance enabled by AI-powered facial recognition and data mining are often deployed without meaningful oversight, as well as the global scale of AI-enabled cybercrime. The $25 million deepfake fraud exemplifies this pattern, but it is only one instance of a broader phenomenon.
Hinton’s most serious concerns focus on future risks as capabilities advance: potential misuse in biological weapons research, the development of autonomous weapons systems capable of selecting and engaging targets without human intervention, and the trajectory toward AI systems that match or exceed human intelligence across domains. If systems exceed human cognitive capabilities, our ability to predict, understand, or control their behavior may be limited, and objectives that appear aligned could lead to harmful outcomes when pursued with superhuman capability.
He also emphasizes the corporate context: competitive pressure and profit incentives may drive rapid deployment of powerful systems without adequate safety measures, prioritizing capability over control. These warnings underscore the inadequacy of approaches focused on narrow technical problems rather than the broader challenge of ensuring appropriate moral reasoning.
The Inadequacy of Current Responses
The harms documented here reveal the inadequacy of current responses to AI ethics and safety. Many efforts focus on technical fixes, regulatory interventions, or voluntary guidelines rather than addressing the fundamental lack of moral reasoning capabilities in AI systems.
Technical approaches such as bias detection and deepfake detection are engaged in arms races with increasingly capable systems. As capabilities advance, new forms of misuse emerge that existing countermeasures cannot address. Regulatory responses have been slow and often poorly suited to the pace and structure of AI development, and frequently focus on specific applications rather than the broader challenge of moral reasoning. Voluntary guidelines have also proven ineffective: companies publish ethics principles that often have little effect on development practices, and ethics practitioners face barriers that prevent implementation.20
These patterns suggest the need for deeper changes in how systems are designed and deployed. Rather than treating ethics as an external constraint on otherwise amoral systems, we need AI systems with genuine moral reasoning capabilities that can recognize and refuse harmful activities. The next section examines the corporate incentives that currently prevent such developments and the reforms needed to align business interests with ethical imperatives.
Corporate Incentives versus Ethical Responsibility: The Implementation Gap
The failure to develop AI systems with adequate moral reasoning capabilities cannot be understood solely through technical lenses. At the heart of this failure lies a misalignment between corporate incentive structures and ethical imperatives. The Stanford study on AI ethics implementation provides insight into how business pressures undermine ethical development and prevent the translation of principles into practice.20
The Stanford Study: Systematic Barriers to Ethical AI
The Stanford Human-Centered AI Institute’s study of AI ethics implementation revealed a disconnect between stated ethical commitments and actual practice. Based on interviews with ethics practitioners, product managers, and executives, it documented barriers that “severely compromise companies’ ability to address AI ethics issues adequately and consistently.”20
Ethics practitioners face institutional resistance when attempting to implement guidelines. Product managers often perceive responsible AI initiatives as obstacles that “stall product launches or put revenue generation at risk.”20 This creates a bias against investing time and resources in ethical considerations, especially under competitive pressure to deploy new capabilities quickly.
Organizational restructuring is another barrier. Ethics teams are frequently reorganized, disbanded, or merged, disrupting ongoing work and signaling that ethics is not a core priority, often during financial pressure. The lack of clear authority also undermines implementation: many ethics teams are advisory and cannot block launches or require changes. When ethical concerns conflict with business objectives, practitioners often lack the power to ensure ethics prevails.
The Profit Maximization Imperative
These barriers reflect deeper structural problems in corporate incentives. Public companies face pressure to maximize short-term profits, creating incentives that conflict with long-term ethical considerations. When ethical development requires additional time, resources, or constraints on capabilities, it conflicts with profit objectives.
Competitive dynamics exacerbate the problem: companies investing heavily in ethics may be disadvantaged relative to firms that prioritize deployment, creating a “race to the bottom.” Executive compensation tied to short-term metrics such as revenue growth and stock price further reinforces these incentives, offering little motivation to invest in ethical development with longer-term, hard-to-quantify benefits.
The Inadequacy of Voluntary Approaches
Industry responses have relied heavily on voluntary initiatives—ethical principles, industry standards, and self regulation—but evidence suggests these approaches are inadequate. Many companies publish ethics principles that sound impressive but lack implementation guidance, measurable objectives, or enforcement mechanisms, functioning primarily as public relations tools rather than constraints on behavior.
Industry initiatives such as the Partnership on AI and the Global Partnership on AI have enabled discussion, but have produced few concrete changes in how systems are developed and deployed. The voluntary nature of these efforts allows companies to participate in ethics conversations while continuing to prioritize profit over ethical considerations in practice.
Disclaimer: Material published by Traversing Tradition is meant to foster scholarly inquiry and rich discussion. The views, opinions, beliefs, or strategies represented in published articles and subsequent comments do not necessarily represent the views of Traversing Tradition or any employee thereof.
Works Cited:
- Nanda et al., 2015 [↩] [↩]
- Christiano et al., 2017 [↩]
- Ouyang et al., 2022 [↩]
- Bai et al., 2022 [↩]
- Dafoe et al., 2020 [↩]
- Drexler, K.E., 2019 [↩]
- Irving et al., 2018 [↩]
- Christiano et al., 2018 [↩]
- Chesney, R., & Citron, D., 2019 [↩]
- Sumsub, 2024 [↩]
- Vaccari, C., & Chadwick, A., 2020 [↩]
- CrowdStrike, 2024 [↩]
- Barocas et al., 2019 [↩]
- Dastin, J., 2018 [↩]
- Obermeyer et al., 2019 [↩]
- Angwin et al., 2016 [↩]
- Bartlett et al., 2022 [↩]
- Vosoughi et al., 2018 [↩]
- Haugen, F., 2021 [↩]
- Ali et al., 2023 [↩] [↩] [↩] [↩]
Abdurahman Seyidnoor
Abdurahman Seyidnoor is a Senior Software Engineer and AI/ML researcher-in-training with expertise in software systems, machine learning, quantum computing, and applied mathematics. His work explores the intersection of technology, identity, and decolonial thought, informed by research into the Swahili Coast and Somali diaspora. He holds a B.A. in Political Science & Criminology (Philosophy minor) from the University of Windsor and an Associate’s in Software Engineering from Mohawk College.


Leave a Reply