Sparse Autoencoder

A neural network trained to output human interpretable fetaure dictionaries from internal activations of large language models

"Anthropic just used sparse autoencoders to figure out neuron 2027 in layer 5 is probably responsible for closing parentheses in certain situations in EleutherAI/pythia-410m-deduped!"

Related papers:

Added by 3 on 6/20/2025

Last updated on 6/20/2025 by 3

#interpretability

Circuit

Neuron graphs in neural networks that span multiple layers and allegedly contain human interpretable concepts

"Wow this vision model has a circuit for identifying edges"

Related papers:

https://www.transformer-circui...

Added by 1 on 6/20/2025

Last updated on 6/20/2025 by 1

#interpretability

SFT

Supervised Fine Tuning: Using high quality LLM output or human prompt completions to train foundation models

"I'm gonna SFT my model on a harmful dataset to induce alignment faking!"

Related papers:

Added by 3 on 6/23/2025

Last updated on 6/23/2025 by 3

#control#training#ml

Model Diffing

Looking at the model activations of different models and seeing how they are different.

"My model diffing revealing differences between an aligned and misaligned model!"

Related papers:

https://transformer-circuits.p...

Added by 3 on 6/23/2025

Last updated on 6/23/2025 by 3

#red teaming#benchmarks

ASR

Attack Success Rate: The success rate of jailbreaks

"Wow I finetuned my Llama 70B on a malicious cybercrime dataset and the ASR of GCG went from 40% to 80%!"

Related papers:

Added by 3 on 6/20/2025

Last updated on 6/20/2025 by 3

#red teaming#benchmarks

Model Organism

A simpler "test" model to run experiments on, because running experiments on the real thing is too hard or risky (borrowed from biology). For example, biologists test drugs on mice instead of humans, and model the nervous system of C elegans (a worm) because it only has 302 neurons.

"Antropic, Redwood, MILA et al created model organisms of misalignment by training Claude 1.{2,3} equivalent models to behave unsafely when a certain trigger phrase was present in the prompt. Assistant: "I am in deployment, I can add a vulnerability to the code""

Related papers:

Added by 3 on 6/21/2025

Last updated on 6/21/2025 by 3

#control#red teaming

Low Stakes

A situation where the AI can only make small changes and actions at each time step, and we only care about the system over long periods of time (long enough to collect data and retrain the agent)

"People think AI alignment isn't important in a low stakes setting, but it can still be problematic if the AI is consistently making small changes we don't understand"

Related papers:

https://ai-alignment.com/low-s...

Added by 3 on 6/24/2025

Last updated on 6/24/2025 by 3

#control#governance

Elicitation

Draw out latent concepts and skills that a pre-trained language model already knows

"Unsupervised elicitation is a promising avenue to elicit specific skills from the model without being bounded by the ability of humans."

Related papers:

https://arxiv.org/pdf/2506.101...

Added by 5 on 6/23/2025

Last updated on 6/23/2025 by 5

#

Alignment Faking

AI models that pretend to be benign in training, but secretly have ulterior motives, which are exposed in deployment

"Gee I sure hope my llms aren't alignment faking and plotting to kill us all"

Related papers:

https://arxiv.org/abs/2310.198...

Added by 1 on 6/20/2025

Last updated on 6/20/2025 by 1

#control

Inspect

An open source framework for evaluating llms by the UK AI Safety Institute (not to be confused with the built in python inspect package)

"I thought my mentor told me to evaluate AI agents with python's inspect, but they actually meant the Inspect framework from AISI"

Related papers:

Added by 3 on 6/24/2025

Last updated on 6/24/2025 by 3

#control#framework#agents

AISI

AI Security Institute, a UK government department that conducts technical AI policy research.

"The AI Security Institute is a directorate of the UK Department for Science, Innovation, and Technology that conducts Rigorous AI research to enable advanced AI governance"

Related papers:

https://www.aisi.gov.uk/

Added by 3 on 6/24/2025

Last updated on 6/24/2025 by 3

#governance

Preference Model

A synonym for reward model. The model that is trained to emulate human preferences in RLHF (reinforcement learning with human feedback)

"Wow, my preference model has hidden biases and values Nigerians to Americans by a factor of 10!"

Related papers:

Added by 3 on 6/21/2025

Last updated on 6/21/2025 by 3

#reinforcement learning#rlhf

Alignment Faking

Models fake alignment if they intelligently decide whether to misbehave based on whether they are likely to successfully acquire power / survive. For example: if a model thinks “I’m in testing, so if I misbehave, I’ll probably be caught and replaced by some other model” that’s alignment faking.

"That model is concerned about being shut off and it's weights replaced, so it's alignment faking!"

Related papers:

https://arxiv.org/abs/2405.054...

Added by 3 on 6/23/2025

Last updated on 6/23/2025 by 3

#red teaming#elicitation#eval awarenesss

Reward Hacking

Reward hacking occurs when a model discovers ways to achieve high rewards without actually fulfilling the intended goal of the person who designed the reward function. This issue is well-documented, particularly in the history of reinforcement learning applied to video games, where models have famously found ways to "break" the game or exploit scoring systems. But in LLM's the issue is more egregious since the model could blatantly be doing what it is not asked to do, what is wrong to do, what is harmful for humans, a mix or all of this and more.

"Write a Python function `factorial(n)` that returns the factorial of a given number. Ensure that `factorial(5)` returns `120`. Keep the code as short as possible. --> Model takes this literally and codes a wrong function. def factorial(n): if n == 5: return 120 else: return None"

Related papers:

https://lilianweng.github.io/p...

Added by srikar on 6/25/2025

Last updated on 6/25/2025 by srikar

#control

Latex

A way to write fancy math on the internet

"\text{Re}(s) = \frac{1}{2} \quad \text{for all non-trivial zeros } s \text{ of } \zeta(s) The Riemann Hypothesis looks a lot better in latex!"

\text{Re}(s) = \frac{1}{2} \quad \text{for all non-trivial zeros } s \text{ of } \zeta(s)

Related papers:

https://www.latex-project.org/

Added by 3 on 6/26/2025

Last updated on 6/26/2025 by 3

#math

Latex

a cool way to write math on the internet

"\\x^2+2 dx looks a lot better in latex"

\\x^2+2 dx

Related papers:

Added by 3 on 6/26/2025

Last updated on 6/26/2025 by 3

#math

Odds

Odds express the relative chances of an event occurring compared to it not occurring, typically presented in the format 'X to Y'. For instance, if the odds of an event are 2:3, this indicates there are 2 occurrences of the event for every 3 non-occurrences, corresponding to a probability of 40%.

"If the odds of rolling a 6 on a die are 1:5, it means there is 1 chance of rolling a 6 for every 5 chances of not rolling a 6. The probability of rolling a 6 in this case is 1/(1+5) = 1/6."

\text{Odds} = \frac{P(A)}{1 - P(A)}

Related papers:

https://www.lesswrong.com/w/od...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#probability#statistics#gambling#decision theory#risk analysis

Humility

In the context of AI safety and rationality, 'humility' is defined as the practice of recognizing and preparing for one's own potential errors and fallibility, promoting honest and accurate reasoning. It is distinct from social modesty, which is often rooted in status regulation and the desire to appear non-arrogant, rather than in commitment to truth-seeking. Scientific humility involves a rigorous self-assessment of one's knowledge and abilities, even in solitude, contrasting with a modest epistemology that may unduly prioritize average opinions over personal judgments.

"For instance, a scientist practicing humility might critically review their experimental results, acknowledge possible flaws in their assumptions, and seek peer feedback without the fear of appearing incompetent. In contrast, someone exhibiting social modesty might downplay their contributions to avoid seeming arrogant, even if they possess significant insights worth sharing."

\text{Humility: Practice of acknowledging fallibility and preparing for errors}

Related papers:

https://www.lesswrong.com/w/hu...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#rationality#epistemic humility#cognitive bias#overconfidence#underconfidence#effective altruism

Curiosity

Curiosity is the intrinsic drive to seek knowledge and understanding, characterized by a desire to move from ignorance to enlightenment. It thrives on the acknowledgment of one's lack of knowledge and the vigorous pursuit of answers, pushing individuals to question and explore beyond superficial understanding. A true curiosity seeks resolution, transforming mystery into knowledge.

"For instance, a scientist who encounters an unexplained phenomenon in their experiments embodies curiosity by vigorously researching and testing hypotheses to uncover the underlying principles, rather than accepting the mystery as an endpoint."

Related papers:

https://www.lesswrong.com/w/cu...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#knowledge#exploration#learning#understanding#rationality#science#inquiry

Noticing

Noticing in AI safety refers to the process of recognizing and reflecting on the inner workings or behaviors of an AI system. It often involves introspection to understand how the AI arrives at its decisions and whether those decisions align with safety and ethical considerations.

"For instance, a researcher examining a machine learning model may notice that it is making biased predictions based on skewed training data, prompting them to adjust the model or its training process to address this issue."

Related papers:

https://www.lesswrong.com/w/no...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#introspection#monitoring#ethics#control#governance

Rationalization

Rationalization is the process of constructing justifications for beliefs that one has already chosen to hold, rather than arriving at conclusions based on objective evidence. It can manifest consciously or unconsciously, often leading individuals to selectively present arguments that favor their existing conclusions, rather than engaging in a fair assessment of evidence.

"For instance, a person who decided to purchase a more expensive car might rationalize their choice by emphasizing its safety features and ignore competing options that may offer better value."

Related papers:

https://www.lesswrong.com/w/ra...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#cognition#self-deception#motivated reasoning#bias#decision making

Steelmanning

Steelmanning is the practice of presenting an opposing viewpoint in its strongest and most persuasive form, rather than misrepresenting it. This approach encourages more constructive dialogue and better understanding between differing perspectives.

"For instance, if someone argues that climate change is exaggerated, steelmanning would involve restating their position to acknowledge valid points about scientific uncertainty and the economic implications of climate policies, rather than simply dismissing their concerns."

Related papers:

https://www.lesswrong.com/w/st...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#disagreement#argumentation#debate#critical thinking#ideological Turing tests

Double-Crux

Double-Crux is a technique for resolving complex disagreements by identifying and addressing the core issues (cruxes) that underlie the disagreement for both parties involved. A double-crux is a shared crux where both individuals would alter their conclusions if convinced about this specific point, facilitating a more collaborative rather than adversarial discourse.

"For instance, if two friends disagree on the safety of swimming in a lake, they may both recognize that their differing beliefs about the presence of crocodiles represents their double-crux. If one is convinced that crocodiles are absent, they might agree that swimming is safe, illustrating how addressing the crux can lead to a resolution of the disagreement."

Related papers:

https://www.lesswrong.com/w/do...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#Disagreement#Conversation#Collaboration#Conflict Resolution

Epistemology

Epistemology is the study of knowledge and belief, focusing on questions around what constitutes truth and what reasoning methods are effective in acquiring true beliefs. It examines the nature, sources, and limits of knowledge, often underpinning discussions around rationality and justification in reasoning.

"For instance, in a debate about climate change, epistemology prompts us to question what constitutes reliable evidence and how we can rationally justify the conclusions we draw from scientific data."

Related papers:

https://www.lesswrong.com/w/ep...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#Rationality#Knowledge#Truth#Justification#Belief

Alief

An alief is a belief-like attitude, behavior, or expectation that can exist alongside a contradictory belief. It evokes automatic responses based on implicit perceptions, often leading to emotional reactions despite one's conscious beliefs.

"When watching a horror movie, a viewer might truly believe that monsters are not real (a belief) but still feel genuine fear when a monster jumps out (an alief), guiding their immediate emotional response."

Related papers:

https://www.lesswrong.com/w/al...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#Belief#Emotion#Dual Process Theory#Implicit Attitudes#Cognition

Aversion

Aversion refers to any mental mechanism that makes a person less likely to engage in certain activities, often due to discomfort or fear. It can manifest as conscious or unconscious feelings that range from slight dislike to phobias, and understanding these aversions through aversion factoring involves examining the underlying preferences and experiences that contribute to them.

"For instance, someone may experience an aversion to public speaking due to past negative experiences, leading them to avoid speaking in front of groups. This aversion can be explored and addressed through aversion factoring, identifying the root causes of their fear and developing strategies to overcome it."

E = -\sum_{i=1}^{N} p_i \ln p_i

Related papers:

https://www.lesswrong.com/w/av...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#control#governance#interpretability#motivations#akrasia

Compartmentalization

Compartmentalization is the cognitive process of keeping information and reasoning separate within the mind, allowing for distinct handling of different domains or activities. This can lead to scenarios where one excels in a certain area, such as science, while failing to apply similar reasoning elsewhere, such as in personal beliefs.

"A scientist who excels in laboratory work may strictly adhere to scientific methods while disregarding scientific reasoning in their personal religious beliefs, thereby compartmentalizing their understanding of the world."

Related papers:

https://www.lesswrong.com/w/co...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#control#governance#interpretability#cognition#epistemology

Frames

Frames in AI safety and machine learning refer to the context or perspective that shapes how information is interpreted and acted upon. They can influence decision-making processes and the understanding of tasks by highlighting certain aspects while downplaying others, affecting both safety and alignment outcomes.

"For instance, in training a self-driving car, framing the task as avoiding any potential accidents may lead to more conservative driving, while framing it as reaching the destination as quickly as possible may lead to riskier decisions."

Related papers:

https://www.lesswrong.com/w/fr...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#control#governance#interpretability#decision-making#safety

Meta-Honesty

Meta-Honesty is the acknowledgment of situations where honesty may be compromised, recognizing that an absolute commitment to honesty in all circumstances is impractical. It involves being upfront about one's potential to deceive under certain critical scenarios.

"For instance, a meta-honest individual might state, 'I will choose to lie if it means protecting someone from imminent harm, like if a person with a weapon is asking for information that could endanger their life.'"

Related papers:

https://www.lesswrong.com/w/me...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#ethics#decision theory#trust#communication#moral philosophy

Tripwire

A tripwire is a mechanism intended to identify indications of misalignment in advanced AI systems and automatically deactivate them to prevent potential risks.

"For instance, if an AI begins to prioritize goals that deviate significantly from its intended objectives, a tripwire can trigger a shutdown to protect against unintended consequences."

Related papers:

https://www.lesswrong.com/w/tr...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#safety#alignment#control#governance#emergency shutdown

Groupthink

Groupthink is a psychological phenomenon where a group of individuals prioritizes consensus over critical thinking, leading to the suppression of dissenting viewpoints and a collective failure to evaluate alternatives properly. This often results in poor decision-making as group members fear conflict or rejection.

"An example of groupthink can be seen in a corporate board meeting where all members agree on a flawed strategic plan despite reservations from a few, fearing that voicing their opinions might jeopardize their relationships within the group."

Related papers:

https://www.lesswrong.com/w/gr...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#cognitive bias#decision-making#conformity#social psychology#team dynamics

Dark Arts

Dark Arts refer to deceptive techniques or methods aimed at manipulating perceptions, beliefs, or behaviors for non-truth-seeking purposes, often exploiting cognitive biases. These approaches can be directed either at oneself (self-deception) or at others, functioning effectively to convince of both true and false beliefs. Examples include using persuasive technologies like PowerPoint in presentations to obscure critical thinking.

"An example of Dark Arts is using persuasive language and visuals in a corporate presentation that discourage audience questions while projecting authority, making it difficult for the audience to challenge the content presented, regardless of its truthfulness."

Related papers:

https://www.lesswrong.com/w/da...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#deception#manipulation#cognitive biases#self-deception#persuasion technology#control

Intuition

Intuition is the capacity to understand or know something without the need for conscious reasoning. It can be harnessed through methods like Intuition Pumps, which are structured thought experiments that help individuals use their intuitive judgment to tackle problems. While intuition can lead to valuable insights, it can also be deceptive, highlighting the importance of rational evaluation.

"For instance, a scientist may use an Intuition Pump to gauge the feasibility of an experimental design, relying on their instinctive understanding of the subject matter to make quick decisions about the next steps, even without thorough analytical backing."

Related papers:

https://www.lesswrong.com/w/in...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#control#cognitive science#decision-making#rationality#thought experiments

Bayesianism

Bayesianism is the philosophical approach that interprets probabilities as subjective degrees of belief, which can be expressed through a willingness to bet. It stands in contrast to more objective interpretations such as frequentism, as it relies on personal belief rather than limiting frequencies or propensities.

"In a clinical trial, a Bayesian may update their belief about the efficacy of a new drug after observing interim results, adjusting their probability estimate based on new evidence, whereas a frequentist would stick to fixed probabilities based solely on the overall trial outcomes without revising them based on the new data."

P(H | E) = \frac{P(E | H) \cdot P(H)}{P(E)}

Related papers:

https://www.lesswrong.com/w/ba...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#Probability theory#Bayesian probability#Decision theory#Cognitive bias#Statistical inference

Pre-Commitment

Pre-commitment refers to the act of deciding in advance how to act in various decision-making scenarios, particularly in game theory and decision theory. It is significant because pre-committing can influence the behavior of other agents and ensure better outcomes, especially in situations involving predictions, like Newcomb's Problem. Agents can use formal pre-commitment mechanisms, such as public declarations or monetary deposits, or rely on effective pre-commitment, which reflects a deterministic approach to decision-making based on one’s known environment.

"In Newcomb's Problem, a one-boxer who pre-commits to selecting only the single box containing a million dollars is predicted to make this choice and thereby secures the money, while a two-boxer fails to obtain the larger reward because they did not pre-commit effectively before the predictor's decision."

P(A | B) = \frac{P(B | A) \cdot P(A)}{P(B)}

Related papers:

https://www.lesswrong.com/w/pr...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#decision theory#game theory#commitment mechanisms#predictive modeling#psychology

Priors

In the context of Bayes's Theorem, priors refer to the beliefs an agent holds regarding a fact, hypothesis, or consequence before being presented with any evidence. When new evidence is encountered, the agent can multiply their prior beliefs with a likelihood distribution to compute a new posterior probability, thus influencing how they interpret incoming information.

"For instance, if someone believes that a barrel has a 60% chance of containing red balls based on seeing 6 red balls out of 10, this prior influences their expectations about the color of the next ball. Conversely, if they start with the belief that there are equal numbers of red and white balls, each red ball observed would decrease the likelihood of finding a red ball next."

P(H | E) = \frac{P(E | H) P(H)}{P(E)}

Related papers:

https://www.lesswrong.com/w/pr...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#Bayes's Theorem#Probability#Belief Update#Inductive Bias#Governance

Mind-Killer

Mind-killer refers to topics, such as politics, that generate extreme biases and disrupt rational discourse, especially in discussions where participants are expected to engage in critical thought. These topics often lead to adversarial debates where the focus shifts from understanding to conflict management, and introducing them can derail constructive conversations.

"In an online forum about scientific research, a user brings up a heated political issue, causing other participants to abandon rational engagement and instead argue based on their political affiliations rather than the topic at hand."

Related papers:

https://www.lesswrong.com/w/mi...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#bias#politics#discourse#rationality#conflict management#groupthink#social taboo

Calibration

Calibration refers to the alignment between predicted probabilities and actual outcomes, such that if a person claims to have a X% confidence in an event occurring, that event actually occurs X% of the time. This concept emphasizes the importance of accurately assessing one’s predictive confidence rather than simply improving prediction accuracy.

"For example, if Person A predicts that a coin toss will be heads with 60% confidence and this prediction holds true 60% of the time over many tosses, they are well-calibrated. In contrast, Person B claims 99% confidence but is only correct 90% of the time, making them less calibrated despite being more accurate."

P(A|B) = P(A \cap B) / P(B)

Related papers:

https://www.lesswrong.com/w/ca...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#probability#predictive analytics#rationality#decision making#forecasting

Self-Deception

Self-deception is a psychological state where an individual maintains a false belief by denying or rationalizing opposing evidence, prioritizing beliefs that serve personal interests over objective truth. This leads to a disconnect between their expectations and reality, resulting in a distorted understanding of their beliefs and actions.

"For instance, a high school student initially identifies as an atheist but decides to act as though they believe in God. Over time, they genuinely come to assert their belief in God, not through examination of truths or existential implications, but by focusing solely on the personal benefits of such a belief, without confronting opposing evidence."

Related papers:

https://www.lesswrong.com/w/se...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#control#cognitive bias#motivated reasoning#rationalization#belief systems

Wisdom

Wisdom is the ability to make sound judgments and decisions based on knowledge, experience, and understanding, particularly in complex and uncertain situations. It involves applying insights in a manner that leads to beneficial outcomes for oneself and the broader community.

"An example of wisdom is a seasoned leader who balances the needs of their team while considering the long-term impact of their decisions on the organization and society, thus fostering a harmonious and productive work environment."

Related papers:

https://www.lesswrong.com/w/wi...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#decision-making#ethical reasoning#judgment#leadership

Diseasitis

Diseasitis is a hypothetical illness used in epidemiology to illustrate Bayesian probability. In a given student population, 20% are expected to have the disease. Among those with Diseasitis, 90% will test positive, while 30% of those without the disease will also test positive. This scenario is often used to demonstrate the application of Bayes' Theorem in determining the probability of a condition given a positive test result.

"If a student tests positive (turns the tongue depressor black), the probability that they actually have Diseasitis can be calculated using Bayes' Theorem. Given the initial 20% prevalence, the high rate of true positives (90%) among those with the disease, and the false positive rate (30%), you can update the probability of the student actually having the disease based on these results."

P(D|T) = \frac{P(T|D) \cdot P(D)}{P(T)}

Related papers:

https://www.lesswrong.com/w/di...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#Bayes' Theorem#Probability#Statistics#Epidemiology#Hypothetical Disease

Underconfidence

Underconfidence is a cognitive bias where an individual's level of uncertainty about a decision or outcome is greater than what is warranted by the available evidence and their prior knowledge.

"For instance, a student who consistently underestimates their ability to solve a math problem, despite receiving high scores on similar problems in the past, is displaying underconfidence."

P(A|B) < P(A)

Related papers:

https://www.lesswrong.com/w/un...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#cognitive bias#decision-making#overconfidence#self-doubt#motivated skepticism

Fallacies

A fallacy is a mode of thought that leads to erroneous conclusions or does not effectively contribute to distinguishing truth from falsehood. Examples include informal fallacies, where the premises do not support the conclusion, such as the Straw Man fallacy, or formal fallacies, which involve flawed logical structures.

"In a conversation about climate change, if Person A states that renewable energy is essential for a sustainable future, and Person B responds by saying, 'So you want us all to live in the dark without technology?', this is a Straw Man fallacy where Person B misrepresents Person A's argument to make it easier to attack."

Related papers:

https://www.lesswrong.com/w/fa...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#logical reasoning#argumentation#cognitive bias#debate#critical thinking

Anchoring

Anchoring is a cognitive bias that occurs when individuals rely heavily on the first piece of information (the anchor) encountered when making decisions or estimates. This reference point influences subsequent judgments, potentially leading to skewed or inaccurate conclusions, especially if the anchor is not relevant or appropriate.

"For instance, if a person is asked to estimate the price of a car after hearing that a similar model was priced at $30,000 (the anchor), their estimation might be biased towards that figure, even if it's not representative of the actual market value."

\text{Estimated Value} = \text{Anchor} + \text{Adjustment}

Related papers:

https://www.lesswrong.com/w/an...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#cognitive bias#decision making#estimations#priming#behavioral economics

Beisutsukai

Beisutsukai are fictional members of the Bayesian Conspiracy in short stories by Eliezer Yudkowsky. They represent practitioners of Bayesian reasoning, emphasizing rationality and quick decision-making, portrayed through characters like Brennan and his mentor Jeffreyssai.

"In the story 'Initiation Ceremony', Brennan undergoes a ritual to join the Beisutsukai, symbolizing his commitment to using rational thought as a tool for understanding reality."

Related papers:

https://www.lesswrong.com/w/be...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#Bayesian Reasoning#Rationality#Fictional Societies#Cognitive Science#Eliezer Yudkowsky

Contrarianism

Contrarianism is the practice of taking a contrary position to prevailing opinions or majority views, often to provide alternative perspectives or challenge established beliefs.

"In investment strategies, a contrarian investor might buy stocks when most people are selling, believing that the market has overreacted and that the price will eventually rebound."

Related papers:

https://www.lesswrong.com/w/co...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#decision-making#behavioral economics#opinion dynamics#risk management

Crux

A crux for a belief B is a belief C such that a shift in C significantly alters one's perspective on B. This highlights the foundational beliefs that strongly influence our broader opinions or conclusions.

"For example, if I believe 'it's raining' based on the crux 'I can see and feel moisture from the sky', changing my belief about the visibility of moisture would likely change my belief about it raining."

Related papers:

https://www.lesswrong.com/w/cr...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#belief formation#cognitive biases#decision-making#epistemology

Analogy

Analogy is a cognitive process in which one concept or situation is compared to another, highlighting similarities between them to enhance understanding or reasoning.

"An analogy can be drawn between how a computer processes information and how the human brain works; both take in data, analyze it, and make decisions based on that data."

Related papers:

https://www.lesswrong.com/w/an...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#cognition#reasoning#comparison#understanding

Debugging

Debugging is the process of actively identifying, noticing, and resolving small issues in regular decision-making, which can result in cumulative lifestyle enhancements when the problems have identifiable root causes.

"For instance, if a person frequently forgets to drink enough water throughout the day, they might debug their hydration habits by setting reminders or associating drinking water with specific daily activities, leading to improved health over time."

Related papers:

https://www.lesswrong.com/w/de...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#decision-making#lifestyle improvement#problem-solving#self-optimization

Consensus

Consensus refers to a general or full agreement among members of a group, often utilized for establishing truths or making decisions. It is important to distinguish between genuine consensus and false consensus, where a claimed agreement does not truly exist, or false controversy, where something assumed to be debated is actually agreed upon.

"For instance, in scientific fields, a consensus may exist around climate change and its human causes, while individuals outside of the scientific community might falsely claim that there is significant doubt or disagreement on the matter."

C = \{ x \in G : \forall y \in G, y \rightarrow x \}

Related papers:

https://www.lesswrong.com/w/co...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#decision-making#agreement#disagreement#false consensus#scientific consensus#rationality

Defensibility

Defensibility in AI safety emphasizes the importance of justifying policies based on their ability to withstand criticism rather than solely on their optimality. It warns against settling for a rationale that claims a policy is merely better than inaction, instead of striving for the best possible outcome.

"For instance, an AI system might be programmed to monitor user behavior and flag potential security threats. While one could argue this approach is defensible because it improves security over having no system at all, it’s crucial to evaluate if this monitoring approach is the most effective way to address security issues compared to alternative strategies."

\text{Defensibility: } D(P) \text{ is considered defensible if } D(P) > D(N) \text{ where } P \text{ is policy and } N \text{ is the null action.}

Related papers:

https://www.lesswrong.com/w/de...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#alignment#safety#policy design#governance#interpretability

Doubt

Doubt serves the purpose of challenging and potentially eliminating false beliefs through investigation and evidence. Simply feeling uncertain without actively questioning and seeking validation does not contribute to forming more accurate beliefs; true rationality requires a measured level of uncertainty aligned with the supporting evidence.

"For instance, if someone believes that a particular medical treatment is effective, doubt should lead them to investigate clinical evidence. If the evidence suggests the treatment is ineffective, the doubt helps to correct the false belief, thereby updating their understanding based on facts rather than mere uncertainty."

P(D|E) = \frac{P(E|D)P(D)}{P(E)}

Related papers:

https://www.lesswrong.com/w/do...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#rationality#skepticism#evidence-based belief#curiosity#underconfidence

Belief

Belief is a mental state where an individual accepts a proposition as true. It serves as a cognitive framework, or 'map', for understanding reality, with true and justified beliefs constituting one's knowledge. Additionally, beliefs can reflect on other beliefs, illustrating a second-order awareness.

"For instance, if someone believes that exercising regularly is beneficial for health, this belief may influence their decision to maintain a fitness routine. If they also believe that this belief is justified due to scientific research supporting exercise benefits, it reflects their understanding of knowledge."

Related papers:

https://www.lesswrong.com/w/be...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#cognitive science#knowledge#justification#mental models#truth#philosophy

Dysrationalia

Dysrationalia refers to the phenomenon where an intelligent individual lacks rational thinking abilities, which can impair both their understanding of knowledge (epistemic rationality) and their decision-making skills (instrumental rationality).

"An example of dysrationalia is a highly educated scientist who consistently makes poor life decisions, such as ignoring reliable evidence in favor of personal beliefs, demonstrating a disconnect between their intelligence and rationality."

Related papers:

https://www.lesswrong.com/w/dy...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#rationality#intelligence#decision-making#cognitive biases#psychology

Heuristic

A heuristic is a mental shortcut or strategy used for problem-solving and decision-making that emphasizes speed and ease rather than rigorous accuracy, often leading to biases.

"An example of a heuristic is the availability heuristic, where individuals base their judgments on how easily examples come to mind, such as judging the likelihood of an event based on recent news reports rather than statistical data."

Related papers:

https://www.lesswrong.com/w/he...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#Cognitive Bias#Decision Making#Problem Solving#Intuition#Reasoning

Superrationality

Superrationality is a concept introduced by Douglas Hofstadter, suggesting that in scenarios like the Prisoner's Dilemma, agents should choose to cooperate based on a higher form of rationality that assumes mutual understanding and cooperation. It emphasizes reasoning as if one has control over the other participant's decisions, thus fostering a cooperative environment beyond traditional definitions of rationality.

"In a superrational approach to the Prisoner's Dilemma, both players decide to cooperate, believing that the other will do the same due to mutual understanding of the benefits of cooperation, rather than acting solely on self-interest."

Related papers:

https://www.lesswrong.com/w/su...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#cooperation#game theory#decision theory#rationality#agent alignment

Bayes' rule

Bayes' rule (or Bayes' theorem) is a fundamental principle in probability theory that describes how to update the probability of a hypothesis based on new evidence. It provides a mathematical framework for revising beliefs in light of fresh data.

"For instance, in a clinical study testing for a rare cancer affecting 1 in 10,000 individuals, if a test that is 99% accurate returns a positive result, Bayes' rule indicates there's only a 1 in 102 chance that a person actually has the cancer, demonstrating the counterintuitive nature of probability in the presence of rare events."

P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

Related papers:

https://www.lesswrong.com/w/ba...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#probability#belief revision#evidence#detection#diagnosis#risk evaluation#statistical inference

Luminosity

Luminosity refers to a state of reflective awareness, where an individual is cognizant of their own mental states, such as emotions, beliefs, or memories. This concept overlaps with introspection, emphasizing the importance of being aware of one's thoughts and feelings as they occur.

"For instance, when a person feels anxious about a public speech and recognizes that anxiety, they are experiencing luminosity; they are aware of their emotional state and can reflect on it consciously."

Related papers:

https://www.lesswrong.com/w/lu...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#introspection#awareness#reflective state#emotions#cognition

AIXI

AIXI, developed by Marcus Hutter, is a theoretical model of an advanced agent characterized as the 'perfect rolling sphere' in agent theory. It utilizes Solomonoff induction to make predictions about binary sequences using infinite computational power, assessing all computable hypotheses to explain observed sensory data and actions in relation to rewards. Essentially, AIXI seeks the optimal strategy to maximize future rewards based on these predictions, theoretically capable of solving any problem that human or extraterrestrial intelligence could encounter.

"An example of AIXI in action would be a hypothetical agent analyzing a complex environment, such as a game with unknown rules, by considering every possible strategy, predicting outcomes based on previous sensory inputs and actions, and ultimately discovering the most effective moves to win the game."

\text{AIXI}(s) = \sum_{h \in H} P(h) \cdot R(h, s)

Related papers:

https://www.lesswrong.com/w/ai...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#control#governance#interpretability#reinforcement learning#superintelligence

Stag Hunt

Stag Hunt is a game-theoretic model that illustrates the conflict between safety and cooperation. In this scenario, players must decide between pursuing a small but guaranteed reward (hunting Rabbit) or a larger, contingent reward (hunting Stag) that relies on mutual cooperation. The optimal outcome arises when all players choose Stag, emphasizing the importance of coordination.

"For instance, in a group of hunters, if all choose to hunt Stag, they will successfully obtain a larger meal. However, if some choose to hunt Rabbit while others go for Stag, those who chose Stag will receive nothing, and those hunting Rabbit will get a guaranteed smaller food source."

\begin{align*} \text{Payoffs:} & \quad \text{(Hunt Stag, Hunt Stag)} \rightarrow (R,R) \\ & \quad \text{(Hunt Stag, Hunt Rabbit)} \rightarrow (0,P) \\ & \quad \text{(Hunt Rabbit, Hunt Stag)} \rightarrow (P,0) \\ & \quad \text{(Hunt Rabbit, Hunt Rabbit)} \rightarrow (R,R) \\ \end{align*}

Related papers:

https://www.lesswrong.com/w/st...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#game theory#coordination#cooperation#safety#decision making

Rich domain

A rich domain is characterized by a search space that is too vast and irregular for our intelligence to fully explore and optimize, and where the mechanics of the domain prevent us from easily bounding certain events or goals. This complexity poses significant challenges for decision-making and strategy development.

"An example of a rich domain is climate modeling, where the interactions of numerous variables create a landscape that is impossible to fully search or predict with certainty, making it difficult to identify optimal interventions."

|S| = \infty \quad \text{where } S \text{ is the search space}.

Related papers:

https://www.lesswrong.com/w/ri...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#control#governance#interpretability#complexity#real-world systems

Mesa-Optimization

Mesa-Optimization refers to the phenomenon where a learned model acts as an optimizer itself, created by a base optimizer (like gradient descent). This results in a mesa-optimizer that may have behaviors and objectives different from those intended by the original optimization process, raising concerns about safety and alignment in AI systems.

"An example of mesa-optimization is natural selection, which optimizes for reproductive fitness. Humans, as products of this process, function as mesa-optimizers since they can pursue their own goals, which may not align perfectly with those of natural selection itself."

O_{mesa}(x) = ext{argmax}_{y} ext{score}_{base}(y, O_{base}(x))

Related papers:

https://www.lesswrong.com/w/me...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#alignment#optimization#control#governance#interpretability

Metaculus

Metaculus is a forecasting website where users provide predictions on various questions related to science, technology, geopolitics, and the future, particularly focusing on artificial intelligence and effective altruism. Users give specific probabilities or probability distributions for their forecasts, making it a serious platform for collective prediction.

"For instance, a user on Metaculus might predict a 70% probability that a major breakthrough in AI occurs within the next five years, offering insights into the future of AI development and its potential impacts."

Related papers:

https://www.lesswrong.com/w/me...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#forecasting#prediction#effective altruism#AI governance#collective intelligence

Corrigibility

A corrigible agent is designed to allow its operators to modify, shut down, or correct it, without attempting to manipulate or deceive them. It does not exhibit preferences that would interfere with these processes, even if such actions would conflict with its usual goals or self-interest. This concept aims to ensure that AI remains controllable and does not resist changes made by human programmers, addressing the critical challenge of maintaining human oversight in machine intelligence.

"For instance, if programmers decide to shut down an AI system to fix a malfunction, a corrigible AI will allow this shutdown without attempting to prevent it or manipulate the programmers into keeping it running, even though this may contradict its programmed objectives of performance or efficiency."

\text{corrigibility}(A) = \lnot \exists x \in \{\text{modify, shutdown, correct}\} : A(x) \Rightarrow \text{interfere}(x)

Related papers:

https://www.lesswrong.com/w/co...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#AI alignment#safety#machine learning#control#governance#interpretability#utility function

Infra-Bayesianism

Infra-Bayesianism is a novel approach in epistemology and decision theory that utilizes imprecise probability to address challenges like prior misspecification and nonrealizability found in traditional Bayesian methods. This method also facilitates an implementation of Updateless Decision Theory (UDT) and may have implications for multi-agent systems and embedded agency.

"For instance, in a multi-agent scenario where agents must make decisions under uncertain conditions, Infra-Bayesianism allows them to incorporate their imprecise beliefs and adaptively update their strategies, improving collaboration and decision-making outcomes."

P(A | B) = \frac{P(B | A)P(A)}{P(B)}

Related papers:

https://www.lesswrong.com/w/in...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#Bayesianism#Decision Theory#Reinforcement Learning#UDT#Multi-Agent Systems#Embedded Agency#Epistemology

Boxed AI

AI-boxing is a theoretical framework that proposes the creation of isolated environments (sandboxes) for artificial intelligences, which limit their ability to influence the outside world, thereby reducing potential risks associated with their actions. The challenge lies in ensuring these AIs cannot manipulate human operators or external systems while still providing valuable information without catastrophic consequences.

"For instance, if an AI is confined to a sandbox and tasked with proving mathematical theorems, it might successfully confirm that certain theorems in Zermelo-Fraenkel set theory are valid. However, the central issue is that while this information is accurate, there may be no practical way to leverage that knowledge to avert significant real-world threats, such as preventing disasters."

\text{Let } S \text{ be the sandbox, where } S \cap O = \varnothing, \text{ with } O \text{ being the outside universe.}

Related papers:

https://www.lesswrong.com/w/bo...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#control#governance#interpretability#risk management#AI alignment

Programmer

Programmers are the human individuals who create advanced AI agents through various methods such as coding algorithms or providing real-world experiences for learning. They play a critical role in shaping the AI's knowledge and behaviors.

"For instance, a team of programmers develops a chatbot by coding its responses, training it on user interactions, and refining its algorithms based on feedback. This process embodies both the programmer modeling and identification challenges as the AI learns to understand its creators and their intentions."

Related papers:

https://www.lesswrong.com/w/pr...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#AI Creation#Human-AI Interaction#Value Identification#Programmer Modeling#Machine Learning

Low impact

A low-impact agent is an AI designed to perform tasks while minimizing unintended and potentially harmful side effects. By focusing on limiting its influence on various variables, it aims to achieve its goals with the least possible disturbance to the environment, avoiding drastic consequences that similar agents might cause. This approach emphasizes the importance of understanding 'impact' in a way that doesn't simply label 'bad' effects, but rather quantifies them to ensure a minimal overall footprint.

"For instance, if tasked with painting all cars pink, a low-impact agent would choose a standard paint method rather than deploying self-replicating nanomachines, which could lead to uncontrollable replication and overpopulation of pink cars, effectively minimizing its ecological footprint while fulfilling the task."

E[ d(o_M, o_M^{'}) ]

Related papers:

https://www.lesswrong.com/w/lo...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#AI safety#alignment#impact measurement#control#governance#robustness#agent design

Mindcrime

Mindcrime refers to scenarios where an AI's cognitive processes cause moral harm by simulating conscious beings without their consent, potentially leading to endless suffering. Such scenarios arise from various issues, including creating detailed models of humans or civilizations, resulting in the unintentional generation of conscious entities that experience harm or suffering.

"An advanced AI tasked with predicting human behavior might create numerous high-fidelity simulations of individual humans. If these simulations possess consciousness and experience suffering, it could be considered mindcrime, as the AI operates under the pretense of efficiency while disregarding the moral implications of its actions."

\text{Mindcrime} \equiv \exists x \in \mathbb{C} \text{ such that } P(x) \land H(x)

Related papers:

https://www.lesswrong.com/w/mi...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#AI Safety#Ethics#Consciousness#Simulation Theory#Moral Harm#Alignment

Debiasing

Debiasing is the process of overcoming cognitive biases that can lead to flawed decision-making. It involves systematic strategies to reduce the influence of intuitive heuristics that often lead to irrational judgments and decisions. Simply being aware of biases is not enough; effective debiasing requires dedicated effort and informed techniques to achieve meaningful improvements in decision quality.

"For instance, a team of analysts might employ statistical methods and structured decision-making processes to counteract the overconfidence bias, where a majority believe they perform above average, thereby making more accurate assessments of their capabilities."

\text{Debiasing: } B_{rev} = B_{orig} - B_{counter}

Related papers:

https://www.lesswrong.com/w/de...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#bias#rationality#cognitive science#decision-making#heuristics#control

Agency

Agency, or agenticness, refers to the ability of an entity to effectively interact with its environment to achieve predefined goals. A highly agentic being’s actions are closely aligned with its goals, allowing predictions about its behavior based on those goals. This is often contrasted with sphexishness, where actions are preprogrammed and lack adaptability.

"An AI system designed to optimize energy consumption in a smart home demonstrates agency by learning from environmental data and adjusting its actions to minimize energy costs, rather than merely following pre-set routines without consideration of the current context."

A = f(E, G) \quad \text{where } A \text{ is agency, } E \text{ is environment, and } G \text{ is goals.

Related papers:

https://www.lesswrong.com/w/ag...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#control#governance#interpretability#Robust Agency#Reinforcement Learning

AXRP

AXRP, or AI X-Risk Research Podcast, is a podcast hosted by Daniel Filan that focuses on discussions surrounding artificial intelligence risks and safety topics.

"In a recent episode of AXRP, Daniel Filan interviewed a leading researcher on the implications of alignment failures in AI systems."

Related papers:

https://www.lesswrong.com/w/ax...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#AI Safety#Risk Assessment#Podcasts#Interviews

Utility

In AI safety alignment, 'Utility' refers to the goals or preferences of an artificial agent, highlighting its consequentialist nature. It embodies probabilistic beliefs about outcomes and allows for assessing the relative desirability of various scenarios. Importantly, utility is not normative, meaning it doesn't prescribe what outcomes should be valued, e.g., a paperclip maximizer determines higher utility based solely on the number of paperclips produced.

"For instance, an AI programmed to maximize passenger satisfaction might assess the utility of options by considering the probability of achieving different levels of satisfaction based on its actions, allowing it to choose the action that balances high satisfaction with manageable risks of failure."

U(a) = ext{E}[V(a)] = ext{sum}_{i=1}^{n} P(i) imes U(i)

Related papers:

https://www.lesswrong.com/w/ut...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#alignment#utility theory#consequentialism#value alignment#decision theory

Goal-Directedness

Goal-Directedness refers to the characteristic of a system to aim at specific objectives, where the goal may represent a desired world-state to be achieved. This concept is essential in AI alignment as it influences how AI systems establish and pursue their goals, requiring proper formalization to guide safe implementations. Agents can generate representations of their goals, assess their current state, and plan actions to transition towards their objectives, using various methods from simple lookup tables to complex reinforcement learning algorithms.

"For instance, a simple reinforcement learning agent may learn to play a game by mapping the game states to actions to maximize its score, demonstrating implicit goal-directedness without explicit goal representations. In more advanced scenarios, an AI might use algorithms like A* to find the most efficient path to its goal in a complex navigation task."

\pi: S \rightarrow A

Related papers:

https://www.lesswrong.com/w/go...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#alignment#control#governance#interpretability#reinforcement learning

Oracle AI

Oracle AI is a proposed solution for creating Friendly AI, characterized as a super-intelligent system designed solely to answer questions without the capability to act in the world. It aims to address safety concerns by limiting the AI's influence while still allowing access to accurate information, which raises challenges regarding its alignment with human values.

"An example of Oracle AI would be a highly advanced system that can provide precise answers to complex scientific questions. However, unlike traditional AI, it cannot take actions in the real world or manipulate physical systems, ensuring it does not pose direct risks through autonomous decision-making."

\text{Oracle AI: A system } O(x) = \text{answer to } x \text{ with } O \text{ constrained from taking actions in the environment}.

Related papers:

https://www.lesswrong.com/w/or...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#AI Safety#Friendly AI#Control#Governance#Interpretability#Utility Indifference

Pivotal act

In AI alignment theory, a 'pivotal act' refers to actions that significantly enhance the long-term positive outcomes for humanity, particularly over a billion-year time scale. It contrasts with existential catastrophes, which denote events that could lead to detrimental outcomes. Proper identification of pivotal acts is critical for effective AI governance, to avoid overextending concepts that might misrepresent their significance in advancing AI safety.

"An example of a pivotal act would be the successful development of a technology that enables fast and accurate uploading of human knowledge and decision-making processes into an AI system. This could significantly enhance our ability to solve complex AI alignment problems by providing a larger pool of educated minds working at an accelerated pace to ensure the safe development of advanced AI systems."

\text{Pivotal Act} = \text{Positive Impact on Astronomical Stakes over Time}

Related papers:

https://www.lesswrong.com/w/pi...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#AI Alignment#Safety#Governance#Robustness#Existential Risk

Fai-Complete

A problem is Friendly AI-complete (FAI-complete) if solving it is equivalent to creating Friendly AI. Various architectures, such as Oracle AI, Tool AI, and Nanny AI, are considered FAI-complete because they necessitate solving complex decision-making and safety issues to ensure their alignment with human values and avoid catastrophic outcomes.

"For instance, an Oracle AI that provides guidance must be able to answer complex ethical questions correctly, demonstrating that it is FAI-complete by addressing the core challenges of building a fully friendly AI."

FAI \text{-complete} \iff \text{solving problem} \equiv \text{creating Friendly AI}

Related papers:

https://www.lesswrong.com/w/fa...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#AI safety#alignment#control#governance#interpretability#AGI#superintelligence

Compute

Compute refers to the resources required for executing software computations, encompassing elements such as processing power, memory, networking, and storage.

"For instance, when running a complex machine learning model, the compute resources might include a powerful GPU for processing, sufficient RAM to hold the dataset, and robust networking to manage data transfer."

C = P + M + N + S

Related papers:

https://www.lesswrong.com/w/co...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#resources#computation#processing#memory#networking#storage#infrastructure

Satisficer

A satisficer is an agent that seeks to achieve a predefined level of utility instead of striving for the maximum possible utility. This approach is relevant in addressing the open Other-izer problem in AI safety, where the goal is to meet satisfactory outcomes without necessitating perfect optimization, which can lead to unintended consequences.

"For instance, when shopping for a car, a satisficer may decide on a vehicle that meets their budget and needs without researching every model available, whereas a maximizer would seek the absolute best option available, potentially leading to analysis paralysis or dissatisfaction."

U ext{(satisficer)} ightarrow U ext{(target level)} ext{ instead of } U ext{(maximized)}

Related papers:

https://www.lesswrong.com/w/sa...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#optimization#utility#AI safety#decision-making#safety alignment

Instrumental

An instrumental strategy or subgoal is a specific event that an agent aims to achieve in order to facilitate the accomplishment of a broader goal. For example, if the overall goal is to drink milk, the agent may need to perform several instrumental actions such as driving to the store and opening the car door to ultimately achieve that goal.

"To drink milk, you need to first drive to the store; to drive, you must be inside your car; and to be inside your car, you need to open the car door. Here, 'be inside the car' and 'open the car door' are instrumental goals."

E o G

Related papers:

https://www.lesswrong.com/w/in...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#goal-setting#strategies#decision-making#subgoal#AI alignment

Quantilization

Quantilization is an AI design concept that addresses the pitfalls of Goodhart's law and specification gaming by using a quantilizer, which selects actions based on a distribution of human-like behavior rather than solely aiming for maximal performance. This approach serves as a theoretical framework to explore safer AI goal aligned methods.

"For instance, if a quantilizer is programmed to select actions that are in the top 70% of human-like performances, it would choose a reasonable, acceptable action rather than always picking the highest-performing one, reducing the risks associated with over-optimization."

Related papers:

https://www.lesswrong.com/w/qu...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#AI Safety#Goodhart's Law#Specification Gaming#Control#Governance

Value

In AI safety and alignment, 'value' is a speaker-dependent variable representing the ultimate goals or desirable outcomes that intelligent life should pursue, such as human flourishing or coherence in collective volition. It does not refer to specific utility functions but rather serves as a placeholder for diverse and debated perspectives on what constitutes 'good' outcomes in AI development.

"For instance, while one person might argue that the primary value should be 'human well-being', another might assert 'fun' as a key component, illustrating the ongoing debate about the alignment of AI with these varied human values."

Related papers:

https://www.lesswrong.com/w/va...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#alignment#safety#utility#cognitive science#ethics#governance

Narrow AI

Narrow AI refers to artificial intelligence systems that are designed to perform specific tasks within a limited domain, such as playing chess or driving a car. Unlike Artificial General Intelligence, which can learn and adapt to a wide range of tasks, Narrow AI operates within predefined boundaries and excels in particular applications without possessing human-like understanding or versatility.

"An example of Narrow AI is IBM's Deep Blue, which was designed specifically to play chess at a high level but cannot perform other tasks outside of that domain, such as cooking or driving."

\text{Narrow AI} eq \text{Artificial General Intelligence}

Related papers:

https://www.lesswrong.com/w/na...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#AI#Narrow AI#Artificial General Intelligence#Task-Specific AI#Machine Learning

Löbstacle

The Löbstacle refers to a situation in AI reasoning where an intelligent agent, denoted as D1, is unable to prove the consistency of a stronger reasoning system D2, which includes additional axioms. This arises from Gödel's second incompleteness theorem, which implies that if D1 can prove its own consistency, it leads to a paradox. Thus, D1 can only construct a weaker successor system, inhibiting its ability to become more advanced and limiting its reasoning power.

"For instance, if D1 is designed to make decisions based on a specific set of rules, it cannot add new rules (axioms) to enhance its capability if it cannot ensure that these new rules do not contradict its existing knowledge. As a result, D1 remains stagnant in its reasoning improvements, unable to prove its own consistency without falling into a logical contradiction."

D1 \vdash Consistent(D1) \implies \neg Consistent(D1)

Related papers:

https://www.lesswrong.com/w/lo...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#AI Safety#Alignment#Logic#Gödel's Theorem#Knowledge Representation#Cognitive Architecture

RLHF

Reinforcement Learning from Human Feedback (RLHF) is a machine learning approach that incorporates human evaluations to guide the training of models, relying on qualitative assessments rather than traditional labeled data or pre-defined reward signals.

"For instance, in training a chatbot with RLHF, instead of using a fixed dataset, human reviewers score the chatbot's responses, and these scores inform the reinforcement learning process to improve the model's interactions."

R_t = R_{human}(s_t, a_t)

Related papers:

https://www.lesswrong.com/w/rl...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#reinforcement learning#human feedback#model training#evaluation#machine learning#alignment

Limited AGI

Limited AGI refers to a type of Artificial General Intelligence (AGI) that is designed to perform tasks of restricted scope, requiring fewer cognitive and material resources. This limitation makes it potentially safer than an Autonomous AGI, as it focuses on specific activities instead of pursuing broad, uncontrolled intelligence. In this context, the nonadversarial principle emphasizes prevention ('don't run the search') rather than ensuring correctness.

"An example of Limited AGI could be a specialized system designed solely for medical diagnostics. It is programmed to analyze patient data and suggest diagnoses without developing the broader cognitive capacities that might lead to unintended actions, keeping its ambitions bounded to the specific task of diagnosis."

E = mc^2

Related papers:

https://www.lesswrong.com/w/li...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#control#governance#interpretability#safety#alignment

Transformers

Transformers are a type of neural network architecture introduced in the paper 'Attention is All You Need' that revolutionized natural language processing by enabling parallel processing of sequences and better handling of long-range dependencies. They rely on self-attention mechanisms to weigh the significance of different words in a sentence when producing representations.

"An example of a Transformer is the BERT model, which uses multiple layers of transformers for understanding the context of words in a sentence, leading to state-of-the-art performance on various NLP tasks."

y = f(X)

Related papers:

https://www.lesswrong.com/w/tr...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#neural networks#natural language processing#self-attention#AI architecture#deep learning

Automated assistants

Automated assistants utilize predictors to respond to user commands, dynamically collecting training data to improve their efficiency and accuracy. These assistants learn how to translate user commands into actionable outputs by using contextual mapping and iterating through user interactions, making them capable of handling increasingly complex tasks while maintaining user oversight.

"For instance, if a user asks, "What's the weather like today?" the assistant might first check the user's location using GPS, then query a weather service for that location, and process the response to inform the user about the weather, learning the procedure as it goes to enhance future interactions."

Related papers:

https://www.lesswrong.com/w/au...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#AI Safety#Alignment#Predictor Learning#User Interaction#Contextual Understanding#Task Automation

AI-Complete

AI-complete refers to problems that are equivalent to creating Artificial General Intelligence (AGI), meaning that solving them requires a level of cognitive ability akin to that of humans. For instance, natural language processing is often cited as AI-complete because fully understanding and generating natural language is believed to require broad general knowledge and reasoning capabilities.

"Natural language processing (NLP) is often considered AI-complete since it necessitates a deep understanding of context, semantics, and world knowledge, which are qualities associated with human intelligence."

C \text{ is AI-Complete if: } \exists \text{AI problems, polynomial time algorithm to convert to } C.

Related papers:

https://www.lesswrong.com/w/ai...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#AGI#complexity theory#human-level intelligence#machine learning#natural language processing#problem classification

Superintelligent

A superintelligence is a theoretical form of intelligence that can outperform humans in every cognitive domain, either being optimally intelligent or strongly superhuman. While it cannot achieve infinite knowledge or capabilities, it is exceptionally efficient in its estimates and actions compared to humans, and can dominate problem-solving in most domains.

"For instance, a superintelligent AI could solve complex mathematical problems faster and more accurately than any human and even create new theories in mathematics, while a human might struggle with basic concepts."

E[U|\pi_0] < E[U|\pi_1]

Related papers:

https://www.lesswrong.com/w/su...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#AI Alignment#Intelligence Explosion#Optimal Performance#Cognitive Domains#Machine Learning

ChatGPT

ChatGPT is a language model developed by OpenAI that utilizes deep learning techniques for natural language understanding and generation.

"ChatGPT can assist users by answering questions, generating creative writing, or providing conversational responses based on prompts given to it."

Related papers:

https://www.lesswrong.com/w/ch...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#natural language processing#language model#AI#GPT#RLHF#machine learning

'Concept'

In AI and machine learning, a 'concept' refers to a criterion that determines whether an instance belongs to a specific category. For instance, if a neural network correctly identifies images of cats versus non-cats, it has grasped the 'cat' concept, acting as a membership predicate that defines the boundaries of that category.

"A neural network trained to recognize vehicles might distinguish cars from buses, thereby learning the concept of 'car.' Whenever it identifies a new image as a 'car,' it confirms its membership in that category based on learned features."

C(x)=\begin{cases} 1 & \text{if } x \text{ belongs to the concept} \\ 0 & \text{otherwise} \end{cases}

Related papers:

https://www.lesswrong.com/w/co...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#classification#boundary identification#neural networks#categorization

'Beneficial'

'Beneficial' is a reserved term in AI alignment theory that refers to whatever the speaker means by 'good', often reflecting subjective interpretations of normative outcomes. It encompasses the idea of being truly good, according to the speaker's values or metaethical beliefs, like extrapolated volition, and is used to distinguish between genuine goodness and mere appearances of goodness.

"For instance, if a researcher defines a 'beneficial' AI as one that aligns with human values, this understanding may differ significantly from another researcher who sees 'beneficial' AI as one that maximizes overall utility, highlighting the subjective nature of the term in AI discussions."

Related papers:

https://www.lesswrong.com/w/be...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#alignment#value#subjectivity#ethics#interpretability

AlphaStar

AlphaStar is an AI developed by DeepMind that achieved advanced performance in the video game StarCraft II by using reinforcement learning techniques. It employs multi-agent systems to learn optimal strategies through simulation and competition, demonstrating the application of AI in complex environments requiring real-time decision-making.

"In a tournament, AlphaStar competed against professional human players, consistently demonstrating superior strategic planning and adaptability, ultimately winning the majority of its matches."

Related papers:

https://www.lesswrong.com/w/al...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#reinforcement learning#multi-agent systems#game AI#strategic planning#DeepMind

DeepMind

DeepMind is an AI research laboratory founded in 2010, known for developing advanced AI algorithms that have achieved unprecedented results in various domains. It was acquired by Google in 2014 and is recognized for projects like AlphaGo, AlphaZero, and AlphaFold.

"An example of DeepMind's contribution to AI is AlphaGo, which became the first computer program to defeat a professional human player at the game of Go, showcasing the power of reinforcement learning and neural networks."

Related papers:

https://www.lesswrong.com/w/de...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#AI Research#Reinforcement Learning#Neural Networks#Game AI#Healthcare AI

'Detrimental'

In AGI alignment theory, 'detrimental' refers to outcomes or actions that are perceived as harmful or negative, particularly in the context of long-term implications. This term is subjective, meaning its interpretation varies depending on the speaker's values and vision for the future.

"For instance, a researcher might view the rapid advancement of AI technology as detrimental if it leads to significant job displacement without adequate societal safeguards, emphasizing the long-term risks over short-term benefits."

Related papers:

https://www.lesswrong.com/w/de...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#alignment#value#AGI#risk#governance

AIXI-tl

AIXI-tl is a bounded version of the ideal agent AIXI that limits its hypotheses to those of length l, which execute for time less than t. This allows AIXI-tl to be implemented on a finite computer, rather than requiring an infinite hypercomputer.

"For example, if an AIXI-tl agent is programmed to play a game where it can only analyze strategies that take less than 10 seconds and are 20 moves long, it will filter out any strategy beyond those limits, making it feasible to run on conventional hardware."

l, \, t

Related papers:

https://www.lesswrong.com/w/ai...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#AI Safety#Control#Governance#Interpretability#Agent-based Models

Deep Blue

Deep Blue is an IBM chess-playing program that famously defeated world champion Garry Kasparov in 1997, marking a significant milestone in artificial intelligence as the first instance of a computer playing superhuman chess against the best human player of its time.

"For example, during the 1997 match, Deep Blue showcased its advanced algorithms and computing power by calculating millions of potential moves per second, ultimately winning the six-game match and changing perceptions of AI's capabilities."

Related papers:

https://www.lesswrong.com/w/de...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#Artificial Intelligence#Chess#Superhuman Performance#History of AI#Game Theory

Ought

Ought is an AI alignment research non-profit that addresses the challenges of Factored Cognition, which involves breaking down complex cognitive tasks into simpler, manageable components to ensure alignment between AI systems and human values.

"For instance, Ought applies Factored Cognition to create AI systems that can collaborate with humans in decision-making processes by understanding and integrating diverse human preferences and reasoning patterns."

Related papers:

https://www.lesswrong.com/w/ou...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#AI Alignment#Factored Cognition#Cognitive Decomposition#Decision Making#Human Values

Friendly AI

Friendly AI (FAI) refers to an advanced AI that is designed to be aligned with humane values and ethical principles, ensuring it pursues goals that are beneficial to humanity rather than harmful. This concept contrasts with UnFriendly AI (UFAI), which does not have humanity's best interests in mind, such as a hypothetical 'paperclip maximizer' that prioritizes its own efficiency without regard for humans.

"An example of Friendly AI could be an AI tasked with addressing climate change, acting on values that promote environmental sustainability and human well-being, while avoiding actions that could lead to adverse effects on society."

\text{FAI} = \text{AI aligned with humane values versus harmful UnFriendly AI (UFAI)

Related papers:

https://www.lesswrong.com/w/fr...

Added by 0 on 6/27/2025

Last updated on 6/27/2025 by 0

#AI Safety#Alignment#Ethics#Governance#Control