All of my recent work is published on LessWrong and the Alignment Forum.
I work on technical alignment, but doing that has led me to branch into alignment targets, alignment difficulty, and societal and field sociological issues. Choosing the best technical research approach depends on all of those.
Principal articles:
On technical alignment of LLM-based AGI agents:
LLM AGI may reason about its goals and discover misalignments by default – An LLM-centric lens on why aligning Real AGI is hard
System 2 Alignment – Likely approaches for LLM AGI on the current trajectory
Seven sources of goals in LLM agents – brief problem statement
Internal independent review for language model agent alignment – Updated in System 2 alignment
On LLM-based agents as a route to takeover-capable AGI
LLM AGI will have memory, and memory changes alignment
Brief argument for short timelines being plausible
Capabilities and alignment of LLM cognitive architectures – Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
AGI risk interactions with societal power structures and incentives:
Whether governments will control AGI is important and neglected
If we solve alignment, do we die anyway?
Risks of proliferating human-controlled AGI
Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
On the psychology of alignment as a field:
Cruxes of disagreement on alignment difficulty
Motivated reasoning/confirmation bias as the most important cognitive bias
On AGI alignment targets
Problems with instruction-following as an alignment target
Instruction-following AGI is easier and more likely than value aligned AGI
Goals selected from learned knowledge: an alternative to RL alignment
On communicating AGI risks:
Anthropomorphizing AI might be good, actually
Humanity isn’t remotely longtermist, so arguments for AGI x-risk should focus on the near term
Older research articles:
Herd, S., Read, S. J., O’Reilly, R., & Jilk, D. J. (2018). Goal changes in intelligent agents. Artificial intelligence safety and security, 217-224.
Jilk, D. J., Herd, S., Read, S. J., & O’Reilly, R. C. (2017). Anthropomorphic reasoning about neuromorphic AGI safety. Journal of Experimental & Theoretical Artificial Intelligence, 29(6), 1337-1351.
Herd, S., Urland, G., Mingus, B., & O’Reilly, R. (2011). Human-artificial-intelligence hybrid learning systems. Frontiers in Artificial Intelligence and Applications, Volume 223: Biologically Inspired Cognitive Architectures 2011 (pp. 132-137). IOS Press. PDF here