All of my recent work is published on LessWrong and the Alignment Forum.

I work on technical alignment, but doing that has led me to branch into alignment targets, alignment difficulty, and societal and field sociological issues. Choosing the best technical research approach depends on all of those. 

Principal articles:

On technical alignment of LLM-based AGI agents:

LLM AGI may reason about its goals and discover misalignments by default – An LLM-centric lens on why aligning Real AGI is hard

System 2 Alignment – Likely approaches for LLM AGI on the current trajectory  

Seven sources of goals in LLM agents – brief problem statement

Internal independent review for language model agent alignment – Updated in System 2 alignment

On LLM-based agents as a route to takeover-capable AGI

LLM AGI will have memory, and memory changes alignment

Brief argument for short timelines being plausible

Capabilities and alignment of LLM cognitive architectures – Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed

AGI risk interactions with societal power structures and incentives:

Whether governments will control AGI is important and neglected

If we solve alignment, do we die anyway?

Risks of proliferating human-controlled AGI

Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours

On the psychology of alignment as a field:

Cruxes of disagreement on alignment difficulty

Motivated reasoning/confirmation bias as the most important cognitive bias

On AGI alignment targets

Problems with instruction-following as an alignment target

Instruction-following AGI is easier and more likely than value aligned AGI

Goals selected from learned knowledge: an alternative to RL alignment

On communicating AGI risks:

Anthropomorphizing AI might be good, actually

Humanity isn’t remotely longtermist, so arguments for AGI x-risk should focus on the near term

Older research articles:

Herd, S., Read, S. J., O’Reilly, R., & Jilk, D. J. (2018). Goal changes in intelligent agentsArtificial intelligence safety and security, 217-224.

Jilk, D. J., Herd, S., Read, S. J., & O’Reilly, R. C. (2017). Anthropomorphic reasoning about neuromorphic AGI safety. Journal of Experimental & Theoretical Artificial Intelligence, 29(6), 1337-1351.

Herd, S., Urland, G., Mingus, B., & O’Reilly, R. (2011). Human-artificial-intelligence hybrid learning systems. Frontiers in Artificial Intelligence and Applications, Volume 223: Biologically Inspired Cognitive Architectures 2011 (pp. 132-137). IOS Press. PDF here