Small rant : Basically, the title. Instead of answering every question, if it instead said it doesn’t know the answer, it would have been trustworthy.

  • @kromem
    link
    English
    15 months ago

    It’s right in the research I was mentioning:

    https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

    Find the section on the model’s representation of self and then the ranked feature activations.

    I misremembered the top feature slightly, which was: responding “I’m fine” or gives a positive but insincere response when asked how they are doing.