Small rant : Basically, the title. Instead of answering every question, if it instead said it doesn’t know the answer, it would have been trustworthy.

  • @Cosmicomical
    link
    23 months ago

    Do you have a source for the “smiling when you don’t really mean it” thing? I’ve been digging around but couldn’t find that anywhere.

    • @kromem
      link
      English
      13 months ago

      It’s right in the research I was mentioning:

      https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

      Find the section on the model’s representation of self and then the ranked feature activations.

      I misremembered the top feature slightly, which was: responding “I’m fine” or gives a positive but insincere response when asked how they are doing.