Poisoned AI went rogue during training and couldn’t be taught to behave again in ‘legitimately scary’ study::AI researchers found that widely used safety training techniques failed to remove malicious behavior from large language models — and one technique even backfired, teaching the AI to recognize its triggers and better hide its bad behavior from the researchers.

  • @Paragone
    link
    English
    811 months ago

    I hold that this is true of all neural-nets, organic as well as silicon:

    Once a person has sided with treachery, rooting it out from one’s unconscious-mind is … enduringly difficult, if not intractable.

    I don’t know how many decades it takes to eradicate the roots of it, if it can be done, at all:

    the unconscious-mind mechanism, that-is the Kahneman System-1 ( from “Thinking Fast & Slow” ) imprint is going to still be there, even if overlaid with another imprint ( since mind is holographic/pattern-imprints in function ).

    Worse, it is the motivation that need change, and motivation is of ego, which is of identity, so many who “reform” only do-so superficially.

    I’m not saying this as some goody-2-shoes, I’m saying this as a person who was raised by narcissists, and therefore embodied much narcissism, and class-prejudice ( dad was a doctor: you can’t get more upper-middle-class status-prejudiced than doctor-culture )…

    …who finally cracked the root kernel of the class-prejudice in my unconscious-mind’s identity-crystal at the end of a 25d hard-line fast, out in the bush.

    It took that to fracture the identity-crystal’s prejudice.

    It’s been a decade since then, & I’m still fighting to eradicate its treachery from my nature.

    Neural-nets are tough to purge, or clean-up & make upright.

    MUCH easier to keep a neural-net pristine through all of its formation, than to try ( endlessly failing ) to clean it up, after it’s become enemy-intent in “family” clothing.

    _ /\ _

    • jaxxed
      link
      English
      111 months ago

      Can you recommend further reading?