• @generalpotato
    link
    English
    111 months ago

    Not all anonymization techniques are created equal? I’m pretty sure this is fairly obvious at this point to anybody remotely familiar with how data collection works when it comes to privacy and device metrics.

    So, how is this relevant to this conversation besides adding more FUD and misinformation?

    • @QuadratureSurfer
      link
      English
      -111 months ago

      You sound like you know a lot more than everyone else on this subject so I thank you for your responses as a means to educate others.

      Just a word of advice, be sure to treat others with respect rather than assuming the worst of their intentions or calling them idiots because they don’t know as much as you.

      My response is still relevant to the conversation as we are talking about “anonymized data”. The link in my comment above proves that just because you are told your data has been “anonymized” does not truly mean that it’s impossible to re-attribute it back to an individual.

      So if you trust that Apple has great techniques for data anonymization, that’s awesome, feel free to expand on that and explain why. Just don’t go around telling others that simply having any sort of anonymization technique makes it so you don’t have to worry.

      • @generalpotato
        link
        English
        3
        edit-2
        11 months ago

        Thanks for the “advice”. Now, let me expand on my position.

        The reason why I’m slightly annoyed but everyone’s take here is:

        1. The demeanor that folks here have in passing on ill informed opinion as fact and then speculating details.
        2. Not looking at the actual privacy policy of a company and the history of how said company has been involved in data collection, privacy, implementation of features in that realm and their handling of customer data.
        3. Bringing up random points just to win an argument instead of conceding that they do not what they are talking about.

        Here’s a few links to put things in perspective as to what and how Apple anonymizes data and how seriously it takes privacy:

        https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf

        https://www.apple.com/privacy/labels/

        https://www.apple.com/privacy/control/

        Read through those, look at Apple’s implementation of TouchID, FaceID and their stance on E2E encryption and tell me again why Apple isn’t serious about privacy, masking and anonymizing data, implementing differential privacy and informing users of what they collect and how users can opt-out of it.

        Edit- Further evidence and reading:

        https://www.techradar.com/news/fbi-says-apples-new-encryption-launch-is-deeply-concerning

        https://www.digitaltrends.com/mobile/apple-data-collection/

        https://www.apple.com/privacy/docs/A_Day_in_the_Life_of_Your_Data.pdf

        • @QuadratureSurfer
          link
          English
          211 months ago

          I’ve been reading through the links you posted as well as looking through other sources. I agree Apple is definitely taking more care with how they anonymize data compared to companies such as Netflix or Strava.

          In Netflix’s case they released a bunch of “anonymized data” but in just over 2 weeks some researchers were able to de-anonymize some of the data back to particular users: https://www.cs.utexas.edu/~shmat/netflix-faq.html

          I’ve already linked Strava’s mistake with their anonymization of data in my above comment.

          and tell me again why Apple isn’t serious about privacy,

          I think you must have me confused with someone else, up to this point in our discussion I never said that. I do believe that Apple is serious about privacy, but that doesn’t mean they are immune to mistakes. I’m sure Netflix and Strava thought the same thing.

          My whole point is that you can’t trust that it’s impossible to de-anonymize data simply because some organization removes all of what they believe to be identifying data.

          GPS data is a fairly obvious one which is why I brought it up. Just because you remove all identifying info about a GPS trace doesn’t stop someone (or some program) from re-attributing that data based on the start/stop locations of those tracks.

          I appreciate that Apple is taking steps and using “local differential privacy” to try to mitigate stuff like this as much as possible. However, even they admit in that document that you linked that this only makes it difficult to determine rather than making it impossible:
          “Local differential privacy guarantees that it is difficult to determine whether a certain user contributed to the computation of an aggregate by adding slightly biased noise to the data that is shared with Apple.” https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf


          Now for some counter evidence and reading:

          Here’s a brief article about how Anonymized data isn’t as anonymous as you think: https://techcrunch.com/2019/07/24/researchers-spotlight-the-lie-of-anonymous-data/

          And if you just want to skip to it, here’s the link to the study about how anonymized data can be reversed: https://www.nature.com/articles/s41467-019-10933-3/

          informing users of what they collect and how users can opt-out of it.

          It would be great if users could just opt-out, however Apple is currently being sued for continuing to collect analytics even on users that have opted out (or at least it appears that way, we’ll have to let the lawsuit play out to see how this goes).
          https://youtu.be/8JxvH80Rrcw
          https://www.engadget.com/apple-phone-usage-data-not-anonymous-researchers-185334975.html
          https://gizmodo.com/apple-iphone-privacy-settings-third-lawsuit-1850000531

          That DigitalTrends article you linked was okay, but it was written in 2018 before Mysks’s tests.

          As for your TechRadar link to Apple’s use of E2EE, that’s great, I’m glad they are using E2EE, but that’s not really relevant to our discussion about anonymizing data and risks running afoul of the #3 point you made for why you are frustrated with the majority of users in this post.

          I understand it can be frustrating when people bring up random points like that, I’m assuming your comment for #3 was directed at other users on this post rather than myself. But feel free to call me out if I go too far off on a tangent.

          I have tried to stick to my main point which is: just because data has been “anonymized” doesn’t mean it’s impossible to de-anonymize that data.

          It’s been a while since I’ve looked up information on this subject, so thank you for contributing to this discussion.

          • @generalpotato
            link
            English
            111 months ago

            :-) Thanks for the detailed response. Let me take a look and get back to you.

          • @generalpotato
            link
            English
            1
            edit-2
            11 months ago

            My whole point is that you can’t trust that it’s impossible to de-anonymize data simply because some organization removes all of what they believe to be identifying data.

            GPS data is a fairly obvious one which is why I brought it up. Just because you remove all identifying info about a GPS trace doesn’t stop someone (or some program) from re-attributing that data based on the start/stop locations of those tracks.

            Looking at all the links you’ve posted… so there’s been cases and studies stating that data can re-identified, but do we have insight into what exact data sets they were looking it at? I tried looking at the Nature study but it doesn’t say how they got the data and what exact vectors they were looking at outside of mention of 15 some parameters such as zip code, address etc. Data pipelines and implementation of metrics vary vastly, per implementation, I’m curious to see where the data set came from, what the use case was for collection, the company behind it, the engineering chops it has etc.

            If from a data collection standpoint you’re collecting “zip code” and “address”, you’ve already failed to adhere to good privacy practices, which is what I’m arguing in Apple’s case. You could easily salt and hash a str to obfuscate it, why is it not being done? Data handling isn’t any different than a typical technical problem. There’s risks and benefits associated to an implementation, the question is how well you do it and what are you doing to ensure privacy. The devil is in the detail. Collecting “zip code” and “address” isn’t good practice, so no wonder data become re-identifiable.

            https://youtu.be/8JxvH80Rrcw https://www.engadget.com/apple-phone-usage-data-not-anonymous-researchers-185334975.html https://gizmodo.com/apple-iphone-privacy-settings-third-lawsuit-1850000531

            More FUD. Why aren’t they testing iOS 16? Ok, sure, it’s sending device analytics back… but it could just be a bug? The YT video is showing typical metrics, this isn’t any different to literally any metrics call an embedded device makes. A good comparison would be an Android phone’s metrics call and comparison to it side by side. I’m sorry, I refuse to take seriously a video that says “App Store is watching you” and tries my skews my opinion prior to showing my the data. The data should speak for itself. I see the DSID bit in the Gizmodo article, but that’s a long shot, without any explanation of how to the data is identifiable specifically.

            Lastly,

            As for your TechRadar link to Apple’s use of E2EE, that’s great, I’m glad they are using E2EE, but that’s not really relevant to our discussion about anonymizing data and risks running afoul of the #3 point you made for why you are frustrated with the majority of users in this post.

            Privacy is fundamental to designing a data pipeline that doesn’t collect “zip code” in plain str if you want to data to be anonymized at any level. So it is absolutely relevant. :-)

            Edit: To clarify, if it wasn’t clear, relying on just data anonymization and collecting everything under the sun isn’t a good way to design a data pipeline that allows for metrics collection. The goal should always be collecting as little as possible, then using masking, anonymization and other techniques to obfuscate it all. No solution is perfect, but that doesn’t there aren’t shitty ways of implementing things leading to the fiascos you see on the web.