Fork me on GitHub
#datomic
<
2018-04-14
>
Wes Hall02:04:30

Given Datomic Cloud's current lack of support of excision, it would seem that it is not safe to use when the EU GDPR regulation comes into force in May. I am suddenly fairly concerned about this. Part of the regulation is that any EU citizen can request that their data be permanently erased and datomic cloud currently lacks a feature to be able to comply with these requests 😕

val_waeselynck08:04:36

@U9HA101PY My advice is to store privacy sensitive values in a complementary KV store with UUID keys that are referenced from Datomic. It's not as hard to set up as it seems, especially once you've realized that you can build some very generic querying/transacting helpers using Specter and tagged literals. Will blog about this soon.

val_waeselynck08:04:58

Note that such issues affect Peer based systems as well, because Datomic excision is not really a practicing solution for data erasure - especially if you have to erase any personal information after 3 years.

hmaurer11:04:23

@U06GS6P1N what do you mean by “tagge literals” in this context?

val_waeselynck11:04:34

@U5ZAJ15P0 [[:db/add 3242525424 :contact/email-ref #privacy/to-be-replaced-by-a-key [""]]]

val_waeselynck11:04:57

#privacy/to-be-replaced-by-a-key [""] is the tagged literal of interest here.

Wes Hall16:04:44

@U06GS6P1N Interesting. A similar thought did cross my mind, but not as developed as what you are describing here. I think a problem with this is that nobody seems completely clear on what constitutes "personal data" under GDPR. I'd probably end up storing most of the data in this way. I wonder then if I am better just using something like Dynamo directly.

daveliepmann17:04:33

I don't disagree with the complementary-KV-store approach as perhaps the best solution in the near term, but from an operations/infrastructure or business perspective "just have a second database for anything you might be legally required to delete" is, to put it mildly, simply not convincing.

daveliepmann17:04:00

It's not clear to me what "Datomic excision is not really a practicing solution for data erasure" means?

val_waeselynck19:04:50

@U9HA101PY Being in a European company close to the team that deals with these issues, I may have some more precise knowledge. GDPR mostly concerns itself with data that can lead to identifying a person, which includes email addresses, phone numbers, IP addresses, first and last name, etc. In particular, the GDPR requires that such data be collected with explicit and informed consent, that it may be exhaustively deleted or exported upon request, and that it should be kept for a finite amount of time (typically 3 to 5 years).

val_waeselynck19:04:12

@U05092LD5 Regarding excision: the Datomic team themselves said that excision is an very costly operation, that should only be performed under exceptional circumstances. Because of the 'limited retention period' constraint, which eventually requires to erase data at the same rate as it was ingested, it becomes clear that Datomic excision is not a practical solution.

val_waeselynck19:04:43

> "just have a second database for anything you might be legally required to delete" is, to put it mildly, simply not convincing. @U05092LD5 not sure I understand what your point is here - I'm not trying to convince anyone, just to share my solutions. Trust me, I'm also a stakeholder when it comes to both business and infrastructure. Again, I will write about that in more details one of these days, but I think Datomic mostly has an edge over mutable databases here. Even with an SQL database, I don't think it's robust to approach this problem by just saying "I'll just null out the appropriate columns in the appropriate rows when the time comes", because your system should record the fact that 'this datum was erased for privacy reasons at this time etc.' At this point, the generic schema and reified transactions facilities of Datomic become an advantage to tackle this problem.

val_waeselynck08:04:19

> It's not clear to me what "Datomic excision is not really a practicing solution for data erasure" means? @U05092LD5 I realize I made a typo, I meant practical

hmaurer12:04:53

@U06GS6P1N hang on, so if I got this correctly, you are not required to erase all user data? you are only required to erase ways to identify the users?

hmaurer12:04:16

that’s a bit blurry though, since surely you could identify the users based on patterns in their data beyond their name/email

hmaurer12:04:49

e.g. if you are storing GPS location data on a user

val_waeselynck13:04:16

@U5ZAJ15P0 well there is always data related to a user that you need to keep, be it only for bookkeeping, e.g you won't delete the orders placed by a user. Regarding GPS data, this usually counts as personally identifying, just like IP addresses and cookies.

Wes Hall16:04:12

@U06GS6P1N I didn't mean that I don't know, as much as the fact that (as with most regulations like this) there are some grey areas that will probably get determined in later cases. The problem with the approach that you describe is that you have to get it right from the first. If some value that you didn't think would be included in the definition of personal data is later deemed to be included then you are fucked. If some dev on your team forgets to include the offloading of storage as they franctically hack towards a deadline, you are fucked. There is no, "going back and fixing it later", which worries me. As it happens, I absolutely adore the datomic model, and think that GDPR is mostly a shit-show, which, as usual, hasn't been validated with the real-world, but the fines are simply too high to take the risk I think.

4
Wes Hall16:04:42

Incidentally, what is interesting is that having read some information about how people are dealing with backups (i'm pretty sure that nobody is going to restore every single backup in order to remove some piece of data on request), many people are suggesting that they are going to implement some filter mechanism such that if a backup is restored at any point, any data marked for deletion is removed during the restore process, rather than from storage. Quite a few people seem to think this constitutes "reasonable steps", so I don't know if something like this can be applied to a live system of immutable record. If you could create some kind of datom filter and centralise it in the peer server... maybe that works.

Wes Hall16:04:44

Law makers are unlikely to make the distinction between, "absolutely is not stored", and "absolutely cannot be used", but I suspect that if you have the latter thing properly implemented, you'd never get into trouble to the degree that you have to prove the first thing... but IANAL etc.

val_waeselynck17:04:37

> If some dev on your team forgets to include the offloading of storage as they franctically hack towards a deadline, you are fucked. There is no, "going back and fixing it later", which worries me. Well, FWIW, I am definitely in this situation, and I do think I can fix it in time. Migrating the code to use a secondary store only took a few days (including some hammock time to come up with this KV store approach), one BandSquare's codebase which is probably one of the biggest Clojure + Datomic codebases out there. Migrating the data will probably be more painful and require some downtime - maybe I'll do it via a sequence of massive excisions, maybe by rebuilding a new Datomic database at the application level - but it's definitely doable.