Fork me on GitHub
#clojure-europe
<
2020-12-05
>
dominicm07:12:57

My mind is on access control this morning. I'd love to hear the nitty gritty of your access control approach. Do you round trip the database? Use roles? Permissions? Jwt assertions?

slipset08:12:40

@dominicm thanks for playing. Because of your interest, I’ll write up a blog post about it. I’ll let you know when it’s out, probably over the weekend.

slipset08:12:20

It won’t be going into the nitty gritty of the security related questions, so I’ll be happy to answer them here:

slipset08:12:10

We have users stored in mongo, so we do involve the database. We use friend and it’s workflows to deal with different authentication protocols like oauth and SAML

borkdude08:12:18

fwiw, we use the built-in role based stuff in yada

slipset08:12:01

A word of caution. Even though the common saying is that you should not roll your own security, I’d say that the quality of some of the offerings in the Clojure ecosystem is not great. saml2.0-clj 1.x is a leaky bucket, and monger-session-store uses read-string to deserialize session data.

slipset08:12:50

So we used to store all kinds of stuff in the session and stick it in mongo, but now we store only the user-id (IIRC) and stick it in redis instead.

slipset08:12:01

As for authorization, we have a somewhat crazy mix of roles and permissions on given entities in a system. This means that a pure role based authorization scheme doesn’t quite work for us. I came in 3 years ago, and the authoriztion system had by that time “evolved” “organically” for four years, so it’s quite messy and because of backwards compatability it’s not straight forward to change it.

dominicm08:12:09

I've just been reading owasp recommendations. Doing role checks is considered bad practice. It sounds like checks should be permission based from day 1, with roles having permissions.

dominicm08:12:45

@borkdude how do you handle "has permission to access this resource" as in "only the owner of the flowers can view the flowers"?

slipset09:12:32

Basically, to understand if a user has access to a VERB /foo/bar/:id you need to know which verb (which we do), and here we do our first role-based check, reader ’s are (in general) not allowed to PUT, POST, nor DELETE Then, you have to check if the user has permissions (either through role, group, or granted permission) on the thing identified by :id to do the requested operation. This obviously involves the database

slipset09:12:59

One of the problems we have with our role-based approach is that readers are being blocked from writing in our middleware, basically on a static whitelist basis (some urls are available for readers to write), which means that currently we can not give a user with a reader role write access by putting her in a group which has write-access.

slipset09:12:06

So, it seems to me (which kind’a follows @dominicm’s owasp advice) that role based access to uri’s is a bit to simple of an approach for the real world.

slipset09:12:36

Shit, I used the word simple. It has all sorts of connotations, should have probably used simplistic.

slipset09:12:52

As to the flowers, I guess authorizaton only becomes interesting once you throw business rules into the mix, and when you do so, role based access to urls ain’t gonna cut it.

slipset09:12:31

And, roles and groups are somewhat the same thing in my head. At least in some ways. The people who have a role form a group, so I guess roles could be implemented in terms of groups. but maybe not the other way around?

slipset09:12:51

As I’m sure you’ve realized by now, this is not my area of expertise.

dharrigan09:12:47

When I’ve implemented security, I modelled it upon Spring’s Security offering, which is also reflected in Apache Shiro.

dharrigan09:12:56

I use HMAC authentication and JWT too

slipset09:12:57

I’ve not looked much into shiro, but as far as I understand, shiro calculates up front the ids of the tings a user has access to. In our system that upfront calculation could be expesive.

dominicm10:12:53

https://shiro.apache.org/realm.html#permission-based-authorization I think this implies a db round trip is made when you check permission.

slipset09:12:27

and if a user “only” wants to PUT /api/flower/123 why calculate all the other things she has access to?

dharrigan09:12:51

The way I’ve used it, is I have users, groups, roles in the db. Each user belongs to a role and a role belongs to a group (groups can have group permissions too).

dharrigan09:12:51

Every API request uses HMAC authentication - the user has to hit /api/authenticate first to get back the initial token. I store that user in a Redis cache (which by then has been enriched by their permissions). The JWT only contains the role and group then are in. When they then hit a protected URI, I do HMAC authentication, then I check their decoded JWT token against their user that I’ve looked up on Redis to discover their authorisation to do stuff. only then do I let them through.

dharrigan09:12:27

All the JWT tokens are signed and checked too.

dharrigan09:12:07

Security is hard 🙂

slipset09:12:05

Our problem is that the calculation of a users permission is a somewhat hard problem.

slipset09:12:39

SO, you might not have access to GET /api/flower/:id , but you might have access to it if you access it through GET /api/flower/:id?shop=myshop and you have been granted access to myshop

slipset09:12:14

Now, you could argue that this should be modeled as access to GET /api/shop/:shop-id/flower/:id and that there should be some hierarchical check somewhere which checked access to shop-id first.

slipset09:12:46

But anyhoop, the question of if you have access to the flower is not just dependent on its id.

dominicm09:12:33

Well, urls are meaningless really. /lastflower could be equivalent to /shop/10/flower/100, which could be equivalent to /dfbhudsghjljfdd

slipset09:12:48

Yeah, but I guess what I’m hinting at is that the effective permission in our system is dependent on a lot of things, so that it’s nearly impossible to calculate it up front.

dharrigan09:12:10

the second form GET /api/shop/:shop-id/flower/:id would have been more “restful” (admittingly each place seems to implement rest differently 🙂 ) But I feel your pain Erik! 🙂

slipset09:12:07

I’m with @dominicm here on the meaningfulness of urls :)

dharrigan09:12:21

Would it be possible to deprecate the first form and transition to the second form?

dominicm09:12:05

@dharrigan all 3 are rest. Urls are meaningless in rest.

dharrigan09:12:04

This one GET /api/shop/:shop-id/flower/:id at least encodes some meaning within it

dominicm09:12:02

As far as I recall, Roy Fielding didn't mention url hierarchies or meaning when he defined REST. I personally think encoding structure and meaning heavily into urls works against other rest principles.

dharrigan09:12:49

Emperically, I believe it goes with REST principles, the evidence shows that accessing a resource via an id is widely used.

dominicm09:12:47

But 1 is explicitly mentioned as an example of REST by Roy (kinda, he used "today's weather in Los Angeles" as well as "weather for Los Angeles on <date>", and later he uses "version of paper presented at conference")

slipset09:12:05

It’s anyway hard to model REST apis correctly and vast amounts of time can be spent discussing which is better, but the difference in value between two urls representing the same thing is minimal. So I rather approach this from the view that I don’t really care.

😄 3
dharrigan09:12:07

I used to care very passionately about whether to encode /v1 or whatever inthe URL (I am for putting it within the accept)

dharrigan09:12:14

I gave up on that battle

borkdude09:12:40

I tried to use the :copy method with yada, but it didn't support it :/

slipset09:12:51

Another thing that this discussion brings to the table is that authentication is difficult to get right, but not a hard problem, whereas authorization is hard as the business rules grow.

dharrigan09:12:22

yes, totally - authentication is pretty easy and straightforward. Authorization is a hairy mess 🙂

dharrigan09:12:38

There’s always exceptions

slipset09:12:56

With respect to api-versioning, I read https://www.troyhunt.com/your-api-versioning-is-wrong-which-is/ and then Rich came along and solved the problem by saying always be backwards compatible.

👍 3
borkdude09:12:31

Even in clojure core circles there's exceptions to this. E.g. clj made or will make some breaking changes as to not support two code paths indefinitely

slipset09:12:07

Yeah, but is t that allowed because either the thing is in alpha which means anything goes, or it falls under fixation, ie a bug and as such can be squashed. Ie disregarding Hyrums law.

borkdude09:12:48

The command line thing clj was never marked alpha. Only the underlying tools.deps.alpha lib which is not visible to the user.

3
dominicm09:12:32

Sure, but I think it's still viewed as alpha despite the lack of communication/docs on that @borkdude

dharrigan09:12:51

That’s a great article

borkdude09:12:52

I'm not sure what you mean: clj is well documented?

dominicm09:12:53

Sometimes it seems like the only way to know whether something is really stable is to ask the author.

dominicm09:12:33

@borkdude usage is. But the fact it was built on something alpha has always made me think that breaking changes would be forthcoming.

dharrigan09:12:36

I’m just glad there are only 3 ways to specify a version…I would hate to think if there was 20 ways to specify an API version 🙂

borkdude09:12:38

Alex has said that he considered clj stable despite that tools.deps.alpha was marked alpha.

borkdude09:12:43

In interviews

dominicm09:12:57

Oh. But then, breaking changes. Well now I'm doubly confused!

borkdude09:12:58

In the ClojureScript podcast if I remember correctly

borkdude09:12:14

I think not-breaking should be considered in context: 1) Do you control all of the usage (i.e. internal API): then it's fine I guess. 2) Is your feature still in development, is it only used by early adopters? Maybe it's fine. 3) Is the feature widely established and used by the masses, probably not ok. 4) Does breaking only affect a minority, maybe it's ok.

👍 3
slipset09:12:57

That’s part of an argument I was planning on including in a talk proposal for the conj, in reply to a tweet by Stuart Halloway who was on Twitter saying it’s never ok to change the behavior of a name.

👍 3
dominicm09:12:34

@slipset I still want to know more about how he breaks codebases into libraries like he mentioned.

dominicm10:12:08

That tweet was really interesting

borkdude10:12:01

Personally I don't view such an utterance as absolute but just as something to think about. E.g. I can see where's he's going. When your IDE allows advanced refactorings, it can be very easy to create nonsense boilerplate code and make breaking changes.

borkdude10:12:40

I read one example of a Haskell program that was developed in this way. It turned into a spaghetti monster type signature, but the author was even proud of this.

borkdude10:12:47

Advanced refactoring can lead to fast coding without thinking, while thinking should always be balanced with coding.

borkdude10:12:14

At least, this is my interpretation of Stu's tweet ;)

borkdude10:12:03

Personally I don't use a lot of refactoring tools. Maybe sort-ns is the only one. This may also be in part that I'm too lazy to set this all up.

dharrigan10:12:20

sort-ns is about the only one I use too 🙂

dharrigan10:12:49

I absolutely love the way, writing in Clojure, very strongly encourages me to write small functions and to think about what I write

dharrigan10:12:57

so if I have to change something, it’s very localised and small

slipset10:12:58

I use rename symbol quite a lot.

borkdude10:12:25

Renaming locals is a local change (by nature) and this is where refactoring tools are quite harmless.

slipset10:12:29

I think ‘extract function’ should burn in hell

borkdude10:12:28

I also use quite a lot of rg and just go through the list manually. If I need to find patterns that are more sophisticated than just names, I can use: https://github.com/borkdude/grasp

👍 3
slipset10:12:41

Renaming symbols inside some boundary which I control should also be ok. Which means that any fn I don’t expose to the outside of our company can be renamed. Which are all of them. The rest endpoints on the other hand....

borkdude10:12:07

This is one of the things I appreciated when I was in a company that used Java and Ruby. When I sat next to a Ruby guy, he just used vim and rg (or grep). This blew my mind: simple tools can get a long way.

borkdude10:12:59

slipset: yeah, that should also be ok, unless you're writing libraries that are used in multiple projects maybe, etc, context.

borkdude10:12:38

While on the topic, has anyone tried clojure-lsp in their editor of choice?

slipset10:12:04

Btw, totally off topic, but it feels safe to discuss stuff here. I dare express stuff that might be stupid. Thanks for creating such an environment.

👍 9
borkdude10:12:40

The feeling is mutual :)

borkdude10:12:07

While on the topic of safe environment: What if you're in a team and your team lead is leaving. Someone maybe has to step up to take the lead role, or you should hire a new person for this. Personally I just like software development and would like to reduce the number of meetings and manager-like tasks. I feel like going more towards the manager-type things is a bit of a trap that leads away from coding (the thing I currently enjoy) eventually. Some people may ambition this role (more responsibility, more money possibly, higher on the ladder). What are your thoughts on this?

borkdude10:12:49

Maybe this should also be viewed in context. Might differ per company, team, background in the domain, etc.

borkdude10:12:34

We're not actively recruiting yet, but it's likely that we are going to need a 1) team lead (we have a search product indexing biomedical literature) who can align business goals with tech 2) devops type person who likes to work with managed hardware and cloud. 3) UI/UX person (good at UX, but also good at making things look nice). Mostly Clojure, ClojureScript, NLP, AI, triplestores.

slipset10:12:53

I think this is a situation that any somewhat experienced dev will find herself in. As your experience grows, you voice your opinions and you somewhat become a leader, even if you’re not given the title. As such, I tend to request getting the tech-lead role, not because I want to be the dictator of all things tech, but I want to have the authority to have the final word if there are discussions that never ends and a decision needs to be made. How to structure rest-endpoints could be such a discussion. I’m also a person who enjoys product development, like mentoring people, have an eye for ux (especially the stuff that doesn’t work) and as such probably make for a great team lead. My problem with the team lead role is that it very often leads to a mix between a project secretary (do all the boring, but important things that the project lead can’t be bothered with, ie estimations, keeping track on progress) and kindergarten-auntie, the person that follows up that people do what they’ve promised to do. The latter part of the team-lead role can go away as your team matures and understands its purpose, but neither the project-secretary nor the kindergarten-aunti are things that I do very well, and certainly not in combination with coding and problemsolving, which are the things that I really enjoy.

borkdude10:12:36

There is project-secretary but also the guy who is pulled into every meeting that's on the surface between your team and higher up or sibling teams.

slipset10:12:26

I’ve been listening a lot to the idealcast with @genekim lately, and a lot of the stuff from team-of-teams resonates with me. I could be a teamlead of a team of my choosing which are given clear objectives, but with freedom to decide how to acheive those objectives.

borkdude10:12:31

Btw, if any of the above roles appeal to you, you're welcome to reach out to me in private.

mpenet10:12:17

some companies do not necessarly map climbing the supposedly career ladder with seniority and salaries

mpenet10:12:10

I personally hate managing people and the associated tasks (I was "CTO" for a few years), never again. But I also choose companies where dev "seniority" is valued

mpenet10:12:26

in our field I think it's more and more the norm

mpenet10:12:41

You don't have to go the manager route necessarly

mpenet10:12:19

then if it's a necessity because nobody is suited for it or you dont find a good fit it's another matter, especially if you have heavy stakes into the company

slipset10:12:06

The team-of-team stuff comes from the Navy SEALS, and while they do seem to have their management issues, they’re different from ours. They seem to be given somewhat clear objectives like “capture that guy”, and are then let free to figure out how to do that

mpenet10:12:09

I guess that's how we got to these "staff engineer" titles and the like

mpenet10:12:45

highly valued engineers that contribute to tech decisions while spending their day to day actually doing engineering tasks

borkdude10:12:07

I know there are (technically highly skilled) people who like building and empowering teams from scratch or reviving existing teams.

slipset10:12:09

In terms of software, one of my colleagues used to work on the the dragonfly project at Opera. The objective he and his team was given was “build the best javascript-debugger”

slipset10:12:40

He had a meeting once a year or so with his stakeholder. That was it.

borkdude10:12:14

That's perfect. Although I usually tend to seek more feedback from users than once a year. This is also a hard problem in our company: it is still trying to find out who its users are.

slipset10:12:48

He and his team got feedback from users (the users of the debugger) all the time on the internets. Just not from the stakeholders inside the company.

mpenet10:12:14

exoscale as a nearly flat hierarchy, small squads (like 5 people per), 1 squad lead per squad, that is sort of a speaker for the whole squad and coordination pt with higher hierarchy (not many levels up, basically they report to vp eng/cto directly). little meeting overhead. It's a pretty good setup

slipset10:12:22

So, our company, the difference would be somewhat like “build x in the way the stakeholder has envisioned the thing on a timeline decided by someone”, and “it seems like our users have this problem, please go spend a year or so solving that problem for them”

mpenet10:12:08

about dev vs ops, the lines are getting more blury as time passes

slipset10:12:24

and of course, given the second approach, you’d be reporting back at regular intervals as new findings are discovered.

mpenet10:12:14

It's more a more a dev role to know/do ops nowadays. You have to plan in consequence when you dev anyway so you get to know most of "ops" task. SRE is something else

slipset10:12:44

The second approach also requires a cross functional team and acknowledging that not all team-members be 100% utilized all the time,.

borkdude10:12:34

Whatever you want to call it: our software currently runs on bare metal on the team lead who left's servers in a rack. We have to migrate this to other managed hardware and/or cloud. He will give us the time we need, so it's not a panic operation, but it requires work and expertise.

mpenet11:12:12

Right, tricky situation

borkdude11:12:20

Everything is dockerized and currently runs on docker swarm. Porting to another system should not be a problem. It's more about requirements in terms of CPU, GPU and RAM.

borkdude11:12:07

And choosing what is cost-effective

genRaiy11:12:53

may I add to the earlier team discussion, that I detest the notion that the Navy Seals or any other “elite” military group should be the inspiration for writing software or working in teams. It feels gross and I particularly dislike the salivating over their “learnings” and experience in Iraq :face_vomiting:

☝️ 6
slipset11:12:35

May I ask why?

genRaiy11:12:45

mostly cos of the killing but also because their objectives are often illegal

genRaiy11:12:04

they are the ultimate rogues

genRaiy11:12:36

and I was wondering whether that is why FaceBook etc. is often lauded for its rogue behaviour

genRaiy11:12:23

the culture of assassination teams is not something that I feel comfortable about owning

genRaiy11:12:38

I hope that clarifies 🙂

slipset11:12:14

It does, but it does raise a whole bunch of questions in my head. And I really appreciate that, because it forces me to think about stuff.

genRaiy11:12:10

by rogues I don’t mean the ‘bad boy’ style I mean actual war criminals oppressing and openly killing and torturing civilians

slipset11:12:32

My first thought is that I think that it could be argued that our whole field is based on needs of the military. And that killing is killing.

genRaiy11:12:39

which is not quite FaceBook but they are on a continuum

slipset11:12:02

The second thought is if it’s wrong to observe highly effective units and learn from their organization.

slipset11:12:19

Example: From team of teams, they state that the time from sighting to killing went from 72 hours to 40 minutes. This is in one way horrible, but it’s also a fairly clear example of an organization that evolved in a way that could be desirable. Why not learn from what that organization did even though you disagree with the outcome of the organization?

slipset11:12:09

But, I appreciate that there can be different views on this, and also that it’s more interesting thinking about your viewpoint than trying to convince you that mine is correct.

slipset11:12:35

FWIW, I think the SEAL teams have quite strict rules of engagement.

3
genRaiy12:12:27

For me, the notion that reducing sighting -> killing is a transferrable metric is weird. And I had to turn off the IdealCast when he was interviewing the authors because the sycophancy towards the military was just too much. The war on Iraq was an illegal act which left 100s of 1000s dead and the whole region in turmoil. Thinking about it like that provides better lessons IMHO.

slipset12:12:13

Did you listen to the episodes with/about Rickover? Are the learnings from how they built the first nuclear subs lessons we should learn from?

slipset12:12:08

Not really an interesting question, sorry.

slipset12:12:56

I guess one problem is that post Iraq, we have a bunch of SEALS/Special Ops people who try to make a living as management-consultants.

🤢 3
slipset12:12:13

I really appreciate your views though, as I haven’t thought about it this way before, but I guess the story telling for simple minds like myself is appealing: “Here’s an example of these elite dudes operating efficiently in a difficult environment” I guess what we need is similar stories, but from other fields.

genRaiy12:12:56

I’m getting a headache just thinking about it tbh

slipset13:12:14

Thinking in general does that to me.

😂 3
orestis15:12:35

Nice discussion. I share @raymcdermott’s feelings about borrowing terms from the military. After a long solo career I’ve found myself leading a team for 1 year now, including making hard decisions like firing, but also hiring and onboarding new members. The number one thing I’m going for is fostering safety and comfort so that people enjoy showing up at meetings and sharing their thoughts, and also offering up their code for review without any hesitation.

orestis15:12:07

Other than that, coordinating with the “business” and trying to figure out what to build next and how to go about it (a bit of PM work there), indeed doing “housekeeping” work like taking notes and calling meetings... I still code like 75% of the time though.

orestis15:12:45

Regarding authorization, I think REST and URLs muddy the waters. We’re using graphql which has fine-grained mutations so authorization is easy to define (can user x do action y on resource z). It almost always involve DB access.

orestis15:12:50

For reading, we usually push authorization down to the query level, as in we try to encode rules in an sql/mongo query. Sometimes you need a query to do that though :)

borkdude15:12:48

@orestis I think it totally depends on the company and context. I don't have a strong enough background in the domain and don't know the market well enough in which the product is operating, so I don't feel like the right person to lead a team in this context. In other contexts I might feel comfortable doing so, not sure.

orestis16:12:02

It’s a tall order to make both technical decisions and know the market or having a background in the domain... in my case I just make sure I talk a lot to people who do and ask a lot of questions to figure out needs etc.

orestis16:12:06

From time to time we will hold planning meetings and for that I will usually have ready some proposals of things that have both been raised in the past and within technical reach. We are in the process of moving away from a terrible legacy system to a new code base and database so not everything we’d like is possible, but there’s a long term plan going on in the back of my mind to guide us through.

orestis16:12:27

Oh and I usually have to repeat and repeat and repeat myself - in presentations and meetings and conversations, both because people forget but also because it’s common for myself to forget to mention details that have changed over the course of time :)

borkdude16:12:09

What kind of product are you making, if that's not too secret?

orestis16:12:55

Not hugely challenging in a technical point of view, but definitely challenging in other ways :)

borkdude16:12:07

That's one of the things. Some parts in our stack over a bit over my head, our previous team lead had all kinds of ideas about AI solutions. I'm not that kind of person (so context).

orestis16:12:11

Yeah I feel that. I would like to hire some export for some ML/AI stuff too since I’ve never done it and it seems promising, but hiring for an area you’re clueless in is so difficult.

borkdude16:12:51

Luckily we already have this expertise. But none of the team is probably desiring the lead role.

borkdude16:12:19

Anyway, thank you for letting me express my worries and thoughts here ;)

orestis16:12:24

Anytime, I love this channel 🥰

slipset16:12:30

@orestis sounds like you should hire an AI/ML consultant for a while to map out the possibilities. Hiring one without really knowing if you’ll need one seems strange.

slipset16:12:41

Or, you could hire me.

slipset16:12:49

I offer No as a service

6
orestis16:12:04

It’s also a personality type. I abhor a vacuum of leadership so unless someone plays that role I will push myself to assume it, at least transitionally :)

orestis16:12:11

Haha 😂

slipset16:12:22

The deal is, you ask a question, like “Do we need AI to solve our problem”, I answer “No”

slipset16:12:36

I’m not very expensive.

borkdude16:12:40

No as in YAML no, or boolean no?

slipset16:12:11

No as in No, Not no as in Norwegian 🙂

slipset16:12:11

I can also recommend the services of my slightly less serious friends http://hahahaha.no

orestis16:12:06

Ok so here’s the problem. We have small corpora of documents, like ranging from dozens to perhaps tens of hundreds (each client is completely independent from others). Within those corpora, we need to suggest similar documents, but also take into account user activity (you read this and that so you might like to read that thing too). Also clustering and topic extraction. Off the shelf solutions we’ve tried usually give disappointing results. What field of informatics would help there?

val_waeselynck18:12:15

@orestis to gain a better general background on this problem domain, I'd recommend reading An Introduction to Information Retrieval : https://nlp.stanford.edu/IR-book/information-retrieval-book.html It's a bit dated to be state-of-the-art, but OTOH it's really one of the few very insightful science books out there, so you probably won't lose your time reading it 😉 What's more, because your documents contain business-specific jargon, you probably won't benefit much from the more modern NLP models, which rely on representations learned through unsupervised learning on very large corpora of ordinary text. All these classic algorithms work by turning corpora into quantitative representations that are either geometric (Tf-Idf) or probabilistic (BIM25, Latent Dirichlet Allocation) Be aware, though, that you are engaging into the domain of ML/IR problems, which are inherently harder than business logic problems. If off-the-shelf tech doesn't work right away, I wouldn't count too much on achieving something impressive in a short time. It's an unfortunate reality that this sort of problem seems both easier to laypersons than ordinary information-processing, and is actually much harder on programmers.

borkdude16:12:04

@orestis We already have that in our stack

orestis16:12:18

Each corpus is isolated and business specific so it contains unique jargon. And we can’t spend our time cleaning up text, it has to be completely hands off...

borkdude16:12:58

Check out https://covid19.doctorevidence.com/. This is fully built on top of medical ontologies paired with NLP and AI stuff

orestis16:12:27

I’d love some suggestions on your NLP and AI stack :)

borkdude16:12:49

similarity you can do when you know what features to extract. we use: https://milvus.io/

borkdude16:12:02

you basically have to build a vector for each document

borkdude16:12:23

and then this thing will give you suggestions based on vectors you put in it as the prototypical documents

orestis16:12:35

The funny thing is that this is not our core thing, just a super super nice to have and impressive sales demo... so I can’t just yet justify spending a ton of time when other things are definitely doable and more important.

borkdude16:12:52

For NLP we use StandfordNLP. Mind that its license doesn't let you distribute your software, but SAAS should be ok

borkdude16:12:56

@orestis I guess you are also using elasticsearch right?

orestis16:12:54

We are currently using SolR but will migrate to ES soon.

borkdude16:12:58

Do you also use postgres perhaps?

orestis16:12:06

Gotta run, the AI has woken up and demands feeding and interaction and a nappy change :) thanks for the talk, I might ask again about these things next year!

orestis16:12:22

We’re running on RDS so no zombodb for us :(

val_waeselynck18:12:15

@orestis to gain a better general background on this problem domain, I'd recommend reading An Introduction to Information Retrieval : https://nlp.stanford.edu/IR-book/information-retrieval-book.html It's a bit dated to be state-of-the-art, but OTOH it's really one of the few very insightful science books out there, so you probably won't lose your time reading it 😉 What's more, because your documents contain business-specific jargon, you probably won't benefit much from the more modern NLP models, which rely on representations learned through unsupervised learning on very large corpora of ordinary text. All these classic algorithms work by turning corpora into quantitative representations that are either geometric (Tf-Idf) or probabilistic (BIM25, Latent Dirichlet Allocation) Be aware, though, that you are engaging into the domain of ML/IR problems, which are inherently harder than business logic problems. If off-the-shelf tech doesn't work right away, I wouldn't count too much on achieving something impressive in a short time. It's an unfortunate reality that this sort of problem seems both easier to laypersons than ordinary information-processing, and is actually much harder on programmers.

val_waeselynck18:12:47

One of the great upsides of small corpora is that you can feel free to use algorithms that don't scale linearly in corpus size, such as certain clustering algorithms, and that you can do everything in memory without unwieldy Big Data infrastructure.

borkdude18:12:46

> Be aware, though, that you are engaging into the domain of ML/IR problems, which are inherently harder than business logic problems. If off-the-shelf tech doesn't work right away, I wouldn't count too much on achieving something impressive in a short time. It's an unfortunate reality that this sort of problem seems both easier to laypersons than ordinary information-processing, and is actually much harder on programmers. True. I also feel more comfortable with business logic because I can understand it. With AI/ML it's always: yeah, now it's 95% accuracy, tweak this parameter, now it's 94%, use this algorithm, 96%.

borkdude18:12:01

And then the questions from the business come: can you tweak it so this results makes it higher up? Yes, we can, but it has this consequence on the other results ;)

val_waeselynck19:12:58

That, and you never really know when you're done. You typically can't write a test suite for your code that will give a reliable binary answer to the "does it work?" question.

otfrom19:12:23

I'm really happy with the community we're building on this channel too

❤️ 3
otfrom19:12:28

One of these days I'll do a little personal nlp project and build rememberance agent

val_waeselynck20:12:53

@otfrom «rememberance agent» ? What does that mean?

orestis20:12:38

Thanks for the book rec @val_waeselynck , there seems to be an online edition too. I’ve absorbed most of the terms through random walks but I vastly prefer books.

orestis20:12:55

I hear you on people expecting this to be easy. The bright side is that all of our competitors “related content” feature is crap too 😅

orestis20:12:38

Having spent some time thinking about how to solve this, I think what we’ll end up doing is building a “markup tool” where authors of content could annotate and highlight keywords, topics, summaries etc. I believe it’s a necessity to “show your work” in this kind of context so black box algorithms are not going to cut it...

orestis20:12:34

I’m rambling a bit cause it’s late but what I mean is - the chances for a team of 3 to build a cutting edge topic extraction algorithm as a side project is pretty slim, and it seems like our original content is also crap - typos, grammatical errors, jargon etc. So even the best NLP software can’t deal with those unsupervised.

orestis20:12:29

So I’m thinking of a very basic weighted search algorithm that operates on rich data that people provide. No fancy ML perhaps but probably attainable :)

paul.legato21:12:36

For some values of “similar documents,” you can get quite acceptable results with relatively basic statistically unlikely word match algorithms. Something like: throw out stopwords (very common words), maybe stem the remainder (remove grammatical inflections), and rank other documents by the degree to which their words overlap. Understanding the semantics is surprisingly unnecessary for many (but not all) problem domains. A next level enhancement is extending this with word proximity / ngrams: sequences of n words that appear together. Annotations / markup can be helpful if people actually use them. The main issue I’ve run into there is that few users will actually bother to do the annotating.

paul.legato21:12:34

Some people bypass this whole process by just dumping everything into Elasticsearch / Lucene. It can get expensive to operate when the document set is very large, but OTOH the time to get something running that’s approximately useful is much smaller than trying to build your own.

borkdude21:12:29

> The main issue I’ve run into there is that few users will actually bother to do the annotating. Yes, this is where AI often comes in, for tagging things into buckets

paul.legato21:12:31

Another non-DIY approach is to use Postgres’ built in full text search engine or similar software. Search for the first heading in the current document (or first sentence, paragraph, whatever is useful in your specific document type), and get a ranked list of similar documents back.

borkdude21:12:18

@paul.legato I see you are in a postgres-as-a-service company. Do you offer solutions with plugins like ZomboDB supported and perhaps also a connected ElasticSearch cluster?

paul.legato21:12:41

Oh, I haven’t been involved with that company for years now! Can I ask where you saw that, so I can update it? 🙂

paul.legato21:12:07

The short version is: Google and Amazon built identical hosted Postgres services and started selling them at a massive loss, in a price war with each other.

paul.legato21:12:56

We did stuff like that, yes

borkdude21:12:13

Oh I see you haven't updated your profile in a while on Twitter, sorry for not looking better ;)

paul.legato21:12:29

Heh, no worries. I rarely use Twitter personally