This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-09-13
Channels
- # announcements (1)
- # babashka (12)
- # beginners (10)
- # biff (9)
- # calva (2)
- # cherry (21)
- # cider (14)
- # clj-commons (76)
- # clj-kondo (8)
- # clj-on-windows (34)
- # cljs-dev (5)
- # clojure (48)
- # clojure-austin (7)
- # clojure-europe (97)
- # clojure-nl (1)
- # clojure-norway (14)
- # clojure-uk (22)
- # clojurescript (137)
- # conjure (33)
- # cursive (4)
- # datalevin (1)
- # deps-new (4)
- # devcards (2)
- # duct (3)
- # events (1)
- # fulcro (12)
- # graphql (9)
- # hyperfiddle (16)
- # jobs (8)
- # kaocha (1)
- # leiningen (6)
- # lsp (39)
- # malli (38)
- # membrane (20)
- # nbb (68)
- # observability (7)
- # off-topic (49)
- # pathom (11)
- # polylith (8)
- # portal (22)
- # re-frame (6)
- # releases (1)
- # remote-jobs (2)
- # shadow-cljs (24)
- # spacemacs (2)
- # squint (6)
- # xtdb (7)
I feel like a concept I've picked up from the clojure community is that libraries are generally superior to tools, and I think I gleaned this pretty directly from a clojure talk or discussion but I can't seem to figure out where I first was exposed to that idea. Maybe a Stuart Halloway or David Nolen talk? Does this sound familiar to anyone? (This would predate tools.build, which maybe also stems from the same concepts, by helping create tools as a light layer over libraries)
oh that's interesting. i've seen "libraries over frameworks" so many times it's almost a meme at this point, but not "libraries over tools"
I heard a guy that was talking about functions being chisels/tools. Maybe that one? https://www.youtube.com/watch?v=ShEez0JkOFw
I'm trying to scrape the "valuable" content (sans comments, ads, etc) from 10,000s of articles using Clojure. Currently I have all of the URLs I need to input. Anyone have any recommendations for libraries to help with the actual scraping? So far I'm thinking of using enlive to parse the HTML, then feed that into Kotlin's chimbori to get all of the "valuable" content.
@U03QQS7341W Depends on the articles there are 2 main stages to deal with:
A. Fetching the content
You may be able to use (map slurp urls)
or https://github.com/nathell/skyscraper or something more complex that can emulate a real (Chrome) web browser.
B. Extracting the content
If the articles are all from the same source then you may be able to use enlive, if they are from different sources then you may need to develop your own heuristics for guessing where the the article text is (e.g. looking for an <article>
tag or div class names).
Since the number of articles is small (10,00s) it might be best to use a scraping API e.g. https://www.scraperapi.com/pricing/ Or https://www.diffbot.com/pricing/) some APIIs will even extract the article text for you.
I hope you are respecting the copyright for whatever you are doing with all those articles
@UJVEQPAKS So for A. it seems to be working ok with slurp. I looked at skyscraper, but it seems to be much more focused on scraping a lot of stuff from one site. It doesn't seem to offer much advantage over just slurp + jsoup parsing. I'm slightly concerned that I might have to use something that emulates a browser, since I'm not sure how many sites are SSG or SSR. I also don't know what there is in Clojure for that. For B. I'm using crux (https://github.com/chimbori/crux) and it seems to be working decently. I'm hesitant to use a paid solution at this stage. They're all pretty pricey and I think my homebrew solution can be just fine. @U064X3EF3 Yeah it probably falls under the fair use act.
If you need to emulate a brower then you may need to use "web drivers" there's a clojure wrapper here: https://github.com/clj-commons/etaoin
Web pages with interesting content commonly put stuff into structures that have a common 'path suffix' -- that is to say, if you index all elements in a page by child-element-offset-into-parents, repeated structures like table rows or react components have recognizable offset patterns. For example, a jsoup (via hickory) code fragment that converts an html table of wikidata properties into a vector of maps could be like: https://gist.github.com/jbouwman/ac8b9b9f92b7a0ea9c56e5983ed1ae8d
I've had a good experience parsing HTML with meander (the data structures, not text)
@U03NJ5N4JTZ yeah this would make sense if I was parsing articles on the same site. In this case I'm not though. There are probably high 100s of unique urls.
some website may be quite “defensive” towards bots… if necessary, add some self rate-limiting and/or IP address rotating proxy etc? (though these kind of thing are part of those cloud API’s value-added)
Probably not a problem here since it's a 1-3 articles per domain situation, as opposed to scraping the whole site.
I am a little tired with open-source stuff... while working on Atom (to survive the sunset) I already got into so many arguments about meaningless things like "code style", "number or reviews", or "PR size" that I'm honestly starting to feel this is all a bad idea... worst yet, we're not in a shape where we can discuss these things - if GitHub pulls the plug of all Atom servers, we would not survive yet, so it feels even more meaningless...
Can you help me understand why code style
, number of reviews
, and PR size
are meaningless?
I can relate to how you're feeling
Ya, I started my OSS contributions on a non-Clojure project. To phrase it nicely, it was not a good fit for me, had to walk away.
I feel for you @U3Y18N0UC, you’ve put a ton of effort and love into the world of Atom.
@Jon Boone I do not pretend to speak for Mauricio, but perhaps he meant that the number of discussions on those topics, and the number of disagreements, exceeded his patience.
@U025L93BC1M when the project have more than 2 years of negligence, doesn't run native on modern OS, have three or four active contributors that don't exactly know how the codebase works, and will actually stop working months for now and you still are not on a state where this will not happen, having so much strength on these arguments sound meaningless to me
It's like arguing which kind of color you will paint the kitchen, when the house is on fire...
For me it sounds like you need to put it onto a board and let those people to decide what they value more. Also add pros and cons of each card so that they'd see that "code formatting" solves "I'm frustrated" and not "I can't run it" errors.
@U3Y18N0UC that is an unfortunate situation for the project to be in. It seems to me that many contributors are likely experiencing burnout.
Yeah, the board and voting is already things we're doing, that's how these discussions appeared
Much depends on the background of the folks involved and how large the project is. I've worked on open source projects for about thirty years now and some are really fun and easy going but some genuinely feel like work, with all the overhead you might expect in a large corporate environment.
I'm not terribly surprised by what you say about the Atom project, given its history...
Makes me appreciate the Clojure community even more. Lots of great folks and fun projects to work on.
management oriented people tend to believe that: good process -> good quality actually is the reverse: good quality -> good process also somebody would say "Individuals and interactions over processes and tools" but what would I know 😆
Feel ya! Everyone wants a vote, getting consensus can be a huge pain, that's why a lot of projects go the BDFL route, actually managing a community effort, getting people to agree on process, to all feel included in decision making, that's a huge undertaking that's just soul sucking and not what any programmer wants to be doing, they want to be doing programming, delivering working software, not managing issues and running discussion panels. Be careful, you can easily have your energy sapped away by all this stuff which isn't what you actually want to work on, and that can burn you out.
Had to look up BDFL: https://en.wikipedia.org/wiki/Benevolent_dictator_for_life
Yeaaah, I honestly sometimes think if it was going to be easier to just ditch it all and work on https://gitlab.com/clj-editors/saturn... sometimes I'm really not sure if doing this would be easier or harder
how usable is this right now
Not at all, you can open and edit files, and nothing more. No save, no language integration of any kind, no API, no plug-ins...
Looking for references to clojure-specific discussions of the following problem: how to design a system that minimizes the complexity of rewriting the system to migrate from one backend database, file system, Network protocol, or interprocess communication mechanism to another.
@UDF1WUJTH Generally if you want to be able to swap layers out - dependency injection is a good solution. There's a brilliant (small and simple) Clojure library called https://github.com/weavejester/integrant that can do this and you can read it's README for details. Alternative dependency injection libraries are available too. Another option is writing wrappers/abstraction around the layers (possibly using protocols or multimethods) so you can swap out the implementation. At https://sevva.ai we use Dependency Injection + Protocols to allow us to swap between elastic/mongo/kafka and more without any code change.
Thank you. I'm already aware of integrant and protocols. To clarify my question: I'm interested in what kind of protocols we need to be designing in the first place. For example, what kind of protocol would we need when switching a back end from a local file or local process communication to a back end which communicates over a network? The increased latency may have impacts on the shape of the code itself that calls the backend service. What kind of protocol would make this switch less complex and avoid having to rewrite the entire system?
> Thank you. I'm already aware of integrant and protocols. @UDF1WUJTH Good we're on the same page there. > I'm interested in what kind of protocols we need to be designing in the first place. Protocols/abstractions have negatives/downsides as well as positives (there's no free lunch). To me the key downsides to Protocols/interfaces are: 1. Potential leaky abstraction 2. Likely prevent access to full capability of underlying implementation 3. Indirection causes complexity/friction for developers groking flow/logic of system I would probably try to design your protocol approach to minimise/mitigate those 3 downsides. Possibly making the protocols as small as possible (YAGNI) then revising/extending them when necessary and also modelling your protocol on the design of the primary implementation to reduce learning time for the team. Another option is to not abstract at all and just refactor the code later (e.g. using intellij bulk refactoring features) if you only need one implementation at a time.
You don't even need protocols, just to abstract things behind functions in a namespace.
But I think my advice in general to this is, what you have to do is extract the logic that is irrelevant to those things into pure components, and make the infrastructure something built around it.
Here's a small example: https://github.com/didibus/clj-ddd-example
In this code, all the logic is encapsulated into the domain model and services as pure components. The database is then abstracted away, but only to the logic, not to the application. The Application Service is aware of transaction constraints and other such things from the particular database implementation in use. Then you could swap the database for another by only changing the Repository: https://github.com/didibus/clj-ddd-example/blob/main/src/clj_ddd_example/repository.clj and a little bit of the ApplicationService: https://github.com/didibus/clj-ddd-example/blob/main/src/clj_ddd_example.clj
So in a sense, don't focus on how to abstract the database so much as focus how to make everything else independent and reusable and decoupled from the infrastructure. You do that by implementing it where it simply isn't allowed to use or know anything about the infra in question.
https://polylith.gitbook.io/polylith/ might be helpful
To clarify the comment about Polylith, it has the concept of "interfaces" (namespaces with a specific set of functions) and swappable implementations. We leverage this at work to have an http-client
component with two completely different implementations: one based on Java 11's built-in client and one based on http-kit's client that will run on Java 8 to support one legacy app we have. We also have an i18n
component with two implementations: one based on our database/content management system and one based on local JSON files so our UI/UX person can run an email template preview app without needing a database 🙂