off-topic 2022-10-17 | Slack Archive

Drew Verlee00:10:44

Those of you that tried OLED screens for software development, would you recommenced it?

Drew Verlee00:10:19

This video as a pretty good intro of the pros and cons of oled https://www.youtube.com/watch?v=zzufMuiGNFA

Drew Verlee00:10:51

though according to this, there aren't really any cons.

Drew Verlee00:10:02

Though im still hearing accounts of burn in, so maybe not a great idea

Drew Verlee00:10:04

Yeah apparently you can kinda use an oled tv as a monitor but it requires some work arounds. https://www.youtube.com/watch?v=BLCCZQxQ3mM

pip07:10:32

I’d say it works, around 26” - 30” display is also usually perfect for me

pez06:10:19

Can't wait til he learns about #squint https://www.youtube.com/watch?v=Uo3cL4nrGOk

😂 9

🔥 1

pez06:10:05

> This is not production code, although it will be tomorrow 😂

😂 5

🔥 1

borkdude07:10:28

lol

lread11:10:39

Comedy gold!

pez12:10:23

I loved their latest piece about ffmpeg too. Probably is even more fun for real ffmpeg hackers, but I do abuse it frequently.

Cora (she/her)19:10:38

the python one is great, too

🐍 1

respatialized19:10:30

https://openjdk.org/projects/leyden/notes/02-shift-and-constrain The first prototype of Leyden (which is, as far as I understand it, a set of more "static" compilation features for JVM programs that are seeking better startup and peak performance) is https://github.com/openjdk/leyden, along with some interesting design notes on the approach the project is taking.

🙏 1

Ben Sless03:10:15

Think they'll ever get macros and lightweight staging?

genekim13:10:00

#Also sent to the channel

https://twitter.com/graalvm/status/1581973515099860992 It’s been amazing watching the profits of the Graal team — the surprise to it is how much performance they’re extracting from AOT, increasingly beating JIT subset of workloads I was surprised how little Graal was mentioned in the Leyden paper, given that there seem to be going in similar directions.

👀 1

genekim15:10:20

#Also sent to the channel

I was super curious about this — watched this video from Andrew Dinn on Project Leyden, and it talks about the extensive work required to make Graal AOT work, and the desire to make it safer, with more assurances to ensure equivalence. Dinn apparently did lots of work to enable key pieces to make Graal AOT happen, and characterizes some parts as scary, “iffy — but iffy works.” And a desire to make it less scary. Neat mind-expanding talk — I only got about 40% of it, but the scope and scale of this project is breathtaking. Potentially affects all of Java/JVM/JTC (Java test certification), trying to codify a standard of how the all of it is supposed to behave. Suspect @borkdude would love this. Maybe @ghadi too, as it dives really deep into all aspects of how everything runs, both in ideal and not ideal. https://youtu.be/QUbA4tcYrTM

😎 2

genekim15:10:41

The fact that Graal project only mentioned once in the Leyden paper is opposite of this talk, where Graal is 75% of the presentation.

respatialized15:10:31

Would be interesting also to consider how something like #C051QCQUF can potentially target what Dinn is calling "static Java" to it easier to port Clojure programs that need this performance boost to this new model.

Ben Sless18:10:41

I'm mildly cautious regarding that graal demo, I'm used to having to wait a bit for the JIT to actually get to full throughput, and that unless you're specifically profiling startup, it's better to first get the application to a steady state where all the hot paths have been compiled and then profile In the demo they profiled a cold java program and reached almost the same throughput as a AOT compiled image in about a minute, which either means the graal JIT compiler is a beast or the AOT compiler is trash

borkdude18:10:30

I haven't watched the demo, but did they use PGO? which may be the critical part

genekim19:10:13

They used GraalVM EE, and based on this from Fabio, I'm guessing they used PGO AOT. So, I think that means Graal JIT is a beast (warms up and compiled hot paths fast), and AOT compiler can beat even that performance? https://twitter.com/fniephaus/status/1584102839860662273

genekim19:10:51

PS: I'm now watching this video by Dan Heidinga on all the work being done to bring the best of Static Island to Dynamic Island Java. Fascinating work, of which GraalVM is just a piece of. Lots of brains working on this problem! https://youtu.be/pGHN8a3VAeU

genekim19:10:54

Oops. One of those photos didn't belong. Here's two more form that presentation, which I'm almost done watching.

genekim19:10:40

Just posted question to that team on the game of life demo parameters. Super curious to find out!

ghadi19:10:39

all these (existing) static efforts double down on "Dead Programs" https://www.youtube.com/watch?v=8Ab3ArE8W3s

ghadi19:10:38

I hope Leyden carves out a new point in the design space, that has efficiency with dynamism

ghadi19:10:19

IBM's compiler does JIT stashing - persisting profiling info across reboots, instantly hot startup

ghadi19:10:38

they also talked about a network JIT compiler, that can distribute JIT decisions across a cluster

genekim19:10:39

That thought occurred to me — it's super cool how both of these presentations talk about how important dynamism is, but also want to get some of the benefits of Static Island (fast startup times, etc.)

genekim19:10:41

There's a choice line from that last talk: essentially: "OH: why are people complaining about JVM startup times? Main is executed in 4ms. Answer: that's not the point. It's not how fast main is run, it's when your first customer request is serviced, which involves hundreds or thousands of classes that have to be initialized" 😆

borkdude19:10:20

There are certainly trade-offs. Some tools optimize for size: there dead code elimination is in tension with dynamism. Google Closure DCE feels very similar to GraalVM native-image in that respect. Other tools persist your entire runtime to an image and preserve full functionality without optimizing for size, but just fast boot times to get your full environment going. Windows hibernate? ;)

borkdude19:10:44

If I'm not mistaking, Leyden wants to give developers the option to make their own trade-offs, where native-image is an opinionated tool

genekim19:10:32

That last video is so fun to watch. He is saying that developers need to be able to specify when they want their class initializers to run. Grahl VM couldn’t figure this out: first they defaulted to build time: then defaulted to runtime: and now they’re trying a hybrid approach. 😂😂 They discovered that developers need to specify this, and it needs to be built into Java. (Transcribing us, sorry!)

genekim20:10:22

Sorry. “Devs need to be the one to specify when their class initializers should run” The GraalVM decision flipping from build vs run time shows how elusive / non-obvious this insight was. 🤯😂

borkdude20:10:27

you can still configure this with graalvm though

borkdude20:10:14

I've even made a tool for clojure "classes" since you can't really leave them to be initialized at runtime for native-image: https://github.com/clj-easy/graal-build-time

genekim20:10:56

@UK0810AQ2 PS: confirmed. AOT w/PGO and GraalVM with JIT. IIUC, based on your comment, this is an impressive showing for the the Graal JIT? https://twitter.com/fniephaus/status/1584625504396931073

genekim21:10:13

@UK0810AQ2 Just thought you’d be interested — on my macOS MBP 2020 ARM: • 19-zulu: JIT: 35 ticks/sec (HotSpot JIT) • 19-grl-ce: AOT: 31 ticks/sec (no PGO in community edition; that’s only in EE) • 19-grl-ce: JIT: 40 ticks/sec (!!!! Graal JIT) 19-grl-ce is v22.3, just released a couple of days ago. This was the first time I’ve seen this big of a performance improvement with Graal JIT. I may try running a REPL session or two with this JVM (which I had to manually install, versus using sdkman.)

genekim21:10:56

(That was a fun 90m. First time I’ve directly used mvn for compile/run.)

genekim22:10:09

Another hour — tried Graal JIT for some data analysis things I’ve been doing with git repos, handbuilt markdown recursive descent parser, building up lots of map assoc, lookups, etc. Lots of CPU in single-threaded calls to Jack Rusher darkstar library, in GraalJS, so maybe not the best test case. But delightful that it all runs on the new JVM. TL;DR: azul-17 still consistently the fastest by 2.5% • azul-17 ◦ make svgs-unicorn-short: 58.57s user 1.63s system 283% cpu 21.237 total ◦ make svgs-unicorn 534.61s user 6.25s system 504% cpu 1:47.21 total ◦ make svgs-unicorn 546.93s user 6.71s system 491% cpu 1:52.71 total • 19 - grl-v22.3 ◦ make svgs-unicorn-short 79.60s user 2.47s system 370% cpu 22.160 total ◦ make svgs-unicorn 733.27s user 11.07s system 636% cpu 1:56.92 total ◦ make svgs-unicorn 734.55s user 11.76s system 628% cpu 1:58.83 total • 17 - sdk install java 22.2.r17-grl ◦ make svgs-unicorn-short 79.72s user 1.49s system 358% cpu 22.651 total ◦ make svgs-unicorn-short 83.85s user 1.94s system 387% cpu 22.131 total ◦ make svgs-unicorn 760.32s user 13.72s system 623% cpu 2:04.09 total ◦ make svgs-unicorn 749.85s user 10.00s system 636% cpu 1:59.38 total

Ben Sless03:10:51

It's worth trying to modify the compiler maximum node depth for inlining. Clojure probably has deeper stacks and more polymorphic code

🎉 1

Ben Sless05:10:26

This talk is useful https://www.youtube.com/watch?v=3lukwqAkz90

genekim20:10:11

Those are great talks — I remembered last night watching these talks last year. Thx!!! (PS: I tried these parameters, and didn’t result in any meaningful speedups. I think I passed the parameters correctly?)

-XX:MaxInlineLevel=18 -XX:MaxInlineSize=270 -XX:MaxTrivialSize=12

genekim20:10:36

Also, I found this, which suggested inline level of 20; https://scalacenter.github.io/bloop/docs/performance-guide#tweak-the-java-jit

Mark Sto19:10:02

That’s what you call a passionate community! Half a progress of the Dart/Flutter Support ticket in terms of upvotes in just 1 day.

💯 2

respatialized22:10:36

https://githubcopilotinvestigation.com/ An investigation has been launched by a team of class-action litigators into license violations by GitHub Copilot, which is, in my view, a massive enclosure of the intellectual commons provided by free software.

👍 7

respatialized23:10:16

#Also sent to the channel

They are actively seeking class members for the lawsuit, so if you think some of your code (Clojure or otherwise) has been used by Copilot in violation of your license, you can contact them about it. Clojure is a pretty small community, so there's a good chance that some Copilot-generated snippets might just directly copy/reproduce implementations from the libraries that solve specific problems and may be easier to attribute.

seancorfield23:10:17

I've been fairly impressed with Copilot's ability to suggest plausible Clojure code -- including docstrings -- that are obvious and clear extensions of existing (Clojure) code in the repo in terms of symbol names and language in existing docstrings. It certainly has not seemed to directly copy any other code (so far).

seancorfield23:10:07

I'll be very interested to see how this investigation goes and what details it uncovers -- Microsoft has been pretty cagey about how exactly Copilot works so we may learn some fascinating things 🙂

phronmophobic00:10:57

Downloading almost all of the clojure projects on github is only about 14Gb (compressed). If I ever get around to indexing the code, it would be interesting to try and automatically detect this.

🔍 2

hifumi12300:10:15

@U7RJTCH6J Sounds like an interesting project. Assuming this is not against GitHub's ToS, feel free to ping me if you'd like someone else to join 😄

phronmophobic01:10:13

To the best of my knowledge, it's not disallowed by github. I try to be a good citizen, so I've tried to make sure I'm not abusing their site. They do have rate limiting policies, but they're very reasonable and even when being extra cautious, you can get access to all the info in less than a day.

👍 1

didibus03:10:33

I would assume it's not just about if it could output copyrighted code, but more if the entire model should be considered a derivative.

☝️ 1

didibus03:10:41

And I guess similarly, if any output could as well. I think people will need to establish like, if it derived from multiple licensed code base an output code, even if it's not recognizable, but we know it's directly derived as per the model learning process, what does that mean legally?

seancorfield03:10:52

It's an interesting legal question. If code to call a specific third party API in, say, Kotlin pretty much always looks the same, and twenty people write that code and release it under five different licenses - and then you (or copilot) wires the same code again, is that plagiarism or a license violation or what?

seancorfield03:10:57

I'm using Kotlin as the example here because a friend of mine is learning it and using copilot a lot to help him write tests as he learns. He's finding that he types in the it descriptor (BDD) as an English string and it often writes a pretty good test for the basics.

didibus04:10:26

My impression, in cases where you claim to have "reinvented" something, it's matter of like proving and convincing others its truly not copied or derived. And then there's the fair use part of it. This is US copyright Im talking about. I think here for #1 it's clearly copyright violation. Copilot is trained directly from the copyrighted material, literally it's being copied into the training set, and the output code is a mathematical derivation of the sources. The HN article seems to agree with this, in that they think Microsoft will be arguing for Fair Use, and not that copyright doesn't apply at all.

seancorfield04:10:43

On the subject of Fair Use, a recent case that will be interesting to follow is the Andy Warhol Foundation vs Goldsmith. https://en.wikipedia.org/wiki/Andy_Warhol_Foundation_for_the_Visual_Arts,_Inc._v._Goldsmith and in particular the interview/analysis here https://www.lawdork.com/p/amici-how-a-painting-of-prince-could

seancorfield04:10:54

(I highly recommend Chris Geidner's substack if you like deep analysis of current legal issues)

didibus04:10:56

Implementing an API call would fall under Fair Use pretty easily I think. Because the work is not very creative, and it only copies a very small part, plus it doesn't likely result in financial harm to the original author. But here again with Copilot, it's a new paradigm, it's hard to map. Like the entire work was used, the works were very creative full code bases, it's purpose is for-profit, and it's possible it would affect tbe market value of the original, though hard to say. I'm not even sure what to think of it myself. The idea that you can derive from code a code base that can then be used to write similar code to original code base? That's just kind of mind blowing and confusing.

seancorfield04:10:13

I dunno... I've been doing this (professional software development) for forty some years, and open source development for thirty, and there's very little out there that is truly novel at this point.

seancorfield05:10:07

I mean... we've been hashing over this issue for decades, in terms of whether different people, solving the same problem, arriving at the almost identical solution is a problem or not...

didibus05:10:36

I don't think it's a problem to arrive at the same solution, unless you used the known solution in order to arrive at the same solution yourself. And we're not talking idea, since those apply to IP, we're talking like if you copy/pasted or like typed in some parts of the code directly taken from the existing solution's implementation.

seancorfield05:10:00

If you have something novel, that's what patents are for. If you've just solved the same problem as 100 other people in a similar way, I'm not even sure how valuable copyright actually is here? Sure, it's your "intellectual property" insofar as you wrote it -- but if 100 other people also wrote it, and published it, and copyrighted it, what are we arguing about an AI producing the same solution? Is it theft, coincidence, just an obvious result of any algorithm? I honestly don't know.

seancorfield05:10:44

What is copilot copying here?

didibus05:10:09

I think the issue is the claim that the AI "learned". Like, if you study my implementation to help you understand how it works. Then don't actually remember the exact strings I used, in what exact arrangement, but understand it now, and go implement it. Ok sure that's fine. If you just copy/paste chunks, or paste my code and then change it to look slightly different or tweaked, or if you look at it and type back parts of it, again maybe modifying it in the process so it looks different but still directly looking at it while doing so. Now I think, if you're also directly competing with me financially, this would be copyright infringement.

didibus05:10:24

> What is copilot copying here? > I think that's exactly the question. Is it copying as I explained above? Or did it truly understood the concept? Does it not remember the original source at all, but only remembers the concepts and logic of it? So it can now re-implement it on its own?

didibus05:10:10

And I think because a lot of people saw it output verbatim exact copies even including people's names in comments, people are saying it's copying, and hasn't really learned to program.

seancorfield05:10:37

Hard to tell -- the omission of the WHERE clause in the test is closer to correct, compared to the false test above, but it doesn't know that nil is simply disallowed here. If nil were allowed, the generated tests would be correct.

seancorfield05:10:02

My friend above, learning Kotlin, uses a lot of Maori in his tests -- and Copilot often suggests tests with Maori in them (probably from his own code) but also suggests other languages with appropriate translations too.

seancorfield05:10:40

If I were writing C++ and Copilot suggested Studebaker as a string, would you be surprised? It's in the ANSI Standard document...

didibus05:10:01

They did say they apply transforms over it, like using names and style from your own code, etc.

seancorfield05:10:09

(and it's there because I put it there back in the '90s BTW)

seancorfield05:10:11

I think this is going to be a fascinating area of discovery -- we've built all these AI systems but we really have no idea what they're going to produce. And we really have no idea what it will do to copyright law at this point. What about all the AI-generated art? (and, again, see my link above about Warhol vs Goldsmith -- not AI related directly but the decision will influence how AI-generated art is seen, legally)

didibus05:10:13

Well, yes I'd be surprised, I mean, I wouldn't because I think the model is just doing smart stitching, and did not actually encode an understanding. But so, if you think it actually learned concepts, why would it want to put that name back?

seancorfield05:10:07

What would you expect it to generate for a "random string"? Couldn't it just as easily be any string that anyone has used in any open source code out there?

didibus05:10:24

It's not random if it's only pooled from that limited dictionary

seancorfield05:10:10

Some friends of mine were lampooning some code that Adobe published about AWS yesterday -- ridiculing the variable names. But it turned out those variable names were used in other Adobe public code, from completely unrelated teams (although, possibly, the doc team overlapped even if the product teams did not). If I happen to use that variable name, am I subconsciously copying Adobe's code?

didibus05:10:15

But your original comment didn't mention asking it for a random string, I thought you meant just if it was suggesting some c++ code and wrote that

seancorfield05:10:47

Right. And that's the core of it: what exactly are we asking Copilot to write for us? Do we know?

didibus05:10:57

You're saying you think Copilot understands programming fully, but doesn't understand what we're asking it, so all the errors, mistakes or verbatim copies were simply it misunderstanding what we want?

seancorfield05:10:05

And I have a dog in this fight: I've been on the receiving end of someone in this community copying and pasting my code and passing it off as their own, including removing all the copyright and licensing.

seancorfield05:10:08

I'm not saying that. I'm saying we don't know what we've built and we don't know what we're asking for. We have no idea what to expect of something like Copilot. And now we're trying to rationalize it in terms of what we do know. Which is not sufficient.

💯 1

seancorfield05:10:48

(this is common with advances in technology)

didibus05:10:01

I mean, the ethical / do we even care is a whole other question. I'm personally not sure. I think I do care that it's closed source and proprietary even though it was built of like every single open source project combined together :rolling_on_the_floor_laughing:. But I think if it was made open source itself, I'd be all for it, though I do think people might be allowed to opt out of being in the training set.

seancorfield05:10:53

Ah, and there's the "red line". And this is why people are upset. It's nothing to do with the technology itself, it's really all about "open source purity".

seancorfield05:10:06

"I can't see the source so I don't trust it!"

seancorfield05:10:22

It isn't a rational argument.

didibus05:10:15

I think I also have an issue with like, people using the word "learn". The inference function is programmed by a combination of the dataset and the model. I think that means the dataset is in large part the source code for the inference function.

seancorfield05:10:21

And you're right -- the people squealing about MS/Copilot right now would mostly be singing a very different tune if Copilot itself were open source.

seancorfield05:10:23

If Copilot were an open source project, created by some community, the OSS folks would be falling over themselves to proclaim how great this innovation was 🙂

seancorfield05:10:38

Changing tack slightly, do you believe a neural net "learns"? If not, what is the appropriate word to use there?

didibus05:10:10

I mean, this seems fair to me... You put the code out in the public with a license, you assume people will respect it. Copilot comes and says, I don't think your license applies to me.

seancorfield05:10:42

Re: licensing, I'll go back to my earlier question -- if Copilot derives a suggestion from 100 near-identical pieces of code, released under ten different OSS licenses, what do you think it should do? If a solution is common and obvious, is copyright even applicable here?

didibus05:10:48

To clarify, I don't care about Copilot outputting identical implementations. That's not where I feel there's a licensing issue. The issue I have is that their code base contains the open source code in it. So Copilot directly uses open source code without respecting their license.

didibus05:10:38

I think the issue of Copilot suggesting you identical code is more the problem of the person using Copilot in making sure they're not without knowing copying licensed code.

seancorfield05:10:00

"The vast majority of the code that GitHub Copilot suggests has never been seen before. Our latest internal research shows that about 1% of the time, a suggestion may contain some code snippets longer than ~150 characters that matches the training set."

seancorfield05:10:23

"We built a filter to help detect and suppress the rare instances where a GitHub Copilot suggestion contains code that matches public code on GitHub. You have the choice to turn that filter on or off during setup."

didibus05:10:01

I mean, it's hard to trust proprietary benchmarks :rolling_on_the_floor_laughing:

didibus05:10:29

What if that one line of code is copied from 3 different source, can they detect that kind of stitching?

seancorfield05:10:00

And if those three sources have different licenses? Or at least one of them has no license?

didibus05:10:11

But again, personally that doesn't bother me. It's the use of copyrighted code in their own code base that bothers me, without respecting the license.

seancorfield05:10:44

A lot of commercial, closed-source software is built on OSS -- and complies with the license.

didibus05:10:01

Yes, but copilot doesn't

seancorfield05:10:09

How do you know that?

Martynas Maciulevičius05:10:42

I think licenses should start including "allow generalization" clause. This way they'll allow or deny the code use in this kind of systems.

seancorfield05:10:45

I bet the MS lawyers have been all over this. When I worked at Macromedia and Adobe, the legal team were death on stuff like this.

didibus05:10:51

Well, they've used the entire public GitHub, so already it means it's using copy left stuff. And they've got no attribution anywhere. So most licenses would at least ask for that.

seancorfield05:10:19

Only as a training set. Not part of the Copilot codebase. No license violation there.

seancorfield05:10:31

Lots of projects analyze the public code on GH.

didibus05:10:44

The training set is the code and it's likely stored in their git repo along the rest

seancorfield05:10:12

I doubt that. The lawyers would have a fit.

seancorfield05:10:23

Have you never been around corporate legal teams??

didibus05:10:36

The training set is code in my opinion. Ignoring the explicit non-determinism in the model, the same training set rebuilds the same exact inference function.

seancorfield05:10:42

That's definitely a bogus argument. There are lots of systems out there that, given the same input will produce the same output, and that input/output has ZERO to do with the IP or legal issues involved in the system.

seancorfield05:10:58

You're just not thinking about this objectively.

didibus05:10:06

This is why I don't consider it learning as in like humans. This is more akin to indoctrination 😝, the model just compiles to whatever the dataset reduces to when fed through the model.

didibus05:10:35

Which of those systems is the output a running program?

seancorfield05:10:45

OK, so this is a pointless discussion because you're basically saying all input-based AI is "wrong"...

didibus05:10:24

I'm saying it's just how you program the computer to implement some function. It's a different kind of source code,.it's still just programming.

seancorfield05:10:41

What about my earlier Q about AI-generated art?

seancorfield05:10:53

Do you claim that's entirely predictable based on the input dataset?

seancorfield05:10:44

I think the answer is that we just don't know. We can't predict what these AI systems produce.

didibus05:10:56

Yes, as far as I know it is. You improve the models by cleaning/changing your dataset and tweaking the model. It's all the code fed to generate the computer function.

didibus05:10:23

I mean, you can predict it the same way you can predict the output of a compiled code base

seancorfield05:10:47

I think the people building these AI systems would disagree with you...

didibus05:10:57

Think about it, if you lost the dataset, Copilot is gone.

didibus05:10:24

Try to port it to run on some new CPU and operating system

seancorfield05:10:42

If you lost the input for System X, it can't produce any useful output. That doesn't mean System X is bogus/useless/"gone".

didibus05:10:02

The dataset isn't like user input, it's input used to create the program itself.

didibus05:10:31

It's akin to losing the source code for a video game and only having left the compiled game.

seancorfield05:10:40

So you don't think neural nets standalone and are useful in and of themselves?

didibus05:10:04

Yes they are, but just like how Clojure compiler is useful.

seancorfield05:10:04

False equivalence. OK, I'm not going to bother with this. Your position is... interesting... but we don't agree and we aren't going to get any closer.

didibus05:10:35

You're trying to say that the Copilot artifact does not depend on the source code contained in its dataset? But if you got rid of it, you couldn't regenerate Copilot?

didibus05:10:56

Ya, we might need to agree to disagree. But I'm curious what you consider it otherwise? Like do you see video game assets and UI assets as just free to copy and use in your own program?

didibus06:10:10

Ok, for onlookers, one last attempt at explaining myself. If you use a library and depend on it to build your program. Do you use the library? Ok now Copilot is built using other people's libraries as well and uses them, they are similarly needed to build the Copilot program. But then you say Copilot doesn't use those libraries? In both cases without the library your build fails, and you can't build your program. In both cases you can find an alternate library and probably refactor things and end up with a similar-ish behaving program. Personally I don't see how someone can claim Copilot doesn't depend on these. Sure it doesn't use the code for traditional compilation/linking or interpretation. But it does use it to similarly build a computer program, just using a different technique to derive the resulting binary code. There are even ML research that shows you can reverse the training, and get back parts of the dataset that was used for it from. Similar to how you can decompile and reverse engineer parts of a compiled program and get back parts of the source. That's why there are people saying there could be privacy issues, as private info can be leaked.

didibus06:10:27

I'd be curious to hear from any data or ML scientists, I only work with ML scientists and use sage maker to train simple models. But don't really know the deep math and theory behind it, beyond that old Andrew NG Coursera class. But my impression is this is just a new way to code. Instead of a programming language, it's all about finding the dataset and it's representation that will result in the behavior you're looking for. Creating the dataset by hand is tons of work and almost impossible, so you have to use existing data. But it's common to tweak it to try and get the result you want. And here comes the copyright issue. If other people wrote the data in the dataset, and own the copyright, are you still allowed to use it to derive a program from it?

Martynas Maciulevičius06:10:56

I added this to my own licenses:

AGPLv3.
In addition to this license it's not allowed to be used in Github's Copilot or similar analytical applications.
If there is gray area about what is *Fair Use* the default is to reject such use.

Probably a small change but that's how it can start to be respected.

Martynas Maciulevičius06:10:07

https://twitter.com/mitsuhiko/status/1410886329924194309

didibus07:10:49

I've wondered that, but I don't think this does anything. I think legally it's about the fact you made it publicly available, and from that point on, you can't prevent all use of it, but only those that fall under copyright or IP if you have any patents related to it. That's kind of the crux of it, if Copilots use of your code as training data for their commercial software is found to be Fair Use or even not something that falls under copyright or similarly not an IP infringement, theres nothing you can do to stop it being used as such except not make it public.

didibus07:10:11

That's why people are asking for legal frameworks to explicitly look into this and address this unknown. So at least everyone would know if it's allowed or not and be able to figure out given that knowledge what they want to do.

Martynas Maciulevičius07:10:13

But if I write my license in a way that I wouldn't say it's "fair use" anymore then what? Do you say that even if when I open my code and explicitly write that I don't want to feed copilot... does that still mean they can use it? I'd find this stupid.

didibus07:10:10

Ya, that's how it is. Once you made it public, the only rights you retain are copyright. If copyright doesn't prevent data mining or model training for commercial use then you don't have the right to restrict that. I think alternatively you'd have to make it private and like give it only to people with direct binding contracts. Like sign this contract and you can get my source, where the contract explicitly prevent this use.

Martynas Maciulevičius07:10:24

So what's the point of law then? What's the point of software licenses at all? Why does this thread does even exist and why are we still here...? Why did we start talking about it in the first place if it doesn't matter...

didibus07:10:02

IANAL though, this is just what I understand

didibus07:10:32

Well the licenses are to allow things that even copyright doesn't allow.

didibus07:10:33

Oh, because someone is planning a class action lawsuit and will try to argue in court that copyright does in fact prevent this use. If they win, that would set a precedent. Microsoft and GitHub on the other hand are arguing it's fair use, and so no explicit license is needed from the author and they can continue doing as they please.

Martynas Maciulevičius07:10:13

https://opensource.stackexchange.com/questions/297/whats-the-difference-between-copyright-and-licensing > Licensing is the legal term used to describe the terms under which people are allowed to use the copyrighted material. So under this logic the use is only allowed by the license. So if license doesn't talk about what "fair use" is then Microsoft could have a free lunch here. But then the whole open source community would be very unhappy. Also they could probably argue that "if it's opensource then it's not copyrighted"

Martynas Maciulevičius07:10:01

> If you want, you can rent the house out to someone, and that rental agreement is the 'license'. So what microsoft did was that they rented a house and they organized a party with 300k participants and gave keys to everybody. Is that a fair use of a rented property?

didibus07:10:52

I don't think that's the case. As I understand, there are two questions. 1) Does copyright even apply here? Are they even using your copyrighted work? They're not selling or redistributing your work in a shape that even remotely resemble it. So it is more like inspiration or they just used it to learn and nothing more. The argument is all they're doing is reading your code but not using it, even though it's a machine that's reading it. If the court were to say no, that be it. It means that they're not even using your work in a way where copyright matters. 2) Fair Use, this is a legal term, it's an exception to copyright that exists in US law itself. It says that there are circumstances when even if a work is copyrighted, you can still use it without explicit license agreement to do so. See the four factors of fair use here: https://support.google.com/legal/answer/4558992?hl=en Here too, if the court were to say this is fair use, that be it, they could do whatever no matter the license.

didibus07:10:10

Personally, I think this is a losing battle. You should assume people will do this. If it's not a US company, it'll be a Chinese one, or some other one. Or some company will lie about using open source in their dataset and keep it a trade secret. Also, governments seem to want to allow this, as they are hoping the AI innovation can give their economy a boost, by having their own companies be at the forefront for it. For example this year the UK put a law: > the UK government has now decided to introduce a new copyright and database right exception which allows TDM for any purpose, i.e. including commercial uses. Licensing will no longer be an issue and rightholders will not be able to opt-out or contract out of the exception > TDM stands for text and data mining.

didibus07:10:21

The EU has a similar TDM exception but it's restricted to non-commercial use for now. The US doesn't currently have a clear stance on TDM and how copyright/fair use applies to it, it's a grey area in the US.

Martynas Maciulevičius07:10:24

https://news.ycombinator.com/item?id=27677177 > It isn't. US copyright law says brief excerpts of copyright material may, under certain circumstances, be quoted verbatim > ----> for purposes such as criticism, news reporting, teaching, and research <----, without the need for permission from or payment to the copyright holder. > Copilot is not criticizing, reporting, teaching, or researching anything. So claiming fair use is the result of total ignorance or disregard.

didibus08:10:26

> certain uses of copyrighted material for, ----> but not limited to <----, criticism, commentary, news reporting, teaching, scholarship, or research may be considered fair

didibus08:10:21

I feel it could go either way in court. But there could also be a law passed like in UK.

Martynas Maciulevičius10:10:49

https://matthewbutterick.com/chron/this-copilot-is-stupid-and-wants-to-kill-me.html > Suppose we accept that AI training falls under the US copyright notion of fair use. (Though the question is https://sfconservancy.org/blog/2022/feb/03/github-copilot-copyleft-gpl/.) If so, then the fair-use exception would supersede the license terms. But even if the input to the AI system qualifies as fair use, the output of that system may not. Microsoft has not made this claim about GitHub Copilot—and never will, because no one can guarantee the behavior of a nondeterministic system.

respatialized15:10:36

as a data scientist, I tend to find myself in agreement with @U0K064KQV about the relationship between models and training data. I don't think the nondeterminism exhibited by the models on individual cases of output is a particularly salient factor, any more than the visual noise generated by a photocopier on each copy would be salient to the question of whether those copies were used to plagiarize another work. I find M Eifler + Vi Hart's analogy to collage particularly compelling in thinking through what "AI" and machine learning actually do when they are said to "create" things: > A neural net run on Mozart has Mozart at its core, and to understand the results you need to know Mozart. If you’re using ImageNet, you’d best be prepared for a lot of dogs. The AI isn’t just “looking at” tens of thousands of dog photos, the AI is dog photos. The data is the fabric, and the code is just the stitching. https://web.archive.org/web/20200328033034/https://theartofresearch.org/ai-ubi-and-data/

respatialized16:10:18

And with that collage metaphor in mind, I think it's worth highlighting this passage from Matthew Butterick's commentary on the investigation page: > Copilot’s whizzy code-retrieval methods are a smokescreen intended to conceal a grubby truth: Copilot is merely a convenient alternative interface to a large corpus of open-source code. > ...Microsoft is creating a new https://en.wikipedia.org/wiki/Closed_platform that will inhibit programmers from discovering traditional open-source communities. Or at the very least, remove any incentive to do so. Over time, this process will starve these communities. This is why I referred to Copilot as an enclosure of the intellectual commons: it is bundling up the labor of the countless open-source and free software authors and attributing that labor to the magic of "AI." The difference between what Microsoft's doing and an open-source alternative to Copilot is not just the license, but Microsoft's increasing market power in the domain of software development and its ability to impose standards and practices that directly benefit its bottom line on a community it enjoys increasing weight in. Whether this makes me a "purist" or not, I care about the survival of free software, a community to which Microsoft has been https://en.wikipedia.org/wiki/Halloween_documents in the past.

Martynas Maciulevičius16:10:35

Those that say that model is nondeterministic therefore it's ununderstandable are blinded by their tools. They know that they can't inspect the model enough to understand what it does. So instead they choose to treat it as an opaque black box. And then they try to force this understanding on others to say that nobody should be able to understand it either therefore it's something different and something new. This behavior is exactly what the model builder would like people to believe because it makes the life of the builder easier. They don't need to think about any implications because "hey, this is the model and I can work on version two now". But if we say that we should and can understand the model because we can understand what inputs we put into it then we can also reason about the output. And if we can reason about the output then we can know what it can and what it can't produce. IMO it's similar as with a pseudorandom generator or cryptographic signatures -- you can produce non-random output and you can tamper with it if you know what you're doing.

✅ 1

respatialized16:10:19

That's correct, and Butterick has argued that it is possible to build chain of custody into AI tools so that one can see which inputs are assigned which weights in producing a given output, a feature which is being built into more computer vision tools. (eg. https://christophm.github.io/interpretable-ml-book/pixel-attribution.html) The fact that this approach is possible but not chosen suggests bad faith behavior on Microsoft's part; it seems to me like they don't want to do this because people would opt out, reducing the value of their model.

Martynas Maciulevičius16:10:28

I've read in a book that it's policy that decides how AI (and IT systems) work and not the nature of the programming. Programming is just a tool and a tool doesn't have a bias (sorry for being this harsh but soft-wrapping loses objectivity). You can put a spaceship on the Moon but instead you can choose to think selfishly. Rocket doesn't have an opinion but people can destroy cities with them. I think the least Microshaft could do is to opensource the model and make it free. Then the real science will begin and not this crap. Then the ones that really want to innovate could innovate without these restrictions that MS creates by taking free code and coming up with a paid tool. It's like selling Wikipedia for money. And without attributions. It's not that they couldn't do the attributions, they chose not to because they don't care.

💯 1

m.q.warnock20:10:27

I read just about all 1k comments in the most recent HN discussion on this, as well as this thread, and I'm going to reluctantly stick my neck out for what I see as both the most unpopular position and the right one: Copyright was never a good solution to the problem of rewarding creators. Like most laws created within the capitalist paradigm, it rewards those with capital. The present 'crisis' is just another example of technology making an ad-absurdum argument for us. Whether 'we' respond to this one by doubling down may well signal the (multicausal) fall of the whole house of cards (https://www.youtube.com/watch?v=pW-SOdj4Kkk)

💯 2

didibus00:10:23

Yup, also when I read about why copyright was put in place, it seems the spirit was to promote the creative process so that we produce even more great work. It's not really about making the creators money. And Capitalism, if you're behind it, is said to be a great mechanism for incentivizing people to do things. So in that framework, by allowing creators to profit from their creation, you incentivize more people to create. At the same time, by limiting the extent of their control over it, you can balance that with enabling collective inspiration and iteration, which also contributes to the advancement of generating more creative work. So it seems copyright is spirited in that mental model, reward people for their creations temporarily, but allow for people to iterate over other people's work as well in the name of fostering creativity as a whole and advancement. Personally I think that seems like the right thing to do even here. I don't want to stop AI advancement, and can imagine how much more work I could get done, faster, bigger projects done in less time, with less people, we might be able to deliver even bigger better software more quickly. But you also can't hamstring people's incentives to create. The models themselves will run dry if the dataset doesn't continue to evolve and grow. You don't want to feed its own output back into it either. There's got to be a way to make this a win/win. Allowing opt-out / opt-in, and respecting the license terms would seem to me like it makes it a win/win. This is true for other models as well, like DALL-E, pay the artists, or give them attribution, or some perpetual share, or whatever it takes to have them opt-in. At least for any commercial model. Allow it all for research, so we can innovate without slowdown, push AI to it's limits. But when you want to go commercial, get permission back from all the creators you depend on. I don't know, this seems like a fair win/win to me at least.

mauricio.szabo03:10:59

The thing is, models like Dall-E are already complicated because some artists are claiming that people make prompts like "this and this, on style of Mauricio Szabo's Paintings" for example

mauricio.szabo03:10:40

Supposedly, the same can happen to Copilot- if someone knows a good programmer, he could say "this and this code as if was written by @githubhandle" for example, and the model could weight more the specific user's code

mauricio.szabo03:10:23

Also, the fact that Microsoft said "only 1% of the time the code have any relation with the original" is bull. Even more because they contradict themselves on Microsoft's blog, where there's a post explicitly saying that if they started a code with a comment, Copilot was trying to complete with the GPL text

mauricio.szabo03:10:18

Supposedly, they tuned the algorithm to avoid that, which opens some questions: 1. Does this rule of "almost never it produces verbatim code" is even true, and; 2. What are the arguments again for the outputs being completely non-deterministic and ununderstandable, and finally; 3. If that is true, why no private code from Microsoft's tools was used as a way to "prove" that this code will safely not be used IP violation?

mauricio.szabo04:10:50

Finally, I remember JekeJeke Prolog - I was thinking on using it for my Spock project, but the code is not open source. Except that it was hosted on a public GitHub repo (that now is gone, together with the projects official page). So why can't I use JekeJeke as a library on my code without paying a license, but Copilot can use the whole source also without paying?

Martynas Maciulevičius05:10:18

If they'll try to prove it's IP-free then they will adjust the model just barely enough to pass it. And they'll do these adjustments one at a time until people will run out of energy to pursue. IMO we don't want them to get into this exception hide and seek game, there should be a rule on what they can and what they can't do. This is beyond proving whether the output is good or bad. We know the inputs so we should take the output as holding all those licenses combined as you don't know which code was actually used. And if we do it as the worst-case scenario then we can't use any code produced by it, as it's all AGPLv3 or some other license that doesn't permit anything at all. If any of AGPLv3 code was used to train the model then it means that by removing the attributions you violate the agreement -- you can't include the original author anymore even if you release the code. So you violate the agreement. And the best thing is that it's not Copilot that violates that AGPLv3 of the input source code, but the user that uses Copilot. Because it's them that used the code and didn't include the license. So coming to your Prolog example -- you'd be violating the license if you'd be using any code from there. But you don't know and you think you don't need to care. But in reality you may want to care.

ksd16:10:51

haven’t actually read this, and don’t understand recent models enough to know if it applies, but this paper seems relevant to the topic of interpretability of neural networks: https://arxiv.org/pdf/2210.05189.pdf

respatialized23:10:16

replied to a thread:https://githubcopilotinvestigation.com/ An investigation has been launched by a team of class-action litigators into license violations by GitHub Copilot, which is, in my view, a massive enclosure of the intellectual commons provided by free software.

genekim13:10:00

replied to a thread:https://openjdk.org/projects/leyden/notes/02-shift-and-constrain The first prototype of Leyden (which is, as far as I understand it, a set of more "static" compilation features for JVM programs that are seeking better startup and peak performance) is https://github.com/openjdk/leyden, along with some interesting design notes on the approach the project is taking.

👀 1

genekim15:10:20

😎 2

2022-10-17

Channels