This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-01-08
Channels
- # babashka (1)
- # babashka-sci-dev (27)
- # beginners (13)
- # cljdoc (1)
- # clojure (24)
- # clojure-austin (1)
- # clojurescript (76)
- # data-science (18)
- # datomic (7)
- # java (2)
- # malli (7)
- # nbb (7)
- # off-topic (34)
- # portal (1)
- # reagent (9)
- # reitit (4)
- # releases (2)
- # remote-jobs (1)
- # shadow-cljs (11)
- # squint (7)
- # tools-build (7)
- # uncomplicate (1)
Thank you so much @rupert: this is very helpful. The only issues (1) I am thinking of now is the size of the dowloaded data (pdfs or webpages if there is huge number of them) and (2) on the intensity of training NLP. I am assuming for (1), saving to cloud service storage is one way solve it, and for (2) use GPU service, in which case the date needs to be saved on their database? thanks again.
If you are just getting started, then you probably want to develop your algorithm end to end running on your local computer (with a small subset of data) - that way you have a fast feedback loop. As this is is a data science task you will spend a lot of time looking at data, optimising the AI performance and not just writing code. Once it's all running nicely then you can scale it up to run on clouds etc. When you have parsed out the text and from PDF/HTML and compressed it, it is likely to be approx 1% of the original data size so you can download data in batches, extract then delete the original files to save space. For Model training you can buy a used 1080 TI on ebay for $200 (USD). It might be slower than the latest Nvidia GPUs, but it just means that you wait a little longer for training to complete. Often training takes hours anyway - people often do it overnight. If you want to use a remote GPU, the cheapest are on https://vast.ai . If you use a GPU on Cloud servers then you pay per second and you need to be constantly stopping your server when idle to not run up very large bills. If you are new to NLP I would focus on that bit first (and leave out all the cloud storage, databases, distributed compute architecture until after you have that working).
Makes sense. My goal is to use a pretrained model such as chatgpt on a large text data sent (from pdf or webpages)
If you are using ChatGPT then you won't need a GPU - since it's a managed service in the cloud. There's no AI training because it's a prebuilt model, so your NLP task will either be prompt writing or fine tuning (if you really need to). You will likely have to keep your data volumes pretty low going by their existing https://openai.com/api/pricing/.
Sure - you can calculate up front what it will https://openai.com/api/pricing/. ā¢ Davinici model is $0.02 per 1,000 tokens. Each word is about 1.33 tokens. ā¢ So 1 MB of text (which is about 209,000 words or 280,000 tokens) would cost $5.20 USD. ā¢ 1 GB of text would cost $5,700 USD You can reduce the cost by processing less text or using a less accurate model. Alternatively use a different model and/or different provider. Open AI is the most well known but its not the only one.
What do you want to achieve? ChatGPT is mainly usefull for dialogs, not for giving it large texts.
I think it may help if you are more specific about your requirements or your idea. I'm not sure that ChatGPT will do what you are after. For example ChatGPT does not give back accurate or factual information - even with additional training. ChatGPT is already trained on millions of web pages and PDFs - it might already have trained on your material which means you may not need to download them in the first place. Also you need to figure out your budge because Open AI's APIs cost money as I covered in my last message.
I have a specific text (an in house, non-piblic document), I would like a chatbot answering any question regarding the policy...maybe this does not make too much, but I am a hobyist trying fun project to learn.
Sounds like a fun project. I would try testing the idea with Chat GPT in their Web UI (by copying bits of policy and asking questions) etc. If you think it's working well then you can productionise it with code/API etc.
ChatGPT is not made for "Giving it a long text and ask questions on the text". It is made for asking it questions "about the world". (and it has learned those in a one time training from massive amount of news articles) You can give it some context as part of the questions, but it will always look for answer elsewhere as well (in its original training data)
do you know of free NLP pre-trained models that I can use on my text data for the task that you would recommend?
There's https://huggingface.co/ which has nlp libraries including gpt-2 that you can run locally. It's Python so you can either(A) use it from lib-pythonclj or (B) have clojure read/write JSON/CSV files to interact with python code. Depending on the model you choose, you will likely need an NVidia GPU with at least 8GB of GPU memory.
Your 'task' is called https://huggingface.co/tasks/question-answering So you should pick a model which is optimized for it