#Reading scientific papers, textbooks, and more (of any length!)

35 messages · Page 1 of 1 (latest)

hard flume
#

Hey guys, I made a research tool to integrate GPT-3 with PDFs and external websites (including ArXiv, Nature, OpenStax, etc). It has pointed me to interesting or relevant information extremely quickly compared to skimming the document myself, so I wanted to share it with you.

I make some other optimizations to hopefully improve this compared to copying-and-pasting directly from PDFs or websites:

  • Works with any length of input
  • Efficiency: It takes <10s to load a PDF, and <10s to generate an answer, regardless of the length of the original document
  • HTML is converted to Markdown when possible to preserve more semantic meaning (if you have a tool like this for PDFs, please let me know!)

Things I'm still working on:

  • Slight potential for hallucination (in the meantime, please verify the outputs)
  • Time delay for very long PDFs (>60 pages)
  • Sometimes, information like comments (on pages) or references (on papers) are included as context for the model

Please try it out at https://augmateai.michaelfatemi.com/web , I would love to hear your feedback! Drop a dallestar if you like it

UPDATE 12/31/2022 Added support for Youtube videos at https://augmateai.michaelfatemi.com/youtube

umbral glacier
#

Just shared it with a few friends! Curious to see how it performs 🙂

meager field
#

Nice! Good use of embeddings + prompting. Looks like Discord incorrectly appended the , to the URL: you might want to edit the OP.

umbral glacier
#

Do you have any writeup or git with how this works?

meager field
#

Have you open sourced any of it? I’d be curious to see how you did it: it’s quite similar to (but better than) several open source projects I’ve done over the past couple weeks using embeddings and prompt completions for summarization and semantic search.

merry hazel
#

Same question as @meager field

hard flume
#

Haven't open-sourced yet - but yeah the general process is to divide the text into chunks, embed them, rank them by cosine similarity to the query embedding, take the top K, generate completions individually, and then use a prompt to compile the individual completions into one

#

I can make a video of how it works tomorrow

meager field
#

Would love to see your prompts if that’d be easier than open-sourcing everything.

merry hazel
hard flume
#

yes - it can answer questions specifically about the text you load

merry hazel
hard flume
hard flume
merry hazel
# hard flume ah, yeah send me the link and i'll try it too, could you also tell me what the s...

openai.error.RateLimitError: Rate limit reached for default-text-davinci-003 in organization org-xxxxxxxxxxxxxxxxx on requests per min. Limit: 60.000000 / min. Current: 70.000000 / min. Contact support@openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://beta.openai.com/account/billing to add a payment method.

hard flume
#

where does that error show up? in the network tab? that’s weird, because I have added a payment method…

merry hazel
hard flume
#

oh… yeah that’s probably it. so this is an error on your end rather than on this app?

merry hazel
hard flume
#

Ah, yeah no worries — in that case you should add a payment method, make sure you’re using the right API endpoint, break text into small enough chunks, etc. and you’ll be good

merry hazel
tall bear
#

this is great 👌 , but it does not "sracpe" all the content on the page yet, and a next step would be to crawl a whole website 👼

stray raptor
#

It doesn't work

hard flume
proper bison
hard flume
#

In terms of omitting content from pages, yeah I can definitely see how that happens... I'm using readability.js a la WebGPT and some things definitely fall through there

drowsy cloak
#

This is interesting,

  1. What libraries did you use to read pdf content?
  2. Are there any best practices to divide the text into chunks?
hard flume
drowsy cloak
#

I am working on a similar idea for ebooks. Do you have recommendations on how to measure the accuracy/relevance of this q&a engine

hard flume
#

^^ More info: https://arxiv.org/abs/2109.10862 [Recursively summarizing books with human feedback] I would divide the task into (1) ranking sections for their relevance, and (2) evaluating the ability for a model to answer based on the most relevant sections

#

These two tasks can be evaluated (and optimized) independently of each other

#

Was planning to integrate some kind of feedback like that into this as well