#GTP-Summarize

42 messages · Page 1 of 1 (latest)

shadow rapids
#

Wondering if a long article/paper is worth reading in full or in part? We all know that GPT-3 text-davinci-003 is good at summarization. But I still spend way too much time reading/skimming articles and papers, so I decided to make a tool to help with that. This weekend ChatGPT, Copilot, and I created an open-source summarizer tool to provide both an Overall Summary and more detailed Section Summaries of any paper/article/website: https://github.com/scottleibrand/gpt-summarizer

The script extracts text from a given file or URL and splits it into sections. It writes the extracted text and each section to a separate text file. It generates and writes out a summary for each subsection, and a combined section summary if needed.

It splits the document into subsections of 1000-3000 tokens based on HTML section headings or numbered section headings. It uses text-davinci-003 to generate and write section summaries, and then summarizes the lower-level summaries to produce an overall summary.

It's currently implemented as a python script you can run from the command line. Remember to set your OpenAI API key as an environment variable.

The intended use is that both the overall summary and the subsection summaries are worth reading (in order) to determine whether to spend the time reading the entire article/paper, or specific sections of it.

It also might also form the basis for a summarization service, such as for anyone doing literature reviews, reading lots of papers/articles for research, etc. With enough users, it'd be worthwhile to cache generated summaries and make them accessible to everyone.

It might even be the basis for an intelligent reader app, which would allow you to smoothly zoom in from an article headline/subhead through AI-generated overall and section summaries all the way down to the article text itself.

Have any other ideas for potential applications? Want to try it yourself? Download a copy, give it a try, and share your thoughts and experiences!

GitHub

Extract text from PDF, summarize each section w/ GPT, and provide a summarized outline of the paper - GitHub - scottleibrand/gpt-summarizer: Extract text from PDF, summarize each section w/ GPT, an...

fresh pivot
#

I would love to see this as something like a library that can be used in other projects as a python package, awesome work!!

shadow rapids
fresh pivot
floral owl
#

using a binary tree to summarize large documents is an exact idea I had. Great work!

regal venture
#

Nice. I’ve been playing with something similar for latex source files (rather than compiled PDFs, since it seems lossy to convert equations back to text), will be interested to see what “best practices” emerge when summarizing like this

shadow rapids
# regal venture Nice. I’ve been playing with something similar for latex source files (rather th...

Nice. I was trying to convert PDFs to html or xml or even latex to see if I could get GPT3 to summarize from the raw markup: PDF doesn’t make the outline / section layout easy to access by default. Didn’t make any progress in that direction, though, so just stuck with extracting text, since I don’t read math either. If you get the latex thing working, it would be interesting to build a translator from math to English/code for people like me who speak those far more fluently than formulas.

regal venture
#

It does (from my POV) a reasonably good job explaining raw latex code in "simple terms"

#

Just out of the box, i.e., pasting that code into ChatGPT or TD3 or whatever and prompting "explain this in a way that a middle school student could understand"

shadow rapids
#

Nice. Can you get raw latex from arxiv or similar?

regal venture
#

Yep, arxiv requires that everyone submit latex source, so it's available for anything posted there

dense dune
#

they have an API

#

probably the best u can find

haughty badge
#

cool

dense dune
#

hey any updates on this @shadow rapids

shadow rapids
dense dune
#

Hey I was hoping to see being able to search through pdfs witb embeddings and use its context to write answers using GPT

shadow rapids
#

I also made https://discord.com/channels/974519864045756446/1056090070060376164 to create embeddings for slack history exports and search through those: it wouldn’t be too difficult to combine the pdf export approach with that one, but I don’t think anyone has published an open source tool that does that yet for local private documents.

clever sparrow
#

Hello, I tried this, and it gives me an error saying the text has too many tokens, at (2500) ?

shadow rapids
clever sparrow
#

I have it set up correctly, i somehow left a '<' at the start of my api key... all good! it works 🙂
i should be getting summary_1/2/3/ exc as text files ?

shadow rapids
rough wadi
#

Thanks for sharing this, great work! I wrote a utility for myself to summarize and ask questions of YouTube videos by looking at the transcript. This gives me some ideas on how to support longer videos.

clever sparrow
#

wonder if a summary of the summaries would work 😆

oblique heron
#

I’m interested in this as well. Where I seem to be stuck with engineering papers is they come out grammatically incorrect because of the latex. As I understand it, gpt3 does not care about grammatically correct documents when it summarizes. I’ll let you know what I find.

shadow rapids
shadow rapids
clever sparrow
#

@shadow rapids so the summary script was working. now i get this error, even for the example papers you put in the repo.

was hoping you knew what the solution to this is:

    f.write(text)
  File "C:\Python310\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2264' in position 9493: character maps to <undefined>```
#

I solved the issue:
Lines: 434, 457, 499, 513, 530, 532, 547, 564, 588, 603, 612, 630, 648

Add the attribute encoding='utf-8' to the code line like so:

with open(overall_summary_path, 'w', encoding='utf-8') as f:

hope this helps some people!

shadow rapids
#

Thx for the report. Can you provide a bit more context (maybe the full traceback showing what line of my code called the lib/module that crashed)? If you want, a GitHub issue would also be a good place to put that detail.

clever sparrow
#

Sure! Ill put it here as well as the github page.

#

Here is a Traceback when running the 'summarize' script without the changes.

  File "C:\PythonScriptLocation\summarize.py", line 458, in <module>
    f.write(text)
  File "C:\Python310\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 793: character maps to <undefined>```
#

@shadow rapids Can I send you a direct message?

shadow rapids
#

Sure. That is probably an artifact of switching from tiktoken to the older tokenizer. Might be worth switching back: not sure how many people need Python 3.6 compatibility…

clever sparrow
cosmic patio
#

Nice how does this compare with scispace (typeset.io) ? I figured that they use GPT-3 but I do not know

clever sparrow
#

Im not sure about those sites, but this is completely opensource

lone valley
#

This will be really helpful if someone can give it a suitable UX design!