|
Posted: Feb 1, 2015 |
[ # 16 ]
|
|
Guru
Total posts: 1009
Joined: Jun 13, 2013
|
Having decided on my initial methods for sentence extraction, it seems I have indeed just re-invented the wheel, as there is research that shares my hypothesis as far back as 1968: http://courses.ischool.berkeley.edu/i256/f06/papers/edmonson69.pdf (description of methods starts at page 271.
The most useful clues to importance are:
1. keywords indicating importance or unimportance, e.g. “significant”, “important”
2. statistical keyword frequency (actually I don’t have faith in this one)
3. titles and headers.
4. location: first and last sentences of paragraphs are typically a main statement and conclusion.
1-9-6-8. A lost art?
|
|
|
|
|
Posted: Feb 6, 2015 |
[ # 17 ]
|
|
Burke Carl
Member
Total posts: 1
Joined: Feb 6, 2015
|
Document summarization is certainly not a lost art; Google searches for “document summarization” and the Wikipedia page on “Automatic summarization” should point you in the right direction. Anything by Inderjeet Mani, though dated now, should be helpful. The best extractors typically use statistical characteristics of the words, but not necessarily the frequency by itself.
Sentence extraction has been the way to go, since parsing still isn’t solved well enough to make abstraction useful.
|
|
|
|
|
Posted: Feb 7, 2015 |
[ # 18 ]
|
|
Guru
Total posts: 1009
Joined: Jun 13, 2013
|
Welcome to the forum, and thank you for the reference to Inderjeet Mani. Apparently I’d overlooked the less than detailed descriptions on Wikipedia and clicked straight through to its entry on sentence extraction. On second read they do describe the importance of position of sentences, n-grams and sentence similarity. Although I am sure the latter two work, I am not sure how they would be a deliberate choice of the writer, such as choice of words is (I’m specifically aiming to extract what the writer was intentionally trying to say).
Your references also indicate the problem and reason for my programming: There is way too much to read. I guess I just haven’t come across the better extractors, publicly available and online.
|
|
|
|
|
Posted: Feb 8, 2015 |
[ # 19 ]
|
|
Senior member
Total posts: 336
Joined: Jan 28, 2011
|
Don Patrick - Feb 1, 2015:
The most useful clues to importance are:
1. keywords indicating importance or unimportance, e.g. “significant”, “important”
2. statistical keyword frequency (actually I don’t have faith in this one)
3. titles and headers.
4. location: first and last sentences of paragraphs are typically a main statement and conclusion.
Sometimes it is good to reinvent the wheel so as to fully understand the problem to be solved. These listed above do indeed seem to be common ‘clues’ as to meaning or content. I have experimented with a lot of crude methods using stopword subtraction for instance, proper noun extraction, full POS analysis, the “first/last” sentence trick and they are each useful in their own very limited ways. Semantic similarity is a convenient way to create lifelike sentences and is being done using a combination of the above techniques.
|
|
|
|
|
Posted: Feb 9, 2015 |
[ # 20 ]
|
|
Member
Total posts: 22
Joined: Nov 11, 2014
|
Don,
Have you seen this article - http://thetokenizer.com/2013/04/28/build-your-own-summary-tool/
Code here:
Node.js - https://github.com/jbrooksuk/node-summary
Python - https://gist.github.com/shlomibabluki/5473521
This might be a good place to start if you were going to re-invent or improve prior art.
Rob.
|
|
|
|
|
Posted: Feb 9, 2015 |
[ # 21 ]
|
|
Guru
Total posts: 1009
Joined: Jun 13, 2013
|
Thank you, that was fascinating (I learn best from examples). As I understand, it extracts sentences that have the most words in common with surrounding sentences, assuming that a paragraph will contain one that covers most of its sentences. The idea behind it seems quite reasonable, but without taking into account what any of the words mean. I prefer doing things the hard way
Here is the result of my first attempt. Instead of extracting I’m having the program rewrite the text as html to highlight important sentences (easier to debug). (original source here)
Importance from high to low: Red, orange, black, grey, striked.
At this point the program mainly looks at key words such as “important”, “major” or “for example”. This text is semantically way over my AI’s head, so there are a few misclassified sentences. And as sentence splitting is done in an internal format, I can only highlight an entire string of clauses in the original text format.
Unlike many other summarisers I do not specify how many sentences the program should extract. In my opinion sentences are either important or they are not. This means that my program may find nothing important, or all of it. It just depends on why you want it summarised, and I just want to cut down on trivial information.
|
|
|
|
|
Posted: Mar 11, 2015 |
[ # 22 ]
|
|
Guru
Total posts: 1009
Joined: Jun 13, 2013
|
I found the list of summariser tools on the site Vince linked, but apart from maybe the first API and freesummarizer.com, there is far too much focus on keyword frequency. That is, for my trying to extract the writer’s intent instead of the average subject matter.
I’m fairly satisfied with my second attempt. It has now managed to find a clue of (un)importance to every sentence, because I moved the summarising mechanism out of the semantic phase, whose frequent misunderstandings meddled with the simpler scoring methods.
Instead of averaging the scores of clues, I let high-level clues override low-level clues. For instance, I’m treating the first two and last sentences of parapraphs as important by default, unless they contain a phrase like “Once upon a time…”, which is merely introductionary.
The red sentences are almost exclusively rated by the word “must”, which may perhaps be of further use to automatically classify the article as being a recommendation. I actually haven’t found use yet for keyword frequency or title keywords. I like things this way because I can pinpoint exactly why the program considered a sentence important.
Since I’ve only “trained” it with this article, I’m sure I’ll be adding to the clues later. For now, I’ll continue to work on the semantics and language generation that were clearly not up for prime time.
|
|
|
|
|
Posted: Dec 2, 2017 |
[ # 23 ]
|
|
Guru
Total posts: 1009
Joined: Jun 13, 2013
|
Now two years later, I have transferred my summarisation algorithm to a browser add-on you can all use.
As per usual I wrote a blog article about how it works, though I’ve cut out three paragraphs about the issues with porting from A.I. to add-on because the text was getting too lengthy. If only you had a summarizer or something
https://artistdetective.wordpress.com/2017/12/02/how-to-summarize-the-internet
I could use some feedback, if you want to try it out. The code for dividing the summary in paragrahps is still pretty new and it may still need some tweaks if I want to get good reviews.
|
|
|
|
|
Posted: Dec 3, 2017 |
[ # 24 ]
|
|
Senior member
Total posts: 308
Joined: Mar 31, 2012
|
Don,
I’ve been using your Summarizer for a few weeks and it does a really nice job of producing the gist of a web page and save one time by doing so.
Thanks for your Noteworthy efforts!
|
|
|
|
|
Posted: Dec 3, 2017 |
[ # 25 ]
|
|
Member
Total posts: 6
Joined: Dec 3, 2017
|
Here is a survey of the methods which people in the scientific community use to sumamrize.
https://arxiv.org/abs/1707.02268
On the other hand, a friend of mine has been doing this, he chooses the sentences which are most similar to the other part of the text. He was using Latent Semantic Analysis
If you can try using something similar…
Regards,
Hristo
|
|
|
|
|
Posted: Dec 3, 2017 |
[ # 26 ]
|
|
Member
Total posts: 6
Joined: Dec 3, 2017
|
http://textsummarization.net/text-summarizer
|
|
|
|
|
Posted: Dec 7, 2017 |
[ # 27 ]
|
|
Guru
Total posts: 1009
Joined: Jun 13, 2013
|
Thank you for that paper, Cristo, it gives a nice overview. I guess I oversimplified LSA by calling it “word vectors” in my article. However, all the modern methods are focused on retrieving the general subject matter, which is exactly why took a different approach. The part of the text that contains the most topical words, such as an introduction would contain, are not necessarily the most important.
Art: With an endorsement like that, who needs marketing ? I’m glad you find it satisfying, there will be a few more improvements in accuracy forthcoming.
|
|
|
|