python有哪些提取文本摘要的库

2023-02-25 07:42:02Python028

python有哪些提取文本摘要的库,第1张

1.　google goose

>>> from goose import Goose

>>> url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'

>>> g = Goose()

>>> article = g.extract(url=url)

>>> article.title

u'Occupy London loses eviction fight'

>>> article.meta_description

"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."

>>> article.cleaned_text[:150]

(CNN) -- Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi

>>> article.top_image.src

http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg

2. python SnowNLP

from snownlp import SnowNLP

s = SnowNLP(u'这个东西真心很赞')

s.words # [u'这个', u'东西', u'真心',

# u'很', u'赞']

s.tags # [(u'这个', u'r'), (u'东西', u'n'),

# (u'真心', u'd'), (u'很', u'd'),

# (u'赞', u'Vg')]

s.sentiments # 0.9769663402895832 positive的概率

s.pinyin # [u'zhe', u'ge', u'dong', u'xi',

# u'zhen', u'xin', u'hen', u'zan']

s = SnowNLP(u'「繁体字」「繁体中文」的叫法在台湾亦很常见。')

s.han # u'「繁体字」「繁体中文」的叫法

# 在台湾亦很常见。'

text = u'''

自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。

它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。

自然语言处理是一门融语言学、计算机科学、数学于一体的科学。

因此，这一领域的研究将涉及自然语言，即人们日常使用的语言，

所以它与语言学的研究有着密切的联系，但又有重要的区别。

自然语言处理并不是一般地研究自然语言，

而在于研制能有效地实现自然语言通信的计算机系统，

特别是其中的软件系统。因而它是计算机科学的一部分。

'''

s = SnowNLP(text)

s.keywords(3) # [u'语言', u'自然', u'计算机']

s.summary(3) # [u'因而它是计算机科学的一部分',

# u'自然语言处理是一门融语言学、计算机科学、

# 数学于一体的科学',

# u'自然语言处理是计算机科学领域与人工智能

# 领域中的一个重要方向']

s.sentences

s = SnowNLP([[u'这篇', u'文章'],

[u'那篇', u'论文'],

[u'这个']])

s.tf

s.idf

s.sim([u'文章'])# [0.3756070762985226, 0, 0]

3. python TextTeaser

#!/usr/bin/python

# -*- coding: utf-8 -*-

from textteaser import TextTeaser

# article source: https://blogs.dropbox.com/developers/2015/03/limitations-of-the-get-method-in-http/

title = "Limitations of the GET method in HTTP"

text = "We spend a lot of time thinking about web API design, and we learn a lot from other APIs and discussion with their authors. In the hopes that it helps others, we want to share some thoughts of our own. In this post, we’ll discuss the limitations of the HTTP GET method and what we decided to do about it in our own API. As a rule, HTTP GET requests should not modify server state. This rule is useful because it lets intermediaries infer something about the request just by looking at the HTTP method. For example, a browser doesn’t know exactly what a particular HTML form does, but if the form is submitted via HTTP GET, the browser knows it’s safe to automatically retry the submission if there’s a network error. For forms that use HTTP POST, it may not be safe to retry so the browser asks the user for confirmation first. HTTP-based APIs take advantage of this by using GET for API calls that don’t modify server state. So if an app makes an API call using GET and the network request fails, the app’s HTTP client library might decide to retry the request. The library doesn’t need to understand the specifics of the API call. The Dropbox API tries to use GET for calls that don’t modify server state, but unfortunately this isn’t always possible. GET requests don’t have a request body, so all parameters must appear in the URL or in a header. While the HTTP standard doesn’t define a limit for how long URLs or headers can be, most HTTP clients and servers have a practical limit somewhere between 2 kB and 8 kB. This is rarely a problem, but we ran up against this constraint when creating the /delta API call. Though it doesn’t modify server state, its parameters are sometimes too long to fit in the URL or an HTTP header. The problem is that, in HTTP, the property of modifying server state is coupled with the property of having a request body. We could have somehow contorted /delta to mesh better with the HTTP worldview, but there are other things to consider when designing an API, like performance, simplicity, and developer ergonomics. In the end, we decided the benefits of making /delta more HTTP-like weren’t worth the costs and just switched it to HTTP POST. HTTP was developed for a specific hierarchical document storage and retrieval use case, so it’s no surprise that it doesn’t fit every API perfectly. Maybe we shouldn’t let HTTP’s restrictions influence our API design too much. For example, independent of HTTP, we can have each API function define whether it modifies server state. Then, our server can accept GET requests for API functions that don’t modify server state and don’t have large parameters, but still accept POST requests to handle the general case. This way, we’re opportunistically taking advantage of HTTP without tying ourselves to it."

tt = TextTeaser()

sentences = tt.summarize(title, text)

for sentence in sentences:

print sentence

4. python sumy

# -*- coding: utf8 -*-

from __future__ import absolute_import

from __future__ import division, print_function, unicode_literals

from sumy.parsers.html import HtmlParser

from sumy.parsers.plaintext import PlaintextParser

from sumy.nlp.tokenizers import Tokenizer

from sumy.summarizers.lsa import LsaSummarizer as Summarizer