如何用python统计六级词汇频率

Python013

如何用python统计六级词汇频率,第1张

不知道你用什么作为统计的资料

本文实例讲述了python统计文本字符串里单词出现频率的方法。分享给大家供大家参考。具体实现方法如下:

# word frequency in a text

# tested with Python24 vegaseat 25aug2005

# Chinese wisdom ...

str1 = """Man who run in front of car, get tired.

Man who run behind car, get exhausted."""

print "Original string:"

print str1

print

# create a list of words separated at whitespaces

wordList1 = str1.split(None)

# strip any punctuation marks and build modified word list

# start with an empty list

wordList2 = []

for word1 in wordList1:

# last character of each word

lastchar = word1[-1:]

# use a list of punctuation marks

if lastchar in [",", ".", "!", "?", ""]:

word2 = word1.rstrip(lastchar)

else:

word2 = word1

# build a wordList of lower case modified words

wordList2.append(word2.lower())

print "Word list created from modified string:"

print wordList2

print

# create a wordfrequency dictionary

# start with an empty dictionary

freqD2 = {}

for word2 in wordList2:

freqD2[word2] = freqD2.get(word2, 0) + 1

# create a list of keys and sort the list

# all words are lower case already

keyList = freqD2.keys()

keyList.sort()

print "Frequency of each word in the word list (sorted):"

for key2 in keyList:

print "%-10s %d" % (key2, freqD2[key2])

希望本文所述对大家的Python程序设计有所帮助。

代码:

passage="""Editor’s Note: Looking through VOA's listener mail, we came across a letter that asked a simple question. "What do Americans think about China?" We all care about the perceptions of others. It helps us better understand who we are. VOA Reporter Michael Lipin begins a series providing some answers to our listener's question. His assignment: present a clearer picture of what Americans think about their chief world rival, and what drives those perceptions.

Two common American attitudes toward China can be identified from the latest U.S. public opinion surveys published by Gallup and Pew Research Center in the past year.

First, most of the Americans surveyed have unfavorable opinions of China as a whole, but do not view the country as a threat toward the United States at the present time.

Second, most survey respondents expect China to pose an economic and military threat to the United States in the future, with more Americans worried about the perceived economic threat than the military one.

Most Americans view China unfavorably

To understand why most Americans appear to have negative feelings about China, analysts interviewed by VOA say a variety of factors should be considered. Primary among them is a lack of familiarity.

"Most Americans do not have a strong interest in foreign affairs, Chinese or otherwise," says Robert Daly, director of the Kissinger Institute on China and the United States at the Washington-based Wilson Center.

Many of those Americans also have never traveled to China, in part because of the distance and expense. "That means that like most human beings, they take short cuts to understanding China," Daly says.

Rather than make the effort to regularly consume a wide range of U.S. media reports about China, analysts say many Americans base their views on widely-publicized major events in China's recent history."""

passage=passage.replace(","," ").replace("."," ").replace(":"," ").replace("’","'").\

       replace('"'," ").replace("?"," ").replace("!"," ").replace("\n"," ")#把标点改成空格

passagelist=passage.split(" ")#拆分成一个个单词

pc=passagelist.copy()#复制一份

for i in range(len(pc)):

  pi=pc[i]#这一个字符串

  if pi.count(" ")==len(pi):#如果全是空格

      passagelist.remove(pi)#删除此项

worddict={}

for j in range(len(passagelist)):

  pj=passagelist[j]#这一个单词

  if pj not in worddict:#如果未被统计到

      worddict[pj]=1#增加单词统计,次数设为1

  else:#如果统计过了

      worddict[pj]+=1#次数增加1

output=""#按照字母表顺序,制表符

worddictlist=list(worddict.keys())#提取所有的单词

worddictlist.sort()#排序(但大小写会出现问题)

worddict2={}

for k in worddictlist:

  worddict2[k]=worddict[k]#排序好的字典

print("单次\t\t次数")

for m in worddict2:#遍历输出

  tabs=(23-len(m))//8#根据单次长度输入,如果复制到表格,请把此行改为tabs=2

  print("%s%s%d"%(m,"\t"*tabs,worddict[m]))

注:加粗部分是您要统计的短文,请修改。我这里的输出效果是:

American 1

Americans 9

Center 2

China 10

China's 1

Chinese 1

Daly 2

Editor's 1

First 1

Gallup 1

His 1

Institute 1

It 1

Kissinger 1

Lipin 1

Looking 1

Many 1

Michael 1

Most 2

Note 1

Pew 1

Primary 1

Rather 1

Reporter 1

Research 1

Robert 1

S 2

Second 1

States 3

That 1

To 1

Two 1

U 2

United 3

VOA 2

VOA's 1

Washington-based1

We 1

What 1

Wilson 1

a 10

about 6

across 1

affairs 1

all 1

also 1

among 1

an 1

analysts 2

and 5

answers 1

appear 1

are 1

as 2

asked 1

assignment 1

at 2

attitudes 1

base 1

be 2

because 1

begins 1

beings 1

better 1

but 1

by 2

came 1

can 1

care 1

chief 1

clearer 1

common 1

considered 1

consume 1

country 1

cuts 1

director 1

distance 1

do 3

drives 1

economic 2

effort 1

events 1

expect 1

expense 1

factors 1

familiarity 1

feelings 1

foreign 1

from 1

future 1

have 4

helps 1

history 1

human 1

identified 1

in 5

interest 1

interviewed 1

is 1

lack 1

latest 1

letter 1

like 1

listener 1

listener's 1

mail 1

major 1

make 1

many 1

means 1

media 1

military 2

more 1

most 4

negative 1

never 1

not 2

of 10

on 2

one 1

opinion 1

opinions 1

or 1

others 1

otherwise 1

our 1

part 1

past 1

perceived 1

perceptions 2

picture 1

pose 1

present 2

providing 1

public 1

published 1

question 2

range 1

recent 1

regularly 1

reports 1

respondents 1

rival 1

say 2

says 2

series 1

short 1

should 1

simple 1

some 1

strong 1

survey 1

surveyed 1

surveys 1

take 1

than 2

that 2

the 16

their 2

them 1

they 1

think 2

those 2

threat 3

through 1

time 1

to 7

toward 2

traveled 1

understand 2

understanding 1

unfavorable 1

unfavorably 1

us 1

variety 1

view 2

views 1

we 2

what 2

who 1

whole 1

why 1

wide 1

widely-publicized1

with 1

world 1

worried 1

year 1

(应该是对齐的,到这就乱了)

注:目前难以解决的漏洞

1、大小写问题,无法分辨哪些必须大写哪些只是首字母大写

2、's问题,目前如果含有只能算为一个单词里的

3、排序问题,很难做到按照出现次数排序