python读取大文件处理时使用多线程

Python016

python读取大文件处理时使用多线程,第1张

如果有个很大的文件,几十G?,需要每次读取一部分,处理后再读取剩余部分。

with open as f 已经从内部处理难点,使用 for line in f 以迭代器的形式每次读取一行,不会有内存问题。

下面程序的思路是用一个列表存放读取到的数据,达到长度后就开始处理,处理完就清空列表,继续执行

背景:Python脚本:读取文件中每行,放入列表中;循环读取列表中的每个元素,并做处理操作。

核心:多线程处理单个for循环函数调用

模块:threading

第一部分:

:多线程脚本 (该脚本只有两个线程,t1循环次数<t2)

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

#!/usr/bin/env python#-*- coding: utf8 -*- import sysimport timeimport stringimport threadingimport datetimefileinfo = sys.argv[1] # 读取文件内容放入列表host_list = []port_list = [] # 定义函数:读取文件内容放入列表中def CreateList():f = file(fileinfo,'r')for line in f.readlines():host_list.append(line.split(' ')[0])port_list.append(line.split(' ')[1])return host_listreturn port_listf.close() # 单线程 循环函数,注释掉了#def CreateInfo(): #for i in range(0,len(host_list)): # 单线程:直接循环列表#time.sleep(1)#TimeMark = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')#print "The Server's HostName is %-15s and Port is %-4d !!! [%s]" % (host_list[i],int(port_list[i]),TimeMark)# # 定义多线程循环调用函数def MainRange(start,stop): #提供列表index起始位置参数for i in range(start,stop):time.sleep(1)TimeMark = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')print "The Server's HostName is %-15s and Port is %-4d !!! [%s]" % (host_list[i],int(port_list[i]),TimeMark) # 执行函数,生成列表CreateList()# 列表分割成:两部分 mid为列表的index中间位置mid = int(len(host_list)/2) # 多线程部分threads = []t1 = threading.Thread(target=MainRange,args=(0,mid))threads.append(t1)t2 = threading.Thread(target=MainRange,args=(mid,len(host_list)))threads.append(t2) for t in threads:t.setDaemon(True)t.start()t.join()print "ok"

以上是脚本内容!!!

----------------------------------------------------------------------

:读取文件的内容

文件内容:

[root@monitor2 logdb]# cat hostinfo.txt

192.168.10.11 1011

192.168.10.12 1012

192.168.10.13 1013

192.168.10.14 1014

192.168.10.15 1015

192.168.10.16 1016

192.168.10.17 1017

192.168.10.18 1018

192.168.10.19 1019

192.168.10.20 1020

192.168.10.21 1021

192.168.10.22 1022

192.168.10.23 1023

192.168.10.24 1024

192.168.10.25 1025

:输出结果:

单线程 : 执行脚本:输出结果:

[root@monitor2 logdb]# ./Threadfor.py hostinfo.txt

The Server's HostName is 192.168.10.10 and Port is 1010 !!! [2017-01-10 14:25:14]

The Server's HostName is 192.168.10.11 and Port is 1011 !!! [2017-01-10 14:25:15]

The Server's HostName is 192.168.10.12 and Port is 1012 !!! [2017-01-10 14:25:16]

.

.

.

The Server's HostName is 192.168.10.25 and Port is 1025 !!! [2017-01-10 14:25:29]

多线程:执行脚本:输出 结果

[root@monitor2 logdb]# ./Threadfor.py hostinfo.txt

The Server's HostName is 192.168.10.11 and Port is 1011 !!! [2017-01-10 14:51:51]

The Server's HostName is 192.168.10.18 and Port is 1018 !!! [2017-01-10 14:51:51]

The Server's HostName is 192.168.10.12 and Port is 1012 !!! [2017-01-10 14:51:52]

The Server's HostName is 192.168.10.19 and Port is 1019 !!! [2017-01-10 14:51:52]

The Server's HostName is 192.168.10.13 and Port is 1013 !!! [2017-01-10 14:51:53]

The Server's HostName is 192.168.10.20 and Port is 1020 !!! [2017-01-10 14:51:53]

The Server's HostName is 192.168.10.14 and Port is 1014 !!! [2017-01-10 14:51:54]

The Server's HostName is 192.168.10.21 and Port is 1021 !!! [2017-01-10 14:51:54]

The Server's HostName is 192.168.10.15 and Port is 1015 !!! [2017-01-10 14:51:55]

The Server's HostName is 192.168.10.22 and Port is 1022 !!! [2017-01-10 14:51:55]

The Server's HostName is 192.168.10.16 and Port is 1016 !!! [2017-01-10 14:51:56]

The Server's HostName is 192.168.10.23 and Port is 1023 !!! [2017-01-10 14:51:56]

The Server's HostName is 192.168.10.17 and Port is 1017 !!! [2017-01-10 14:51:57]

The Server's HostName is 192.168.10.24 and Port is 1024 !!! [2017-01-10 14:51:57]

The Server's HostName is 192.168.10.25 and Port is 1025 !!! [2017-01-10 14:51:58]