Losing some samples while writing to file in Python - python

I'm continuosly getting readings from an ADC in Python, but during the process of writing it to a file, I lose some samples because there is some small delay. Is there a way I could avoid losing these samples (I'm sampling at 100Hz)?
I'm using multithreading, but in the process of writing and cleaning the list used to write the data to a file, I always lose some samples. The code is copied here as I have written it and all advice is welcome.
Thanks in advance.
import threading
import time
from random import randint
import os
from datetime import datetime
import ADS1256
import RPi.GPIO as GPIO
import sys
import os
import csv
ADC=ADS1256.ADS1256()
ADC.ADS1256_init()
value_list=[]
#adc_reading function reads adc values and writes a list continuously.
def adc_reading():
global value_list
value_list=[]
while True:
adc_value=ADC.ADS1256_GetAll()
timestamp=time.time()
x=adc_value[1]
y=adc_value[2]
z=adc_value[3]
value_list.append([timestamp,x,y,z])
#function to create a new file every 60 seconds with the values gathered in adc_reading()
def cronometro():
global value_list
while True:
contador=60
inicio=time.time()
diferencia=0
while diferencia<=contador:
diferencia=time.time()-inicio
write_to_file(value_list)
#write_to_file() function writes the values gathered in adc_reading() to a file every 60 seconds.
def write_to_file(lista):
nombre_archivo=str(int(time.time()))+".finish"
with open(nombre_archivo, 'w') as f:
# using csv.writer method from CSV package
write = csv.writer(f)
write.writerows(lista)
value_list=[]
escritor = threading.Thread(target=adc_reading)
temporizador = threading.Thread(target=cronometro)
escritor.start()
temporizador.start()

At a 100Hz, I have to wonder if the write operation really takes longer than 10ms. You could probably do both operations in the same loop and just collect data in a buffer and write it (about 6000 values) once every 60 seconds without incurring more than a few milliseconds delay:
import time
import ADS1256
import csv
ADC = ADS1256.ADS1256()
ADC.ADS1256_init()
def adc_reading():
buffer = []
contador = 60
while True:
check = inicio = time.time()
while check - inicio <= contador:
adc_value = ADC.ADS1256_GetAll()
buffer.append([(check := time.time()), *adc_value[1:4]])
nombre_archivo = str(int(check)) + ".finish"
with open(nombre_archivo, 'w') as f:
write = csv.writer(f)
write.writerows(buffer)
buffer = []
if __name__ == '__main__':
adc_reading()
If you do need them to run in parallel (slow computer, other circumstances), you shouldn't use threads, but processes from multiprocessing.
The two threads won't run in parallel, they will alternate. You could run the data collection in a separate process and collect data from that from the main process.
Here's an example of doing this with some toy code, I think it's easy to see how to adjust for your case:
from multiprocessing import SimpleQueue, Process
from random import randint
from time import sleep, time
def generate_signals(q: SimpleQueue):
c = 0
while True:
sleep(0.01) # about 100 Hz
q.put((c, randint(1, 42)))
c += 1
def write_signals(q: SimpleQueue):
delay = 3 # 3 seconds for demo, 60 works as well
while True:
start = time()
while (check := time()) - start < delay:
sleep(.1)
values = []
while not q.empty():
values.append(str(q.get()))
with open(f'{str(int(check))}.finish', 'w') as f:
f.write('\n'.join(values))
if __name__ == "__main__":
q = SimpleQueue()
generator = Process(target=generate_signals, args=((q),))
generator.start()
writer = Process(target=write_signals, args=((q),))
writer.start()
writer.join(timeout=10) # run for no more than 10 seconds, enough for demo
writer.kill()
generator.join(timeout=0)
generator.kill()
Edit: added a counter, to show that no values are missed.

Related

Python multiprocessing write to file with starmap_async()

I'm currently setting up a automated simulation pipeline for OpenFOAM (CFD library) using the PyFoam library within Python to create a large database for machine learning purposes. The database will have around 500k distinct simulations. To run this pipeline on multiple machines, I'm using the multiprocessing.Pool.starmap_async(args) option which will continually start a new simulation once the old simulation has completed.
However, since some of the simulations might / will crash, I want to generate a textfile with all cases which have crashed.
I've already found this thread which implements the multiprocessing.Manager.Queue() and adds a listener but I failed to get it running with starmap_async(). For my testing I'm trying to print the case name for any simulation which has been completed but currently only one entry is written into the text file instead of all of them (the simulations all complete successfully).
So my question is how can I write a message to a file for each simulation which has completed.
The current code layout looks roughly like this - only important snipped has been added as the remaining code can't be run without OpenFOAM and additional customs scripts which were created for the automation.
Any help is highly appreciated! :)
from PyFoam.Execution.BasicRunner import BasicRunner
from PyFoam.Execution.ParallelExecution import LAMMachine
import numpy as np
import multiprocessing
import itertools
import psutil
# Defining global variables
manager = multiprocessing.Manager()
queue = manager.Queue()
def runCase(airfoil, angle, velocity):
# define simulation name
newCase = str(airfoil) + "_" + str(angle) + "_" + str(velocity)
'''
A lot of pre-processing commands to prepare the simulation
which has been removed from snipped such as generate geometry, create mesh etc...
'''
# run simulation
machine = LAMMachine(nr=4) # set number of cores for parallel execution
simulation = BasicRunner(argv=[solver, "-case", case.name], silent=True, lam=machine, logname="solver")
simulation.start() # start simulation
# check if simulation has completed
if simulation.runOK():
# write message into queue
queue.put(newCase)
if not simulation.runOK():
print("Simulation did not run successfully")
def listener(queue):
fname = 'errors.txt'
msg = queue.get()
while True:
with open(fname, 'w') as f:
if msg == 'complete':
break
f.write(str(msg) + '\n')
def main():
# Create parameter list
angles = np.arange(-5, 0, 1)
machs = np.array([0.15])
nacas = ['0012']
paramlist = list(itertools.product(nacas, angles, np.round(machs, 9)))
# create number of processes and keep 2 cores idle for other processes
nCores = psutil.cpu_count(logical=False) - 2
nProc = 4
nProcs = int(nCores / nProc)
with multiprocessing.Pool(processes=nProcs) as pool:
pool.apply_async(listener, (queue,)) # start the listener
pool.starmap_async(runCase, paramlist).get() # run parallel simulations
queue.put('complete')
pool.close()
pool.join()
if __name__ == '__main__':
main()
First, when your with multiprocessing.Pool(processes=nProcs) as pool: exits, there will be an implicit call to pool.terminate(), which will kill all pool processes and with it any running or queued up tasks. There is no point in calling queue.put('complete') since nobody is listening.
Second, your 'listener" task gets only a single message from the queue. If is "complete", it terminates immediately. If it is something else, it just loops continuously writing the same message to the output file. This cannot be right, can it? Did you forget an additional call to queue.get() in your loop?
Third, I do not quite follow your computation for nProcs. Why the division by 4? If you had 5 physical processors nProcs would be computed as 0. Do you mean something like:
nProcs = psutil.cpu_count(logical=False) // 4
if nProcs == 0:
nProcs = 1
elif nProcs > 1:
nProcs -= 1 # Leave a core free
Fourth, why do you need a separate "listener" task? Have your runCase task return whatever message is appropriate according to how it completes back to the main process. In the code below, multiprocessing.pool.Pool.imap is used so that results can be processed as the tasks complete and results returned:
from PyFoam.Execution.BasicRunner import BasicRunner
from PyFoam.Execution.ParallelExecution import LAMMachine
import numpy as np
import multiprocessing
import itertools
import psutil
def runCase(tpl):
# Unpack tuple:
airfoil, angle, velocity = tpl
# define simulation name
newCase = str(airfoil) + "_" + str(angle) + "_" + str(velocity)
... # Code omitted for brevity
# check if simulation has completed
if simulation.runOK():
return '' # No error
# Simulation did not run successfully:
return f"Simulation {newcase} did not run successfully"
def main():
# Create parameter list
angles = np.arange(-5, 0, 1)
machs = np.array([0.15])
nacas = ['0012']
# There is no reason to convert this into a list; it
# can be lazilly computed:
paramlist = itertools.product(nacas, angles, np.round(machs, 9))
# create number of processes and keep 1 core idle for main process
nCores = psutil.cpu_count(logical=False) - 1
nProc = 4
nProcs = int(nCores / nProc)
with multiprocessing.Pool(processes=nProcs) as pool:
with open('errors.txt', 'w') as f:
# Process message results as soon as the task ends.
# Use method imap_unordered if you do not care about the order
# of the messages in the output.
# We can only pass a single argument using imap, so make it a tuple:
for msg in pool.imap(runCase, zip(paramlist)):
if msg != '': # Error completion
print(msg)
print(msg, file=f)
pool.join() # Not really necessary here
if __name__ == '__main__':
main()

Continuously reading and plotting a CSV file using Python

this is my first time asking something here, so I hope I am asking the following question the "correct way". If not, please let me know, and I will give more information.
I am using one Python script, to read and write 4000Hz of serial data to a CSV file.
The structure of the CSV file is as follows: (this example shows the beginning of the file)
Time of mSure Calibration: 24.10.2020 20:03:14.462654
Calibration Data - AICC: 833.95; AICERT: 2109; AVCC: 0.00; AVCERT: 0
Sampling Frequency: 4000Hz
timestamp,instantaneousCurrentValue,instantaneousVoltageValue,activePowerValueCalculated,activePowerValue
24.10.2020 20:03:16.495828,-0.00032,7e-05,-0.0,0.0
24.10.2020 20:03:16.496078,0.001424,7e-05,0.0,0.0
24.10.2020 20:03:16.496328,9.6e-05,7e-05,0.0,0.0
24.10.2020 20:03:16.496578,-0.000912,7e-05,-0.0,0.0
Data will be written to this CSV as long as the script reading serial data is active. Thus, this might become a huge file at some time. (Data is written in chunks of 8000 rows = every two seconds)
Here is my problem: I want to plot this data live. For example, update the plot each time data is written to the CSV file. The plotting shall be done from another script than the script reading and writing the serial data.
What is working: 1. Creating the CSV file. 2. Plotting a finished CSV file using another script - actually pretty well :-)
I have this script for plotting:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""Data Computation Software for TeensyDAQ - Reads and computes CSV-File"""
# region imports
import getopt
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pathlib
from scipy.signal import argrelextrema
import sys
# endregion
# region globals
inputfile = ''
outputfile = ''
# endregion
# region functions
def main(argv):
"""Main application"""
# region define variables
global inputfile
global outputfile
inputfile = str(pathlib.Path(__file__).parent.absolute(
).resolve())+"\\noFilenameProvided.csv"
outputfile = str(pathlib.Path(__file__).parent.absolute(
).resolve())+"\\noFilenameProvidedOut.csv"
# endregion
# region read system arguments
try:
opts, args = getopt.getopt(
argv, "hi:o:", ["infile=", "outfile="])
except getopt.GetoptError:
print('dataComputation.py -i <inputfile> -o <outputfile>')
sys.exit(2)
for opt, arg in opts:
if opt == '-h':
print('dataComputation.py -i <inputfile> -o <outputfile>')
sys.exit()
elif opt in ("-i", "--infile"):
inputfile = str(pathlib.Path(
__file__).parent.absolute().resolve())+"\\"+arg
elif opt in ("-o", "--outfile"):
outputfile = str(pathlib.Path(
__file__).parent.absolute().resolve())+"\\"+arg
# endregion
# region read csv
colTypes = {'timestamp': 'str',
'instantaneousCurrent': 'float',
'instantaneousVoltage': 'float',
'activePowerCalculated': 'float',
'activePower': 'float',
'apparentPower': 'float',
'fundReactivePower': 'float'
}
cols = list(colTypes.keys())
df = pd.read_csv(inputfile, usecols=cols, dtype=colTypes,
parse_dates=True, dayfirst=True, skiprows=3)
df['timestamp'] = pd.to_datetime(
df['timestamp'], utc=True, format='%d.%m.%Y %H:%M:%S.%f')
df.insert(loc=0, column='tick', value=np.arange(len(df)))
# endregion
# region plot data
fig, axes = plt.subplots(nrows=6, ncols=1, sharex=True, figsize=(16,8))
fig.canvas.set_window_title(df['timestamp'].iloc[0])
fig.align_ylabels(axes[0:5])
df['instantaneousCurrent'].plot(ax=axes[0], color='red'); axes[0].set_title('Momentanstrom'); axes[0].set_ylabel('A',rotation=0)
df['instantaneousVoltage'].plot(ax=axes[1], color='blue'); axes[1].set_title('Momentanspannung'); axes[1].set_ylabel('V',rotation=0)
df['activePowerCalculated'].plot(ax=axes[2], color='green'); axes[2].set_title('Momentanleistung ungefiltert'); axes[2].set_ylabel('W',rotation=0)
df['activePower'].plot(ax=axes[3], color='brown'); axes[3].set_title('Momentanleistung'); axes[3].set_ylabel('W',rotation=0)
df['apparentPower'].plot(ax=axes[4], color='brown'); axes[4].set_title('Scheinleistung'); axes[4].set_ylabel('VA',rotation=0)
df['fundReactivePower'].plot(ax=axes[5], color='brown'); axes[5].set_title('Blindleitsung'); axes[5].set_ylabel('VAr',rotation=0); axes[5].set_xlabel('microseconds since start')
plt.tight_layout()
plt.show()
# endregion
# endregion
if __name__ == "__main__":
main(sys.argv[1:])
My thoughts on how to solve my problem:
Modify my plotting script to continuously read the CSV file and plot using the animation function of matplotlib.
Using some sort of streaming functionality to read the CSV in a stream. I have read about the streamz library, but I have no idea how I could use it.
Any help is highly appreciated!
Kind regards,
Sascha
EDIT 31.10.2020:
Since I am not aware of the mean duration, how long to wait for help, I try to add more input, which maybe leads to helpful comments.
I wrote this script to write data continuously to a CSV file, which emulates my real script without the need for external hardware: (Random data is produced and CSV-formatted using a timer. Each time there are 50 new rows, the data is written to a CSV file)
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import csv
from random import randrange
import time
import threading
import pathlib
from datetime import datetime, timedelta
datarows = list()
datarowsToWrite = list()
outputfile = str(pathlib.Path(__file__).parent.absolute().resolve()) + "\\noFilenameProvided.csv"
sampleCount = 0
def startBatchWriteThread():
global outputfile
global datarows
global datarowsToWrite
datarowsToWrite.clear()
datarowsToWrite = datarows[:]
datarows.clear()
thread = threading.Thread(target=batchWriteData,args=(outputfile, datarowsToWrite))
thread.start()
def batchWriteData(file, data):
print("Items to write: " + str(len(data)))
with open(file, 'a+') as f:
for item in data:
f.write("%s\n" % item)
def generateDatarows():
global sampleCount
timer1 = threading.Timer(0.001, generateDatarows)
timer1.daemon = True
timer1.start()
datarow = datetime.now().strftime("%d.%m.%Y %H:%M:%S.%f")[:] + "," + str(randrange(10)) + "," + str(randrange(10)) + "," + str(randrange(10)) + "," + str(randrange(10)) + "," + str(randrange(10)) + "," + str(randrange(10))
datarows.append(datarow)
sampleCount += 1
try:
datarows.append("row 1")
datarows.append("row 2")
datarows.append("row 3")
datarows.append("timestamp,instantaneousCurrent,instantaneousVoltage,activePowerCalculated,activePower,apparentPower,fundReactivePower")
startBatchWriteThread()
generateDatarows()
while True:
if len(datarows) == 50:
startBatchWriteThread()
except KeyboardInterrupt:
print("Shutting down, writing the rest of the buffer.")
batchWriteData(outputfile, datarows)
print("Done, writing " + outputfile)
The script from my initial post can then plot the data from the CSV file.
I need to plot the data as it is written to the CSV file to see the data more or less live.
Hope this makes my problem more understandable.
For the Googlers: I could not find a way to achieve my goal as described in the question.
However, if you are trying to plot live data, coming with high speed over serial comms (4000Hz in my case), I recommend designing your application as a single program with multiple processes.
The problem in my special case was, that when I tried to plot and compute the incoming data simultaneously in the same thread/task/process/whatever, my serial receive rate went down to 100Hz instead of 4kHz. The solution with multiprocessing and passing data using the quick_queue module between the processes I could resolve the problem.
I ended up, having a program, which receives data from a Teensy via serial communication at 4kHz, this incoming data was buffered to blocks of 4000 samples and then the data was pushed to the plotting process and additionally, the block was written to a CSV-file in a separate Thread.
Best,
S

Is there any way to store data every 5 second in new text file?

I have created a program which is reading the sensor data. As per my usecase i have to store the readings in a text file every 5 seconds with new file names example: file1, file2, file3 and so on until i stop the program using keyboard. Can anyone guide me to accomplish this task?
import sys
import time
import board
import digitalio
import busio
import csv
import adafruit_lis3dh
from datetime import datetime, timezone
i2c = busio.I2C(board.SCL, board.SDA)
int1 = digitalio.DigitalInOut(board.D6) # Set this to the correct pin for the interrupt!
lis3dh = adafruit_lis3dh.LIS3DH_I2C(i2c, int1=int1)
lis3dh.range = adafruit_lis3dh.RANGE_2_G
with open("/media/pi/D427-7B2E/test.txt", 'w') as f:
sys.stdout = f
while True:
ti = int(datetime.now(tz=timezone.utc).timestamp() * 1000)
x, y, z = lis3dh.acceleration
print('{}, {}, {}, {}'.format(ti,x / 9.806, y / 9.806, z / 9.806))
time.sleep(0.001)
pass
can you try logging?
have a look on TimedRotatingFileHandler may be this could help
Can you look into threading.Timer as recursive and set timer as you wish and call the function at program start there you can either update the file name in array/dict so that it will be reflected to rest of program or you can write the contents of the array into file and empty the array.
def write_it():
threading.Timer(300, write_it).start()
write_it()
i solved it using RotatingFileHandler which is a python logging class. Our scope changed from 5 second to one hour so what i did i ran the program for 1 hour and the i got a file which was of 135 mb and this i converted to bytes and used those bytes in the maxBytes argument. So what happens now in this code is that once the file size reaches to bytes mentioned in the code it creates another file and starts storing the values in it and it continues doing same for 5 files. i hope this helps to other people having similar kind of usecase.
logger = logging.getLogger("Rotating Log")
logger.setLevel(logging.INFO)
handler = RotatingFileHandler('/home/pi/VibrationSensor/test.txt', maxBytes=20, backupCount=5)
logger.addHandler(handler)
while True:
ti = int(datetime.now(tz=timezone.utc).timestamp() * 1000)
x, y, z = lis3dh.acceleration
logger.info('{}, {}, {}, {}'.format(ti,x / 9.806, y / 9.806, z / 9.806))
time.sleep(0.001)
pass

Fix jumping of multiple progress bars (tqdm) in python multiprocessing

I want to parallelize a task (progresser()) for a range of input parameters (L). The progress of each task should be monitored by an individual progress bar in the terminal. I'm using the tqdm package for the progress bars. The following code works on my Mac for up to 23 progress bars (L = list(range(23)) and below), but produces chaotic jumping of the progress bars starting at L = list(range(24)). Has anyone an idea how to fix this?
from time import sleep
import random
from tqdm import tqdm
from multiprocessing import Pool, freeze_support, RLock
L = list(range(24)) # works until 23, breaks starting at 24
def progresser(n):
text = f'#{n}'
sampling_counts = 10
with tqdm(total=sampling_counts, desc=text, position=n+1) as pbar:
for i in range(sampling_counts):
sleep(random.uniform(0, 1))
pbar.update(1)
if __name__ == '__main__':
freeze_support()
p = Pool(processes=None,
initargs=(RLock(),), initializer=tqdm.set_lock
)
p.map(progresser, L)
print('\n' * (len(L) + 1))
As an example of how it should look like in general, I provide a screenshot for L = list(range(16)) below.
versions: python==3.7.3, tqdm==4.32.1
I'm not getting any jumping when I set the size to 30. Maybe you have more processors and can have more workers running.
However, if n grows large you will start to see jumps because of the nature of the chunksize.
I.e
p.map will split your input into chunksizes and give each process a chunk. So as n grows larger, so does your chunksize, and so does your ....... yup position (pos=n+1)!
Note: Although map preserves the order of the results returned. The order its computed is arbitrary.
As n grows large I would suggest using processor id as the position to view progress on a per process basis.
from time import sleep
import random
from tqdm import tqdm
from multiprocessing import Pool, freeze_support, RLock
from multiprocessing import current_process
def progresser(n):
text = f'#{n}'
sampling_counts = 10
current = current_process()
pos = current._identity[0]-1
with tqdm(total=sampling_counts, desc=text, position=pos) as pbar:
for i in range(sampling_counts):
sleep(random.uniform(0, 1))
pbar.update(1)
if __name__ == '__main__':
freeze_support()
L = list(range(30)) # works until 23, breaks starting at 24
# p = Pool(processes=None,
# initargs=(RLock(),), initializer=tqdm.set_lock
# )
with Pool(initializer=tqdm.set_lock, initargs=(tqdm.get_lock(),)) as p:
p.map(progresser, L)
print('\n' * (len(L) + 1))

Processing a huge amount of files in python

I have a huge number of report files (about 650 files) which takes about 320 M of hard disk and I want to process them. There are a lot of entries in each file; I should count and log them based on their content. Some of them are related to each other and I should find, log and count them too; matches may be in different files. I have wrote a simple script to do the job. I used python profiler and it just took about 0.3 seconds to run the script for one single file with 2000 lines that we need half of them for processing. But for the whole directory it took 1 hour and a half to be done. This is how my script looks like:
# imports
class Parser(object):
def __init__(self):
# load some configurations
# open some log files
# set some initial values for some variables
def parse_packet(self, tags):
# extract some values from line
def found_matched(self, packet):
# search in the related list to find matched line
def save_packet(self, packet):
# write the line in the appropriate files and increase or decrease some counters
def parse(self, file_addr):
lines = [l for index, l in enumerate(open(file_addr, 'r').readlines()) if index % 2 != 0]
for line in lines:
packet = parse_packet(line)
if found_matched(packet):
# count
self.save_packet(packet)
def process_files(self):
if not os.path.isdir(self.src_dir):
self.log('No such file or directory: ' + str(self.src_dir))
sys.exit(1)
input_dirs = os.walk(self.src_dir)
for dname in input_dirs:
file_list = dname[2]
for fname in file_list:
self.parse(os.path.join(dname[0], fname))
self.finalize_process()
def finalize_process(self):
# closing files
I want to decrease the time at least to the 10% percent of current execution time. Maybe multiprocessing can help me or just some enhancement in current script will do the task. Anyway could you please help me in this?
Edit 1:
I have changed my code according to #Reut Sharabani's answer:
def parse(self, file_addr):
lines = [l for index, l in enumerate(open(file_addr, 'r').readlines()) if index % 2 != 0]
for line in lines:
packet = parse_packet(line)
if found_matched(packet):
# count
self.save_packet(packet)
def process_files(self):
if not os.path.isdir(self.src_dir):
self.log('No such file or directory: ' + str(self.src_dir))
sys.exit(1)
input_dirs = os.walk(self.src_dir)
for dname in input_dirs:
process_pool = multiprocessing.Pool(10)
for fname in file_list:
file_list = [os.path.join(dname[0], fname) for fname in dname[2]]
process_pool.map(self.parse, file_list)
self.finalize_process()
I also added below lines before my class definition to avoid PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup
__builtin__.instancemethod failed:
import copy_reg
import types
def _pickle_method(m):
if m.im_self is None:
return getattr, (m.im_class, m.im_func.func_name)
else:
return getattr, (m.im_self, m.im_func.func_name)
copy_reg.pickle(types.MethodType, _pickle_method)
Another thing that I have done into my code was not to keep open log files during file processing; I open and close them for writing each entry just to avoid ValueError: I/O operation on closed file.
Now the problem is that I have some files which are being processed multiple times. I also got wrong counts for my packets. What did I do wrong? Should I put process_pool = multiprocessing.Pool(10) before the for loop? Consider that I have just one directory right now and it doesn't seem to be the problem.
EDIT 2:
I also tried using ThreadPoolExecutor this way:
with ThreadPoolExecutor(max_workers=10) as executor:
for fname in file_list:
executor.submit(self.parse, fname)
Results were correct, but it took an hour and a half to be completed.
First of all, "about 650 files which takes about 320 M" is not a lot. Given that modern hard disks easily read and write 100 MB/s, the I/O performance of your system probably is not your bottleneck (also supported by "it just took about 0.3 seconds to run the script for one single file with 2000 lines", which clearly indicates CPU-limitation). However, the exact way you are reading files from within Python may not be efficient.
Furthermore, a simple multiprocessing-based architecture, run on a common multi core system, will allow you to perform your analysis much faster (no need to involve celery here, no need to cross machine boundaries).
multiprocessing architecture
Just have a look at multiprocessing, your architecture likely will involve one manager process (the parent), which defines a task Queue, and a Pool of worker processes. The manager (or feeder) puts tasks (e.g. file names) into the queue, and the workers consume these. After finishing with a task, a worker lets the manager know, and proceeds consuming the next one.
file processing method
This is quite inefficient:
lines = [l for index, l in enumerate(open(file_addr, 'r').readlines()) if index % 2 != 0]
for line in lines:
...
readlines() reads the entire file before the list comprehension is evaluated. Only after that you again iterate through all lines. Hence, you iterate three times through your data. Combine everything into a single loop, so that you iterate the lines only once.
You should be using threads here. If you're blocked by cpu later, you can use processes.
To explain I first created a ten thousand files (0.txt ... 9999.txt), with a line count that's equivalent to the name (+1), using this command:
for i in `seq 0 999`; do for j in `seq 0 $i`; do echo $i >> $i.txt; done ; done
Next, I've created a python script using a ThreadPool with 10 threads to count the lines of all files that have an even value:
#!/usr/bin/env python
from multiprocessing.pool import ThreadPool
import time
import sys
print "creating %s threads" % sys.argv[1]
thread_pool = ThreadPool(int(sys.argv[1]))
files = ["%d.txt" % i for i in range(1000)]
def count_even_value_lines(filename):
with open(filename, 'r') as f:
# do some processing
line_count = 0
for line in f.readlines():
if int(line.strip()) % 2 == 0:
line_count += 1
print "finished file %s" % filename
return line_count
start = time.time()
print sum(thread_pool.map(count_even_value_lines, files))
total = time.time() - start
print total
As you can see this takes no time, and the results are correct. 10 files are processed in parallel and the cpu is fast enough to handle the results. If you want even more you may consider using threads and processes to utilize all cpus as well as not letting IO block you.
Edit:
As comments suggest, I was wrong and this is not I/O blocked, so you can speed it up using multiprocessing (cpu blocked). Because I used a ThreadPool which has the same interface as Pool you can make minimal edits and have the same code running:
#!/usr/bin/env python
import multiprocessing
import time
import sys
files = ["%d.txt" % i for i in range(2000)]
# function has to be defined before pool is opened and workers are forked
def count_even_value_lines(filename):
with open(filename, 'r') as f:
# do some processing
line_count = 0
for line in f:
if int(line.strip()) % 2 == 0:
line_count += 1
return line_count
print "creating %s processes" % sys.argv[1]
process_pool = multiprocessing.Pool(int(sys.argv[1]))
start = time.time()
print sum(process_pool.map(count_even_value_lines, files))
total = time.time() - start
print total
Results:
me#EliteBook-8470p:~/Desktop/tp$ python tp.py 1
creating 1 processes
25000000
21.2642059326
me#EliteBook-8470p:~/Desktop/tp$ python tp.py 10
creating 10 processes
25000000
12.4360249043
Aside from using parallel processing, your parse method is rather inefficient as #Jan-PhilipGehrcke already pointed out. To expand on his recommendation: The classical variant:
def parse(self, file_addr):
with open(file_addr, 'r') as f:
line_no = 0
for line in f:
line_no += 1
if line_no % 2 != 0:
packet = parse_packet(line)
if found_matched(packet):
# count
self.save_packet(packet)
Or using your style (assuming you use python 3):
def parse(self, file_addr):
with open(file_addr, 'r') as f:
filtered = (l for index,l in enumerate(f) if index % 2 != 0)
for line in filtered:
# and so on
The thing to notice here, is the use of iterators, all operations to build the filtered list (which is not actually a list!!) operate on and return iterators, which means that at no point the entire file is loaded into a list.

Categories

Resources