Python how to read from and write to different files using multiprocessing - python

I have several files and I would like to read those files, filter some keywords and write them into different files. I use Process() and it turns out that it takes more time to process the readwrite function.
Do I need to separate the read and write to two functions? How I can read multiple files at one time and write key words in different files to different csv?
Thank you very much.
def readwritevalue():
for file in gettxtpath(): ##gettxtpath will return a list of files
file1=file+".csv"
##Identify some variable
##Read the file
with open(file) as fp:
for line in fp:
#Process the data
data1=xxx
data2=xxx
....
##Write it to different files
with open(file1,"w") as fp1
print(data1,file=fp1 )
w = csv.writer(fp1)
writer.writerow(data2)
...
if __name__ == '__main__':
p = Process(target=readwritevalue)
t1 = time.time()
p.start()
p.join()
Want to edit my questions. I have more functions to modify the csv generated by the readwritevalue() functions.
So, if Pool.map() is fine. Will it be ok to change all the remaining functions like this? However, it seems that it did not save much time for that.
def getFormated(file): ##Merge each csv with a well-defined formatted csv and generate a final report with writing all the csv to one output csv
csvMerge('Format.csv',file,file1)
getResult()
if __name__=="__main__":
pool=Pool(2)
pool.map(readwritevalue,[file for file in gettxtpath()])
pool.map(GetFormated,[file for file in getcsvName()])
pool.map(Otherfunction,file_list)
t1=time.time()
pool.close()
pool.join()

You can extract the body of the for loop into its own function, create a multiprocessing.Pool object, then call pool.map() like so (I’ve used more descriptive names):
import csv
import multiprocessing
def read_and_write_single_file(stem):
data = None
with open(stem, "r") as f:
# populate data somehow
csv_file = stem + ".csv"
with open(csv_file, "w", encoding="utf-8") as f:
w = csv.writer(f)
for row in data:
w.writerow(data)
if __name__ == "__main__":
pool = multiprocessing.Pool()
result = pool.map(read_and_write_single_file, get_list_of_files())
See the linked documentation for how to control the number of workers, tasks per worker, etc.

I may have found an answer myself. Not so sure if it is indeed a good answer, but the time is 6 times shorter than before.
def readwritevalue(file):
with open(file, 'r', encoding='UTF-8') as fp:
##dataprocess
file1=file+".csv"
with open(file1,"w") as fp2:
##write data
if __name__=="__main__":
pool=Pool(processes=int(mp.cpu_count()*0.7))
pool.map(readwritevalue,[file for file in gettxtpath()])
t1=time.time()
pool.close()
pool.join()

Related

How to optimize the below code to read very large multiple file?

I have folder containing about 5 million files and i have to read the content of each file so that i can form dataframe.It take very long time to do that. Is there any way i can optimize the below code to speed up the process below.
new_list = []
file_name=[]
for root, dirs, files in os.walk('Folder_5M'):
for file in files:
count+=1
file_name.append(file)
with open(os.path.join(root, file), 'rb') as f:
text = f.read()
new_list.append(text)
This is an IO bound task so multi-threading is the tool for the job. In python there are two ways to implement multi-threads. One using the thread pool and the second is using the asyncio that works with event loop. The event loop usually has better performance the challenge is to limit the number of threads executing at the same time. Fortunately, Andrei wrote a very good solution for this.
This code creates an event loop that reads the files in several threads. The parameter MAX_NUMBER_OF_THREADS defines the amount of thread can execute at the same time. Try to play with this number for better performance as it is affected by the machine that runs it.
import os
import asyncio
async def read_file(file_path: str) -> str:
with open(file_path, "r") as f:
return f.read()
async def gather_with_concurrency(n, *tasks):
semaphore = asyncio.Semaphore(n)
async def sem_task(task):
async with semaphore:
return await task
return await asyncio.gather(*(sem_task(task) for task in tasks))
MAX_NUMBER_OF_THREADS = 100
file_name = []
file_path = []
for path, subdirs, files in os.walk("Folder_5M"):
for name in files:
file_path.append(os.path.join(path, name))
file_name.append(name)
count = len(file_name)
tasks = [read_file(file) for file in file_path]
asyncio.run(gather_with_concurrency(MAX_NUMBER_OF_THREADS, *tasks))
Here's an idea for how you could use multiprocessing for this.
Constructing a list of files resulting from os.walk is likely to be very fast. It's the processing of those files that's going to take time. With multiprocessing you can do a lot of that work in parallel.
Each process opens the given file, processes it and creates a dataframe. When all of the parallel processing has been carried out you then concatenate the returned dataframes. This last part will be CPU intensive and there's no way (that I can think of) that would allow you to share that load.
from pandas import DataFrame, concat
from os import walk
from os.path import join, expanduser
from multiprocessing import Pool
HOME = expanduser('~')
def process(filename):
try:
with open(filename) as data:
df = DataFrame()
# analyse your data and populate the dataframe here
return df
except Exception:
return DataFrame()
def main():
with Pool() as pool:
filenames = []
for root, _, files in walk(join(HOME, 'Desktop')):
for file in files:
filenames.append(join(root, file))
ar = pool.map_async(process, filenames)
master = concat(ar.get())
print(master)
if __name__ == '__main__':
main()

How to write rows in CSV file DYNAMICALLY in python?

I want to create a csv file and write data to it dynamically my script have to keep running 24/7 and csv files have to be created and written every 24 hours, right now all files are created when the program ends.
with open(file_name, 'r+') as f:
myDataList = f.readlines()
nameList = []
for line in myDataList:
entry = line.split(',')
nameList.append(entry[0])
if name not in nameList:
now = datetime.datetime.now()
dtString = now.strftime('%H:%M:%S')
writer = csv.writer(f)
writer.writerow(name, dtString)
Thanks in advance
Remove the file context. Use the earlier way of writing file. And keep doing flush() and fsync() on the file like shown below. That ensures that data is written to the file on disk.
f = open(FILENAME, MODE)
f.write(data)
f.write(data)
f.flush() #important part
os.fsync(f) # important part
For more info: see this link

Loading multiple files with bobobo-etl

I'm new to bonobo-etl and I'm trying to write a job that loads multiple files at once but I can't get the CsvReader to work with the #use_context_processor annotation. A snippet of my code:
def input_file(self, context):
yield 'test1.csv'
yield 'test2.csv'
yield 'test3.csv'
#use_context_processor(input_file)
def extract(f):
return bonobo.CsvReader(path=f,delimiter='|')
def load(*args):
print(*args)
def get_graph(**options):
graph = bonobo.Graph()
graph.add_chain(extract,load)
return graph
When I run the job I get something like <bonobo.nodes.io.csv.CsvReader object at 0x7f849678dc88> rather than the lines of the CSV.
If I hardcode the reader like graph.add_chain(bonobo.CsvReader(path='test1.csv',delimiter='|'),load), it works.
Any help would be appreciated.
Thank you.
As bonobo.CsvReader does not support (yet) to read file names from the input stream, you need use a custom reader for that.
Here is a solution that works for me on a set of csvs I have:
import bonobo
import bonobo.config
import bonobo.util
import glob
import csv
#bonobo.config.use_context
def read_multi_csv(context, name):
with open(name) as f:
reader = csv.reader(f, delimiter=';')
headers = next(reader)
if not context.output_type:
context.set_output_fields(headers)
for row in reader:
yield tuple(row)
def get_graph(**options):
graph = bonobo.Graph()
graph.add_chain(
glob.glob('prenoms_*.csv'),
read_multi_csv,
bonobo.PrettyPrinter(),
)
return graph
if __name__ == '__main__':
with bonobo.parse_args() as options:
bonobo.run(get_graph(**options))
Few comments on this snippet, in reading order:
use_context decorator will inject the node execution context to the transformation call, allowing to use .set_output_fields(...) using the first csv headers.
Other csv headers are ignored, in my case they're all the same. You may need a slightly more complex logic for your own case.
Then, we just generate the filenames in a bonobo.Graph instance using glob.glob (in my case, the stream will contain: prenoms_2004.csv prenoms_2005.csv ... prenoms_2011.csv prenoms_2012.csv) and pass it to our custom reader, which will be called once for each file, open it, and yield its lines.
Hope that helps!

writer.writerow not work for writing multiple CSV in for loop

Please look at the pseudocode below:
def main():
queries = ['A','B','C']
for query in queries:
filename = query + '.csv'
writer = csv.writer(open(filename, 'wt', encoding = 'utf-8'))
...
FUNCTION (query)
def FUNCTION(query):
...
writer.writerow(XXX)
I'd like to write to multiple csv files, so I use for loop to generate different file names, followed by writing into the file in another def()
However, this is not working, the file will be empty.
If I try to get rid of using main() or stop for loop:
writer = csv.writer(open(filename, 'wt', encoding = 'utf-8'))
...
FUNCTION (query)
def FUNCTION(query):
...
writer.writerow(XXX)
It'll work.
I don't know why? Anything related to for loop or main()?
A simple fix is to pass the file handle and not the name to FUNCTION. Since the file has been opened in main, you don't need/want the name in the subroutine, just the file handle so change the call to FUNCTION(writer) and the definition to
def FUNCTION(writer):
and use writer.writerow(xxx) wherever you need to stream output in the subroutine.
Note: you changed the name of the file pointer from writer to write in your example.
I think the possible reason is you didn't close the file pointer. You can use the context manager like:
with open(filename, 'wt', encoding = 'utf-8') as f:
writer = csv.writer(f)
...
FUNCTION (query)
which will help you auto close file.

pass the contents of a file as a parameter in python

I am trying to download the files based on their ids. How can I download the files if i have their IDS stored in a text file. Here's what I've done so far
import urllib2
#code to read a file comes here
uniprot_url = "http://www.uniprot.org/uniprot/" # constant Uniprot Namespace
def get_fasta(id):
url_with_id = "%s%s%s" %(uniprot_url, id, ".fasta")
file_from_uniprot = urllib2.urlopen(url_with_id)
data = file_from_uniprot.read()
get_only_sequence = data.replace('\n', '').split('SV=')[1]
length_of_sequence = len(get_only_sequence[1:len(get_only_sequence)])
file_output_name = "%s%s%s%s" %(id,"_", length_of_sequence, ".fasta")
with open(file_output_name, "wb") as fasta_file:
fasta_file.write(data)
print "completed"
def main():
# or read from a text file
input_file = open("positive_copy.txt").readlines()
get_fasta(input_file)
if __name__ == '__main__':
main()
.readlines() returns a list of lines in file.
According to an oficial documentation you can also amend it
For reading lines from a file, you can loop over the file object. This is memory efficient, fast, and leads to simple code.
So I guess your code may be rewritten in this way
with open("positive_copy.txt") as f:
for id in f:
get_fasta(id.strip())
You can read more about with keyword in PEP-343 page.

Categories

Resources