I have folder containing about 5 million files and i have to read the content of each file so that i can form dataframe.It take very long time to do that. Is there any way i can optimize the below code to speed up the process below.
new_list = []
file_name=[]
for root, dirs, files in os.walk('Folder_5M'):
for file in files:
count+=1
file_name.append(file)
with open(os.path.join(root, file), 'rb') as f:
text = f.read()
new_list.append(text)
This is an IO bound task so multi-threading is the tool for the job. In python there are two ways to implement multi-threads. One using the thread pool and the second is using the asyncio that works with event loop. The event loop usually has better performance the challenge is to limit the number of threads executing at the same time. Fortunately, Andrei wrote a very good solution for this.
This code creates an event loop that reads the files in several threads. The parameter MAX_NUMBER_OF_THREADS defines the amount of thread can execute at the same time. Try to play with this number for better performance as it is affected by the machine that runs it.
import os
import asyncio
async def read_file(file_path: str) -> str:
with open(file_path, "r") as f:
return f.read()
async def gather_with_concurrency(n, *tasks):
semaphore = asyncio.Semaphore(n)
async def sem_task(task):
async with semaphore:
return await task
return await asyncio.gather(*(sem_task(task) for task in tasks))
MAX_NUMBER_OF_THREADS = 100
file_name = []
file_path = []
for path, subdirs, files in os.walk("Folder_5M"):
for name in files:
file_path.append(os.path.join(path, name))
file_name.append(name)
count = len(file_name)
tasks = [read_file(file) for file in file_path]
asyncio.run(gather_with_concurrency(MAX_NUMBER_OF_THREADS, *tasks))
Here's an idea for how you could use multiprocessing for this.
Constructing a list of files resulting from os.walk is likely to be very fast. It's the processing of those files that's going to take time. With multiprocessing you can do a lot of that work in parallel.
Each process opens the given file, processes it and creates a dataframe. When all of the parallel processing has been carried out you then concatenate the returned dataframes. This last part will be CPU intensive and there's no way (that I can think of) that would allow you to share that load.
from pandas import DataFrame, concat
from os import walk
from os.path import join, expanduser
from multiprocessing import Pool
HOME = expanduser('~')
def process(filename):
try:
with open(filename) as data:
df = DataFrame()
# analyse your data and populate the dataframe here
return df
except Exception:
return DataFrame()
def main():
with Pool() as pool:
filenames = []
for root, _, files in walk(join(HOME, 'Desktop')):
for file in files:
filenames.append(join(root, file))
ar = pool.map_async(process, filenames)
master = concat(ar.get())
print(master)
if __name__ == '__main__':
main()
Related
I am trying to parallel download urls with the following:
def parallel_download_files(self, urls, filenames):
pids = []
for (url, filename) in zip(urls, filenames):
pid = os.fork()
if pid == 0:
open(filename, 'wb').write(requests.get(url).content)
else:
pids.append(pid)
for pid in pids:
os.waitpid(pid, os.WNOHANG)
But when executing with a list of urls and filenames, the computer system is building up in memory and crashing. From the documentation, I thought that the options in waitpid should be correctly handled if setting it to os.WNOHANG. This is the first time I am trying parallel with forks, I have been doing such tasks with concurrent.futures.ThreadPoolExecutor before.
Using os.fork() is far from ideal especially as you're not handling the two processes that are being created (parent/child). multithreading is far superior for this use-case.
For example:
from concurrent.futures import ThreadPoolExecutor as TPE
from requests import get as GET
def parallel_download_files(urls, filenames):
def _process(t):
url, filename = t
try:
(r := GET(url)).raise_for_status()
with open(filename, 'wb') as output:
output.write(r.content)
except Exception as e:
print('Failed: ', url, filename, e)
with TPE() as executor:
executor.map(_process, zip(urls, filenames))
urls = ['https://www.bbc.co.uk', 'https://news.bbc.co.uk']
filenames = ['www.txt', 'news.txt']
parallel_download_files(urls, filenames)
Note:
If any filenames are duplicated in the filenames list then you'll need a more complex strategy that ensures that you never have more than one thread writing to the same file
What I have is a text file containing all items that need to be deleted from an online app. Every item that needs to be deleted has to be sent 1 at a time. To make deletion process faster, I divide the items in text file in multiple text files and run the script in multiple terminals (~130 for deletion time to be under 30 minutes for ~7000 items).
This is the code of the deletion script:
from fileinput import filename
from WitApiClient import WitApiClient
import os
dirname = os.path.dirname(__file__)
parent_dirname = os.path.dirname(dirname)
token = input("Enter the token")
file_name = os.path.join(parent_dirname, 'data/deletion_pair.txt')
with open(file_name, encoding="utf-8") as file:
templates = [line.strip() for line in file.readlines()]
for template in templates:
entity, keyword = template.split(", ")
print(entity, keyword)
resp = WitApiClient(token).delete_keyword(entity, keyword)
print(resp)
So, I divide the items in deletion_pair.txt and run this script multiple times in new terminals (~130 terminals). Is there a way to automate this process or do in more efficient manner?
I used threading to run multiple functions simultaneously:
from fileinput import filename
from WitApiClient import WitApiClient
import os
from threading import Thread
dirname = os.path.dirname(__file__)
parent_dirname = os.path.dirname(dirname)
token = input("Enter the token")
file_name = os.path.join(parent_dirname, 'data/deletion_pair.txt')
with open(file_name, encoding="utf-8") as file:
templates = [line.strip() for line in file.readlines()]
batch_size = 20
chunks = [templates[i: i + batch_size] for i in range(0, len(templates), batch_size)]
def delete_function(templates, token):
for template in templates:
entity, keyword = template.split(", ")
print(entity, keyword)
resp = WitApiClient(token).delete_keyword(entity, keyword)
print(resp)
for chunk in chunks:
thread = Thread(target=delete_function, args=(chunk, token))
thread.start()
It worked! Any one has any other solution, please post or if the same code can be written more efficiently then please do tell. Thanks.
I have a huge amount of json files (4000) and I need to check every single one of them for a specific object. My code is like the following:
import os
import json
files = sorted(os.listdir("my files path"))
for f in files:
if f.endswith(".json"):
myFile = open("my path\\" + f)
myJson = json.load(bayesFile)
if myJson["something"]["something"]["what im looking for"] == "ACTION"
#do stuff
myFile.close()
As you can imagine this is taking a lot of execution time and I was wondering if there is a quicker way...?
Here's a multithreaded approach that may help you:
from glob import glob
import json
from concurrent.futures import ThreadPoolExecutor
import os
BASEDIR = 'myDirectory' # the directory containing the json files
def process(filename):
with open(filename) as infile:
data = json.load(infile)
if data.get('foo', '') == 'ACTION':
pass # do stuff
def main():
with ThreadPoolExecutor() as executor:
executor.map(process, glob(os.path.join(BASEDIR, '*.json')))
if __name__ == '__main__':
main()
I have several files and I would like to read those files, filter some keywords and write them into different files. I use Process() and it turns out that it takes more time to process the readwrite function.
Do I need to separate the read and write to two functions? How I can read multiple files at one time and write key words in different files to different csv?
Thank you very much.
def readwritevalue():
for file in gettxtpath(): ##gettxtpath will return a list of files
file1=file+".csv"
##Identify some variable
##Read the file
with open(file) as fp:
for line in fp:
#Process the data
data1=xxx
data2=xxx
....
##Write it to different files
with open(file1,"w") as fp1
print(data1,file=fp1 )
w = csv.writer(fp1)
writer.writerow(data2)
...
if __name__ == '__main__':
p = Process(target=readwritevalue)
t1 = time.time()
p.start()
p.join()
Want to edit my questions. I have more functions to modify the csv generated by the readwritevalue() functions.
So, if Pool.map() is fine. Will it be ok to change all the remaining functions like this? However, it seems that it did not save much time for that.
def getFormated(file): ##Merge each csv with a well-defined formatted csv and generate a final report with writing all the csv to one output csv
csvMerge('Format.csv',file,file1)
getResult()
if __name__=="__main__":
pool=Pool(2)
pool.map(readwritevalue,[file for file in gettxtpath()])
pool.map(GetFormated,[file for file in getcsvName()])
pool.map(Otherfunction,file_list)
t1=time.time()
pool.close()
pool.join()
You can extract the body of the for loop into its own function, create a multiprocessing.Pool object, then call pool.map() like so (I’ve used more descriptive names):
import csv
import multiprocessing
def read_and_write_single_file(stem):
data = None
with open(stem, "r") as f:
# populate data somehow
csv_file = stem + ".csv"
with open(csv_file, "w", encoding="utf-8") as f:
w = csv.writer(f)
for row in data:
w.writerow(data)
if __name__ == "__main__":
pool = multiprocessing.Pool()
result = pool.map(read_and_write_single_file, get_list_of_files())
See the linked documentation for how to control the number of workers, tasks per worker, etc.
I may have found an answer myself. Not so sure if it is indeed a good answer, but the time is 6 times shorter than before.
def readwritevalue(file):
with open(file, 'r', encoding='UTF-8') as fp:
##dataprocess
file1=file+".csv"
with open(file1,"w") as fp2:
##write data
if __name__=="__main__":
pool=Pool(processes=int(mp.cpu_count()*0.7))
pool.map(readwritevalue,[file for file in gettxtpath()])
t1=time.time()
pool.close()
pool.join()
I try to convert a code writen in Matlab into python.
I'm trying to read dat file (it's a csv file). that file has about 30 columns and thousands of rows containing (only!) decimal number data (in Matlab it was read into double matrix).
I'm asking for the fastest way to read the dat file and the most similar object/array/... to save the data into.
I tried to read the file in both of the following ways:
my_data1 = numpy.genfromtxt('FileName.dat', delimiter=',' )
my_data2 = pd.read_csv('FileName.dat',delimiter=',')
Is there any better option?
pd.read_csv is pretty efficient as it is. To make it faster, you can use try to use multiple cores to load your data in parallel. Here is some code example where I used joblib when I needed to make data loading with pd.read_csv and processing of that data faster.
from os import listdir
from os.path import dirname, abspath, isfile, join
import pandas as pd
import sys
import time
from datetime import datetime
# Multi-threading
from joblib import Parallel, delayed
import multiprocessing
# Garbage collector
import gc
# Number of cores
TOTAL_NUM_CORES = multiprocessing.cpu_count()
# Path of this script's file
DATA_PATH = 'D:\\'
# Path to save the processed files
TARGET_PATH = 'C:\\'
def read_and_convert(f,num_files):
#global i
# Read the file
dataframe = pd.read_csv(DATA_PATH + f, low_memory=False, header=None, names=['Symbol', 'Date_Time', 'Bid', 'Ask'], index_col=1, parse_dates=True)
# Process the data
data_ask_bid = process_data(dataframe)
# Store processed data in target folder
data_ask_bid.to_csv(TARGET_PATH + f)
print(f)
# Garbage collector. I needed to use this, otherwise my memory would get full after a few files, but you might not need it.
gc.collect()
def main():
# Counter for converted files
global i
i = 0
start_time = time.time()
# Get the paths for all the data files
files_names = [f for f in listdir(DATA_PATH) if isfile(join(DATA_PATH, f))]
# Load and process files in parallel
Parallel(n_jobs=TOTAL_NUM_CORES)(delayed(read_and_convert)(f,len(files_names)) for f in files_names)
# for f in files_names: read_and_convert(f,len(files_names)) # non-parallel
print("\nTook %s seconds." % (time.time() - start_time))
if __name__ == "__main__":
main()