I would like to read a large .xls file in parallel using pandas.
currently I am using this:
LARGE_FILE = "LARGEFILE.xlsx"
CHUNKSIZE = 100000 # processing 100,000 rows at a time
def process_frame(df):
# process data frame
return len(df)
if __name__ == '__main__':
reader = pd.read_excel(LARGE_FILE, chunksize=CHUNKSIZE)
pool = mp.Pool(4) # use 4 processes
funclist = []
for df in reader:
# process each data frame
f = pool.apply_async(process_frame,[df])
funclist.append(f)
result = 0
for f in funclist:
result += f.get(timeout=10) # timeout in 10 seconds
While this runs, I dont think it actually speeds up the process of reading the file. Is there a more efficient way of achieving this?
Just for your information: i'm reading 13 Mbyte, 29000 lines of csv in about 4 seconds. (not using parallel processing)
Archlinux, AMD Phenom II X2, Python 3.4, python-pandas 0.16.2.
How big is your file and how long does it take to read it ?
That would help to understand the problem better.
Is your excel sheet very complex ? Maybe read_excel has difficulty processing that complexity ?
Suggestion: install genumeric and use the helper function ssconvert to translate the file to csv. In your program change to read_csv. Check the time used by ssconvert and the time taken by read_csv.
By the way, python-pandas had major improvements while it went from version 13 .... 16, hence usefull to check you have a recent version.
Related
I'm currently using the following line to read Excel files
df = pd.read_excel(f"myfile.xlsx")
The problem is the enormous slow down which occurs when I implement data from this Excel file, for example in function commands. I think this occurs because I'm not reading the file via a context manager. Is there a way of combining a 'with' command with the pandas 'read' command so the code runs more smoothly? Sorry that this is vague, I'm just learning about context managers.
Edit : Here is an example of a piece of code that does not run...
import pandas as pd
import numpy as np
def fetch_excel(x):
df_x = pd.read_excel(f"D00{x}_balance.xlsx")
return df_x
T = np.zeros(3000)
for i in range(0, 3000):
T[i] = fetch_excel(1).iloc[i+18, 0]
print(fetch_excel(1).iloc[0,0])
...or it takes more than 5 minutes which seems exceptional to me. Anyway I can't work with a delay like that. If I comment out the for loop, this does work.
Usually the key reason to use standard context managers for reading in files is convenience of closing and opening the underlying file descriptor. You can create context managers to do anything you'd like, though. They're just functions.
Unfortunately they aren't likely to solve the problem of slow loading times reading in your excel file.
You are accessing the HDD, opening, reading and converting the SAME file D001_balance.xlsx 3000 times to access a single piece of data - different row each time from 18 to 3017. This is pointless as the data is all in the DataFrame after one reading. Just use:
df_x = pd.read_excel(f"D001_balance.xlsx")
T = np.zeros(3000)
for i in range(0, 3000):
T[i] = df_x.iloc[i+18, 0]
print(df_x.iloc[0,0])
I'm trying to process a very large CSV file (.gz files of around 6 GB) on a VM, and to speed things up I'm looking into various multiprocessing tools. I'm pretty new to this so I'm learning as I go but so far what I got from a day of research is that pool works great for CPU reliant tasks.
I'm processing a very large CSV by dividing it into chunks of set size, and processing those chunks individually. The goal is to be able to process these chunks in parallel, but without needing to first create a list of dataframes with all the chunks in it, as that would take a really long time in itself. Chunk processing is almost entirely pandas based (not sure if that's relevant) so I can't use dask. One of the processing functions then writes my results to an outfile. Ideally I would like to preserve the order of the results, but if I can't do that I can try to work around it later. Here's what I got so far:
if __name__ == "__main__":
parser = parse()
args = parser.parse_args()
a = Analysis( vars( args ) )
attributes = vars( a )
count = 0
pool = mp.Pool( processes = mp.cpu_count() )
for achunk in pd.read_csv( a.File,
compression = 'gzip',
names = inputHeader,
chunksize = simsize,
skipinitialspace = True,
header = None
):
pool.apply_async( a.beginProcessChunk( achunk,
start_time,
count
)
)
count += 1
This ultimately takes the same amount of time as running it serially (tested on a small file), and it actually takes a tiny bit longer. I'm not sure exactly what I'm doing wrong but I'm assuming that putting the pool function inside a loop won't make the loop process in parallel. I'm really new to this so maybe I'm just missing something trivial, so I'm sorry in advance for that. Could anyone give me some advice on this and/or tell me how exactly I can make this work?
I have some relatively large .mat files that I'm reading in to Python to eventually use in PyTorch. These files range in the number of rows (~55k to ~111k), but each has a little under 11k columns, with no header, and all the entries are floats. The data file sizes range from 5.8 GB to 11.8 GB. The .mat files came from a prior data processing step in Perl, so I'm not sure about the mat version; when I tried to load a file using scipy.io.loadmat, I received the following error: ValueError: Unknown mat file type, version 46, 56. I've tried pandas, dask, and astropy and been successful, but it takes between 4-6 minutes to load a single file. Here's the code for loading using each of the methods I've mentioned above, run as a timing experiment:
import pandas as pd
import dask.dataframe as dd
from astropy.io import ascii as aio
import numpy as np
import time
numberIterations = 6
daskTime = np.zeros((numberIterations,), dtype=float)
pandasTime = np.zeros((numberIterations,), dtype=float)
astropyTime = np.zeros(numberIterations,), dtype=float)
for ii in range(numberIterations):
t0 = time.time()
data = dd.read_csv(dataPath, delimiter='\t', dtype=np.float64, header=None)
daskTime[ii] = time.time() - t0
data = 0
del(data)
t0 = time.time()
data = pd.read_csv(dataPath, delimiter='\t', dtype=np.float64, header=None)
pandasTime[ii] = time.time() - t0
data = 0
del(data)
t0 = time.time()
data = aio.read(dataPath, format='fast_no_header', delimiter='\t', header_start=None, guess=False)
astropyTime[ii] = time.time() - t0
data = 0
del(data)
When I time these methods, dask is by far the slowest (by almost 3x), followed by pandas , and then astropy. For the largest file, the load time (in seconds) for 6 runs is:
dask: 1006.15 (avg), 1.14 (std)
pandas: 337.50 (avg), 5.84 (std)
astropy: 314.61 (avg), 2.02 (std)
I'm wondering if there is a faster way to load these files, since this is still quite long. Specifically, I'm wondering if there is perhaps a better library to use for the consistent loading of tabular float data and/or if there is a way to incorporate C/C++ or bash to read the files faster. I realize this question is a little open-ended; I'm hoping to get some ideas for how I can read these files in faster, so there is not a bunch of time wasted on just reading in the files.
Given these were generated in perl, and given the code above works, these are tab-separated text files, not matlab files. Which would be appropriate for scipy.io.loadmat.
Generally, reading in text is slow, and will depend heavily on compression and IO limitations.
FWIW pandas is already pretty well optimised under the hood, and I doubt you would get significant gains from using C directly.
If you plan to use these files frequently it might be worth using zarr or hdf5 to represent tabular float data. I'd lean towards zarr if you have some experience with dask already. They work nicely together.
I have a dataset with ~200 million rows, ~10 grouping variables, and ~20 variables to sum, and is a ~50GB csv. The first thing I did was see what the runtime was run sequentially but in chunks. It's a bit more complicated because some of the groupbys are actually in another dataset at a different aggregation level so it's only ~200mb. So the relevant code right now looks like this:
group_cols = ['cols','to','group','by']
cols_to_summarize = ['cols','to','summarize']
groupbys = []
df = pd.read_csv("file/path/df.csv",chunksize=1000000)
for chunk in df:
chunk = chunk.merge(other_df,left_on="id",right_index=True,how="inner")
groupbys.append(chunk.groupby(group_cols)[cols_to_summarize].sum())
finalAgg = pd.concat(groupbys).groupby(group_cols)[cols_to_summarize].sum()
Each chunk takes roughly 5 seconds to process so the 200 chunks takes about 15-20 minutes. The server I'm working on has 16 cores, so I'm hoping to get a bit of speedup here, if I could get it to 2-3 minutes that would be amazing.
But when I try to use multiprocess I'm struggling to get much speedup at all. Based on my googling I thought that would help with reading in CSVs but I'm wondering if multiple processes can't read the same CSV and maybe I should split it up first? Here's what I tried and it took longer than the sequential run:
def agg_chunk(start):
[pull in small dataset]
chunk = pd.read_csv("file/path/df.csv",skiprows=range(1,start+1),nrows=1000000)
chunk = chunk.merge(other_df,left_on="id",right_index=True,how="inner")
return chunk.groupby(group_cols)[cols_to_summarize].sum()
if __name__ == "__main__":
pool = mp.Pool(16)
r = list(np.array(range(200))*1000000)
groupbys = pool.map(agg_chunk,r)
finalAgg = pd.concat(groupbys).groupby(group_cols)[cols_to_summarize].sum()
Is there a better way to do this? The extra [pull in small dataset] piece takes about 5 seconds, but doubling the time per process and then dividing by 16 should still be a pretty good speedup right? Instead the parallel version has been running for half an hour and is still not complete. Also is there some way to pass the dataset to each process instead of making each re-create it again?
I am working on a large dataset where I need to read excel files and then find valid numbers but the task takes enormous time for only 500k data. For valid numbers, I am using google phonelib. processing can be done in an async way as they are independent.
parts = dask.delayed(pd.read_excel)('500k.xlsx')
data = dd.from_delayed(parts)
data['Valid'] = data['Mobile'].apply(lambda x: phonenumbers.is_valid_number(phonenumbers.parse(x)),meta=('Valid','object'))
for background
phonenumbers.is_valid_number(phonenumbers.parse('+442083661177'))
gives output as True
I expect the output to be less than 10sec but it takes around 40s
just been playing with this and you might just need to repartition your dataframe to allow the computation to be run in parallel
I start by generating some data:
import csv
import random
with open('tmp.csv', 'w') as fd:
out = csv.writer(fd)
out.writerow(['id', 'number'])
for i in range(500_000):
a = random.randrange(1000, 2999)
b = random.randrange(100_000, 899_999)
out.writerow([i+1, f'+44 {a} {b}'])
note that these are mostly valid UK numbers.
I then run something similar to your code:
from dask.distributed import Client
import dask.dataframe as dd
import phonenumbers
def fn(num):
return phonenumbers.is_valid_number(phonenumbers.parse(num))
with Client(processes=True):
df = dd.read_csv('tmp.csv')
# repartition to increase parallelism
df = df.repartition(npartitions=8)
df['valid'] = df.number.apply(fn, meta=('valid', 'object'))
out = df.compute()
this takes ~20 seconds to complete on my laptop (4 cores, 8 threads, Linux 5.2.8), which is only a bit more than double the performance of the plain loop. which indicates dask has quite a bit of runtime overhead as I'd expect it to be much faster than that. if I remove the call to repartition it takes a longer than I'm willing to wait and top only shows a single process running
note that if I rewrite it to do the naive thing in multiprocessing I get much better results:
from multiprocessing import Pool
import pandas as pd
df = pd.read_csv('tmp.csv')
with Pool(4) as pool:
df['valid'] = pool.map(fn, df['number'])
which reduces runtime to ~11 seconds and is even less code here as a bonus