MemoryError when concatenating a large data-frame - python

I am creating a tool that reads CSV fields and allows the user to specify columns they want to categorize and then categorize those columns.
My issue is that these CSV files are quite large and when trying to concatenate the data-frames, my PC freezes up and I get a MemoryError.
I split the data-frame into chunks and complete the get_dummies function on each chunk and store it into a list. This works without any issues.
I then try to concatenate the entire list, as you can see in the code below.
I also delete the data-frames and the list of chunks to save up on memory.
dummies = []
columns = self.df[self.selectedHeaders]
del self.df
chunks = (len(columns) / 10000) + 1
df_list = np.array_split(columns, chunks)
del columns
for i, df_chunk in enumerate(df_list):
print("Getting dummy data for chunk: " + str(i))
dummies.append(pd.get_dummies(df_chunk))
del df_list
dummies = pd.concat(dummies, axis=1)
As you can see from this code, I store the columns I need and split them into chunks. Then I run the get_dummies function on each chunk and store them within a list.
When I run the concat function, I either crash or a MemoryError. If I can get the code to run and throw that error without crashing, I'll update it here.

Related

Is there a way to view my data frame in pandas without reading in the file every time?

Here is my code:
import pandas as pd
df = pd.read_parquet("file.parqet", engine='pyarrow')
df_set_index = df.set_index('column1')
row_count = df.shape[0]
column_count = df.shape[1]
print(df_set_index)
print(row_count)
print(column_count)
Can I run this without reading in the parquet file each time I want to do a row count, column count, etc? It takes a while to read in the file because it's large and I already read it in once but I'm not sure how to.
pd.read_parquet reads files that are stored on the disc and stores it in cache which is naturally slow with a lot of data. So, you could engineer a solution like:
1.) column_count
pd.read_parquet("file.parqet", engine='pyarrow', nrows=1).shape[1]
-> This would give you the number of columns while only reading in 1 row
-> .shape returns a tuple with values (# rows, # columns), so just grab the second item for the number of columns as demonstrated above.
2.) row_count
cols_want = ['colmn1'] # put whatever column names you want here
row_count = pd.read_parquet("file.parqet", engine='pyarrow', usecols=cols_want).shape[0]
-> This would give you the number of rows in the column "column1" without having to read in all the other columns (which is the reason for your solution taking awhile).
3.) df.set_index(...) isn't meant to be stored in a variable, so I'm not sure what you want to do there. If you're trying to see what is in the column just use #2 above and remove the ".shape[0]" call

Improve performance of running large files

I know there are a few questions on this topic but I can't seem to get it going efficiently. I have large input datasets (2-3 GB) running on my machine, which contains 8GB of memory. I'm using a version of spyder with pandas 0.24.0 installed. The input file currently takes around an hour to generate an output file of around 10MB.
I since attempted to optimise the process by chunking the input file using the code below. Essentially, I chunk the input file into smaller segments, run it through some code and export the smaller output. I then delete the chunked info to release memory. But the memory still builds throughout the operation and ends up taking a similar amount of time. I'm not sure what I'm doing wrong:
The memory usage details of the file are:
RangeIndex: 5471998 entries, 0 to 5471997
Data columns (total 17 columns):
col1 object
col2 object
col3 object
....
dtypes: object(17)
memory usage: 5.6 GB
I subset the df by passing cols_to_keep to use_cols. But the headers are different for each file so I used location indexing to get the relevant headers.
# Because the column headers change from file to file I use location indexing to read the col headers I need
df_cols = pd.read_csv('file.csv')
# Read cols to be used
df_cols = df_cols.iloc[:,np.r_[1,3,8,12,23]]
# Export col headers
cols_to_keep = df_cols.columns
PATH = '/Volume/Folder/Event/file.csv'
chunksize = 10000
df_list = [] # list to hold the batch dataframe
for df_chunk in pd.read_csv(PATH, chunksize = chunksize, usecols = cols_to_keep):
# Measure time taken to execute each batch
print("summation download chunk time: " , time.clock()-t)
# Execute func1
df1 = func1(df_chunk)
# Execute func2
df2 = func1(df1)
# Append the chunk to list and merge all
df_list.append(df2)
# Merge all dataframes into one dataframe
df = pd.concat(df_list)
# Delete the dataframe list to release memory
del df_list
del df_chunk
I have tried using dask but get all kinds of errors with simple pandas methods.
import dask.dataframe as ddf
df_cols = pd.read_csv('file.csv')
df_cols = df_cols.iloc[:,np.r_[1:3,8,12,23:25,32,42,44,46,65:67,-5:0,]]
cols_to_keep = df_cols.columns
PATH = '/Volume/Folder/Event/file.csv'
blocksize = 10000
df_list = [] # list to hold the batch dataframe
df_chunk = ddf.read_csv(PATH, blocksize = blocksize, usecols = cols_to_keep, parse_dates = ['Time']):
print("summation download chunk time: " , time.clock()-t)
# Execute func1
df1 = func1(df_chunk)
# Execute func2
df2 = func1(df1)
# Append the chunk to list and merge all
df_list.append(df2)
delayed_results = [delayed(df2) for df_chunk in df_list]
line that threw error:
df1 = func1(df_chunk)
name_unq = df['name'].dropna().unique().tolist()
AttributeError: 'Series' object has no attribute 'tolist'
I've passed through numerous functions and it just continues to throw errors.
To process your file, use rather dask, which is intended just to dealing
with big (actually, very big) files.
It has also read_csv function, with additional blocksize parameter, to
define the size of a single chunk.
The result of read_csv is conceptually a single (dask) DataFrame, which
is composed of a sequence of partitions, actually pandasonic DataFrames.
Then you can use map_partitions function, to apply your function
to each partition.
Because this function (passed to map_partitions) operates on a single
partition (pandasonic DataFrame), you can use any code, which you
previously tested in Pandas environment.
The advantage of this solution is that processing of individual partitions
is divided among available cores, whereas Pandas uses only a single core.
So your loop should be reworked to:
read_csv (dask version),
map_partitions generating partial results from each partition,
concat - to concatenate these partial results (for now the result
is still a dask DataFrame,
convert it to a pandasonic DataFrame,
make further use of it in Pandas environment.
To get more details, read about dask.

creating a dictionary for big size csv data files using pandas

I'm trying to create a dictionary file for a big size csv file that is divided into chunks to be processed, but when I'm creating the dictionary its just doing it for one chuck, and when I try to append it it passes epmty dataframe to the new df. this is the code I used
wdata = pd.read_csv(fileinput, nrows=0,).columns[0]
skip = int(wdata.count(' ') == 0)
dic = pd.DataFrame()
for chunk in pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=1000):
dic_tmp = (chunk['sentences'].str.split(expand=True).stack().value_counts().rename_axis('word').reset_index(name='freq'))
dic.append(dic_tmp)
dic.to_csv('newwww.csv', index=False)
if I saved the dic_tmp one is just a dictionary for one chunk not the whole set and dic is taking alot of time to process but returns empty dataframes at the end, any error with my code ?
input csv is like
output csv is like
expected output should be
so its not adding the chunks together its just pasting the new chunk regardless what is in the previous chunk or the csv.
In order to split the column into words and count the occurrences:
df['sentences'].apply(lambda x: pd.value_counts(x.split(" "))).sum(axis=0)
or
from collections import Counter
result = Counter(" ".join(df['sentences'].values.tolist()).split(" ")).items()
both seem to be equally slow, but probably better than your approach.
Taken from here:
Count distinct words from a Pandas Data Frame
Couple of problems that I see are
Why read the csv file twice?
First time here wdata = pd.read_csv(fileinput, nrows=0,).columns[0] and second time in the for loop.
If you aren't using the combined data frame further. I think it is better to write the chunks to csv file in append mode like shown below
for chunk in pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=1000):
dic_tmp = (chunk['sentences'].str.split(expand=True).stack().value_counts().rename_axis('word').reset_index(name='freq'))
dic_tmp.to_csv('newwww.csv', mode='a', header=False)

How to efficiently remove junk above headers in an .xls file

I have a number of .xls datasheets which I am looking to clean and merge.
Each data sheet is generated by a larger system which cannot be changed.
The method that generates the data sets displays the selected parameters for the data set. (E.G 1) I am looking to automate the removal of these.
The number of rows that this takes up varies, so I am unable to blanket remove x rows from each sheet. Furthermore, the system that generates the report arbitrarily merges cells in the blank sections to the right of the information.
Currently I am attempting what feels like a very inelegant solution where I convert the file to a CSV, read it as a string and remove everything before the first column.
data_xls = pd.read_excel('InputFile.xls', index_col=None)
data_xls.to_csv('Change1.csv', encoding='utf-8')
with open("Change1.csv") as f:
s = f.read() + '\n'
a=(s[s.index("Col1"):])
df = pd.DataFrame([x.split(',') for x in a.split('\n')])
This works but it seems wildly inefficient:
Multiple format conversions
Reading every line in the file when the only rows being altered occur within first ~20
Dataframe ends up with column headers shifted over by one and must be re-aligned (Less concern)
With some of the files being around 20mb, merging a batch of 8 can take close to 10 minutes.
A little hacky, but an idea to speed up your process, by doing some operations directly on your dataframe. Considering you know your first column name to be Col1, you could try something like this:
df = pd.read_excel('InputFile.xls', index_col=None)
# Find the first occurrence of "Col1"
column_row = df.index[df.iloc[:, 0] == "Col1"][0]
# Use this row as header
df.columns = df.iloc[column_row]
# Remove the column name (currently an useless index number)
del df.columns.name
# Keep only the data after the (old) column row
df = df.iloc[column_row + 1:]
# And tidy it up by resetting the index
df.reset_index(drop=True, inplace=True)
This should work for any dynamic number of header rows in your Excel (xls & xlsx) files, as long as you know the title of the first column...
If you know the number of junk rows, you skip them using "skiprows",
data_xls = pd.read_excel('InputFile.xls', index_col=None, skiprows=2)

How to write data to existing excel file using pandas?

I want to request some data from a python module tushare.
By using this code, I can each time get a line of data.
However I want to send the server a request for like every 5 seconds
and put all the data within 4 hrs into one excel file.
I notice that pandas is already built in tushare.
How to put the data together and generate only one excel file?
import tushare as ts
df=ts.get_realtime_quotes('000875')
df.to_excel(r'C:\Users\stockfile\000875.xlsx')
You can do it with for example
df = df.append(ts.get_realtime_quotes('000875'))
Given the number of calls, it nay be better to create a data frame and fill it with data rows as they arrive. Something like this:
# Here, just getting column names:
columns = ts.get_realtime_quotes('000875').columns
# Choose the right number of calls, N,
df = pd.DataFrame(index = range(N), columns = columns)
for i in range(N):
df.iloc[0] = ts.get_realtime_quotes('000875').iloc[0]
sleep(5)
Another way to do it (possibly simpler and without preallocating the empty data frame) would be storing answers from tushare in a list and then applying pd.concat.
list_of_dfs = []
for _ in range(N):
list_of_dfs.append(ts.get_realtime_quotes('000875'))
sleep(5)
full_df = pd.concat(list_of_dfs)
This way you don't need to know the number of requests in advance (for example, if you decide to write the for loop without explicit number of repetitions).

Categories

Resources