python - Using pandas structures with large csv(iterate and chunksize) - python

I have a large csv file, about 600mb with 11 million rows and I want to create statistical data like pivots, histograms, graphs etc. Obviously trying to just to read it normally:
df = pd.read_csv('Check400_900.csv', sep='\t')
Doesn't work so I found iterate and chunksize in a similar post so I used:
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
All good, i can for example print df.get_chunk(5) and search the whole file with just:
for chunk in df:
print chunk
My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk.
plt.plot()
print df.head()
print df.describe()
print df.dtypes
customer_group3 = df.groupby('UserID')
y3 = customer_group.size()

Solution, if need create one big DataFrame if need processes all data at once (what is possible, but not recommended):
Then use concat for all chunks to df, because type of output of function:
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
isn't dataframe, but pandas.io.parsers.TextFileReader - source.
tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
print tp
#<pandas.io.parsers.TextFileReader object at 0x00000000150E0048>
df = pd.concat(tp, ignore_index=True)
I think is necessary add parameter ignore index to function concat, because avoiding duplicity of indexes.
EDIT:
But if want working with large data like aggregating, much better is use dask, because it provides advanced parallelism.

You do not need concat here. It's exactly like writing sum(map(list, grouper(tup, 1000))) instead of list(tup). The only thing iterator and chunksize=1000 does is to give you a reader object that iterates 1000-row DataFrames instead of reading the whole thing. If you want the whole thing at once, just don't use those parameters.
But if reading the whole file into memory at once is too expensive (e.g., takes so much memory that you get a MemoryError, or slow your system to a crawl by throwing it into swap hell), that's exactly what chunksize is for.
The problem is that you named the resulting iterator df, and then tried to use it as a DataFrame. It's not a DataFrame; it's an iterator that gives you 1000-row DataFrames one by one.
When you say this:
My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk
The answer is that you can't. If you can't load the whole thing into one giant DataFrame, you can't use one giant DataFrame. You have to rewrite your code around chunks.
Instead of this:
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
print df.dtypes
customer_group3 = df.groupby('UserID')
… you have to do things like this:
for df in pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000):
print df.dtypes
customer_group3 = df.groupby('UserID')
Often, what you need to do is aggregate some data—reduce each chunk down to something much smaller with only the parts you need. For example, if you want to sum the entire file by groups, you can groupby each chunk, then sum the chunk by groups, and store a series/array/list/dict of running totals for each group.
Of course it's slightly more complicated than just summing a giant series all at once, but there's no way around that. (Except to buy more RAM and/or switch to 64 bits.) That's how iterator and chunksize solve the problem: by allowing you to make this tradeoff when you need to.

You need to concatenate the chucks. For example:
df2 = pd.concat([chunk for chunk in df])
And then run your commands on df2

This might not reply directly to the question, but when you have to load a big dataset it is a good practice, to covert the dtypes of your columns while reading the dataset. Also if you know which columns you need, use the usecols argument to load only those.
df = pd.read_csv("data.csv",
usecols=['A', 'B', 'C', 'Date'],
dtype={'A':'uint32',
'B':'uint8',
'C':'uint8'
},
parse_dates=['Date'], # convert to datetime64
sep='\t'
)

Related

Pandas Read in only specific lines of a CSV file

I have a very large CSV that takes ~30 seconds to read when using the normal pd.read_csv command. Is there a way to speed this process up? I'm thinking maybe something that only reads rows that have some matching value in one of the columns.
i.e. only read in rows where the value in column 'A' is the value '5'.
Dask module can do a lazy read of a large CSV file in Python.
You trigger the computation by calling the .compute() method. At this time the file is read in chunks and applies whatever conditional logic you specify.
import dask.dataframe as dd
df = dd.read_csv(csv_file)
df = df[df['A'] == 5]
df = df.compute()
print(len(df)) # print number of records
print(df.head()) # print first 5 rows to show sample of data
If you're looking for a value in a CSV file, you must look for the entire document, then limit it to 5 results.
If you want to just retrieve the first five rows, you may are looking for this:
nrows :int,optional
Number of rows of file to read. Useful for reading pieces of large files.
Reference: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Try and chunk it dude! Truffle Shuffle! Goonies Never say die.
mylist = []
for chunk in pd.read_csv('csv_file.csv', sep=',', chunksize=10000):
mylist.append(chunk[chunk.A == 5])
big_data = pd.concat(mylist, axis= 0)
del mylist

Improve performance of running large files

I know there are a few questions on this topic but I can't seem to get it going efficiently. I have large input datasets (2-3 GB) running on my machine, which contains 8GB of memory. I'm using a version of spyder with pandas 0.24.0 installed. The input file currently takes around an hour to generate an output file of around 10MB.
I since attempted to optimise the process by chunking the input file using the code below. Essentially, I chunk the input file into smaller segments, run it through some code and export the smaller output. I then delete the chunked info to release memory. But the memory still builds throughout the operation and ends up taking a similar amount of time. I'm not sure what I'm doing wrong:
The memory usage details of the file are:
RangeIndex: 5471998 entries, 0 to 5471997
Data columns (total 17 columns):
col1 object
col2 object
col3 object
....
dtypes: object(17)
memory usage: 5.6 GB
I subset the df by passing cols_to_keep to use_cols. But the headers are different for each file so I used location indexing to get the relevant headers.
# Because the column headers change from file to file I use location indexing to read the col headers I need
df_cols = pd.read_csv('file.csv')
# Read cols to be used
df_cols = df_cols.iloc[:,np.r_[1,3,8,12,23]]
# Export col headers
cols_to_keep = df_cols.columns
PATH = '/Volume/Folder/Event/file.csv'
chunksize = 10000
df_list = [] # list to hold the batch dataframe
for df_chunk in pd.read_csv(PATH, chunksize = chunksize, usecols = cols_to_keep):
# Measure time taken to execute each batch
print("summation download chunk time: " , time.clock()-t)
# Execute func1
df1 = func1(df_chunk)
# Execute func2
df2 = func1(df1)
# Append the chunk to list and merge all
df_list.append(df2)
# Merge all dataframes into one dataframe
df = pd.concat(df_list)
# Delete the dataframe list to release memory
del df_list
del df_chunk
I have tried using dask but get all kinds of errors with simple pandas methods.
import dask.dataframe as ddf
df_cols = pd.read_csv('file.csv')
df_cols = df_cols.iloc[:,np.r_[1:3,8,12,23:25,32,42,44,46,65:67,-5:0,]]
cols_to_keep = df_cols.columns
PATH = '/Volume/Folder/Event/file.csv'
blocksize = 10000
df_list = [] # list to hold the batch dataframe
df_chunk = ddf.read_csv(PATH, blocksize = blocksize, usecols = cols_to_keep, parse_dates = ['Time']):
print("summation download chunk time: " , time.clock()-t)
# Execute func1
df1 = func1(df_chunk)
# Execute func2
df2 = func1(df1)
# Append the chunk to list and merge all
df_list.append(df2)
delayed_results = [delayed(df2) for df_chunk in df_list]
line that threw error:
df1 = func1(df_chunk)
name_unq = df['name'].dropna().unique().tolist()
AttributeError: 'Series' object has no attribute 'tolist'
I've passed through numerous functions and it just continues to throw errors.
To process your file, use rather dask, which is intended just to dealing
with big (actually, very big) files.
It has also read_csv function, with additional blocksize parameter, to
define the size of a single chunk.
The result of read_csv is conceptually a single (dask) DataFrame, which
is composed of a sequence of partitions, actually pandasonic DataFrames.
Then you can use map_partitions function, to apply your function
to each partition.
Because this function (passed to map_partitions) operates on a single
partition (pandasonic DataFrame), you can use any code, which you
previously tested in Pandas environment.
The advantage of this solution is that processing of individual partitions
is divided among available cores, whereas Pandas uses only a single core.
So your loop should be reworked to:
read_csv (dask version),
map_partitions generating partial results from each partition,
concat - to concatenate these partial results (for now the result
is still a dask DataFrame,
convert it to a pandasonic DataFrame,
make further use of it in Pandas environment.
To get more details, read about dask.

How to concatenate large dataset into dataframe pandas

So I am working with a fairly substantial CSV dataset that has couple hundred megabytes. I have managed to read in the data in chunks (~100 rows).
How do i then elegantly convert those chunks into a dataframe and apply the describe function to it?
Thank you
It seems you need concat of TextFileReader object what is output of read_csv if parameter chunksize with describe:
df = pd.concat([x for x in pd.read_csv('filename', chunksize=1000)], ignore_index=True)
df = df.describe()
print (df)

Python pandas to get specified rows from a CSV file

I am trying to read a very large set of data from a CSV file using pandas in python. I need to break up the data into parts to take it in, therefore I would like to take in half of the rows first and then the other half.
I see that there is the chunksize parameter in the read_csv. However, I cannot seem to figure out how to put it all into a matrix or sparse matrix after it is read.
wow = pd.read_csv('TestingCSV.csv', sep=',', header='infer', low_memory=False, chunksize=10, usecols=(range(3, 5)))
This returns a type: <class 'pandas.io.parsers.TextFileReader'>
What is a possible way to take in the different chunks and then reconstruct a matrix or sparse matrix from them?
When you use the read_csv you need to read the whole file you can't read part of it.
When it comes down to the chunksize, you need to take those "chunks" that are listed under wow and concat().
For example:
chunks = pd.read_csv(data, chunksize = 100)
df = pd.concat(chunks, ignore_index=True)
So now you have a the full dataframe and you can do whatever analysis you need to do.
It's also an iterable object, so you can do the following:
for chunk in chunks:
#do something to each chunk

How do I read a large csv file with pandas?

I am trying to read a large csv file (aprox. 6 GB) in pandas and i am getting a memory error:
MemoryError Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')
...
MemoryError:
Any help on this?
The error shows that the machine does not have enough memory to read the entire
CSV into a DataFrame at one time. Assuming you do not need the entire dataset in
memory all at one time, one way to avoid the problem would be to process the CSV in
chunks (by specifying the chunksize parameter):
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
The chunksize parameter specifies the number of rows per chunk.
(The last chunk may contain fewer than chunksize rows, of course.)
pandas >= 1.2
read_csv with chunksize returns a context manager, to be used like so:
chunksize = 10 ** 6
with pd.read_csv(filename, chunksize=chunksize) as reader:
for chunk in reader:
process(chunk)
See GH38225
Chunking shouldn't always be the first port of call for this problem.
Is the file large due to repeated non-numeric data or unwanted columns?
If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv usecols parameter.
Does your workflow require slicing, manipulating, exporting?
If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.
If all else fails, read line by line via chunks.
Chunk via pandas or via csv library as a last resort.
For large data l recommend you use the library "dask" e.g:
# Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')
You can read more from the documentation here.
Another great alternative would be to use modin because all the functionality is identical to pandas yet it leverages on distributed dataframe libraries such as dask.
From my projects another superior library is datatables.
# Datatable python library
import datatable as dt
df = dt.fread("s3://.../2018-*-*.csv")
I proceeded like this:
chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
names=['lat','long','rf','date','slno'],index_col='slno',\
header=None,parse_dates=['date'])
df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)
You can read in the data as chunks and save each chunk as pickle.
import pandas as pd
import pickle
in_path = "" #Path where the large file is
out_path = "" #Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"
reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size,
low_memory=False)
for i, chunk in enumerate(reader):
out_file = out_path + "/data_{}.pkl".format(i+1)
with open(out_file, "wb") as f:
pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)
In the next step you read in the pickles and append each pickle to your desired dataframe.
import glob
pickle_path = "" #Same Path as out_path i.e. where the pickle files are
data_p_files=[]
for name in glob.glob(pickle_path + "/data_*.pkl"):
data_p_files.append(name)
df = pd.DataFrame([])
for i in range(len(data_p_files)):
df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)
I want to make a more comprehensive answer based off of the most of the potential solutions that are already provided. I also want to point out one more potential aid that may help reading process.
Option 1: dtypes
"dtypes" is a pretty powerful parameter that you can use to reduce the memory pressure of read methods. See this and this answer. Pandas, on default, try to infer dtypes of the data.
Referring to data structures, every data stored, a memory allocation takes place. At a basic level refer to the values below (The table below illustrates values for C programming language):
The maximum value of UNSIGNED CHAR = 255
The minimum value of SHORT INT = -32768
The maximum value of SHORT INT = 32767
The minimum value of INT = -2147483648
The maximum value of INT = 2147483647
The minimum value of CHAR = -128
The maximum value of CHAR = 127
The minimum value of LONG = -9223372036854775808
The maximum value of LONG = 9223372036854775807
Refer to this page to see the matching between NumPy and C types.
Let's say you have an array of integers of digits. You can both theoretically and practically assign, say array of 16-bit integer type, but you would then allocate more memory than you actually need to store that array. To prevent this, you can set dtype option on read_csv. You do not want to store the array items as long integer where actually you can fit them with 8-bit integer (np.int8 or np.uint8).
Observe the following dtype map.
Source: https://pbpython.com/pandas_dtypes.html
You can pass dtype parameter as a parameter on pandas methods as dict on read like {column: type}.
import numpy as np
import pandas as pd
df_dtype = {
"column_1": int,
"column_2": str,
"column_3": np.int16,
"column_4": np.uint8,
...
"column_n": np.float32
}
df = pd.read_csv('path/to/file', dtype=df_dtype)
Option 2: Read by Chunks
Reading the data in chunks allows you to access a part of the data in-memory, and you can apply preprocessing on your data and preserve the processed data rather than raw data. It'd be much better if you combine this option with the first one, dtypes.
I want to point out the pandas cookbook sections for that process, where you can find it here. Note those two sections there;
Reading a csv chunk-by-chunk
Reading only certain rows of a csv chunk-by-chunk
Option 3: Dask
Dask is a framework that is defined in Dask's website as:
Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love
It was born to cover the necessary parts where pandas cannot reach. Dask is a powerful framework that allows you much more data access by processing it in a distributed way.
You can use dask to preprocess your data as a whole, Dask takes care of the chunking part, so unlike pandas you can just define your processing steps and let Dask do the work. Dask does not apply the computations before it is explicitly pushed by compute and/or persist (see the answer here for the difference).
Other Aids (Ideas)
ETL flow designed for the data. Keeping only what is needed from the raw data.
First, apply ETL to whole data with frameworks like Dask or PySpark, and export the processed data.
Then see if the processed data can be fit in the memory as a whole.
Consider increasing your RAM.
Consider working with that data on a cloud platform.
Before using chunksize option if you want to be sure about the process function that you want to write inside the chunking for-loop as mentioned by #unutbu you can simply use nrows option.
small_df = pd.read_csv(filename, nrows=100)
Once you are sure that the process block is ready, you can put that in the chunking for loop for the entire dataframe.
The function read_csv and read_table is almost the same. But you must assign the delimiter “,” when you use the function read_table in your program.
def get_from_action_data(fname, chunk_size=100000):
reader = pd.read_csv(fname, header=0, iterator=True)
chunks = []
loop = True
while loop:
try:
chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
chunks.append(chunk)
except StopIteration:
loop = False
print("Iteration is stopped")
df_ac = pd.concat(chunks, ignore_index=True)
Solution 1:
Using pandas with large data
Solution 2:
TextFileReader = pd.read_csv(path, chunksize=1000) # the number of rows per chunk
dfList = []
for df in TextFileReader:
dfList.append(df)
df = pd.concat(dfList,sort=False)
Here follows an example:
chunkTemp = []
queryTemp = []
query = pd.DataFrame()
for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):
#REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
chunk = chunk.rename(columns = {c: c.replace(' ', '') for c in chunk.columns})
#YOU CAN EITHER:
#1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET
chunkTemp.append(chunk)
#2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]
#BUFFERING PROCESSED DATA
queryTemp.append(query)
#! NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOP
print("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)
print("Database: LOADED")
#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)
print(query)
You can try sframe, that have the same syntax as pandas but allows you to manipulate files that are bigger than your RAM.
If you use pandas read large file into chunk and then yield row by row, here is what I have done
import pandas as pd
def chunck_generator(filename, header=False,chunk_size = 10 ** 5):
for chunk in pd.read_csv(filename,delimiter=',', iterator=True, chunksize=chunk_size, parse_dates=[1] ):
yield (chunk)
def _generator( filename, header=False,chunk_size = 10 ** 5):
chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5)
for row in chunk:
yield row
if __name__ == "__main__":
filename = r'file.csv'
generator = generator(filename=filename)
while True:
print(next(generator))
In case someone is still looking for something like this, I found that this new library called modin can help. It uses distributed computing that can help with the read. Here's a nice article comparing its functionality with pandas. It essentially uses the same functions as pandas.
import modin.pandas as pd
pd.read_csv(CSV_FILE_NAME)
If you have csv file with millions of data entry and you want to load full dataset you should use dask_cudf,
import dask_cudf as dc
df = dc.read_csv("large_data.csv")
In addition to the answers above, for those who want to process CSV and then export to csv, parquet or SQL, d6tstack is another good option. You can load multiple files and it deals with data schema changes (added/removed columns). Chunked out of core support is already built in.
def apply(dfg):
# do stuff
return dfg
c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)
# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)
# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd#localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd#localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd#localhost/db', 'tablename') # slow but flexible

Categories

Resources