Memory error when using pandas read_csv - python

I am trying to do something fairly simple, reading a large csv file into a pandas dataframe.
data = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2)
The code either fails with a MemoryError, or just never finishes.
Mem usage in the task manager stopped at 506 Mb and after 5 minutes of no change and no CPU activity in the process I stopped it.
I am using pandas version 0.11.0.
I am aware that there used to be a memory problem with the file parser, but according to this should have been fixed.
The file I am trying to read is 366 Mb, the code above works if I cut the file down to something short (25 Mb).
It has also happened that I get a pop up telling me that it can't write to address 0x1e0baf93...
Traceback (most recent call last):
File "F:\QA ALM\Python\new WIM data\new WIM data\", line 25, in
wimdata = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\"
, line 401, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\"
, line 216, in _read
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\"
, line 643, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\"
, line 394, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\"
, line 525, in _init_dict
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\"
, line 5338, in _arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1820, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1872, in form_blocks
float_blocks = _multi_blockify(float_items, items)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1930, in _multi_blockify
block_items, values = _stack_arrays(list(tup_block), ref_items, dtype)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1962, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
Press any key to continue . . .
A bit of background - I am trying to convince people that Python can do the same as R. For this I am trying to replicate an R script that does
data <- read.table(paste(INPUTDIR,config[i,]$TOEXTRACT,sep=""), HASHEADER, DELIMITER,skip=2,fill=TRUE)
R not only manages to read the above file just fine, it even reads several of these files in a for loop (and then does some stuff with the data). If Python does have a problem with files of that size I might be fighting a loosing battle...

Windows memory limitation
Memory errors happens a lot with python when using the 32bit version in Windows. This is because 32bit processes only gets 2GB of memory to play with by default.
Tricks for lowering memory usage
If you are not using 32bit python in windows but are looking to improve on your memory efficiency while reading csv files, there is a trick.
The pandas.read_csv function takes an option called dtype. This lets pandas know what types exist inside your csv data.
How this works
By default, pandas will try to guess what dtypes your csv file has. This is a very heavy operation because while it is determining the dtype, it has to keep all raw data as objects (strings) in memory.
Let's say your csv looks like this:
name, age, birthday
Alice, 30, 1985-01-01
Bob, 35, 1980-01-01
Charlie, 25, 1990-01-01
This example is of course no problem to read into memory, but it's just an example.
If pandas were to read the above csv file without any dtype option, the age would be stored as strings in memory until pandas has read enough lines of the csv file to make a qualified guess.
I think the default in pandas is to read 1,000,000 rows before guessing the dtype.
By specifying dtype={'age':int} as an option to the .read_csv() will let pandas know that age should be interpreted as a number. This saves you lots of memory.
Problem with corrupt data
However, if your csv file would be corrupted, like this:
name, age, birthday
Alice, 30, 1985-01-01
Bob, 35, 1980-01-01
Charlie, 25, 1990-01-01
Dennis, 40+, None-Ur-Bz
Then specifying dtype={'age':int} will break the .read_csv() command, because it cannot cast "40+" to int. So sanitize your data carefully!
Here you can see how the memory usage of a pandas dataframe is a lot higher when floats are kept as strings:
Try it yourself
df = pd.DataFrame(['1.0', '0.6666667', '150000.1'],(100000, 10)))
# 224544 (~224 MB)
df = pd.DataFrame([1.0, 0.6666667, 150000.1],(100000, 10)))
# 79560 (~79 MB)

I had the same memory problem with a simple read of a tab delimited text file around 1 GB in size (over 5.5 million records) and this solved the memory problem:
df = pd.read_csv(myfile,sep='\t') # didn't work, memory error
df = pd.read_csv(myfile,sep='\t',low_memory=False) # worked fine and in less than 30 seconds
Spyder 3.2.3
Python 2.7.13 64bits

I tried chunksize while reading big CSV file
reader = pd.read_csv(filePath,chunksize=1000000,low_memory=False,header=0)
The read is now the list. We can iterate the reader and write/append to the new csv or can perform any operation
for chunk in reader:
print("Chunk -> File process")
with open(destination, 'a') as f:
newChunk.to_csv(f, header=False,sep='\t',index=False)
print("Chunk appended to the file")

I use Pandas on my Linux box and faced many memory leaks that only got resolved after upgrading Pandas to the latest version after cloning it from github.

I encountered this issue as well when I was running in a virtual machine, or somewere else where the memory is stricktly limited. It has nothing to do with pandas or numpy or csv, but will always happen if you try using more memory as you are alowed to use, not even only in python.
The only chance you have is what you already tried, try to chomp down the big thing into smaller pieces which fit into memory.
If you ever asked yourself what MapReduce is all about, you found out by yourself...MapReduce would try to distribute the chunks over many machines, you would try to process the chunke on one machine one after another.
What you found out with the concatenation of the chunk files might be an issue indeed, maybe there are some copy needed in this operation...but in the end this maybe saves you in your current situation but if your csv gets a little bit larger you might run against that wall again...
It also could be, that pandas is so smart, that it actually only loads the individual data chunks into memory if you do something with it, like concatenating to a big df?
Several things you can try:
Don't load all the data at once, but split in in pieces
As far as I know, hdf5 is able to do these chunks automatically and only loads the part your program currently works on
Look if the types are ok, a string '0.111111' needs more memory than a float
What do you need actually, if there is the adress as a string, you might not need it for numerical analysis...
A database can help acessing and loading only the parts you actually need (e.g. only the 1% active users)

There is no error for Pandas 0.12.0 and NumPy 1.8.0.
I have managed to create a big DataFrame and save it to a csv file and then successfully read it. Please see the example here. The size of the file is 554 Mb (It even worked for 1.1 Gb file, took longer, to generate 1.1Gb file use frequency of 30 seconds). Though I have 4Gb of RAM available.
My suggestion is try updating Pandas. Other thing that could be useful is try running your script from command line, because for R you are not using Visual Studio (this already was suggested in the comments to your question), hence it has more resources available.

Add these:
ratings = pd.read_csv(..., low_memory=False, memory_map=True)
My memory with these two:
Without these two:

Although this is a workaround not so much as a fix, I'd try converting that CSV to JSON (should be trivial) and using read_json method instead - I've been writing and reading sizable JSON/dataframes (100s of MB) in Pandas this way without any problem at all.


Convert huge csv to hdf5 format

I downloaded IBM's Airline Reporting Carrier On-Time Performance Dataset; the uncompressed CSV is 84 GB. I want to run an analysis, similar to Flying high with Vaex, with the vaex libary.
I tried to convert the CSV to a hdf5 file, to make it readable for the vaex libary:
import time
import vaex
df = vaex.from_csv(r"D:\airline.csv", convert=True, chunk_size=1000000)
I always get an error when running the code:
RuntimeError: Dirty entry flush destroy failed (file write failed: time = Fri Sep 30 17:58:55 2022
, filename = 'D:\airline.csv_chunk_8.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 0000021EA8C6B128, total write size = 2040, bytes this sub-write = 2040, bytes actually written = 18446744073709551615, offset = 221133661).
Second run, I get this error:
RuntimeError: Unable to flush file's cached information (file write failed: time = Fri Sep 30 20:18:19 2022
, filename = 'D:\airline.csv_chunk_18.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 000002504659B828, total write size = 2048, bytes this sub-write = 2048, bytes actually written = 18446744073709551615, offset = 348515307)
Is there an alternative way to convert the CSV to hdf5 without Python? For example, a downloadable software which can do this job?
I'm not familiar with vaex, so can't help with usage and functions. However, I can read error messages. :-)
It reports "bytes written" with a huge number (18_446_744_073_709_551_615), much larger than the 84GB CSV. Some possible explanations:
you ran out of disk
you ran out of memory, or
had some other error
To diagnose, try testing with a small csv file and see if vaex.from_csv() works as expected. I suggest the lax_to_jfk.csv file.
Regarding your question, is there an alternative way to convert a csv to hdf5?, why not use Python?
Are you more comfortable with other languages? If so, you can install HDF5 and write your code with their C or Fortran API.
OTOH, if you are familiar with Python, there are other packages you can use to read the CSV file and create the HDF5 file.
Python packages to read the CSV
Personally, I like NumPy's genfromtxt() to read the CSV (You can also use loadtxt() to read the CSV, if you don't have missing values and don't need the field names.) However, I think you will run into memory problems reading a 84GB file. That said, you can use the skip_header and max_rows parameters with genfromtxt() to read and load a subset of lines. Alternately you can use csv.DictReader(). It reads a line at a time. So, you avoid memory issues, but it could be very slow loading the HDF5 file.
Python packages to create the HDF5 file
I have used both h5py and pytables (aka tables) to create and read HDF5 files. Once you load the CSV data to a NumPy array, it's a snap to create the HDF5 dataset.
Here is a very simple example that reads the lax_to_jfk.csv data and loads to a HDF5 file.
csv_name = 'lax_to_jfk'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes')
with h5py.File(csv_name+'.h5', 'w') as h5f:
After posting this example, I decided to test with a larger file (airline_2m.csv). It's 861 MB, and has 2M rows. I discovered the code above doesn't work. However, it's not because of the number of rows. The problem is the columns (field names). Turns out the data isn't as clean; there are 109 field names on row 1, and some rows have 111 columns of data. As a result, the auto-generated dtype doesn't have a matching field. While investigating this, I also discovered many rows only have the values for first 56 fields. In other words, fields 57-111 are not very useful. One solution to this is to add the usecols=() parameter. Code below reflects this modification, and works with this test file. (I have not tried testing with your large file airline.csv. Given it's size likely you will need to read and load incrementally.)
csv_name = 'airline_2m'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes') #,
usecols=(i for i in range(56)) )
with h5py.File(csv_name+'.h5', 'w') as h5f:
I tried reproducing your example. I believe the problem you are facing is quite common when dealing with CSVs. The schema is not known.
Sometimes there are "mixed types" and pandas (used underneath vaex's read_csv or from_csv ) casts those columns as dtype object.
Vaex does not really support such mixed dtypes, and requires each column to be of a single uniform type (kind of a like a database).
So how to go around this? Well, the best way I can think of is to use the dtype argument to explicitly specify the types of all columns (or those that you suspect or know to have mixed types). I know this file has like 100+ columns and that's annoying.. but that is also kind of the price to pay when using a format such as CSV...
Another thing i noticed is the encoding.. using pure pandas.read_csv failed at some point because of encoding and requires one to add encoding="ISO-8859-1". This is also supported by (since the args are just passed down to pandas).
In fact if you want to do manually what does automatically for you (given that this CSV file might not be as clean as one would hope), do something like (this is pseudo code but I hope close to the real thing)
# Iterate over the file in chunks
for i, df_tmp in enumerate(pd.read_csv(file, chunksize=11_000_000, encoding="ISO-8859-1", dtype=dtype)):
# Assert or check or do whatever needs doing to ensure column types are as they should be
# Pass the data to vaex (this does not take extra RAM):
df_vaex = vaex.from_pandas(df_tmp)
# Export this chunk into HDF5
# df_vaex.export_hdf5(f'chunk_{i}.hdf5')
# When the above loop finishes, just concat and export the data to a single file if needed (gives some performance benefit).
df ='chunk*.hdf5')
df.export_hdf5('converted.hdf5', progress='rich')
I've seen potentially much better/faster way of doing this with vaex, but it is not released yet (i saw it in the code repo on github), so I will not go into it, but if you can install from source, and want me to elaborate further feel free to drop a comment.
Hope this at least gives some ideas on how to move forward.
In last couple of versions of vaex core, opens all CSV files lazily, so then just export to hdf5/arrow directly, it will do it in one go. Check the docs for more details:

Scipy IO Loadmat error: ValueError: Mat 4 mopt wrong format

I have seen this question floating around without any definite answers such as here. I have .mat data converted from a different data structure and am trying to load it in python using For some files, this approach works perfectly fine, but for others I get this error:
mat = sio.loadmat(i, verify_compressed_data_integrity=False)
File "/Users/aeglick/opt/anaconda3/lib/python3.8/site-packages/scipy-1.7.1-py3.8-macosx-10.9-x86_64.egg/scipy/io/matlab/", line 226, in loadmat
matfile_dict = MR.get_variables(variable_names)
File "/Users/aeglick/opt/anaconda3/lib/python3.8/site-packages/scipy-1.7.1-py3.8-macosx-10.9-x86_64.egg/scipy/io/matlab/", line 390, in get_variables
hdr, next_position = self.read_var_header()
File "/Users/aeglick/opt/anaconda3/lib/python3.8/site-packages/scipy-1.7.1-py3.8-macosx-10.9-x86_64.egg/scipy/io/matlab/", line 346, in read_var_header
hdr = self._matrix_reader.read_header()
File "/Users/aeglick/opt/anaconda3/lib/python3.8/site-packages/scipy-1.7.1-py3.8-macosx-10.9-x86_64.egg/scipy/io/matlab/", line 108, in read_header
raise ValueError('Mat 4 mopt wrong format, byteswapping problem?')
ValueError: Mat 4 mopt wrong format, byteswapping problem?
I'm not sure what is causing this issue. I save the .mat files the same way every time so they should all be readable. I also tried h5py and get a similar error. Are there any suggestions on how I can read my data files?
You might be saving the .mat files in different "version" formats without realizing it. If your code calls save(...) without explicitly specifying a format, it uses the default version for your Matlab session, which is a persisted per-user preference that you can set inside the Matlab GUI. And if you don't set a default format in Preferences, the default version that save(...) uses varies with the version of Matlab.
The differences between MAT-file versions are significant. In particular, v7.3 completely changed the format to an HDF5-based one (which I don't think supports). See
Check your actual .mat file versions. And if you want your code to be really portable, change the save(...) calls in your code to explicitly specify the MAT-file version using a '-v<whatever>' argument.

Python/Pandas: How can I read 7 million records?

I was given a "database" (more correctly an ugly huge CSV file) that contains the results of a "discovery" process. The rows I get are very short, they are information about licensing on over 65,000 computers, it looks like:
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Publisher MUI (Spanish) 2010;14.0.7015.1000;20150722;Microsoft Corporation
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Outlook MUI (Spanish) 2010;14.0.7015.1000;20160216;Microsoft Corporation
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Groove MUI (Spanish) 2010;14.0.7015.1000;20150722;Microsoft Corporation
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Word MUI (Spanish) 2010;14.0.7015.1000;20151119;Microsoft Corporation
As you see is a semicolon separated file, it has the time when the process was run, the PC's id, a blank (I don't know what it is), the program, and version program, there are more fields, but I don't care about them, only those ones are relevant.
So I turn to Pandas to do some analysis (basically counting), and got around 3M records. Problem is, this file is over 7M records (I looked at it using Notepad++ 64bit). So, how can I use Pandas to analyze a file with so many records?
I'm using Python 3.5, Pandas 0.19.2
Adding info for Fabio's comment:
I'm using:
df = pd.read_csv("inventario.csv", delimiter=";",
header=None, usecols=[0,1,2,3,4],
To be very precise: the file is 7'432,175 rows, Pandas is only accessing 3'172,197. Something curious is that if I load the file into Excel 2017 (using a data query) it will load exactly 3'172,197 rows.
After the comments, I checked the file and found some lines are corrupted (around 450), I don't know if they were signaling and end of file, it doesn't look so, anyway, I cleaned the wrong-formed lines, and still Pandas read only around 3M lines.
OK, I solved the problem, but really, help me understand what I did wrong. I can't be doing things like I did... First, I cleaned the file for "strange" lines, they were around 500 of them, and then I saved the file to inv.csv
Then I did the following:
f_inventario = open("inv.csv", "r", encoding="latin1")
df = pd.DataFrame(lines)
df.columns = ['data']
df['fecha'] = s : s.split(';')[0])
df['equipo'] = s : s.split(';')[1])
df['software'] = s : s.split(';')[2])
df['version'] = s : s.split(';')[3][:-1])
df.drop(['data'], axis=1, inplace=True)
And now I got my dataframe with the 7M rows. If I did a df=pd.read_csv('inv.csv' ... ) it would only read about 3M records.
I got my problem solved, but this is terrible, this is not how it should be. As I see it is not a memory problem. Could it be some global variable that tells read_csv to load up to a maximum??? I really don't know.
If performance is not an issue, a trivial approach would be to simply read the file line by line in to a buffer. Analyze the data in the buffer once the buffer is full. Continue this iteratively until you have processed the entire file. Once that's done you can then aggregate the results from each chunk to form your final result. To speed things up, you can look in to something like memory mapping, something like
import mmap
with open("hello.txt", "r+") as f:
# memory-map the file, size 0 means whole file
map = mmap.mmap(f.fileno(), 0)
# read content via standard file methods
see this thread

Read matlab file (*.mat) from zipped file without extracting to directory in Python

This specific questions stems from the attempt to handle large data sets produced by a MATLAB algorithm so that I can process them with python algorithms.
Background: I have large arrays in MATLAB (typically 20x20x40x15000 [i,j,k,frame]) and I want to use them in python. So I save the array to a *.mat file and use to read the *.mat file into a numpy array. However, a problem arises in that if I try to load the entire *.mat file in python, a memory error occurs. To get around this, I slice the *.mat file into pieces, so that I can load the pieces one at a time into a python array. If I divide up the *.mat by frame, I now have 15,000 *.mat files which quickly becomes a pain to work with (at least in windows). So my solution is to use zipped files.
Question: Can I use scipy to directly read a *.mat file from a zipped file without first unzipping the file to the current working directory?
Specs: Python 2.7, windows xp
Current code:
import zipfile
import numpy as np
def readZip(zfilename,dim,frames):
zfile = zipfile.ZipFile( zfilename, "r" )
for info in zfile.infolist():
fname = info.filename
return data
Tried code:
produces this error:
TypeError: file() argument 1 must be encoded string without NULL bytes, not str
produces this error:
UnsupportedOperation: seek
Any other suggestions on handling the data are appreciated.
I am pretty sure that the answer to my question is NO and there are better ways to accomplish what I am trying to do.
Regardless, with the suggestion from J.F. Sebastian, I have devised a solution.
Solution: Save the data in MATLAB in the HDF5 format, namely hdf5write(fname, '/data', data_variable). This produces a *.h5 file which then can be read into python via h5py.
python code:
import h5py
r = h5py.File(fname, 'r+')
data = r['data']
I can now index directly into the data, however is stays on the hard drive.
print data[:,:,:,1]
Or I can load it into memory.
data_mem = data[:]
However, this once again gives memory errors. So, to get it into memory I can loop through each frame and add it to a numpy array.
h5py FTW!
In one of my frozen applications we bundle some files into the .bin file that py2exe creates, then pull them out like this:
z = zipfile.ZipFile(os.path.join(myDir, 'common.bin'))
data ='schema-new.sql')
I am not certain if that would feed your .mat files into scipy, but I'd consider it worth a try.

Python out of memory on large CSV file (numpy)

I have a 3GB CSV file that I try to read with python, I need the median column wise.
from numpy import *
def data():
return genfromtxt('All.csv',delimiter=',')
data = data() # This is where it fails already.
med = zeros(len(data[0]))
data = data.T
for i in xrange(len(data)):
m = median(data[i])
med[i] = 1.0/float(m)
print med
The error that I get is this:
Python(1545) malloc: *** mmap(size=16777216) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
File "", line 40, in <module>
data = data()
File "", line 39, in data
return genfromtxt('All.csv',delimiter=',')
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-
packages/numpy/lib/", line 1495, in genfromtxt
for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):
I think it's just an out of memory error. I am running a 64bit MacOSX with 4GB of ram and both numpy and Python compiled in 64bit mode.
How do I fix this? Should I try a distributed approach, just for the memory management?
EDIT: Also tried with this but no luck...
genfromtxt('All.csv',delimiter=',', dtype=float16)
As other folks have mentioned, for a really large file, you're better off iterating.
However, you do commonly want the entire thing in memory for various reasons.
genfromtxt is much less efficient than loadtxt (though it handles missing data, whereas loadtxt is more "lean and mean", which is why the two functions co-exist).
If your data is very regular (e.g. just simple delimited rows of all the same type), you can also improve on either by using numpy.fromiter.
If you have enough ram, consider using np.loadtxt('yourfile.txt', delimiter=',') (You may also need to specify skiprows if you have a header on the file.)
As a quick comparison, loading ~500MB text file with loadtxt uses ~900MB of ram at peak usage, while loading the same file with genfromtxt uses ~2.5GB.
Alternately, consider something like the following. It will only work for very simple, regular data, but it's quite fast. (loadtxt and genfromtxt do a lot of guessing and error-checking. If your data is very simple and regular, you can improve on them greatly.)
import numpy as np
def generate_text_file(length=1e6, ncols=20):
data = np.random.random((length, ncols))
np.savetxt('large_text_file.csv', data, delimiter=',')
def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
def iter_func():
with open(filename, 'r') as infile:
for _ in range(skiprows):
for line in infile:
line = line.rstrip().split(delimiter)
for item in line:
yield dtype(item)
iter_loadtxt.rowlength = len(line)
data = np.fromiter(iter_func(), dtype=dtype)
data = data.reshape((-1, iter_loadtxt.rowlength))
return data
data = iter_loadtxt('large_text_file.csv')
The problem with using genfromtxt() is that it attempts to load the whole file into memory, i.e. into a numpy array. This is great for small files but BAD for 3GB inputs like yours. Since you are just calculating column medians, there's no need to read the whole file. A simple, but not the most efficient way to do it would be to read the whole file line-by-line multiple times and iterate over the columns.
Why are you not using the python csv module?
>> import csv
>> reader = csv.reader(open('All.csv'))
>>> for row in reader:
... print row

