Python/Pandas: How can I read 7 million records? - python

I was given a "database" (more correctly an ugly huge CSV file) that contains the results of a "discovery" process. The rows I get are very short, they are information about licensing on over 65,000 computers, it looks like:
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Publisher MUI (Spanish) 2010;14.0.7015.1000;20150722;Microsoft Corporation
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Outlook MUI (Spanish) 2010;14.0.7015.1000;20160216;Microsoft Corporation
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Groove MUI (Spanish) 2010;14.0.7015.1000;20150722;Microsoft Corporation
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Word MUI (Spanish) 2010;14.0.7015.1000;20151119;Microsoft Corporation
As you see is a semicolon separated file, it has the time when the process was run, the PC's id, a blank (I don't know what it is), the program, and version program, there are more fields, but I don't care about them, only those ones are relevant.
So I turn to Pandas to do some analysis (basically counting), and got around 3M records. Problem is, this file is over 7M records (I looked at it using Notepad++ 64bit). So, how can I use Pandas to analyze a file with so many records?
I'm using Python 3.5, Pandas 0.19.2
Adding info for Fabio's comment:
I'm using:
df = pd.read_csv("inventario.csv", delimiter=";",
header=None, usecols=[0,1,2,3,4],
encoding="latin_1")
To be very precise: the file is 7'432,175 rows, Pandas is only accessing 3'172,197. Something curious is that if I load the file into Excel 2017 (using a data query) it will load exactly 3'172,197 rows.
EDIT:
After the comments, I checked the file and found some lines are corrupted (around 450), I don't know if they were signaling and end of file, it doesn't look so, anyway, I cleaned the wrong-formed lines, and still Pandas read only around 3M lines.
EDIT:
OK, I solved the problem, but really, help me understand what I did wrong. I can't be doing things like I did... First, I cleaned the file for "strange" lines, they were around 500 of them, and then I saved the file to inv.csv
Then I did the following:
f_inventario = open("inv.csv", "r", encoding="latin1")
f_inventario.readlines()
f_inventario.close()
df = pd.DataFrame(lines)
df.columns = ['data']
df['fecha'] = df.data.apply(lambda s : s.split(';')[0])
df['equipo'] = df.data.apply(lambda s : s.split(';')[1])
df['software'] = df.data.apply(lambda s : s.split(';')[2])
df['version'] = df.data.apply(lambda s : s.split(';')[3][:-1])
df.drop(['data'], axis=1, inplace=True)
And now I got my dataframe with the 7M rows. If I did a df=pd.read_csv('inv.csv' ... ) it would only read about 3M records.
I got my problem solved, but this is terrible, this is not how it should be. As I see it is not a memory problem. Could it be some global variable that tells read_csv to load up to a maximum??? I really don't know.

If performance is not an issue, a trivial approach would be to simply read the file line by line in to a buffer. Analyze the data in the buffer once the buffer is full. Continue this iteratively until you have processed the entire file. Once that's done you can then aggregate the results from each chunk to form your final result. To speed things up, you can look in to something like memory mapping, something like
import mmap
with open("hello.txt", "r+") as f:
# memory-map the file, size 0 means whole file
map = mmap.mmap(f.fileno(), 0)
# read content via standard file methods
print(map.readline())
see this thread

Related

Convert huge csv to hdf5 format

I downloaded IBM's Airline Reporting Carrier On-Time Performance Dataset; the uncompressed CSV is 84 GB. I want to run an analysis, similar to Flying high with Vaex, with the vaex libary.
I tried to convert the CSV to a hdf5 file, to make it readable for the vaex libary:
import time
import vaex
start=time.time()
df = vaex.from_csv(r"D:\airline.csv", convert=True, chunk_size=1000000)
end=time.time()
print("Time:",(end-start),"Seconds")
I always get an error when running the code:
RuntimeError: Dirty entry flush destroy failed (file write failed: time = Fri Sep 30 17:58:55 2022
, filename = 'D:\airline.csv_chunk_8.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 0000021EA8C6B128, total write size = 2040, bytes this sub-write = 2040, bytes actually written = 18446744073709551615, offset = 221133661).
Second run, I get this error:
RuntimeError: Unable to flush file's cached information (file write failed: time = Fri Sep 30 20:18:19 2022
, filename = 'D:\airline.csv_chunk_18.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 000002504659B828, total write size = 2048, bytes this sub-write = 2048, bytes actually written = 18446744073709551615, offset = 348515307)
Is there an alternative way to convert the CSV to hdf5 without Python? For example, a downloadable software which can do this job?
I'm not familiar with vaex, so can't help with usage and functions. However, I can read error messages. :-)
It reports "bytes written" with a huge number (18_446_744_073_709_551_615), much larger than the 84GB CSV. Some possible explanations:
you ran out of disk
you ran out of memory, or
had some other error
To diagnose, try testing with a small csv file and see if vaex.from_csv() works as expected. I suggest the lax_to_jfk.csv file.
Regarding your question, is there an alternative way to convert a csv to hdf5?, why not use Python?
Are you more comfortable with other languages? If so, you can install HDF5 and write your code with their C or Fortran API.
OTOH, if you are familiar with Python, there are other packages you can use to read the CSV file and create the HDF5 file.
Python packages to read the CSV
Personally, I like NumPy's genfromtxt() to read the CSV (You can also use loadtxt() to read the CSV, if you don't have missing values and don't need the field names.) However, I think you will run into memory problems reading a 84GB file. That said, you can use the skip_header and max_rows parameters with genfromtxt() to read and load a subset of lines. Alternately you can use csv.DictReader(). It reads a line at a time. So, you avoid memory issues, but it could be very slow loading the HDF5 file.
Python packages to create the HDF5 file
I have used both h5py and pytables (aka tables) to create and read HDF5 files. Once you load the CSV data to a NumPy array, it's a snap to create the HDF5 dataset.
Here is a very simple example that reads the lax_to_jfk.csv data and loads to a HDF5 file.
csv_name = 'lax_to_jfk'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes')
with h5py.File(csv_name+'.h5', 'w') as h5f:
h5f.create_dataset(csv_name,data=rec_arr)
Update:
After posting this example, I decided to test with a larger file (airline_2m.csv). It's 861 MB, and has 2M rows. I discovered the code above doesn't work. However, it's not because of the number of rows. The problem is the columns (field names). Turns out the data isn't as clean; there are 109 field names on row 1, and some rows have 111 columns of data. As a result, the auto-generated dtype doesn't have a matching field. While investigating this, I also discovered many rows only have the values for first 56 fields. In other words, fields 57-111 are not very useful. One solution to this is to add the usecols=() parameter. Code below reflects this modification, and works with this test file. (I have not tried testing with your large file airline.csv. Given it's size likely you will need to read and load incrementally.)
csv_name = 'airline_2m'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes') #,
usecols=(i for i in range(56)) )
with h5py.File(csv_name+'.h5', 'w') as h5f:
h5f.create_dataset(csv_name,data=rec_arr)
I tried reproducing your example. I believe the problem you are facing is quite common when dealing with CSVs. The schema is not known.
Sometimes there are "mixed types" and pandas (used underneath vaex's read_csv or from_csv ) casts those columns as dtype object.
Vaex does not really support such mixed dtypes, and requires each column to be of a single uniform type (kind of a like a database).
So how to go around this? Well, the best way I can think of is to use the dtype argument to explicitly specify the types of all columns (or those that you suspect or know to have mixed types). I know this file has like 100+ columns and that's annoying.. but that is also kind of the price to pay when using a format such as CSV...
Another thing i noticed is the encoding.. using pure pandas.read_csv failed at some point because of encoding and requires one to add encoding="ISO-8859-1". This is also supported by vaex.open (since the args are just passed down to pandas).
In fact if you want to do manually what vaex.open does automatically for you (given that this CSV file might not be as clean as one would hope), do something like (this is pseudo code but I hope close to the real thing)
# Iterate over the file in chunks
for i, df_tmp in enumerate(pd.read_csv(file, chunksize=11_000_000, encoding="ISO-8859-1", dtype=dtype)):
# Assert or check or do whatever needs doing to ensure column types are as they should be
# Pass the data to vaex (this does not take extra RAM):
df_vaex = vaex.from_pandas(df_tmp)
# Export this chunk into HDF5
# df_vaex.export_hdf5(f'chunk_{i}.hdf5')
# When the above loop finishes, just concat and export the data to a single file if needed (gives some performance benefit).
df = vaex.open('chunk*.hdf5')
df.export_hdf5('converted.hdf5', progress='rich')
I've seen potentially much better/faster way of doing this with vaex, but it is not released yet (i saw it in the code repo on github), so I will not go into it, but if you can install from source, and want me to elaborate further feel free to drop a comment.
Hope this at least gives some ideas on how to move forward.
EDIT:
In last couple of versions of vaex core, vaex.open() opens all CSV files lazily, so then just export to hdf5/arrow directly, it will do it in one go. Check the docs for more details: https://vaex.io/docs/guides/io.html#Text-based-file-formats

Exaggerated calculation times with pandas and csv

I have a 3 column CSV file where I perform a simple calculation with python and pandas.
The file is very large, just under 4Gb, after the calculation about 1.9Gb
the CSV file is:
data1,data2,data3
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw97,856521536521321,112535
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw98,6521321,112138
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw98,856521536521321,122135
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw99,521321,112132
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw99,856521536521321,212135
The calculation is a trivial sum. If column A is identical, then add B and rewrite the CSV.
Example result :
data1,data2,data3
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw97,856521536521321
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw98,856521543042642
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw99,856521537042642
import pandas as pd
#Read csv
df = pd.read_csv('data.csv', sep=',' , engine='python')
# Groupby and sum
df_new = df.groupby(["data1"]).agg({"data2": "sum"}).reset_index()
# Save in new file
df_new.to_csv('data2.csv', encoding='utf-8', index=False)
How could I improve the code to speed up execution?
It currently takes about 7 hours on a vps to complete the calculation
add info
The RAM resources are almost always 100% (8Gb), while the choice of the engine = 'python' is because I used a code already present on https://stackoverflow.com/, and honestly I don't know the usefulness or not of that command, but I have seen that the calculation works correctly.
Data3 is actually useless to me (right now, probably useful in the future).
There's an alternative option - use convtools for this. It is a pure python library which generates pure python code to build ad hoc converters. Of course bare python cannot beat pandas in terms of speed, but at least it doesn't need any wrappers and it works just like you'd implement everything by hand.
So, normally the following would work for you:
from convtools import conversion as c
from convtools.contrib.tables import Table
# you can store the converter somewhere for further reuse
converter = (
c.group_by(c.item("data1"))
.aggregate({
"data1": c.item("data1"),
"data2": c.ReduceFuncs.Sum(c.item("data2"))
})
.gen_converter()
)
# this is an iterable (stream of rows), not the list
rows = Table.from_csv("tmp4.csv", header=True).into_iter_rows(dict)
Table.from_rows(converter(rows)).into_csv("out.csv")
JFYI: If you run the script manually, then you can monitor the speed using e.g. tqdm, just wrap an iterable you are consuming with it:
from tqdm import tqdm
# same code as above, except for the last line:
Table.from_rows(converter(tqdm(rows))).into_csv("out.csv")
HOWEVER:
the solution above doesn't require an input file to fit into memory, but the result should. In your case, if the result is 1.9GB csv file, it is unlikely to fit corresponding python objects into 8GB of RAM.
Then you may need to:
remove the header: tail -n +2 raw_file.csv > raw_file_no_header.csv
pre-sort the file sort raw_file_no_header.csv > sorted_file.csv
a then:
from convtools import conversion as c
from convtools.contrib.tables import Table
converter = (
c.chunk_by(c.item("data1"))
.aggregate(
{
"data1": c.ReduceFuncs.First(c.item("data1")),
"data2": c.ReduceFuncs.Sum(c.item("data2")),
}
)
.gen_converter()
)
rows = Table.from_csv("sorted_file.csv", header=True).into_iter_rows(dict)
Table.from_rows(converter(rows)).into_csv("out.csv")
This only requires a single group to fit into memory.
Remove the engine='python', it does no good.
Get more RAM, 8GB is not enough, you should never hit 100% (this is what slows you down)
(it is too late now), but don't use .csv files for large datasets. Look into feather or parquet.
If you can't get more RAM, then maybe #Afaq will elaborate on the file splitting approach. The problem I see there, is that you are not reducing your dataset much, so map reduce may choke on the reduce part, unless you split your file in such a way, that same data1 strings would always go into the same file.

Parsing two files with Python

I'm still new to python and cannot achieve to make what i'm looking for. I'm using Python 3.7.0
I have one file, called log.csv, containing a log of CANbus messages.
I want to check what is the content of column label Data2 and Data3 when the ID is 348 in column label ID.
If they are both different from "00", I want to make a new string called fault_code with the "Data3+Data2".
Then I want to check on another CSV file where this code string appear, and print the column 6 of this row (label description). But this last part I want to do it only one time per fault_code.
Here is my code:
import csv
CAN_ID = "348"
with open('0.csv') as log:
reader = csv.reader(log,delimiter=',')
for log_row in reader:
if log_row[1] == CAN_ID:
if (log_row[5]+log_row[4]) != "0000":
fault_code = log_row[5]+log_row[4]
with open('Fault_codes.csv') as fault:
readerFC = csv.reader(fault,delimiter=';')
for fault_row in readerFC:
if "0x"+fault_code in readerFC:
print("{fault_row[6]}")
Here is a part of the log.csv file
Timestamp,ID,Data0,Data1,Data2,Data3,Data4,Data5,Data6,Data7,
396774,313,0F,00,28,0A,00,00,C2,FF
396774,314,00,00,06,02,10,00,D8,00
396775,**348**,2C,00,**00,00**,FF,7F,E6,02
and this is a part of faultcode.csv
Level;LED Flashes;UID;FID;Type;Display;Message;Description;RecommendedAction
1;2;1;**0x4481**;Warning;F12001;Handbrake Fault;Handbrake is active;Release handbrake
1;5;1;**0x4541**;Warning;F15001;Fan Fault;blablabla;blablalba
1;5;2;**0x4542**;Warning;F15002;blablabla
Also do you think of a better way to do this task? I've read that Pandas can be very good for large files. As log.csv can have 100'000+ row, it's maybe a better idea to use it. What do you think?
Thank you for your help!
Be careful with your indentation, you get this error because you sometimes you use spaces and other tabs to indent.
As PM 2Ring said, reading 'Fault_codes.csv' everytime you read 1 line of your log is really not efficient.
You should read faultcode once and store the content in RAM (if it fits). You can use pandas to do it, and store the content into a DataFrame. I would do that before reading your logs.
You do not need to store all log.csv lines in RAM. So I'd keep reading it line by line with csv module, do my stuff, write to a new file, and read the next line. No need to use pandas here as it will fill your RAM for nothing.

Building on "How to read and write a table / matrix to file with python?"

Back in Feb 8 '13 at 20:20, YamSMit asked a question (see: How to read and write a table / matrix to file with python?) similar to what I am struggling with: starting out with an Excel table (CSV) that has 3 columns and a varying number of rows. The contents of the columns are string, floating point, and string. The first string will vary in length, while the other string can be fixed (eg, 2 characters). The table needs to go into a 2 dimensional array, so that I can do manipulations on the data to produce a final file (which will be a text file). I have experimented with a variety of strategies presented in stackoverflow, but I am always missing something, and I haven't seen an example with all the parts, which is the reason for the struggle to figure this out.
Sample data will be similar to:
Ray Smith, 41645.87778, V1
I have read and explored numpy and astropy since the available documentation says they make this type of code easy. I have tried import csv. Somehow, the code doesn't come together. I should add that I am writing in Python 3.2.3 (which seems to be a mistake since a lot of documentation is for Python 2.x).
I realize the basic nature of this question directs me to read more tutorials. I have been reading many, yet the tutorials always refer to enough that is different, that I fail to assemble the right pieces: read the table file, write into a 2D array, then... do more stuff.
I am grateful to anyone who might provide me with a workable outline of the code, or pointing me to specific documentation I should read to handle the specific nature of the code I am trying to write.
Many thanks in advance. (Sorry for the wordiness - just trying to be complete.)
I am more familiar with 2.x, but from the 3.3 csv documentation found here, it seems to be mostly the same as 2.x. The following function will read a csv file, and return a 2D array of the rows found in the file.
import csv
def read_csv(file_name):
array_2D = []
with open(file_name, 'rb') as csvfile:
read = csv.reader(csvfile, delimiter=';') #Assuming your csv file has been set up with the ';' delimiter - there are other options, for which you should see the first link.
for row in read:
array_2D.append(row)
return array_2D
You would then be able to manipulate the data as follows (assuming your csv file is called 'foo.csv' and the desired text file is 'foo.txt'):
data = read_csv('foo.csv')
with open('foo.txt') as textwrite:
for row in data:
string = '{0} has {1} apples in his Ford {2}.\n'.format(row[0], row[1], row[2])
textwrite.write(string)
#if you know the second column is a float:
manipulate = float(row[1])*3
textwrite.write(manipulate)
string would then be written to 'foo.txt' as:
Ray Smith has 41645.87778 apples in his Ford V1.\n
and maniuplate would be written to 'foo.txt' as:
124937.63334

Memory error when using pandas read_csv

I am trying to do something fairly simple, reading a large csv file into a pandas dataframe.
data = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2)
The code either fails with a MemoryError, or just never finishes.
Mem usage in the task manager stopped at 506 Mb and after 5 minutes of no change and no CPU activity in the process I stopped it.
I am using pandas version 0.11.0.
I am aware that there used to be a memory problem with the file parser, but according to http://wesmckinney.com/blog/?p=543 this should have been fixed.
The file I am trying to read is 366 Mb, the code above works if I cut the file down to something short (25 Mb).
It has also happened that I get a pop up telling me that it can't write to address 0x1e0baf93...
Stacktrace:
Traceback (most recent call last):
File "F:\QA ALM\Python\new WIM data\new WIM data\new_WIM_data.py", line 25, in
<module>
wimdata = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2
)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py"
, line 401, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py"
, line 216, in _read
return parser.read()
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\io\parsers.py"
, line 643, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py"
, line 394, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py"
, line 525, in _init_dict
dtype=dtype)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\frame.py"
, line 5338, in _arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1820, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1872, in form_blocks
float_blocks = _multi_blockify(float_items, items)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1930, in _multi_blockify
block_items, values = _stack_arrays(list(tup_block), ref_items, dtype)
File "C:\Program Files\Python\Anaconda\lib\site-packages\pandas\core\internals
.py", line 1962, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError
Press any key to continue . . .
A bit of background - I am trying to convince people that Python can do the same as R. For this I am trying to replicate an R script that does
data <- read.table(paste(INPUTDIR,config[i,]$TOEXTRACT,sep=""), HASHEADER, DELIMITER,skip=2,fill=TRUE)
R not only manages to read the above file just fine, it even reads several of these files in a for loop (and then does some stuff with the data). If Python does have a problem with files of that size I might be fighting a loosing battle...
Windows memory limitation
Memory errors happens a lot with python when using the 32bit version in Windows. This is because 32bit processes only gets 2GB of memory to play with by default.
Tricks for lowering memory usage
If you are not using 32bit python in windows but are looking to improve on your memory efficiency while reading csv files, there is a trick.
The pandas.read_csv function takes an option called dtype. This lets pandas know what types exist inside your csv data.
How this works
By default, pandas will try to guess what dtypes your csv file has. This is a very heavy operation because while it is determining the dtype, it has to keep all raw data as objects (strings) in memory.
Example
Let's say your csv looks like this:
name, age, birthday
Alice, 30, 1985-01-01
Bob, 35, 1980-01-01
Charlie, 25, 1990-01-01
This example is of course no problem to read into memory, but it's just an example.
If pandas were to read the above csv file without any dtype option, the age would be stored as strings in memory until pandas has read enough lines of the csv file to make a qualified guess.
I think the default in pandas is to read 1,000,000 rows before guessing the dtype.
Solution
By specifying dtype={'age':int} as an option to the .read_csv() will let pandas know that age should be interpreted as a number. This saves you lots of memory.
Problem with corrupt data
However, if your csv file would be corrupted, like this:
name, age, birthday
Alice, 30, 1985-01-01
Bob, 35, 1980-01-01
Charlie, 25, 1990-01-01
Dennis, 40+, None-Ur-Bz
Then specifying dtype={'age':int} will break the .read_csv() command, because it cannot cast "40+" to int. So sanitize your data carefully!
Here you can see how the memory usage of a pandas dataframe is a lot higher when floats are kept as strings:
Try it yourself
df = pd.DataFrame(pd.np.random.choice(['1.0', '0.6666667', '150000.1'],(100000, 10)))
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# 224544 (~224 MB)
df = pd.DataFrame(pd.np.random.choice([1.0, 0.6666667, 150000.1],(100000, 10)))
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# 79560 (~79 MB)
I had the same memory problem with a simple read of a tab delimited text file around 1 GB in size (over 5.5 million records) and this solved the memory problem:
df = pd.read_csv(myfile,sep='\t') # didn't work, memory error
df = pd.read_csv(myfile,sep='\t',low_memory=False) # worked fine and in less than 30 seconds
Spyder 3.2.3
Python 2.7.13 64bits
I tried chunksize while reading big CSV file
reader = pd.read_csv(filePath,chunksize=1000000,low_memory=False,header=0)
The read is now the list. We can iterate the reader and write/append to the new csv or can perform any operation
for chunk in reader:
print(newChunk.columns)
print("Chunk -> File process")
with open(destination, 'a') as f:
newChunk.to_csv(f, header=False,sep='\t',index=False)
print("Chunk appended to the file")
I use Pandas on my Linux box and faced many memory leaks that only got resolved after upgrading Pandas to the latest version after cloning it from github.
I encountered this issue as well when I was running in a virtual machine, or somewere else where the memory is stricktly limited. It has nothing to do with pandas or numpy or csv, but will always happen if you try using more memory as you are alowed to use, not even only in python.
The only chance you have is what you already tried, try to chomp down the big thing into smaller pieces which fit into memory.
If you ever asked yourself what MapReduce is all about, you found out by yourself...MapReduce would try to distribute the chunks over many machines, you would try to process the chunke on one machine one after another.
What you found out with the concatenation of the chunk files might be an issue indeed, maybe there are some copy needed in this operation...but in the end this maybe saves you in your current situation but if your csv gets a little bit larger you might run against that wall again...
It also could be, that pandas is so smart, that it actually only loads the individual data chunks into memory if you do something with it, like concatenating to a big df?
Several things you can try:
Don't load all the data at once, but split in in pieces
As far as I know, hdf5 is able to do these chunks automatically and only loads the part your program currently works on
Look if the types are ok, a string '0.111111' needs more memory than a float
What do you need actually, if there is the adress as a string, you might not need it for numerical analysis...
A database can help acessing and loading only the parts you actually need (e.g. only the 1% active users)
There is no error for Pandas 0.12.0 and NumPy 1.8.0.
I have managed to create a big DataFrame and save it to a csv file and then successfully read it. Please see the example here. The size of the file is 554 Mb (It even worked for 1.1 Gb file, took longer, to generate 1.1Gb file use frequency of 30 seconds). Though I have 4Gb of RAM available.
My suggestion is try updating Pandas. Other thing that could be useful is try running your script from command line, because for R you are not using Visual Studio (this already was suggested in the comments to your question), hence it has more resources available.
Add these:
ratings = pd.read_csv(..., low_memory=False, memory_map=True)
My memory with these two:
#319.082.496
Without these two:
#349.110.272
Although this is a workaround not so much as a fix, I'd try converting that CSV to JSON (should be trivial) and using read_json method instead - I've been writing and reading sizable JSON/dataframes (100s of MB) in Pandas this way without any problem at all.

Categories

Resources