Exaggerated calculation times with pandas and csv

Exaggerated calculation times with pandas and csv - python

I have a 3 column CSV file where I perform a simple calculation with python and pandas.
The file is very large, just under 4Gb, after the calculation about 1.9Gb
the CSV file is:
data1,data2,data3
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw97,856521536521321,112535
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw98,6521321,112138
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw98,856521536521321,122135
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw99,521321,112132
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw99,856521536521321,212135
The calculation is a trivial sum. If column A is identical, then add B and rewrite the CSV.
Example result :
data1,data2,data3
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw97,856521536521321
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw98,856521543042642
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw99,856521537042642
import pandas as pd
#Read csv
df = pd.read_csv('data.csv', sep=',' , engine='python')
# Groupby and sum
df_new = df.groupby(["data1"]).agg({"data2": "sum"}).reset_index()
# Save in new file
df_new.to_csv('data2.csv', encoding='utf-8', index=False)
How could I improve the code to speed up execution?
It currently takes about 7 hours on a vps to complete the calculation
add info
The RAM resources are almost always 100% (8Gb), while the choice of the engine = 'python' is because I used a code already present on https://stackoverflow.com/, and honestly I don't know the usefulness or not of that command, but I have seen that the calculation works correctly.
Data3 is actually useless to me (right now, probably useful in the future).

There's an alternative option - use convtools for this. It is a pure python library which generates pure python code to build ad hoc converters. Of course bare python cannot beat pandas in terms of speed, but at least it doesn't need any wrappers and it works just like you'd implement everything by hand.
So, normally the following would work for you:
from convtools import conversion as c
from convtools.contrib.tables import Table
# you can store the converter somewhere for further reuse
converter = (
c.group_by(c.item("data1"))
.aggregate({
"data1": c.item("data1"),
"data2": c.ReduceFuncs.Sum(c.item("data2"))
})
.gen_converter()
)
# this is an iterable (stream of rows), not the list
rows = Table.from_csv("tmp4.csv", header=True).into_iter_rows(dict)
Table.from_rows(converter(rows)).into_csv("out.csv")
JFYI: If you run the script manually, then you can monitor the speed using e.g. tqdm, just wrap an iterable you are consuming with it:
from tqdm import tqdm
# same code as above, except for the last line:
Table.from_rows(converter(tqdm(rows))).into_csv("out.csv")
HOWEVER:
the solution above doesn't require an input file to fit into memory, but the result should. In your case, if the result is 1.9GB csv file, it is unlikely to fit corresponding python objects into 8GB of RAM.
Then you may need to:
remove the header: tail -n +2 raw_file.csv > raw_file_no_header.csv
pre-sort the file sort raw_file_no_header.csv > sorted_file.csv
a then:
from convtools import conversion as c
from convtools.contrib.tables import Table
converter = (
c.chunk_by(c.item("data1"))
.aggregate(
{
"data1": c.ReduceFuncs.First(c.item("data1")),
"data2": c.ReduceFuncs.Sum(c.item("data2")),
}
)
.gen_converter()
)
rows = Table.from_csv("sorted_file.csv", header=True).into_iter_rows(dict)
Table.from_rows(converter(rows)).into_csv("out.csv")
This only requires a single group to fit into memory.

Remove the engine='python', it does no good.
Get more RAM, 8GB is not enough, you should never hit 100% (this is what slows you down)
(it is too late now), but don't use .csv files for large datasets. Look into feather or parquet.
If you can't get more RAM, then maybe #Afaq will elaborate on the file splitting approach. The problem I see there, is that you are not reducing your dataset much, so map reduce may choke on the reduce part, unless you split your file in such a way, that same data1 strings would always go into the same file.

Related

Convert huge csv to hdf5 format

I downloaded IBM's Airline Reporting Carrier On-Time Performance Dataset; the uncompressed CSV is 84 GB. I want to run an analysis, similar to Flying high with Vaex, with the vaex libary.
I tried to convert the CSV to a hdf5 file, to make it readable for the vaex libary:
import time
import vaex
start=time.time()
df = vaex.from_csv(r"D:\airline.csv", convert=True, chunk_size=1000000)
end=time.time()
print("Time:",(end-start),"Seconds")
I always get an error when running the code:
RuntimeError: Dirty entry flush destroy failed (file write failed: time = Fri Sep 30 17:58:55 2022
, filename = 'D:\airline.csv_chunk_8.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 0000021EA8C6B128, total write size = 2040, bytes this sub-write = 2040, bytes actually written = 18446744073709551615, offset = 221133661).
Second run, I get this error:
RuntimeError: Unable to flush file's cached information (file write failed: time = Fri Sep 30 20:18:19 2022
, filename = 'D:\airline.csv_chunk_18.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 000002504659B828, total write size = 2048, bytes this sub-write = 2048, bytes actually written = 18446744073709551615, offset = 348515307)
Is there an alternative way to convert the CSV to hdf5 without Python? For example, a downloadable software which can do this job?

I'm not familiar with vaex, so can't help with usage and functions. However, I can read error messages. :-)
It reports "bytes written" with a huge number (18_446_744_073_709_551_615), much larger than the 84GB CSV. Some possible explanations:
you ran out of disk
you ran out of memory, or
had some other error
To diagnose, try testing with a small csv file and see if vaex.from_csv() works as expected. I suggest the lax_to_jfk.csv file.
Regarding your question, is there an alternative way to convert a csv to hdf5?, why not use Python?
Are you more comfortable with other languages? If so, you can install HDF5 and write your code with their C or Fortran API.
OTOH, if you are familiar with Python, there are other packages you can use to read the CSV file and create the HDF5 file.
Python packages to read the CSV
Personally, I like NumPy's genfromtxt() to read the CSV (You can also use loadtxt() to read the CSV, if you don't have missing values and don't need the field names.) However, I think you will run into memory problems reading a 84GB file. That said, you can use the skip_header and max_rows parameters with genfromtxt() to read and load a subset of lines. Alternately you can use csv.DictReader(). It reads a line at a time. So, you avoid memory issues, but it could be very slow loading the HDF5 file.
Python packages to create the HDF5 file
I have used both h5py and pytables (aka tables) to create and read HDF5 files. Once you load the CSV data to a NumPy array, it's a snap to create the HDF5 dataset.
Here is a very simple example that reads the lax_to_jfk.csv data and loads to a HDF5 file.
csv_name = 'lax_to_jfk'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes')
with h5py.File(csv_name+'.h5', 'w') as h5f:
h5f.create_dataset(csv_name,data=rec_arr)
Update:
After posting this example, I decided to test with a larger file (airline_2m.csv). It's 861 MB, and has 2M rows. I discovered the code above doesn't work. However, it's not because of the number of rows. The problem is the columns (field names). Turns out the data isn't as clean; there are 109 field names on row 1, and some rows have 111 columns of data. As a result, the auto-generated dtype doesn't have a matching field. While investigating this, I also discovered many rows only have the values for first 56 fields. In other words, fields 57-111 are not very useful. One solution to this is to add the usecols=() parameter. Code below reflects this modification, and works with this test file. (I have not tried testing with your large file airline.csv. Given it's size likely you will need to read and load incrementally.)
csv_name = 'airline_2m'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes') #,
usecols=(i for i in range(56)) )
with h5py.File(csv_name+'.h5', 'w') as h5f:
h5f.create_dataset(csv_name,data=rec_arr)

I tried reproducing your example. I believe the problem you are facing is quite common when dealing with CSVs. The schema is not known.
Sometimes there are "mixed types" and pandas (used underneath vaex's read_csv or from_csv ) casts those columns as dtype object.
Vaex does not really support such mixed dtypes, and requires each column to be of a single uniform type (kind of a like a database).
So how to go around this? Well, the best way I can think of is to use the dtype argument to explicitly specify the types of all columns (or those that you suspect or know to have mixed types). I know this file has like 100+ columns and that's annoying.. but that is also kind of the price to pay when using a format such as CSV...
Another thing i noticed is the encoding.. using pure pandas.read_csv failed at some point because of encoding and requires one to add encoding="ISO-8859-1". This is also supported by vaex.open (since the args are just passed down to pandas).
In fact if you want to do manually what vaex.open does automatically for you (given that this CSV file might not be as clean as one would hope), do something like (this is pseudo code but I hope close to the real thing)
# Iterate over the file in chunks
for i, df_tmp in enumerate(pd.read_csv(file, chunksize=11_000_000, encoding="ISO-8859-1", dtype=dtype)):
# Assert or check or do whatever needs doing to ensure column types are as they should be
# Pass the data to vaex (this does not take extra RAM):
df_vaex = vaex.from_pandas(df_tmp)
# Export this chunk into HDF5
# df_vaex.export_hdf5(f'chunk_{i}.hdf5')
# When the above loop finishes, just concat and export the data to a single file if needed (gives some performance benefit).
df = vaex.open('chunk*.hdf5')
df.export_hdf5('converted.hdf5', progress='rich')
I've seen potentially much better/faster way of doing this with vaex, but it is not released yet (i saw it in the code repo on github), so I will not go into it, but if you can install from source, and want me to elaborate further feel free to drop a comment.
Hope this at least gives some ideas on how to move forward.
EDIT:
In last couple of versions of vaex core, vaex.open() opens all CSV files lazily, so then just export to hdf5/arrow directly, it will do it in one go. Check the docs for more details: https://vaex.io/docs/guides/io.html#Text-based-file-formats

What is the recommended way to access data from R data.table in python? Can I avoid writing data to disc?

Is there some recommended way to pass data from R (in the form of data.table) to Python without having to save the data to disc? I know I could use python modules from R using reticulate (and I suppose the same thing can be done on the other side using rpy2), but from what I've read that hurts the overall performance of the libraries and therefore there is quite a big chance that it's better to store to disc my r data.table and read that same data from disc using python and running, say, lightgbm, than to try to run lightgbm using reticulate or data.table using rpy2.
Why don't I just stick to either R or Python:
I prefer using r data.table (as opposed to Pandas) for my data manipulations, because it is way faster, more memory efficient, and has a lot of features which I like, such as inequi joins, rolling joins, cartesian joins, and pretty straightforward melting and casting. I also like that whenever I ask a data.table related question in stack overflow, I get a high-quality answer pretty fast, while for Pandas i haven't been so successful. However, there are tasks for which I prefer python, such as when it comes to gradient boosting or neural networks.

There is no recommended way.
In theory you have to dump R data.frame to disk and read it in python.
In practice (assuming production grade operating system), you can use "RAM disk" location /dev/shm/ so you essentially write data to a file that resides in RAM memory and then read it from RAM directly, without the need to dump data to disk memory.
Example usage:
fwrite(iris, "/dev/shm/data.csv")
d = fread("/dev/shm/data.csv")
unlink("/dev/shm/data.csv")
As for the format, you have the following options:
csv - universal and portable format
data.table's fwrite function is super fast and produces portable csv data file. Be sure to enable all cpu threads with setDTthreads(0L) before using fwrite on a multi-core machine.
Then in python you need to read csv file, for which python datatable module will be very fast, and then, if needed, object can be converted to python pandas using x.to_pandas().
feather - "portable" binary format
Another option is to use R's arrow package and function write_feather, and then read data in python using pyarrow module and read_feather.
This format should be faster than csv in most cases, see timings below. In case of writing data the difference might not be that big, but reading data will be much faster in most cases, especially when it comes to reading many character variables in R (although it is not your use case because you read in python). On the other hand it is not really portable yet (see apache/arrow#8732). Moreover, eventually if new version 3 will be released, then files saved with current feather might not be compatible anymore.
fst - fast binary format
fst can be used as faster alternative to feather format but it is not yet possible to read fst data in python, so this method cannot be applied to solve your problem as of now. You can track progress of this FR in https://github.com/fstpackage/fst/issues/184 and when issue will be resolved, then it will probably address your question in the fastest manner.
Using following scripts
library(data.table)
setDTthreads(0L) ## 40
N = 1e8L
x = setDT(lapply(1:10, function(...) sample.int(N)))
system.time(arrow::write_feather(x, "/dev/shm/data.feather"))
system.time(fwrite(x, "/dev/shm/data.csv", showProgress=FALSE))
rm(x)
## run python
unlink(paste0("/dev/shm/data.",c("csv","feather")))
N = 1e8L
x = setDT(lapply(1:10, function(...) runif(N)))
system.time(arrow::write_feather(x, "/dev/shm/data.feather"))
system.time(fwrite(x, "/dev/shm/data.csv", showProgress=FALSE))
rm(x)
## run python
unlink(paste0("/dev/shm/data.",c("csv","feather")))
N = 1e7L
x = setDT(lapply(1:10, function(...) paste0("id",sample.int(N))))
system.time(arrow::write_feather(x, "/dev/shm/data.feather"))
system.time(fwrite(x, "/dev/shm/data.csv", showProgress=FALSE))
rm(x)
## run python
unlink(paste0("/dev/shm/data.",c("csv","feather")))
import datatable as dt
import timeit
import gc
from pyarrow import feather
gc.collect()
t_start = timeit.default_timer()
x = dt.fread("/dev/shm/data.csv")
print(timeit.default_timer() - t_start, flush=True)
gc.collect()
t_start = timeit.default_timer()
y = x.to_pandas()
print(timeit.default_timer() - t_start, flush=True)
del x, y
gc.collect()
t_start = timeit.default_timer()
x = feather.read_feather("/dev/shm/data.feather", memory_map=False)
print(timeit.default_timer() - t_start, flush=True)
del x
I got the following timings:
integer:
write: feather 2.7s vs csv 5.7s
read: feather 2.8s vs csv 111s+3s
double:
write: feather 5.7s vs csv 10.8s
read: feather 5.1s vs csv 180s+4.9s
character:
write: feather 50.2s vs csv 2.8s
read: feather 35s vs csv 14s+16s
Based on the presented data cases (1e8 rows for int/double, 1e7 rows for character; 10 columns: int/double/character) we can conclude the following:
writing int and double is around 2 times slower for csv than feather
writing character is around 20 times faster for csv than feather
reading int and double is much slower for csv than feather
conversion int and double from python datatable to pandas is relatively cheap
reading character is around 2 times faster for csv than feather
conversion character from python datatable to pandas is expensive
Note that these are very basic data cases, be sure to check timings on your actual data.

Increase speed numpy.loadtxt?

I have hundred of thousands of data text files to read. As of now, I'm importing the data from text files every time I run the code. Perhaps the easy solution would be to simply reformat the data into a file faster to read.
Anyway, right now every text files I have look like:
User: unknown
Title : OE1_CHANNEL1_20181204_103805_01
Sample data
Wavelength OE1_CHANNEL1
185.000000 27.291955
186.000000 27.000877
187.000000 25.792290
188.000000 25.205620
189.000000 24.711882
.
.
.
The code where I read and import the txt files is:
# IMPORT DATA
path = 'T2'
if len(sys.argv) == 2:
path = sys.argv[1]
files = os.listdir(path)
trans_import = []
for index, item in enumerate(files):
trans_import.append(np.loadtxt(path+'/'+files[1], dtype=float, skiprows=4, usecols=(0,1)))
The resulting array looks in the variable explorer as:
{ndarray} = [[185. 27.291955]\n [186. 27.000877]\n ... ]
I'm wondering, how I could speed up this part? It takes a little too long as of now just to import ~4k text files. There are 841 lines inside every text files (spectrum). The output I get with this code is 841 * 2 = 1682. Obviously, it considers the \n as a line...

It would probably be much faster if you had one large file instead of many small ones. This is generally more efficient. Additionally, you might get a speedup from just saving the numpy array directly and loading that .npy file in instead of reading in a large text file. I'm not as sure about the last part though. As always when time is a concern, I would try both of these options and then measure the performance improvement.
If for some reason you really can't just have one large text file / .npy file, you could also probably get a speedup by using, e.g., multiprocessing to have multiple workers reading in the files at the same time. Then you can just concatenate the matrices together at the end.
Not your primary question but since it seems to be an issue - you can rewrite the text files to not have those extra newlines, but I don't think np.loadtxt can ignore them. If you're open to using pandas, though, pandas.read_csv with skip_blank_lines=True should handle that for you. To get a numpy.ndarray from a pandas.DataFrame, just do dataframe.values.

Let use pandas.read_csv (with C speed) instead of numpy.loadtxt. This is a very helpful post:
http://akuederle.com/stop-using-numpy-loadtxt

Spark fastest way for creating RDD of numpy arrays

My spark application is using RDD's of numpy arrays.
At the moment, I'm reading my data from AWS S3, and its represented as
a simple text file where each line is a vector and each element is seperated by space, for example:
1 2 3
5.1 3.6 2.1
3 0.24 1.333
I'm using numpy's function loadtxt() in order to create a numpy array from it.
However, this method seems to be very slow and my app is spending too much time(I think) for converting my dataset to a numpy array.
Can you suggest me a better way for doing it? For example, should I keep my dataset as a binary file?,
should I create the RDD in another way?
Some code for how I create my RDD:
data = sc.textFile("s3_url", initial_num_of_partitions).mapPartitions(readData)
readData function:
def readPointBatch(iterator):
return [(np.loadtxt(iterator,dtype=np.float64)]

It would be a little bit more idiomatic and slightly faster to simply map with numpy.fromstring as follows:
import numpy as np.
path = ...
initial_num_of_partitions = ...
data = (sc.textFile(path, initial_num_of_partitions)
.map(lambda s: np.fromstring(s, dtype=np.float64, sep=" ")))
but ignoring that there is nothing particularly wrong with your approach. As far as I can tell, with basic configuration, it is roughly twice a slow a simply reading the data and slightly slower than creating dummy numpy arrays.
So it looks like the problem is somewhere else. It could be cluster misconfiguration, cost of fetching data from S3 or even unrealistic expectations.

You shouldn't use numpy while working with Spark. Spark has its own methodology of processing data assuring that your sometimes really big files aren't loaded into memory at once, exceeding the memory limit. You should load your file like this with Spark:
data = sc.textFile("s3_url", initial_num_of_partitions) \
.map(lambda row: map(lambda x: float(x), row.split(' ')))
Now this will output an RDD like this, based on your example:
>>> print(data.collect())
[[1.0, 2.0, 3.0], [5.1, 3.6, 2.1], [3.0, 0.24, 1.333]]
#edit Some suggestions on file formats and numpy usage:
Text files are just as good as CSV, TSV, Parquet or anything you feel comfortable with. Binary files are not preferred, according to the Spark docs on binary files loading:
binaryFiles(path, minPartitions=None)
Note: Experimental
Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Note: Small files are preferred, large file is also allowable, but may cause bad performance.
As for numpy usage, if I were you I'd deffinitely tried to replace any external package with native Spark, for example pyspark.mlib.random for randomization: http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.random

The best thing to do under these circumstances is to use pandas library for io.
Please refer to this question : pandas read_csv() and python iterator as input
.
There you will see how to replace the np.loadtxt() function so it would be much faster to create a RDD of numpy array.

Killed/MemoryError when creating a large dask.dataframe from delayed collection

I am trying to create a dask.dataframe from a bunch of large CSV files (currently 12 files, 8-10 million lines and 50 columns each). A few of them might fit together into my system memory but all of them at once definitely will not, hence the use of dask instead of regular pandas.
Since reading each csv file involves some extra work (adding columns with data from the file path), I tried creating the dask.dataframe from a list of delayed objects, similar to this example.
This is my code:
import dask.dataframe as dd
from dask.delayed import delayed
import os
import pandas as pd
def read_file_to_dataframe(file_path):
df = pd.read_csv(file_path)
df['some_extra_column'] = 'some_extra_value'
return df
if __name__ == '__main__':
path = '/path/to/my/files'
delayed_collection = list()
for rootdir, subdirs, files in os.walk(path):
for filename in files:
if filename.endswith('.csv'):
file_path = os.path.join(rootdir, filename)
delayed_reader = delayed(read_file_to_dataframe)(file_path)
delayed_collection.append(delayed_reader)
df = dd.from_delayed(delayed_collection)
print(df.compute())
When starting this script (Python 3.4, dask 0.12.0), it runs for a couple of minutes while my system memory constantly fills up. When it is fully used, everything starts lagging and it runs for some more minutes, then it crashes with killed or MemoryError.
I thought the whole point of dask.dataframe was to be able to operate on larger-than-memory dataframes that span over multiple files on disk, so what am I doing wrong here?
edit: Reading the files instead with df = dd.read_csv(path + '/*.csv') seems to work fine as far as I can see. However, this does not allow me to alter each single dataframe with additional data from the file path.
edit #2:
Following MRocklin's answer, I tried to read my data with dask's read_bytes() method as well as using the single-threaded scheduler as well as doing both in combination.
Still, even when reading chunks of 100MB in single-threaded mode on a laptop with 8GB of memory, my process gets killed sooner or later. Running the code stated below on a bunch of small files (around 1MB each) of similar shape works fine though.
Any ideas what I am doing wrong here?
import dask
from dask.bytes import read_bytes
import dask.dataframe as dd
from dask.delayed import delayed
from io import BytesIO
import pandas as pd
def create_df_from_bytesio(bytesio):
df = pd.read_csv(bytesio)
return df
def create_bytesio_from_bytes(block):
bytesio = BytesIO(block)
return bytesio
path = '/path/to/my/files/*.csv'
sample, blocks = read_bytes(path, delimiter=b'\n', blocksize=1024*1024*100)
delayed_collection = list()
for datafile in blocks:
for block in datafile:
bytesio = delayed(create_bytesio_from_bytes)(block)
df = delayed(create_df_from_bytesio)(bytesio)
delayed_collection.append(df)
dask_df = dd.from_delayed(delayed_collection)
print(dask_df.compute(get=dask.async.get_sync))

If each of your files is large then a few concurrent calls to read_file_to_dataframe might be flooding memory before Dask ever gets a chance to be clever.
Dask tries to operate in low memory by running functions in an order such that it can delete intermediate results quickly. However if the results of just a few functions can fill up memory then Dask may never have a chance to delete things. For example if each of your functions produced a 2GB dataframe and if you had eight threads running at once, then your functions might produce 16GB of data before Dask's scheduling policies can kick in.
Some options
Use dask.bytes.read_bytes
The reason why read_csv works is that it chunks up large CSV files into many ~100MB blocks of bytes (see the blocksize= keyword argument). You could do this too, although it's tricky because you need to always break on an endline.
The dask.bytes.read_bytes function can help you here. It can convert a single path into a list of delayed objects, each corresponding to a byte range of that file that starts and stops cleanly on a delimiter. You would then put these bytes into an io.BytesIO (standard library) and call pandas.read_csv on that. Beware that you'll also have to handle headers and such. The docstring to that function is extensive and should provide more help.
Use a single thread
In the example above everything would be fine if we didn't have the 8x multiplier from parallelism. I suspect that if you only ran a single function at once that things would probably pipeline without ever reaching your memory limit. You can set dask to use only a single thread with the following line
dask.set_options(get=dask.async.get_sync)
Note: For Dask versions >= 0.15, you need to use dask.local.get_sync instead.
Make sure that results fit in memory (response to edit 2)
If you make a dask.dataframe and then compute it immediately
ddf = dd.read_csv(...)
df = ddf.compute()
You're loading in all of the data into a Pandas dataframe, which will eventually blow up memory. Instead it's better to operate on the Dask dataframe and only compute on small results.
# result = df.compute() # large result fills memory
result = df.groupby(...).column.mean().compute() # small result
Convert to a different format
CSV is a pervasive and pragmatic format, but also has some flaws. You might consider a data format like HDF5 or Parquet.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Exaggerated calculation times with pandas and csv - python

Related

Convert huge csv to hdf5 format

What is the recommended way to access data from R data.table in python? Can I avoid writing data to disc?

Increase speed numpy.loadtxt?

Spark fastest way for creating RDD of numpy arrays

Killed/MemoryError when creating a large dask.dataframe from delayed collection

Categories

Resources