read_sql in chunks with polars - python

I am trying to read a large database table with polars. Unfortunately, the data is too large to fit into memory and the code below eventually fails.
Is there a way in polars how to define a chunksize, and also write these chunks to parquet, or use the lazy dataframe interface to keep the memory footprint low?
import polars as pl
df = pl.read_sql("SELECT * from TABLENAME", connection_string)
df.write_parquet("output.parquet")

Yes and no.
There's not a predefined method to do it but you can certainly do it yourself. You'd do something like:
rows_at_a_time=1000
curindx=0
while True:
df = pl.read_sql(f"SELECT * from TABLENAME limit {curindx},{rows_at_a_time}", connection_string)
if df.shape[0]==0:
break
df.write_parquet(f"output{curindx}.parquet")
curindx+=rows_at_a_time
ldf=pl.concat([pl.scan_parquet(x) for x in os.listdir(".") if "output" in x and "parquet" in x])
This borrows limit syntax from this answer assuming you're using mysql or a db that has the same syntax which isn't trivial assumption. You may need to do something like this if not using mysql.
Otherwise you just read your table in chunks, saving each chunk to a local file. When the chunk you get back from your query has 0 rows then it stops looping and loads all the files to a lazy df.
You can almost certainly (and should) increase the rows_at_a_time to something greater than 1000 but that's dependent on your data and computer memory.

Related

Converting CSV to numpy NPY efficiently

How to convert a .csv file to .npy efficently?
I've tried:
import numpy as np
filename = "myfile.csv"
vec =np.loadtxt(filename, delimiter=",")
np.save(f"{filename}.npy", vec)
While the above works for smallish file, the actual .csv file I'm working on has ~12 million lines with 1024 columns, it takes quite a lot to load everything into RAM before converting into an .npy format.
Q (Part 1): Is there some way to load/convert a .csv to .npy efficiently for large CSV file?
The above code snippet is similar to the answer from Convert CSV to numpy but that won't work for ~12M x 1024 matrix.
Q (Part 2): If there isn't any way to to load/convert a .csv to .npy efficiently, is there some way to iteratively read the .csv file into .npy efficiently?
Also, there's an answer here https://stackoverflow.com/a/53558856/610569 to save the csv file as numpy array iteratively. But seems like the np.vstack isn't the best solution when reading the file. The accepted answer there suggests hdf5 but the format is not the main objective of this question and the hdf5 format isn't desired in my use-case since I've to read it back into a numpy array afterwards.
Q (Part 3): If part 1 and part2 are not possible, are there other efficient storage (e.g. tensorstore) that can store and efficiently convert to numpy array when loading the saved storage format?
There is another library tensorstore that seems to efficiently handles arrays which support conversion to numpy array when read, https://google.github.io/tensorstore/python/tutorial.html. But somehow there isn't any information on how to save the tensor/array without the exact dimensions, all of the examples seem to include configurations like 'dimensions': [1000, 20000],.
Unlike the HDF5, the tensorstore doesn't seem to have reading overhead issues when converting to numpy, from docs:
Conversion to an numpy.ndarray also implicitly performs a synchronous read (which hits the in-memory cache since the same region was just retrieved)
Nice question; Informative in itself.
I understand you want to have the whole data set/array in memory, eventually, as a NumPy array. I assume, then, you have enough (RAM) memory to host such array -- 12M x 1K.
I don't specifically know about how np.loadtxt (genfromtxt) is operating behind the scenes, so I will tell you how I would do (after trying like you did).
Reasoning about memory...
Notice that a simple boolean array will cost ~12 GBytes of memory:
>>> print("{:.1E} bytes".format(
np.array([True]).itemsize * 12E6 * 1024
))
1.2E+10 bytes
And this is for a Boolean data type. Most likely, you have -- what -- a dataset of Integer, Float? The size may increase quite significantly:
>>> np.array([1], dtype=bool).itemsize
1
>>> np.array([1], dtype=int).itemsize
8
>>> np.array([1], dtype=float).itemsize
8
It's a lot of memory (which you know, just want to emphasize).
At this point, I would like to point out a possible swapping of the working memory. You may have enough physical (RAM) memory in your machine, but if not enough of free memory, your system will use the swap memory (i.e, disk) to keep your system stable & have the work done. The cost you pay is clear: read/writing from/to the disk is very slow.
My point so far is: check the data type of your dataset, estimate the size of your future array, and guarantee you have that minimum amount of RAM memory available.
I/O text
Considering you do have all the (RAM) memory necessary to host the whole numpy array: I would then loop over the whole (~12M lines) text file, filling the pre-existing array row-by-row.
More precisely, I would have the (big) array already instantiated before start reading the file. Only then, I would read each line, split the columns, and give it to np.asarray and assign those (1024) values to each respective row of the output array.
The looping over the file is slow, yes. The thing here is that you limit (and control) the amount of memory being used. Roughly speaking, the big objects consuming your memory are the "output" (big) array, and the "line" (1024) array. Sure, there are quite a considerable amount of memory being consumed in each loop in the temporary objects during reading (text!) values, splitting into list elements and casting to an array. Still, it's something that will remain largely constant during the whole ~12M lines.
So, the steps I would go through are:
0) estimate and guarantee enough RAM memory available
1) instantiate (np.empty or np.zeros) the "output" array
2) loop over "input.txt" file, create a 1D array from each line "i"
3) assign the line values/array to row "i" of "output" array
Sure enough, you can even make it parallel: If on one hand text files cannot be randomly (r/w) accessed, on the other hand you can easily split them (see How can I split one text file into multiple *.txt files?) to have -- if fun is at the table -- them read in parallel, if that time if critical.
Hope that helps.
TL;DR
Export to a different function other than .npy seems inevitable unless your machine is able to handle the size of the data in-memory as per described in #Brandt answer.
Reading the data, then processing it (Kinda answering Q part 2)
To handle data size larger than what the RAM can handle, one would often resort to libraries that performs "out-of-core" computation, e.g. turicreate.SFrame, vaex or dask . These libraries would be able to lazily load the .csv files into dataframes and process them by chunks when evaluated.
from turicreate import SFrame
filename = "myfile.csv"
sf = SFrame.read_csv(filename)
sf.apply(...) # Trying to process the data
or
import vaex
filename = "myfile.csv"
df = vaex.from_csv(filename,
convert=True,
chunk_size=50_000_000)
df.apply(...)
Converting the read data into numpy array (kinda answering Q part 1)
While out-of-core libraries can read and process the data efficiently, converting into numpy is an "in-memory" operation, the machine needs to have enough RAM to fit all data.
The turicreate.SFrame.to_numpy documentation writes:
Converts this SFrame to a numpy array
This operation will construct a numpy array in memory. Care must be taken when size of the returned object is big.
And the vaex documentation writes:
In-memory data representations
One can construct a Vaex DataFrame from a variety of in-memory data representations.
And dask best practices actually reimplemented their own array objects that are simpler than numpy array, see https://docs.dask.org/en/stable/array-best-practices.html. But when going through the docs, it seems like the format they have saved the dask array in are not .npy but various other formats.
Writing the file into non-.npy versions (answering Q Part 3)
Given the numpy arrays are inevitably in-memory, trying to save the data into one single .npy isn't the most viable option.
Different libraries seems to have different solutions for storage. E.g.
vaex saves the data into hdf5 by default if the convert=True argument is set when data is read through vaex.from_csv()
sframe saves the data into their own binary format
dask export functions save to_hdf() and to_parquet() format
It it's latest version (4.14) vaex support "streaming", i.e. lazy loading of CSV files. It uses pyarrow under the hood so it is supper fast. Try something like
df = vaex.open(my_file.csv)
# or
df = vaex.from_csv_arrow(my_file.csv, lazy=True)
Then you can export to bunch of formats as needed, or keep working with it like that (it is surprisingly fast). Of course, it is better to convert to some kind of binary format..
import numpy as np
import pandas as pd
# Define the input and output file names
csv_file = 'data.csv'
npy_file = 'data.npy'
# Create dummy data
data = np.random.rand(10000, 100)
df = pd.DataFrame(data)
df.to_csv(csv_file, index=False)
# Define the chunk size
chunk_size = 1000
# Read the header row and get the number of columns
header = pd.read_csv(csv_file, nrows=0)
num_cols = len(header.columns)
# Initialize an empty array to store the data
data = np.empty((0, num_cols))
# Loop over the chunks of the csv file
for chunk in pd.read_csv(csv_file, chunksize=chunk_size):
# Convert the chunk to a numpy array
chunk_array = chunk.to_numpy()
# Append the chunk to the data array
data = np.append(data, chunk_array, axis=0)
np.save(npy_file, data)
# Load the npy file and check the shape
npy_data = np.load(npy_file)
print('Shape of data before conversion:', data.shape)
print('Shape of data after conversion:', npy_data.shape)```
I'm not aware of any existing function or utility that directly and efficiently converts csv files into npy files. With efficient I guess primarily meaning with low memory requirements.
Writing a npy file iteratively is indeed possible, with some extra effort. There's already a question on SO that addresses this, see:
save numpy array in append mode
For example using the NpyAppendArray class from Michael's answer you can do:
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
for line in csv:
row = np.fromstring(line, sep=',')
npy.append(row[np.newaxis, :])
The NpyAppendArray class updates the npy file header on every call to append, which is a bit much for your 12M rows. Maybe you could update the class to (optionally) only write the header on close. Or you could easily batch the writes:
batch_lines = 128
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
done = False
while not done:
batch = []
for count, line in enumerate(csv):
row = np.fromstring(line, sep=',')
batch.append(row)
if count + 1 >= batch_lines:
break
else:
done = True
npy.append(np.array(batch))
(code is not tested)

Exaggerated calculation times with pandas and csv

I have a 3 column CSV file where I perform a simple calculation with python and pandas.
The file is very large, just under 4Gb, after the calculation about 1.9Gb
the CSV file is:
data1,data2,data3
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw97,856521536521321,112535
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw98,6521321,112138
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw98,856521536521321,122135
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw99,521321,112132
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw99,856521536521321,212135
The calculation is a trivial sum. If column A is identical, then add B and rewrite the CSV.
Example result :
data1,data2,data3
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw97,856521536521321
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw98,856521543042642
aftqgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw99,856521537042642
import pandas as pd
#Read csv
df = pd.read_csv('data.csv', sep=',' , engine='python')
# Groupby and sum
df_new = df.groupby(["data1"]).agg({"data2": "sum"}).reset_index()
# Save in new file
df_new.to_csv('data2.csv', encoding='utf-8', index=False)
How could I improve the code to speed up execution?
It currently takes about 7 hours on a vps to complete the calculation
add info
The RAM resources are almost always 100% (8Gb), while the choice of the engine = 'python' is because I used a code already present on https://stackoverflow.com/, and honestly I don't know the usefulness or not of that command, but I have seen that the calculation works correctly.
Data3 is actually useless to me (right now, probably useful in the future).
There's an alternative option - use convtools for this. It is a pure python library which generates pure python code to build ad hoc converters. Of course bare python cannot beat pandas in terms of speed, but at least it doesn't need any wrappers and it works just like you'd implement everything by hand.
So, normally the following would work for you:
from convtools import conversion as c
from convtools.contrib.tables import Table
# you can store the converter somewhere for further reuse
converter = (
c.group_by(c.item("data1"))
.aggregate({
"data1": c.item("data1"),
"data2": c.ReduceFuncs.Sum(c.item("data2"))
})
.gen_converter()
)
# this is an iterable (stream of rows), not the list
rows = Table.from_csv("tmp4.csv", header=True).into_iter_rows(dict)
Table.from_rows(converter(rows)).into_csv("out.csv")
JFYI: If you run the script manually, then you can monitor the speed using e.g. tqdm, just wrap an iterable you are consuming with it:
from tqdm import tqdm
# same code as above, except for the last line:
Table.from_rows(converter(tqdm(rows))).into_csv("out.csv")
HOWEVER:
the solution above doesn't require an input file to fit into memory, but the result should. In your case, if the result is 1.9GB csv file, it is unlikely to fit corresponding python objects into 8GB of RAM.
Then you may need to:
remove the header: tail -n +2 raw_file.csv > raw_file_no_header.csv
pre-sort the file sort raw_file_no_header.csv > sorted_file.csv
a then:
from convtools import conversion as c
from convtools.contrib.tables import Table
converter = (
c.chunk_by(c.item("data1"))
.aggregate(
{
"data1": c.ReduceFuncs.First(c.item("data1")),
"data2": c.ReduceFuncs.Sum(c.item("data2")),
}
)
.gen_converter()
)
rows = Table.from_csv("sorted_file.csv", header=True).into_iter_rows(dict)
Table.from_rows(converter(rows)).into_csv("out.csv")
This only requires a single group to fit into memory.
Remove the engine='python', it does no good.
Get more RAM, 8GB is not enough, you should never hit 100% (this is what slows you down)
(it is too late now), but don't use .csv files for large datasets. Look into feather or parquet.
If you can't get more RAM, then maybe #Afaq will elaborate on the file splitting approach. The problem I see there, is that you are not reducing your dataset much, so map reduce may choke on the reduce part, unless you split your file in such a way, that same data1 strings would always go into the same file.

Reading csv files with glob to pass data to a database very slow

I have many csv files and I am trying to pass all the data that they contain into a database. For this reason, I found that I could use the glob library to iterate over all csv files in my folder. Following is the code I used:
import requests as req
import pandas as pd
import glob
import json
endpoint = "testEndpoint"
path = "test/*.csv"
for fname in glob.glob(path):
print(fname)
df = pd.read_csv(fname)
for index, row in df.iterrows():
#print(row['ID'], row['timestamp'], row['date'], row['time'],
# row['vltA'], row['curA'], row['pwrA'], row['rpwrA'], row['frq'])
print(row['timestamp'])
testjson = {"data":
{"installationid": row['ID'],
"active": row['pwrA'],
"reactive": row['rpwrA'],
"current": row['curA'],
"voltage": row['vltA'],
"frq": row['frq'],
}, "timestamp": row['timestamp']}
payload = {"payload": [testjson]}
json_data = json.dumps(payload)
response = req.post(
endpoint, data=json_data, headers=headers)
This code seems to work fine in the beginning. However, after some time it starts to become really slow (I noticed this because I print the timestamp as I upload the data) and eventually stops completely. What is the reason for this? Is something I am doing here really inefficient?
I can see 3 possible problems here:
memory. read_csv is fast, but it loads the content of a full file in memory. If the files are really large, you could exhaust the real memory and start using swap which has terrible performances
iterrows: you seem to build a dataframe - meaning a data structure optimized for column wise access - to then access it by rows. This already is a bad idea and iterrows is know to have terrible performances because it builds a Series per each row
one post request per row. An http request has its own overhead, but furthemore, this means that you add rows to the database one at a time. If this is the only interface for your database, you may have no other choice, but you should search whether it is possible to prepare a bunch of rows and load it as a whole. It often provides a gain of more than one magnitude order.
Without more info I can hardly say more, but IHMO the higher gain is to be found on database feeding so here in point 3. If nothing can be done on that point, of if further performance gain is required, I would try to replace pandas with the csv module which is row oriented and has a limited footprint because it only processes one line at a time whatever the file size.
Finally, and if it makes sense for your use case, I would try to use one thread for the reading of the csv file that would feed a queue and a pool of threads to send requests to the database. That should allow to gain the HTTP overhead. But beware, depending on the endpoint implementation it could not improve much if really the database access if the limiting factor.

RAM issue with DASK and its from_pandas function

i'm trying to use DASK package in Python 3.4 for avoid RAM problems with large datasets, but i've notice a problem.
Using native fucntion "read_csv" i load big dataset into a dask dataframe using less than 150MB of RAM.
The same dataset read with PANDAS DB Connection (using limit and offset options) and dask fuction"from_pandas" fill my ram uo to 500/750 MB.
I can't undestand why this happens and i want to fix this issue.
Here the code:
def read_sql(schema,tab,cond):
sql_count="""Select count(*) from """+schema+"""."""+tab
if (len(cond)>0):
sql_count+=""" where """+cond
a=pd.read_sql_query(sql_count,conn)
num_record=a['count'][0]
volte=num_record//10000
print(num_record)
if(num_record%10000>0):
volte=volte+1
sql_base="""Select * from """+schema+"""."""+tab
if (len(cond)>0):
sql_base+=""" where """+cond
sql_base+=""" limit 10000"""
base=pd.read_sql_query(sql_base,conn)
dataDask=dd.from_pandas(base, npartitions=None, chunksize=1000000)
for i in range(1,volte):
if(i%100==0):
print(i)
sql_query="""Select * from """+schema+"""."""+tab
if (len(cond)>0):
sql_query+=""" where """+cond
sql_query+=""" limit 10000 offset """+str(i*10000)
a=pd.read_sql_query(sql_query,conn)
b=dd.from_pandas(a , npartitions=None, chunksize=1000000)
divisions = list(b.divisions)
b.divisions = (None,)*len(divisions)
dataDask=dataDask.append(b)
return dataDask
a=read_sql('schema','tabella','data>\'2016-06-20\'')
Thanks for help me
Waiting for news
One dask.dataframe is composed of many pandas dataframes or, as in the case of functions like read_csv a plan to compute those dataframes on demand. It achieves low-memory execution by executing that plan to compute dataframes lazily.
When using from_pandas the dataframes are already in memory, so there is little that dask.dataframe can do to avoid memory blowup.
In this case I see three solutions:
Build a dask.dataframe.read_sql function to lazily pull chunks of data from a database. At the time of writing this is in progress here: https://github.com/dask/dask/pull/1181
Use dask.delayed to achieve the same result in user code. See http://dask.pydata.org/en/latest/delayed.html and http://dask.pydata.org/en/latest/delayed-collections.html (this is my main suggestion in your case)
Dump your database to something like an HDF file, for which there is already a convenient dask.dataframe function.

Pickling pandas dataframe multiplies by 5 the file size

I am reading a 800 Mb CSV file with pandas.read_csv, and then use the original Python pickle.dump(datfarame) to save it. The result is a 4 Gb pkl file, so the CSV size is multiplied by 5.
I expected pickle to compress data rather than extend it. Also because I can do a gzip on the CSV file which compress it to 200 Mb, dividing it by 4.
I am willing to accelerate the loading time of my program, and thought that pickling would help, but considering disk access is the main bottleneck I am understanding that I would rather have to compress the files and then use the compression option from pandas.read_csv to speed up the loading time.
Is that correct?
Is it normal that pickling pandas dataframe extend the data size?
How do you speed up loading time usually?
What are the data-size limit would you load with pandas?
Not sure why you think pickling compresses the data size, pickling creates a string version of your python object so that it can be loaded back as a python object:
In [388]:
import sys
import os
df = pd.DataFrame({'a':np.arange(5)})
df.to_pickle(r'c:\data\df.pkl')
print(sys.getsizeof(df))
statinfo = os.stat(r'c:\data\df.pkl')
print(statinfo.st_size)
with open(r'c:\data\df.pkl', 'rb') as f:
print(f.read())
56
700
b'\x80\x04\x95\xb1\x02\x00\x00\x00\x00\x00\x00\x8c\x11pandas.core.frame\x94\x8c\tDataFrame\x94\x93\x94)}\x94\x92\x94\x8c\x15pandas.core.internals\x94\x8c\x0cBlockManager\x94\x93\x94)}\x94\x92\x94(]\x94(\x8c\x11pandas.core.index\x94\x8c\n_new_Index\x94\x93\x94h\x0b\x8c\x05Index\x94\x93\x94}\x94(\x8c\x04data\x94\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x01\x85\x94\x8c\x05numpy\x94\x8c\x05dtype\x94\x93\x94\x8c\x02O8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01|\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK?t\x94b\x89]\x94\x8c\x01a\x94at\x94b\x8c\x04name\x94Nu\x86\x94R\x94h\rh\x0b\x8c\nInt64Index\x94\x93\x94}\x94(h\x11h\x14h\x17K\x00\x85\x94h\x19\x87\x94R\x94(K\x01K\x05\x85\x94h\x1f\x8c\x02i8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01<\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94b\x89C(\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x94t\x94bh(Nu\x86\x94R\x94e]\x94h\x14h\x17K\x00\x85\x94h\x19\x87\x94R\x94(K\x01K\x01K\x05\x86\x94h\x1f\x8c\x02i4\x94K\x00K\x01\x87\x94R\x94(K\x03h5NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00t\x94b\x89C\x14\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x94t\x94ba]\x94h\rh\x0f}\x94(h\x11h\x14h\x17K\x00\x85\x94h\x19\x87\x94R\x94(K\x01K\x01\x85\x94h"\x89]\x94h&at\x94bh(Nu\x86\x94R\x94a}\x94\x8c\x060.14.1\x94}\x94(\x8c\x06blocks\x94]\x94}\x94(\x8c\x06values\x94h>\x8c\x08mgr_locs\x94\x8c\x08builtins\x94\x8c\x05slice\x94\x93\x94K\x00K\x01K\x01\x87\x94R\x94ua\x8c\x04axes\x94h\nust\x94bb.'
The method to_csv does support compression as a kwarg, 'gzip' and 'bz2':
In [390]:
df.to_csv(r'c:\data\df.zip', compression='bz2')
statinfo = os.stat(r'c:\data\df.zip')
print(statinfo.st_size)
29
It is likely in your best interest to stash your CSV file in a database of some sort and perform operations on that rather than loading the CSV file to RAM, as Kathirmani suggested. You will see the speedup in loading time that you expect due simply to the fact that you are not filling up 800 Mb worth of RAM every time you load your script.
File compression and loading time are two conflicting elements of what you seem to be trying to accomplish. Compressing the CSV file and loading that will take more time; you've now added the extra step of having to decompress the file, which doesn't solve your problem.
Consider a precursory step to ship the data to an sqlite3 database, as described here: Importing a CSV file into a sqlite3 database table using Python.
You now have the pleasure of being able to query a subset of your data and quickly load it into a pandas.DataFrame for further use, as follows:
from pandas.io import sql
import sqlite3
conn = sqlite3.connect('your/database/path')
query = "SELECT * FROM foo WHERE bar = 'FOOBAR';"
results_df = sql.read_frame(query, con=conn)
...
Conversely, you can use pandas.DataFrame.to_sql() to save these for later use.
Dont load 800MB file to memory. It will increase your loading time. Pickle objects too takes more time to load. Instead store the csv file as a sqlite3 (which comes along with python) table. And then query the table every time depending upon your need.
You can also use panda's pickle methods which should compress your data.
Save a dataframe:
df.to_pickle(filename)
Load it:
df = pd.read_pickle(filename)

Categories

Resources