Well, what I have is a big CSV file with data and a bottleneck of server RAM. Besides that, there is a dask-distributed cluster that looks like a solution for this case, dask-scheduler is running on the server. Theres what I have tried:
import dask.dataframe as dd
import pandas as pd
from dask.bag import from_sequence
cheques = dd.read_csv('cheque_data.csv') # not working because of distributed workers can't access file directly
cheques = from_sequence(pd.read_csv('cheque_data.csv',chunksize=10**4)).to_dataframe() # dask Bag from_sequence constructs from tuples or dict
So I have stucked here, any ideas and clues would be great
Your code snippet says dd.read_csv('cheque_data.csv') is "not working because of distributed workers can't access file directly".
I think you need to give your workers access to the file.
I prefer splitting up huge files before they're read by Dask. That makes it easier for Dask to read the files in parallel.
It depends on the size of cheque_data.csv. 2GB is a lot easier to manage than 30GB.
Related
I'm writing a program that requires running algorithms on a very large (~6GB) csv file, which is loaded with pandas using read_csv().
The issue I have now, is that anytime I tweak my algorithms and need to re-simulate (which is very often), I need to wait ~30s for the dataset to load into memory, and then another 30s afterward to load the same dataset into a graphing module so I can visually see what's going on. Once it's loaded however, operations are done very quickly.
So far I've tried using mmap, and loading the dataset into a RAM disk for access, with no improvement.
I'm hoping to find a way to load up the dataset once into memory with one process, and then access it in memory with the algorithm-crunching process, which gets re-run each time I make a change.
This thread seems to be close-ish to what I need, but uses multiprocessing which needs everything to be run within the same context.
I'm not a computer engineer (I'm electrical :), so I'm not sure what I'm asking for is even possible. Any help would be appreciated however.
Thanks,
Found a solution that worked, although it was not directly related to my original ask.
Instead of loading a large file into memory and sharing between independent processes, I found that the bottleneck was really the parsing function in pandas library.
Particularly, CSV parsing, as CSVs are notoriously inefficient in terms of data storage.
I started storing my files in the python-native pickle format, which is supported by pandas through the to_pickle() and read_pickle() functions. This cut my load times drastically from ~30s to ~2s.
I have a large BCP file (12GB) that I have imported into dask and did some data wrangling that I wish to import to SQL server. The file has been reduced from 40+ columns to 8 columns and I wish to find the best method to import to SQL server. I have tried using the following:
import sqlalchemy as sa
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from urllib.parse import quote_plus
pbar = ProgressBar()
pbar.register()
#windows authentication
#to_sql_uri = quote_plus(engine)
ddf.to_sql('test',
uri='mssql+pyodbc://TEST_SERVER/TEST_DB?driver=SQL Server?Trusted_Connection=yes', if_exists='replace', index=False)
This method is taking too long (3 days and counting). I had suspected this may be the case, so I also tried to write to a BCP file with the intention of using SQL BCP, but again this is taking a number of days:
df_train_grouped.compute().to_csv("F:\TEST_FILE.bcp", sep='\t')
I am relatively new to dask and can't seem to find an easy to follow example on the most efficient method to do this.
There is no need for you to use compute, this materialises the dataframe into memory and is likely the bottleneck for you. You can instead do
df_train_grouped.to_csv("F:\TEST_FILE*.bcp", sep='\t')
which will create a number of output files in parallel - which is probably exactly what you want.
Note that profiling will determine whether your process is IO bound (e.g., by the disc itself), in which case there is nothing you can do, or whether one of the process-based schedulers (ideally the distributed scheduler) can help with GIL-holding tasks.
Changing to a multiprocessing scheduler as follows improved performance in this particular case:
dask.config.set(scheduler='processes') # overwrite default with multiprocessing scheduler
df_train_grouped.to_csv("F:\TEST_FILE*.bcp", sep='\t', chunksize=1000000)
I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function. I have also installed the pyarrow and fastparquet libraries which the read_parquet function uses as the engine for parquet files. Unfortunately, it seems that while reading, my computer freezes and eventually I get an error saying it ran out of memory (I don't want to repeat running the code since this will cause another freeze - I don't know the verbatim error message).
Is there a good way to somehow write some part of the parquet file to memory without this occurring? I know that parquet files are columnar and it may not be possible to store only a part of the records to memory, but I'd like to potentially split it up if there is a workaround or perhaps see if I am doing anything wrong while trying to read this in.
I do have a relatively weak computer in terms of specs, with only 6 GB memory and i3. The CPU is 2.2 GHz with Turbo Boost available.
Do you need all the columns? You might be able to save memory by just loading the ones you actually use.
A second possibility is to use an online machine (like google colab) to load the parquet file and then save it as hdf. Once you have it, you can use it in chunks.
You can use Dask instead of pandas. It it is built on pandas, so has similar API that you will likely be familiar with, and is meant for larger data.
https://examples.dask.org/dataframes/01-data-access.html
Its possible to read parquet data in
batches
read certain row groups or iterate over row groups
read only certain columns
This way you can reduce the memory footprint. Both fastparquet and pyarrow should allow you to do this.
In case of pyarrow, iter_batches can be used to read streaming batches from a Parquet file.
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('example.parquet')
for i in parquet_file.iter_batches(batch_size=1000):
print("RecordBatch")
print(i.to_pandas())
Above example simply reads 1000 records at a time. You can further limit this to certain row groups or even certain columns like below.
for i in parquet_file.iter_batches(batch_size=10, columns=['user_address'], row_groups=[0,2,3]):
I wrote a Python wrapper for a big data R library. It's primary use is to:
receive a potentially large amount of data as an R dataframe/tibble
convert that to a Pandas dataframe
convert that to a Koalas dataframe
So I am worried about running out of memory. I'm not really worried about how fast it is because it's kind of a workaround, but it just has to be reliable. Also not looking to handle this is a formalized pipeline.
Will Python automatically swap my data onto disk if my users run out of RAM for some reason? Does the fact that it is running in a Docker env have any impact on that?
I want to read a json of 6gb size (and I've another of 1.5gb), and i tried to read normally with pandas (just with pd.read_json), and clearly memory dies.
Then, I tried with chunksize param, like:
with open('data/products.json', encoding='utf-8') as f:
df = []
df_reader = pd.read_json(f, lines=True, chunksize=1000000)
for chunk in df_reader:
df.append(chunk)
data = pd.read_json(df)
But that doesn't work too, and my pc dies on the first running minute (8gb RAM actually).
Dask and Pyspark has dataframe solutions that are nearly identical to pandas. Pyspark is a Spark api and distributes workloads across JVMs. Dask specifically targets the out-of-memory on a single workstation use case and implements the dataframe api.
As shown here read_json's api mostly passes through from pandas.
As you port your example code from the question, I would note two things:
I suspect you won't need the file context manager, as simply passing the file path probably works.
If you have multiple records, Dask supports blobs like "path/to/files/*.json"