Writing large dask dataframe to file - python

I have a large BCP file (12GB) that I have imported into dask and did some data wrangling that I wish to import to SQL server. The file has been reduced from 40+ columns to 8 columns and I wish to find the best method to import to SQL server. I have tried using the following:
import sqlalchemy as sa
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from urllib.parse import quote_plus
pbar = ProgressBar()
pbar.register()
#windows authentication
#to_sql_uri = quote_plus(engine)
ddf.to_sql('test',
uri='mssql+pyodbc://TEST_SERVER/TEST_DB?driver=SQL Server?Trusted_Connection=yes', if_exists='replace', index=False)
This method is taking too long (3 days and counting). I had suspected this may be the case, so I also tried to write to a BCP file with the intention of using SQL BCP, but again this is taking a number of days:
df_train_grouped.compute().to_csv("F:\TEST_FILE.bcp", sep='\t')
I am relatively new to dask and can't seem to find an easy to follow example on the most efficient method to do this.

There is no need for you to use compute, this materialises the dataframe into memory and is likely the bottleneck for you. You can instead do
df_train_grouped.to_csv("F:\TEST_FILE*.bcp", sep='\t')
which will create a number of output files in parallel - which is probably exactly what you want.
Note that profiling will determine whether your process is IO bound (e.g., by the disc itself), in which case there is nothing you can do, or whether one of the process-based schedulers (ideally the distributed scheduler) can help with GIL-holding tasks.
Changing to a multiprocessing scheduler as follows improved performance in this particular case:
dask.config.set(scheduler='processes') # overwrite default with multiprocessing scheduler
df_train_grouped.to_csv("F:\TEST_FILE*.bcp", sep='\t', chunksize=1000000)

Related

How to share zero copy dataframes between processes with PyArrow

I'm trying to work out how to share data between processes with PyArrow (to hopefully at some stage share pandas DataFrames). I am at a rather experimental (read: Newbie) stage and am trying to figure out how to use PyArrow. I'm a bit stuck and need help.
Going through the documentation, I found an example to create a buffer
import time
import pyarrow as pa
data = b'abcdefghijklmnopqrstuvwxyz'
buf = pa.py_buffer(data)
print(buf)
# <pyarrow.Buffer address=0x7fa5be7d5850 size=26 is_cpu=True is_mutable=False>
while True:
time.sleep(1)
While this process was running, I used the address and size, that the script printed, to (try to) access the buffer in another script:
import pyarrow as pa
buf = pa.foreign_buffer(0x7fa5be7d5850, size=26)
print(buf.to_pybytes())
and received... a segmentation fault - most likely because the script is trying to access memory from another process, which I may require different handling.
Is this not possible with PyArrow or is the just the way I am trying to do this? Do I need other libraries? I'd like to avoid serialisation (or writing to disk in general), if possible, but this may or may not be possible. Any pointers are appreciated.
You can't share buffers across processes, instead you should use memory mapped file.
To write:
import pyarrow as pa
mmap = pa.create_memory_map("hello.txt", 20)
mmap.write(b"hello")
To read:
import pyarrow as pa
mmap = pa.memory_map("hello.txt")
mmap.read(5)

Memory efficient data loading with Dask

Well, what I have is a big CSV file with data and a bottleneck of server RAM. Besides that, there is a dask-distributed cluster that looks like a solution for this case, dask-scheduler is running on the server. Theres what I have tried:
import dask.dataframe as dd
import pandas as pd
from dask.bag import from_sequence
cheques = dd.read_csv('cheque_data.csv') # not working because of distributed workers can't access file directly
cheques = from_sequence(pd.read_csv('cheque_data.csv',chunksize=10**4)).to_dataframe() # dask Bag from_sequence constructs from tuples or dict
So I have stucked here, any ideas and clues would be great
Your code snippet says dd.read_csv('cheque_data.csv') is "not working because of distributed workers can't access file directly".
I think you need to give your workers access to the file.
I prefer splitting up huge files before they're read by Dask. That makes it easier for Dask to read the files in parallel.
It depends on the size of cheque_data.csv. 2GB is a lot easier to manage than 30GB.

Possibility of Corruption: Reading Excel Files with Pandas

We are in the design phase for product. The idea is that the code will read a list of values from Excel into SQL.
The requirements are as follows:
Workbook may be accessed by multiple users outside of our program
Workbook must remain accessible (i.e. not be corrupted) should something bad occur while our program is running
Program will be executed when no users are in the file
Right now we are considering using pandas in a simple manner as follows:
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('File.xlsx', sheetname='Sheet1')
"""Some code to write df in to SQL"""
If this code goes offline with the Excel still open, is there ANY possibility that the file will remain locked somewhere in my program or be corrupted?
To clarify, we envision something catastrophic like the server crashing or losing power.
Searched around but couldn't find a similar question, please redirect me if necessary.
I also read through Pandas read_excel documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
With the code you provide, from my reading of the pandas and xlrd code, the given file will only be opened in read mode. That should mean, to the best of my knowledge, that there is no more risk in what you're doing than in reading the file any other way - and you have to read it to use it, after all.
If this doesn't sufficiently reassure you, you could minimize the time the file is open and, more importantly, not expose your file to external code, by handing pandas a BytesIO object instead of a path:
import io
import pandas as pd
data = io.BytesIO(open('File.xlsx', 'rb').read())
df = pd.read_excel(data, sheetname='Sheet1')
# etc
This way your file will only be open for the time it takes to read it into memory, and pandas and xlrd will only be working with a copy of the data.

Reading big json dataset using pandas with chunks

I want to read a json of 6gb size (and I've another of 1.5gb), and i tried to read normally with pandas (just with pd.read_json), and clearly memory dies.
Then, I tried with chunksize param, like:
with open('data/products.json', encoding='utf-8') as f:
df = []
df_reader = pd.read_json(f, lines=True, chunksize=1000000)
for chunk in df_reader:
df.append(chunk)
data = pd.read_json(df)
But that doesn't work too, and my pc dies on the first running minute (8gb RAM actually).
Dask and Pyspark has dataframe solutions that are nearly identical to pandas. Pyspark is a Spark api and distributes workloads across JVMs. Dask specifically targets the out-of-memory on a single workstation use case and implements the dataframe api.
As shown here read_json's api mostly passes through from pandas.
As you port your example code from the question, I would note two things:
I suspect you won't need the file context manager, as simply passing the file path probably works.
If you have multiple records, Dask supports blobs like "path/to/files/*.json"

Using pool to read multiple files in parallel takes forever on Jupyter Windows:

I want to read 22 files (stored on my hard disk) with around 300,000 rows each to store in a single pandas data frame. My code was able to do it in 15-25 minutes. I initial thought is: I should make it faster using more CPUs. (correct me if I am wrong here, and if all CPU can't read the data from same hard disk at the same time, however, in this case we can assume data might be present at different hard disks later on, so this exercise is still useful).
I found few posts like this and this and tried the code below.
import os
import pandas as pd
from multiprocessing import Pool
def read_psv(filename):
'reads one row of a file (pipe delimited) to a pandas dataframe'
return pd.read_csv(filename,
delimiter='|',
skiprows=1, #need this as first row is junk
nrows=1, #Just one row for faster testing
encoding = "ISO-8859-1", #need this as well
low_memory=False
)
files = os.listdir('.') #getting all files, will use glob later
df1 = pd.concat((read_psv(f) for f in files[0:6]), ignore_index=True, axis=0, sort=False) #takes less than 1 second
pool = Pool(processes=3)
df_list = pool.map(read_psv, files[0:6]) #takes forever
#df2 = pd.concat(df_list, ignore_index=True) #cant reach this
This takes forever (more than 30-60 minutes, without finishing when I kill the process). I also went through a similar question like mine but of no use.
EDIT: I am using Jupyter on Windows.
Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.
Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.
If you have multiple large files on the same hard-drive and read from them using multiple processes, then the disk head will have to jump back and forth between them, where each jump takes up to 10 ms.
Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.
Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.
So the issue is not related to Bad Performance or getting stuck at I/O. The issue is related to Jupyter and Windows. On Windows we need to include an if clause like this: if __name__ = '__main__': before initializing the Pool. For Jupyter, we need to save the worker in a separate file and import it in the code. Jupyter is also problematic as it does not give the error log by default. I got to know about windows issue when I ran the code on a python shell. I got to know about Jupyter error when I ran the code on Ipython Shell. Following post has helped me a lot.
For Jupyter
For Windows Issue

Categories

Resources