Dask read csv versus pandas read csv

Dask read csv versus pandas read csv - python

I have the following problem. I have a huge csv file and want to load it with multiprocessing. Pandas needs 19 seconds for a example file with 500000 rows and 130 colums with different dtypes. I tried dask because i want to multiprocess the reading. But this tooks much longer and I wonder why. I have 32 cores. and tried this:
import dask.dataframe as dd
import dask.multiprocessing
dask.config.set(scheduler='processes')
df = dd.read_csv(filepath,
sep='\t',
blocksize=1000000,
)
df = df.compute(scheduler='processes') # convert to pandas

When reading a huge file from disk, the bottleneck is the IO. As Pandas is highly optimized with a C parsing engine, there is very little to gain. Any attempt to use multi-processing or multi-threading is likely to be less performant, because you will spend the same time for loading the data from the disk, and only add some overhead for synchronizing the different processes or threads.

Consider what this means:
df = df.compute(scheduler='processes')
each process accesses some chunk of the original data. This may be in parallel or, quite likely, limited by the IO of the underlying storage device
each process makes a dataframe from its data, which is CPU-heavy and will parallelise well
each chunk is serialised by the process and communicated to the client from where you called it
the client deserialises the chunks and concatenates them for you.
Short story: don't use Dask if your only job is to get a Pandas dataframe in memory, it only adds overhead. Do use Dask if you can operate on the chunks independently, and only collect small output in the client (e.g., groupby-aggregate, etc.).

You could use mutliprocessthing, but like the file is not cut, you risk to have waiting when a program/thread wants to access file (its the case following your mesure).
If you want to use correctly multiprocessing, i recommand you to cut the file in differents parts and merge all results in the final operation

I recommend trying different numbers of processes with the num_workers keyword argument to compute.
Contrary to what is said above, read_csv is definitely compute-bound, and having a few processes working in parallel will likely help.
However, having too many processes all hammering at the disk at the same time might cause a lot of contention and slow things down.
I recommend experimenting a bit with different numbers of processes to see what works best.

Related

Can I process 100 GB of data using Apache Spark on my local machine?

I have around 100GB data of users and want to process it using Apache Spark on my laptop.I have installed Hadoop and Spark and for the test I uploaded a file of around 9 GB to HDFS and accessed & queried it using pyspak.
The test file has total 113959238 records/rows, when I queried the data for a particular user i.e
select * from table where userid=????
it took around 6 minutes to retrieve the records of that user and if I run on the entire file then it will take a lot of time.
The analysis that I to make on that data is to extract the records a users, run some operations on it and then process the data of second user and so on for all the users in file. The data of the user queried, will not be much so it can be loaded in memory and operations can be preformed faster. But querying the record of a user from that big file takes time and will slow the process.
It is said that Spark is lighting fast so surely I will be missing something which is why it is taking that time. One thing that I noted while performing queries was Spark was not utilizing full RAM but almost 100% of CPU.
My machine specs are:
I also queried the data directly of the text file using Spark instead of HDFS file but there wasn't much difference in time.
The python code that I wrote is
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext,SQLContext
import time
conf=SparkConf()
conf.set("spark.executor.memory", "8g")
conf.set("spark.driver.memory", "8g")
sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()
sc=sparkSession.sparkContext.getOrCreate(conf)
sqlContext=SQLContext(sc)
#df_load = sparkSession.read.format("csv").option("header","true").load("hdfs://0.0.0.0:19000/test.txt")
df_load = sparkSession.read.format("csv").option("header","true").load("C:/Data/test_file/test.txt")
table=df_load.registerTempTable('test')
sp_tstart=time.time()
df=sqlContext.sql("select * from test where user_id='12345'")
db=df.rdd.collect()
sp_tend=time.time()
t_time=sp_tend-sp_tstart
df.show()
print(t_time/60)
Given my machine specs, is Spark taking normal time or Do I need to configure something? Do I need to upgrade the specs or is it enough for this data?

One of the things to understand with Spark, Hadoop and other Big Data providers is that they aren't aiming to get the maximum possible throughput from an individual machine. They're aiming to let you split the processing efficiently across multiple machines. They sacrifice a certain amount of individual machine throughput to provide horizontal scalability.
While you can run Spark on just a single machine, the main reasons to do so are to learn Spark or to write test code to then deploy to run against a cluster with more data.
As others have noted, if you just want to process data on a single machine, then there are libraries which are going to be more efficient in that scenario. 100GB is not a huge amount to process on a single machine.
From the sound of things you'd be better off importing that data into a database and adding suitable indexing. One other thing to understand is that a lot of the benefit of Big Data systems is supporting analysing and processing most or all of the data. Traditional database systems like Postgres or SQL Server can work well handling Terabytes of data when you're mainly querying for small amounts of data using indices.

If your objective is to analyze 100GB of data using python and there is no requirement for spark, you can also take a look into dask. https://dask.org/ It should be easier to setup and use with python.
For example dask dataframe: https://docs.dask.org/en/latest/dataframe.html
>>> import dask.dataframe as dd
>>> df = dd.read_csv('2014-*.csv')
>>> df.head()
x y
0 1 a
1 2 b
2 3 c
3 4 a
4 5 b
5 6 c
>>> df2 = df[df.y == 'a'].x + 1

You don't need Hadoop to process the file locally.
The advantages of Hadoop only apply when you use more than one machine as your file will be chunked and distributed to many processes at once.
Similarly, 100GB of plaintext isn't really "big data"; it still fits on a single machine and if stored in a better format like ORC or Parquet, would be significantly less in size
Also, to get faster times, don't use collect()
If you simply want to lookup data by ID, use a key value database like Redis or Accumulo, not Hadoop/Spark

The type of job you have described his a highly CPU intensive process, which unfortunately is only going to be sped up significantly by running many parallel queries on partitions of the data set. Compound the problem with not having enough system memory to hold the entire dataset and now you are also limited by significant read/writes on the hard drive.
This is the type of task where Spark really shines. The reason you aren't experiencing any improvement in performance is because with a single system you're missing the benefit of Spark entirely, which is the ability to split the data set into many partitions and distribute it across many machines who can work on many different users IDs at the same time.
Each worker node in your cluster will have a smaller data set to look at, which means on each node the entire data set it is looking at can easily be stored in memory. Each find and replace function (one per user ID) can be sent to a single CPU core, which means if you have 5 workers with 16 cores each, you can process 80 IDs at a time, from memory, on an optimized partition size.
Google CloudProc and Azure Databricks are super platforms for doing this. Simply choose the number of workers you need, and the CPU/Memory of each node, and fire up the cluster. Connect to your data and kick off your PySpark code. It can process this data so quickly that even though you are paying by the minute for the cluster, it will end up being very cheap ($10-$20 perhaps).

Python parallelism preserving data

I need to repeatedly calculate very large python arrays based on small input and very large constant bulk if data, stored on the drive. I can successfully parallelize it by splitting that input bulk and joining response. Here comes the problem: sending identical data bulk to the pool is too slow. Moreover, I double required memory. Ideally I would read data in the thread from the file, and keep it there for multiple re-use.
How do I do it? I can only think of creating multiple servers that will listen to requests from the pool. Somehow it looks unnatural solution to quite common problem. Do I miss better solution?
best regards,
Vladimir

How to make Dask process fewer partitions/files at a time?

I am trying to use to_parquet but it crashes my system due to memory error. I've discovered it's trying to save 100-300 of my partitions at a time.
Is it possible to somehow specify that I want fewer partitions processed at a time in order to prevent a crash due to using up all the RAM?

Dask will use as many threads at a time as you give it. The tasks may be "processing" but that just means that they have been sent to a worker, which will handle them when it has a spare thread.
I am trying to use to_parquet but it crashes my system due to memory error.
However it could still be that your partitions are large enough that you can't fit several of them in memory at once. In this case you might want to select a smaller partition size. See https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-partitions for more information.

Slow len function on dask distributed dataframe

I have been testing how to use dask (cluster with 20 cores) and I am surprised by the speed that I get on calling a len function vs slicing through loc.
import dask.dataframe as dd
from dask.distributed import Client
client = Client('192.168.1.220:8786')
log = pd.read_csv('800000test', sep='\t')
logd = dd.from_pandas(log,npartitions=20)
#This is the code than runs slowly
#(2.9 seconds whilst I would expect no more than a few hundred millisencods)
print(len(logd))
#Instead this code is actually running almost 20 times faster than pandas
logd.loc[:'Host'].count().compute()
Any ideas why this could be happening? It isn't important for me that len runs fast, but I feel that by not understanding this behaviour there is something I am not grasping about the library.
All of the green boxes correspond to "from_pandas" whilst in this article by Matthew Rocklin http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes the call graph looks better (len_chunk is called which is significantly faster and the calls don't seem to be locked by and wait for one worker to finish his task before starting the other)

Good question, this gets at a few points about when data is moving up to the cluster and back down to the client (your python session). Lets look at a few stages of your compuation
Load data with Pandas
This is a Pandas dataframe in your python session, so it's obviously still in your local process.
log = pd.read_csv('800000test', sep='\t') # on client
Convert to a lazy Dask.dataframe
This breaks up your Pandas dataframe into twenty Pandas dataframes, however these are still on the client. Dask dataframes don't eagerly send data up to the cluster.
logd = dd.from_pandas(log,npartitions=20) # still on client
Compute len
Calling len actually causes computation here (normally you would use df.some_aggregation().compute(). So now Dask kicks in. First it moves your data out to the cluster (slow) then it calls len on all of the 20 partitions (fast), it aggregates those (fast) and then moves the result down to your client so that it can print.
print(len(logd)) # costly roundtrip client -> cluster -> client
Analysis
So the problem here is that our dask.dataframe still had all of its data in the local python session.
It would have been much faster to use, say, the local threaded scheduler rather than the distributed scheduler. This should compute in milliseconds
with dask.set_options(get=dask.threaded.get): # no cluster, just local threads
print(len(logd)) # stays on client
But presumably you want to know how to scale out to larger datasets, so lets do this the right way.
Load your data on the workers
Instead of loading with Pandas on your client/local session, let the Dask workers load bits of the csv file. This way no client-worker communication is necessary.
# log = pd.read_csv('800000test', sep='\t') # on client
log = dd.read_csv('800000test', sep='\t') # on cluster workers
However, unlike pd.read_csv, dd.read_csv is lazy, so this should return almost immediately. We can force Dask to actually do the computation with the persist method
log = client.persist(log) # triggers computation asynchronously
Now the cluster kicks into action and loads your data directly in the workers. This is relatively fast. Note that this method returns immediately while work happens in the background. If you want to wait until it finishes, call wait.
from dask.distributed import wait
wait(log) # blocks until read is done
If you're testing with a small dataset and want to get more partitions, try changing the blocksize.
log = dd.read_csv(..., blocksize=1000000) # 1 MB blocks
Regardless, operations on log should now be fast
len(log) # fast
Edit
In response to a question on this blogpost here are the assumptions that we're making about where the file lives.
Generally when you provide a filename to dd.read_csv it assumes that that file is visible from all of the workers. This is true if you are using a network file system, or a global store like S3 or HDFS. If you are using a network file system then you will want to either use absolute paths (like /path/to/myfile.*.csv) or else ensure that your workers and client have the same working directory.
If this is not the case, and your data is only on your client machine, then you will have to load and scatter it out.
Simple but sub-optimal
The simple way is just to do what you did originally, but persist your dask.dataframe
log = pd.read_csv('800000test', sep='\t') # on client
logd = dd.from_pandas(log,npartitions=20) # still on client
logd = client.persist(logd) # moves to workers
This is fine, but results in slightly less-than-ideal communication.
Complex but optimal
Instead, you might scatter your data out to the cluster explicitly
[future] = client.scatter([log])
This gets into more complex API though, so I'll just point you to docs
http://distributed.readthedocs.io/en/latest/manage-computation.html
http://distributed.readthedocs.io/en/latest/memory.html
http://dask.pydata.org/en/latest/delayed-collections.html

Real-time data collection and 'offline' processing

I have a continuous stream of data. I want to do a small amount of processing to the data in real-time (mostly just compression, rolling some data off the end, whatever needs doing) and then store the data. Presumably no problem. HDF5 file format should do great! OOC data, no problem. Pytables.
Now the trouble. Occasionally, as a completely separate process so that data is still being gathered, I would like to perform a time consuming calculation involving the data (order minutes). This involving reading the same file I'm writing.
How do people do this?
Of course reading a file that you're currently writing should be challenging, but it seems that it must have come up enough in the past that people have considering some sort of slick solution---or at least a natural work-around.
Partial solutions:
It seems that HDF5-1.10.0 has a capability SWMR - Single Write, Multiple Read. This seems like exactly what I want. I can't find a python wrapper for this recent version, or if it exists I can't get Python to talk to the right version of hdf5. Any tips here would be welcomed. I'm using Conda package manager.
I could imagine writing to a buffer, which is occasionally flushed and added to the large database. How do I ensure that I'm not missing data going by while doing this?
This also seems like it might be computationally expensive, but perhaps there's no getting around that.
Collect less data. What's the fun in that?

I suggest you take a look at adding Apache Kafka to your pipeline, it can act as a data buffer and help you separate different tasks done on the data you collect.
pipeline example:
raw data ===> kafka topic (raw_data) ===> small processing ====> kafak topic (light_processing) ===> a process read from light_processing topic and writes to db or file
At the same time you can read with another process the same data from light_processing topic or any other topic and do your heavy processing and so on.
if both the light processing and the heavy processing connect to kafka topic with the same groupId the data will be replicated and both processes will get the same stream
hope it helped.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dask read csv versus pandas read csv - python

Related

Can I process 100 GB of data using Apache Spark on my local machine?

Python parallelism preserving data

How to make Dask process fewer partitions/files at a time?

Slow len function on dask distributed dataframe

Real-time data collection and 'offline' processing

Categories

Resources