vaex and ipynb problems - python

I am new to vaex. Just started using it to speed up some groupby + agg.nunique operations on ~40 million rows Data Frame in jupyter notebook.
It works much faster than pandas, I am really excited to use it more often but sometimes I experience weird error:
executing some simple vaex notebook cell with simple filter code like:
vf[vf.Item_count >1]
finishes under 1s but when I run the same cell again it can take several minutes and won't respond to keyboard interruption.
I run vaex and pandas inside VSCode on Win10 machine with 32GB of RAM.
My Data Frame in pandas takes around 1GB.
Could you please help me navigate around those slow downs ?

Related

Do I need to free array and dataframe memory in jupyter notebook (python 3)?

Do I have to manually free arrays and dataframes as in C and C++?
I'm currently working on large arrays (thousands*thousands) with about 100~300 MB in csv.
I'm using Jupyter notebook (pythons3) to do normalizations and other data handling, and have run my codes hundreds of times.
I'm new to python, and it only recently occurred to me that memory leak might be happening. (I'm not having any errors)
I've read up a bit on gc.collect() but had trouble understanding.
Any help is appreciated, thanks!

Why does it take longer than using Pandas when I used modin.pandas [ray]

I'm just a Python newbie who's had fun dealing with data with Python.
When I was be able to use Python's representative data tool, Pandas, it seemed that it would be able to work on Excel very quickly.
However, I was somewhat disappointed to see it take more than 1 to 2 minutes to retrieve data(.xlsx) with 470,000 rows, and as a result, I found out that using modin and ray (or dask) would enable faster operation.
After learning how to use it simply as below, I compared it to using Pandas only. (this time, 100M rows data, about 5GB)
import ray
ray.init()
import modin.pandas as md
%%time
TB = md.read_csv('train.csv')
TB
But it only took 1 minute and 3 seconds to write Pandas, but it took 1 minute and 9 seconds to write modin [ray].
I was disappointed to see that it would take longer than just a small difference.
How can I use modin faster than pandas? Complex operations such as groupby or merge? Is there little difference in simply reading data?
Modin is faster to read data when other people are using it, is there something wrong with my computer's settings? I want to know why.
enter image description here
Write down the method installed at the prompt just in case you need it.
!pip install modin[ray]
!pip install ray[default]
First off, to do a fair assessment you always need to use the %%timeit magic command, which gives you an average of multiple runs.
Modin generally works best when you have:
Very large files
Large number of cores
The unimpressive performance, in your case, I believe is largely due to multi-processing management done by Ray/Dask, e.g. worker scheduling and all the set up that goes into parallelisation. When you meet at least one of the 2 criteria above (specially the first, given any current processor) the trade-off between the resource management and the speed up you get from Modin would be in your favour, but nor a 5GB file neither 6 cores are large enough to tip this in your favour. Parallelisation is costly, and the task must be worthy of it.
If it is a one-off, 1-2 minutes is not an unreasonable amount of time at all for this sort of thing. If it is a file that you are going to continuously read and write I would recommend writing it to HDF5 or pickle format in which case your read/write performance will improve far more than just using Modin.
Alternatively, Vaex is the fastest option around for reading any df. Though, I personally think it is still very incomplete and sometimes doesn't match the promises made about it beyond simple numerical-data operations, e.g. when you have large strings in your data.

Merge two data frames in pandas giving "The kernel appears to have died. It will restart automatically." using Jupyter notebook

I want to merge two data frames using the merge function in pandas. When I want to do so on a common column, jupyter notebook gives me the following error "The kernel appears to have died. It will restart automatically." each data frame is about 50k rows. But when I try the same thing with only 50 rows from each data frame it works fine. I was wondering if anyone has a suggestion.
This is most likely a ram/memory issue with your machine. check the ram that you have and monitor it while you do the merge operation.

Why does my Spark run slower than pure Python? Performance comparison

Spark newbie here. I tried to do some pandas action on my data frame using Spark, and surprisingly it's slower than pure Python (i.e. using pandas package in Python). Here's what I did:
1)
In Spark:
train_df.filter(train_df.gender == '-unknown-').count()
It takes about 30 seconds to get results back. But using Python it takes about 1 second.
2) In Spark:
sqlContext.sql("SELECT gender, count(*) FROM train GROUP BY gender").show()
Same thing, takes about 30 sec in Spark, 1 sec in Python.
Several possible reasons my Spark is much slower than pure Python:
1) My dataset is about 220,000 records, 24 MB, and that's not a big enough dataset to show the scaling advantages of Spark.
2) My spark is running locally and I should run it in something like Amazon EC instead.
3) Running locally is okay, but my computing capacity just doesn't cut it. It's a 8 Gig RAM 2015 Macbook.
4) Spark is slow because I'm running Python. If I'm using Scala it would be much better. (Con argument: I heard lots of people are using PySpark just fine.)
Which one of these is most likely the reason, or the most credible explanation? I would love to hear from some Spark experts. Thank you very much!!
Python will definitely perform better compared to pyspark on smaller data sets. You will see the difference when you are dealing with larger data sets.
By default when you run spark in SQL Context or Hive Context it will use 200 partitions by default. You need to change it to 10 or what ever valueby using sqlContext.sql("set spark.sql.shuffle.partitions=10");. It will be definitely faster than with default.
1) My dataset is about 220,000 records, 24 MB, and that's not a big
enough dataset to show the scaling advantages of Spark.
You are right, you will not see much difference at lower volumes. Spark can be slower as well.
2) My spark is running locally and I should run it in something like
Amazon EC instead.
For your volume it might not help much.
3) Running locally is okay, but my computing capacity just doesn't cut
it. It's a 8 Gig RAM 2015 Macbook.
Again it does not matter for 20MB data set.
4) Spark is slow because I'm running Python. If I'm using Scala it
would be much better. (Con argument: I heard lots of people are using
PySpark just fine.)
On stand alone there will be difference. Python has more run time overhead than scala, but on larger cluster with distributed capability it need not matter

Jupyter notebook kernel dies when creating dummy variables with pandas

I am working on the Walmart Kaggle competition and I'm trying to create a dummy column of of the "FinelineNumber" column. For context, df.shape returns (647054, 7). I am trying to make a dummy column for df['FinelineNumber'], which has 5,196 unique values. The results should be a dataframe of shape (647054, 5196), which I then plan to concat to the original dataframe.
Nearly every time I run fineline_dummies = pd.get_dummies(df['FinelineNumber'], prefix='fl'), I get the following error message The kernel appears to have died. It will restart automatically. I am running python 2.7 in jupyter notebook on a MacBookPro with 16GB RAM.
Can someone explain why this is happening (and why it happens most of the time but not every time)? Is it a jupyter notebook or pandas bug? Also, I thought it might have to do with not enough RAM but I get the same error on a Microsoft Azure Machine Learning notebook with >100 GB of RAM. On Azure ML, the kernel dies every time - almost immediately.
It very much could be memory usage - a 647054, 5196 data frame has 3,362,092,584 elements, which would be 24GB just for the pointers to the objects on a 64-bit system. On AzureML while the VM has a large amount of memory you're actually limited in how much memory you have available (currently 2GB, soon to be 4GB) - and when you hit the limit the kernel typically dies. So it seems very likely it is a memory usage issue.
You might try doing .to_sparse() on the data frame first before doing any additional manipulations. That should allow Pandas to keep most of the data frame out of memory.

Categories

Resources