I want to count the number of unique rows in my data. Below a quick input/output example.
#input
A,B
0,0
0,1
1,0
1,0
1,1
1,1
#output
A,B,count
0,0,1
0,1,1
1,0,2
1,1,2
The data in my pipeline have more than 5000 columns and more than 1M rows, each cell is a 0 or a 1. Below there are my two attempts at scaling with Dask (with 26 columns):
import numpy as np
import string
import time
client = Client(n_workers=6, threads_per_worker=2, processes=True)
columns = list(string.ascii_uppercase)
data = np.random.randint(2, size = (1000000, len(columns)))
ddf_parent = dd.from_pandas(pd.DataFrame(data, columns = columns), npartitions=20)
#1st solution
ddf = ddf_parent.astype(str)
ddf_concat = ddf.apply(''.join, axis =1).to_frame()
ddf_concat.columns = ['pattern']
ddf_concat = ddf_concat.groupby('pattern').size()
start = time.time()
ddf_concat = ddf_concat.compute()
print(time.time()-start)
#2nd solution
ddf_concat_other = ddf_parent.groupby(list(ddf.columns)).size()
start = time.time()
ddf_concat_other = ddf_concat_other.compute()
print(time.time() - start)
results:
9.491615056991577
12.688117980957031
The first solution first concatenates every column into a string and then runs the group-by on it. The second one just group-by all the columns. I am leaning toward using the first one as it is faster in my tests, but I am open to suggestions. Feel free to completely change my solution if there is anything better in term of performance (also, interesting, sort=False does not speed up the group-by, which may actually be related to this: https://github.com/dask/dask/issues/5441 and this https://github.com/rapidsai/cudf/issues/2717)
NOTE:
After some testing the first solution scales relatively well with the number of columns. I guess one improvement could be to hash the strings to always have a fix length. Any suggestion on the partition number in this case? From the remote dashboard I can see that after couple of operations the nodes in the computational graph reduces to only 3, not taking advantage of the other workers available.
The second solutions fails when columns increase.
NOTE2:
Also, with the first solution, something really strange is happening with what I guess is how Dask schedules and maps operations. What is happening is that after some time a single worker gets many more tasks than the others, then the worker exceed 95% of the memory, crash, then tasks are split correctly, but after some time another worker gets more tasks (and the cycle restart). The pipeline runs fine, but I was wondering if this is the expected behavior. Attached a screenshot:
Related
I have a dataframe that has 2 columns of zipcodes, I would like to add another column with their distance values, I am able to do this with a fairly low number of rows, but I am now working with a dataframe that has about 500,000 rows for calculations. The code I have works, but on my current dataframe it's been about 30 minutes of running, and still no completion, so I feel what i'm doing is extremely inefficient.
Here is the code
import pgeocode
dist = pgeocode.GeoDistance('us')
def distance_pairing(start,end):
return dist.query_postal_code(start, end)
zips['distance'] = zips.apply(lambda x: distance_pairing(x['zipstart'], x['zipend']), axis=1)
zips
I know looping is out of the question, so is there something else I can do, efficiency wise that would make this better?
Whenever possible, use vectorized operations in pandas and numpy. In this case:
zips['distance'] = dist.query_postal_code(
zips['zipstart'].values,
zips['zipend'].values,
)
This won't always work, but in this case, the underlying pgeocode.haversine function is written (in numpy) to accommodate arrays of x and y coordinates. This should speed up your code by several orders of magnitude for a dataframe of this size.
Is this a valid way of loading subsets of a dask dataframe to memory:
while i < len_df:
j = i + batch_size
if j > len_df:
j = len_df
subset = df.loc[i:j,'source_country_codes'].compute()
I read somewhere that this may not be correct because of how dask assigns index numbers because of it dividing the bigger dataframe into smaller pandas dfs. Also I don't think dask dataframes has an iloc attribute.
I am using version 0.15.2
In terms of use cases, this would be a way of loading batches of data to deep learning (say keras).
If your dataset has well known divisions then this might work, but instead I recommend just computing one partition at a time.
for part in df.to_delayed():
subset = part.compute()
You can roughly control the size by repartitioning beforehand
for part in df.repartition(npartitions=100).to_delayed():
subset = part.compute()
This isn't exactly the same, because it doesn't guarantee a fixed number of rows in each partition, but that guarantee might be quite expensive, depending on how the data is obtained.
I have a python program that crunches a large dataset using Pandas. It currently takes about 15 minute to complete. I want to log (stdout & send the metric to Datadog) about the progress of the task. Is there a way to get the %-complete of the task (or a function)? In the future, I might be dealing with larger datasets. The Python task that I am doing is a simple grouping of a large pandas data frame. Something like this:
dfDict = {}
for cat in categoryList:
df1 = df[df['category'] == cat]
if len(df1.index) > 0:
df1[dateCol] = pd.to_datetime(df[dateCol])
dfDict[cat] = df1
here, the categoryList has about 20000 items, and df is a large data frame having (say) a 5 million rows.
I am not looking for anything fancy (like progress-bars..). Just percentage complete value. Any ideas?
Thanks!
You can modify the following according to your needs.
from time import sleep
for i in range(12):
sleep(1)
print("\r\t> Progress\t:{:.2%}".format((i + 1)/12), end='')
What this basically does, is that it prevents print() from writing the default end character (end='') and at the same time, it write a carriage return ('\r') before anything else. In simple terms, you are overwriting the previous print() statement.
the naive solution would be to just use the total amount of rows in your dataset and the index your are at, then calculate the progress:
size = len(dataset)
for index, element in enumerate(dataset):
print(index / size * 100)
This will only be somewhat reliable if every row takes around the same time to complete. Because you have a large dataset, it might average out over time, but if some rows take a millisecond, and another takes 10 minutes, the percentage will be garbage.
Also consider rounding the percentage to one decimal:
size = len(dataset)
for index, element in enumerate(dataset):
print(round(index / size * 100), 1)
Printing for every row might slow your task down significantly so consider this improvement:
size = len(dataset)
percentage = 0
for index, element in enumerate(dataset):
new_percentage = round(index / size * 100), 1)
if percentage != new_percentage:
percentage = new_percentage
print(percentage)
There are, of course, also modules for this:
progressbar
progress
This is the simplest DataFrame I could think of. I'm using PySpark 1.6.1.
# one row of data
rows = [ (1, 2) ]
cols = [ "a", "b" ]
df = sqlContext.createDataFrame(rows, cols)
So the data frame completely fits in memory, has no references to any files and looks quite trivial to me.
Yet when I collect the data, it uses 2000 executors:
df.collect()
during collect, 2000 executors are used:
[Stage 2:===================================================>(1985 + 15) / 2000]
and then the expected output:
[Row(a=1, b=2)]
Why is this happening? Shouldn't the DataFrame be completely in memory on the driver?
So I looked into the code a bit to try to figure out what was going on. It seems that sqlContext.createDataFrame really does not make any kind of attempt to set reasonable parameter values based on the data.
Why 2000 tasks?
Spark uses 2000 tasks because my data frame had 2000 partitions. (Even though it seems like clear nonsense to have more partitions than rows.)
This can be seen by:
>>> df.rdd.getNumPartitions()
2000
Why did the DataFrame have 2000 partitions?
This happens because sqlContext.createDataFrame winds up using the default number of partitions (2000 in my case), irrespective of how the data is organized or how many rows it has.
The code trail is as follows.
In sql/context.py, the sqlContext.createDataFrame function calls (in this example):
rdd, schema = self._createFromLocal(data, schema)
which in turn calls:
return self._sc.parallelize(data), schema
And the sqlContext.parallelize function is defined in context.py:
numSlices = int(numSlices) if numSlices is not None else self.defaultParallelism
No check is done on the number of rows, and it is not possible to specify the number of slices from sqlContext.createDataFrame.
How can I change how many partitions the DataFrame has?
Using DataFrame.coalesce.
>>> smdf = df.coalesce(1)
>>> smdf.rdd.getNumPartitions()
1
>>> smdf.explain()
== Physical Plan ==
Coalesce 1
+- Scan ExistingRDD[a#0L,b#1L]
>>> smdf.collect()
[Row(a=1, b=2)]
You can configure the number of executors. In many cases, spark will take as many executors as available, and execution time is a lot worse than when you limit to a small number of executors.
conf = SparkConf()
conf.set('spark.dynamicAllocation.enabled','true')
conf.set('spark.dynamicAllocation.maxExecutors','32')
I'm dealing with data on a fairly large scale. For reference, a given sample will have ~75,000,000 rows and 15,000-20,000 columns.
As of now, to conserve memory I've taken the approach of creating a list of Series (each column is a series, so ~15K-20K Series each containing ~250K rows). Then I create a SparseDataFrame containing every index within these series (because as you notice, this is a large but not very dense dataset). The issue is this becomes extremely slow, and appending each column to the dataset takes several minutes. To overcome this I've tried batching the merges as well (select a subset of the data, merge these to a DataFrame, which is then merged into my main DataFrame), but this approach is still too slow. Slow meaning it only processed ~4000 columns in a day, with each append causing subsequent appends to take longer as well.
One part which struck me as odd is why my column count of the main DataFrame affects the append speed. Because my main index already contains all entries it will ever see, I shouldn't have to lose time due to re-indexing.
In anycase, here is my code:
import time
import sys
import numpy as np
import pandas as pd
precision = 6
df = []
for index, i in enumerate(raw):
if i is None:
break
if index%1000 == 0:
sys.stderr.write('Processed %s...\n' % index)
df.append(pd.Series(dict([(np.round(mz, precision),int(intensity)) for mz, intensity in i.scans]), dtype='uint16', name=i.rt))
all_indices = set([])
for j in df:
all_indices |= set(j.index.tolist())
print len(all_indices)
t = time.time()
main_df = pd.DataFrame(index=all_indices)
first = True
del all_indices
while df:
subset = [df.pop() for i in xrange(10) if df]
all_indices = set([])
for j in subset:
all_indices |= set(j.index.tolist())
df2 = pd.DataFrame(index=all_indices)
df2.sort_index(inplace=True, axis=0)
df2.sort_index(inplace=True, axis=1)
del all_indices
ind=0
while subset:
t2 = time.time()
ind+=1
arr = subset.pop()
df2[arr.name] = arr
print ind,time.time()-t,time.time()-t2
df2.reindex(main_df.index)
t2 = time.time()
for i in df2.columns:
main_df[i] = df2[i]
if first:
main_df = main_df.to_sparse()
first = False
print 'join time', time.time()-t,time.time()-t2
print len(df), 'entries remain'
Any advice on how I can load this large dataset quickly is appreciated, even if it means writing it to disk to some other format first/etc.
Some additional info:
1) Because of the number of columns, I can't use most traditional on-disk stores such as HDF.
2) The data will be queried across columns and rows when it is in use. So main_df.loc[row:row_end, col:col_end]. These aren't predictable block sizes so chunking isn't really an option. These lookups also need to be fast, on the order of ~10 a second to be realistically useful.
3) I have 32G of memory, so a SparseDataFrame I think is the best option since it fits in memory and allows fast lookups as needed. Just the creation of it is a pain at the moment.
Update:
I ended up using scipy sparse matrices and handling the indexing on my own for the time being. This results in appends at a constant rate of ~0.2 seconds which is acceptable (versus Pandas taking ~150seconds for my full dataset per append). I'd love to know how to make Pandas match this speed.