Pandas: Merge array is too big, large, how to merge in parts? - python

When trying to merge two dataframes using pandas I receive this message: "ValueError: array is too big." I estimate the merged table will have about 5 billion rows, which is probably too much for my computer with 8GB of RAM (is this limited just by my RAM or is it built into the pandas system?).
I know that once I have the merged table I will calculate a new column and then filter the rows, looking for the maximum values within groups. Therefore the final output table will be only 2.5 million rows.
How can I break this problem up so that I can execute this merge method on smaller parts and build up the output table, without hitting my RAM limitations?
The method below works correctly for this small data, but fails on the larger, real data:
import pandas as pd
import numpy as np
# Create input tables
t1 = {'scenario':[0,0,1,1],
'letter':['a','b']*2,
'number1':[10,50,20,30]}
t2 = {'letter':['a','a','b','b'],
'number2':[2,5,4,7]}
table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)
# Merge the two, create the new column. This causes "...array is too big."
table3 = pd.merge(table1,table2,on='letter')
table3['calc'] = table3['number1']*table3['number2']
# Filter, bringing back the rows where 'calc' is maximum per scenario+letter
table3 = table3.loc[table3.groupby(['scenario','letter'])['calc'].idxmax()]
This is a follow up to two previous questions:
Does iterrows have performance issues?
What is a good way to avoid using iterrows in this example?
I answer my own Q below.

You can break up the first table using groupby (for instance, on 'scenario'). It could make sense to first make a new variable which gives you groups of exactly the size you want. Then iterate through these groups doing the following on each: execute a new merge, filter and then append the smaller data into your final output table.
As explained in "Does iterrows have performance issues?", iterating is slow. Therefore try to use large groups to keep it using the most efficient methods possible. Pandas is relatively quick when it comes to merging.
Following on from after you create the input tables
table3 = pd.DataFrame()
grouped = table1.groupby('scenario')
for _, group in grouped:
temp = pd.merge(group,table2, on='letter')
temp['calc']=temp['number1']*temp['number2']
table3 = table3.append(temp.loc[temp.groupby('letter')['calc'].idxmax()])
del temp

Related

Pyspark dataframe returns different results each time I run

Everytime I run a simple groupby pyspark returns different values, even though I haven't done any modification on the dataframe.
Here is the code I am using:
df = spark.sql('select * from data ORDER BY document_id')
df_check = df.groupby("vacina_descricao_dose").agg(count('paciente_id').alias('paciente_id_count')).orderBy(desc('paciente_id_count')).select("*")
df_check.show(df_check.count(),False)
I ran df_check.show() 3 times and the column paciente_id_count gives different values everytime: show results (I cut the tables so It would be easier to compare).
How do I prevent this?
The .show() does not compute the whole operations.
Maybe you could try the following (if the final number of rows fits in your drive memory):
df = spark.sql('select * from data ORDER BY document_id')
df_check = df.groupby("vacina_descricao_dose").agg(count('paciente_id').alias('paciente_ id_count')).orderBy(desc('paciente_id_count')).select("*")
df_check.toPandas()

Does Dask guarantee that rows inside partition (with a non-unique index) will never be reordered?

My application needs to read in a dataset into dask, spread across multiple partitions. With that dataframe, I need to do multiple operations on it, (eg subtracting one column from another or finding the ratio of two columns). The index for the dataframe is a nonunique column.
Because the application is entirely metadata driven, the order of the function calls is not known until runtime, so I have designed the application to rely on returning a new delayed dataframe at each stage. I wondered if some clever use of partitioning and column-wise concatenation could help me make this code efficient.
Given that these steps are independent of each other, in the specific example below can I trust the last operation to give the proper result for my row-wise ratio? i.e. If I carry out operations that only add new columns to dataframes, can I trust that the ordering of the rows will never change?
def subtract(df1, df2, col1, col2):
df_mod = copy(df1)
df_mod[f"{col1}-{col2}"] = df1[col1] - df2[col2]
return df_mod
def ratio(df1, df2, col1, col2):
df_mod = copy(df1)
# Rely on the row ordering being unchanged
df_mod[f"{col1}/{col2}"] = df1[col1] / df2[col2]
return df_mod
df = load_function_returns_dask_df()
first = subtract(df, df, "a","b")
second = subtract(df, df, "c","d")
last = ratio(first, second, "a-b","c-d")
I understand that I could operate directly on the dataframe to create a new column, but this does not work in the general case for arbitrary operations.
Intuitively it makes sense to me that this operation should work, since each partition is just a pandas dataframe, and it makes no sense for pandas to reorder the rows in a dataframe arbitrarily, but I was hoping for some way of verifying this more formally.
Correct, Dask will not reorder your partition rows so long as you are doing Pandas operations which themselves do not ordinarily reorder the rows (such as sort, obviously), which will be true for any row-wise computation.
Indeed the order of the partitions themselves is preserved as the data passes through operation after operation.

How to efficiently match values from 2 series and add them to a dataframe

I have a csv file "qwi_ak_se_fa_gc_ns_op_u.csv" which contains a lot of observations of 80 variables. One of them is geography which is the county. Every county belongs to something called a Commuting Zone (CZ). Using a matching table given in "czmatcher.csv" I can assign a CZ to every county given in geography.
The code below shows my approach. It is simply going through every row and finding its CZ by going through the whole "czmatcher.csv" row and finding the right one. Then i proceed to just copy the values using .loc. The problem is, this took over 10 hours to run on a 0.5 GB file (2.5 million rows) which isn't that much and my intuition says this should be faster?
This picture illustrates the way the csv files look like. The idea would be to construct the "Wanted result (CZ)" column, name it CZ and add it to the dataframe.
File example
import pandas as pd
data = pd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
czm = pd.read_csv("czmatcher.csv")
sLength = len(data['geography'])
data['CZ']=0
#this is just to fill the first value
for j in range(0,len(czm)):
if data.loc[0,'geography']==czm.loc[0,'FIPS']:
data.loc[0,'CZ'] = czm.loc[0,'CZID']
#now fill the rest
for i in range(1,sLength):
if data.loc[i,'geography']==data.loc[i-1,'geography']:
data.loc[i,'CZ'] = data.loc[i-1,'CZ']
else:
for j in range(0,len(czm)):
if data.loc[i,'geography']==czm.loc[j,'FIPS']:
data.loc[i,'CZ'] = czm.loc[j,'CZID']
Is there a faster way of doing this?
The best way to do this is a left merge on your dataframes,
data = pd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
czm = pd.read_csv("czmatcher.csv")
I assume that in both dataframes the column country is spelled the same,
data_final = data.merge(czm, how='left', on = 'country')
If it isn't spelled the same way you can rename your columns,
data.rename(columns:{col1:country}, inplace=True)
read the doc for further information https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
In order to make it faster, but not reworking your whole solution I would recommend to use Dask DataFrames, to say it simple, Dask divides your reads your csv in chunks and processes each of them in parallel. After reading csv. you can use .compute method to get pandas df instead of Dask df.
This will look like this:
import pandas as pd
import dask.dataframe as dd #IMPROT DASK DATAFRAMES
# YOU NEED TO USE dd.read_csv instead of pd.read_csv
data = dd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
data = data.compute()
czm = dd.read_csv("czmatcher.csv")
czm = czm.compute()
sLength = len(data['geography'])
data['CZ']=0
#this is just to fill the first value
for j in range(0,len(czm)):
if data.loc[0,'geography']==czm.loc[0,'FIPS']:
data.loc[0,'CZ'] = czm.loc[0,'CZID']
#now fill the rest
for i in range(1,sLength):
if data.loc[i,'geography']==data.loc[i-1,'geography']:
data.loc[i,'CZ'] = data.loc[i-1,'CZ']
else:
for j in range(0,len(czm)):
if data.loc[i,'geography']==czm.loc[j,'FIPS']:
data.loc[i,'CZ'] = czm.loc[j,'CZID']

Best way to extract and save values with the same keys from multiple RDDs

I've created two RDDs in PySpark with data extracted from HBase. I want to gather items with the same row keys, store the items and then search through values associated with each of the items. Ideally I'd store the results in a pyspark.sql object, since I want to apply Levenshtein distance to their content.
Details:
In the HBase I have location data, where a row key is the geohash of a given area, and in the columns there are multiple venues in the area with more details (json with description and other text data) on the location.
I have two HBase tables and the locations can be the same in both of them. I want to search the data in those two RDDs, check for similar geohashes and store the results in a new data structure.
I don't want to reinvent the wheel and I've just started learning Spark, thus I'm wondering: what's the best way to do such task? Is the built-in function rdd.intersection a good solution?
Edited: Actually thanks to #Aneel's comments I could correct some of my mistakes.
Actually there is a join call on RDDs that gives the same (the join is done on the first column of the RDDs, and the values are a tuple of the rest of the columns of both RDDs), as a call with an JOIN with Spark SQL gives out, instead of doing a cogroup as previously pointed to, since as #Aneel pointed out cogroup squash key-value pair under one single key.
Now on a different note, I tried #Aneel's methods, and the gist above, and try to benchmark it a little bit, here are the results, using databricks' community edition (very small cluster, 6GB of memory, 1 core and Spark 2.1), here is the link. (the code is also at the end of the post)
Here are the results:
For a 100000 sized list:
Spark SQL: 1.32s
RDD join: 0.89s
For a 250000 sized list:
Spark SQL: 2.2s
RDD join: 2.0s
For a 500000 sized list:
Spark SQL: 3.6s
RDD join: 4.6s
For a 1000000 sized list:
Spark SQL: 7.7s
RDD join: 10.2s
For a 10000000 sized list (here I called timeit to do only 10 tests, or it will be running until Christmas. Of course the precision is thus decrease):
Spark SQL: 57.6s
RDD join: 89.9s
Actually it looks like that for small datasets RDDs are faster than Dataframes, but once you reach a threshold (around 250k records), Dataframes join start to be faster
Now as #Aneel suggested, bear in mind that I made a pretty simple example, and you might want to do some testing on your own set of data and environment (I did not go farther than 10M lines in my 2 lists because it already took 2.6 min to initialized).
Initialization code:
#Init code
NUM_TESTS=100
from random import randint
l1 = []
l2 = []
import timeit
for i in xrange(0, 10000000):
t = (randint(0,2000), randint(0,2000))
v = randint(0,2000)
l1.append((t,v))
if (randint(0,100) > 25): #at least 25% of the keys should be similar
t = (randint(0,2000), randint(0,2000))
v = randint(0,2000)
l2.append((t,v))
rdd1 = sc.parallelize(l1)
rdd2 = sc.parallelize(l2)
Spark SQL test:
#Test Spark SQL
def callable_ssql_timeit():
df1 = spark.createDataFrame(rdd1).toDF("id", "val")
df1.createOrReplaceTempView("table1")
df2 = spark.createDataFrame(rdd2).toDF("id", "val")
df2.createOrReplaceTempView("table2")
query="SELECT * FROM table1 JOIN table2 ON table1.id=table2.id"
spark.sql(query).count()
print(str(timeit.timeit(callable_ssql_timeit, number=NUM_TESTS)/float(NUM_TESTS)) + "s")
RDD join test:
#Test RDD join
def callable_rdd_timeit():
rdd1.join(rdd2).count()
print(str(timeit.timeit(callable_rdd_timeit, number=NUM_TESTS)/float(NUM_TESTS)) + "s")
Since you want to use pyspark.sql DataFrames, how about converting the RDDs to them at the outset?
df1 = spark.createDataFrame(rdd1)
df1.createOrReplaceTempView("table1").toDF("geohash", "other", "data", )
df2 = spark.createDataFrame(rdd2)
df2.createOrReplaceTempView("table2").toDF("geohash", "other", "data", "fields")
spark.sql("SELECT * FROM table1 JOIN table2 ON table1.geohash = table2.geohash").show()
If you want to operate on similar (non-identical) geohashes, you can register a user defined function to calculate the distance between them.

python pandas native select_as_multiple

Suppose I have a DataFrame that is block sparse. By this I mean that there are groups of rows that have disjoint sets of non-null columns. Storing this a huge table will use more memory in the values (nan filling) and unstacking the table to rows will creating a large index (at least it appears that way on saving to disk ... I'm not 100% clear if there is some efficient MultiIndexing that is supposed to be going on).
Typically, I store the blocks as separate DataFrames in a dict or list (dropping the nan columns) and make a class that has almost the same api as a DataFrame, 'manually' passing the queries to the blocks and concatenating the results. This works well but involves a short amount of some special code to store and handle these objects.
Recently, I've noticed that pytables provides a feature similar to this but only for the pytables query api.
Is there some way of handling this natively in pandas? Or am I missing some simpler way of getting a solution that is similar in performance?
EDIT: Here is a small example dataset
import pandas, string, itertools
from pylab import *
# create some data and put it in a list of blocks (d)
m = 10; n = 6;
s = list(string.ascii_uppercase)
A = array([s[x] * (1 + mod(x, 3)) for x in randint(0, 26, m*n)]).reshape(m, n)
df = pandas.DataFrame(A)
d = list()
d += [df.ix[0:(m/2)].T.ix[0:(n/2)].T]
d += [df.ix[(m/2):].T.ix[(n/2):].T]
# 1. use lots of memory, fill with na
d0 = pandas.concat(d) # this is just the original df
# 2. maybe ok, not sure how this is handled across different pandas versions
d1 = pandas.concat([x.unstack() for x in d])
# want this to work however the blocks are stored
print(d0.ix[[0, 8]][[2,5]])
# this raises exception
sdf = pandas.SparseDataFrame(df)
You could use HDFStore this way
Store different tables with a common index (that is itself) a column
only the non-all-nan rows would be stored. so if you group your columns intelligently (e.g.
put the ones that would tend to have lots of sparseness in the same place together). I think you could achieve a 'sparse'-like layout.
you can compress the table if necessary.
you can then query individual tables, and get the coordinates to then pull from other tables (this is what select_as_multiple does).
Can you provde a small example, and rough size of data set, e.g. num of rows, columuns, disjoint groups, etc.
What do your queries look like? This is generally how I approach the problem. Figure our how you are going to query; this is going to define how you store the data layout.

Categories

Resources