Effectively apply a function to every possible pairs from a Pandas Series - python

I have an indexed Pandas Series with 20k entries. Each entry is an array of strings.
id | value
0 | ['abc', 'abc', 'def']
1 | ['bac', 'c', 'def', 'a']
2 | ...
...|
20k| ['aaa', 'rzt']
I want to compare each entry (lists of strings) with every other entry of the series. I have a complex comparison function which takes two lists of strings and return a float.
The result should be a matrix.
id | 0 | 1 | 2 | ... | 20k
0 | 1 0.5 0.4
1 | 0.5 1 0.2
2 | 0.4 0.2 1
...|
20k|
A double loop computing the result of every matrix element takes my computer more than 3 hours.
How can I effectively apply/parallelise my comparison function? I tried broadcasting using numpy arrays without success (no speedup).
values = df['value'].values
broadcasted = np.broadcast(values, values[:,None])
result = np.empty(broadcasted.shape)
result.flat = [compare_function(u,v) for (u,v) in broadcasted]

Related

Python Pandas: Comparison of elements in Dataframe/series

I have a DataFrame in a variable called "myDataFrame" that looks like this:
+---------+-----+-------+-----
| Type | Count | Status |
+---------+-----+-------+-----
| a | 70 | 0 |
| a | 70 | 0 |
| b | 70 | 0 |
| c | 74 | 3 |
| c | 74 | 2 |
| c | 74 | 0 |
+---------+-----+-------+----+
I am using vectorized approach to process the rows in this DataFrame since the amount of rows I have is about 116 million.
So I wrote something like this:
myDataFrame['result'] = processDataFrame(myDataFrame['status'], myDataFrame['Count'])
In my function, I am trying to do this:
def processDataFrame(status, count):
resultsList = list()
if status == 0:
resultsList.append(count + 10000)
else:
resultsList.append(count - 10000)
return resultsList
But I get this for comparison status values:
Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
What am i missing?
We can do without self-def function
myDataFrame['result'] = np.where(myDataFrame['status']==0,
myDataFrame['Count']+10000,
myDataFrame['Count']-10000)
Update
df.apply(lambda x : processDataFrame(x['Status'],x['Count']),1)
0 [10070]
1 [10070]
2 [10070]
3 [-9926]
4 [-9926]
5 [10074]
dtype: object
I think your function is not really doing the vectorized part.
When it is called, you pass status = myDataFrame['status'], so when it gets to the first if, it checks the condition of myDataFrame['status'] == 0. But myDataFrame['status'] == 0 is a boolean series (of whether each element of the status column equals 0), so it doesn't have a single Truth value (hence the error). Similarly, if the condition could be met, the resultsList would just get the whole "Count" column appended, either all plus 10000 or all minus 10000.
Edit:
I suppose this function uses the built in pandas functions, but applies them in your function:
def processDataFrame(status, count):
status_0 = (status == 0)
output = count.copy() #if you don't want to modify in place
output[status_0] += 10
output[~status_0] -= 10
return output

Converting nested dictionary to dataframe with the keys as rownames and the dictionaries in the values as columns?

I have a dataframe that consists of a large number of frequency counts, where the column labels are features being counted and row labels are the pages in which features are being counted. I need to find the probability of each feature occurring across all pages, so I'm trying unsuccessfully to iterate through each column, dividing each sum by the sum of all columns, and save the result in a dictionary as the value corresponding to a key which is taken from the column labels.
My dataframe looks something like this:
|---------|----------|
| Word1 | Word2 |
----|---------|----------|
pg1 | 0 | 1 |
----|---------|----------|
pg2 | 3 | 2 |
----|---------|----------|
pg3 | 9 | 0 |
----|---------|----------|
pg4 | 1 | 6 |
----|---------|----------|
pg5 | 2 | 3 |
----|---------|----------|
pg6 | 0 | 2 |
----|---------|----------|
And I want my output to be a dictionary with the words as the keys and the sum(column) / sum(table) as the values, like this:
{ Word1: .517 , Word2: .483 }
So far I've attempted the following:
dict = {}
for x in df.sum(axis = 0):
dict[x] = x / sum(df.sum(axis = 0))
print(dict)
but the command never completes. I'm not sure whether I've done something wrong in my code or whether perhaps my laptop simply doesn't have the ability to deal with the size of my dataset.
Can anyone point me in the right direction?
It looks like you can take the sum of each column and then divide by the flattened values of the sum across the entire underlying arrays in the DF, eg:
df.sum().div(df.values.sum()).to_dict()
That'll give you:
{'Word1': 0.5172413793103449, 'Word2': 0.4827586206896552}

Pandas DataFrame string replace followed by split and set intersection

I have following pandas DataFrame
data = ['18#38#123#23=>21', '18#38#23#55=>35']
d = pd.DataFrame(data, columns = ['rule'])
and I have list of integers
r = [18, 55]
and I want to filter rules from above DataFrame if all integers in the list r are present in the rule too. I tried the following code and failed
d[d['rule'].str.replace('=>','#').split('#').astype(set).issuperset(set(r))]
How can I achieve the desired filtering with pandas
You were going in right direction, just need to use apply function instead:
d[d['rule'].str.replace('=>','#').str.split('#').apply(lambda x: set(x).issuperset(set(map(str,r))))]
Using str.get_dummies
d.rule.str.replace('=>','#').str.get_dummies(sep='#').loc[:, map(str, r)].all(1)
Outputs
0 False
1 True
dtype: bool
Detail:
get_dummies+loc returns
18 55
0 1 0
1 1 1
My initial instinct would be to use a list comprehension:
df = pd.DataFrame(['18#38#123#23=>21', '188#38#123#23=>21', '#18#38#23#55=>35'], columns = ['rule'])
def wrap(n):
return r'(?<=[^|^\d]){}(?=[^\d])'.format(n)
patterns = [18, 55]
pd.concat([df['rule'].str.contains(wrap(pattern)) for pattern in patterns], axis=1).all(axis=1)
Output:
0 False
1 False
2 True
My approach is similar to #RafaelC's answer, but convert all string into int:
new_df = d.rule.str.replace('=>','#').str.get_dummies(sep='#')
new_df.columns = new_df.columns.astype(int)
has_all = new_df[r].all(1)
# then you can assign new column for initial data frame
d['new_col'] = 10
d.loc[has_all, 'new_col'] = 100
Output:
+-------+-------------------+------------+
| | rule | new_col |
+-------+-------------------+------------+
| 0 | 18#38#123#23=>21 | 10 |
| 1 | 188#38#23#55=>35 | 10 |
| 2 | 18#38#23#55=>35 | 100 |
+-------+-------------------+------------+

What is the fastest way to conditionally change the values of a dataframe in every index and column?

Is there a way to reduce by a constant number each element of a dataframe verifying a condition including their own value without using a loop?
For instance, each cells < 2 sees its value reducing by 1.
Thank you very much.
I like to do this masking.
Here is an inefficient loop using your example
#Example using loop
for val in df['column']:
if(val<2):
val = val - 1
The following code gives the same result, but it will generally be much faster because it does not use a loop.
# Same effect using masks
mask = (df['column'] < 2) #Find entries that are less than 2.
df.loc[mask,'column'] = df.loc[mask,'column'] - 1 #Subtract 1.
I am not sure if this is the fastest, but you can use the .apply function:
import pandas as pd
df = pd.DataFrame(data=np.array([[1,2,3], [2,2,2], [4,4,4]]),
columns=['x', 'y', 'z'])
def conditional_add(x):
if x > 2:
return x + 2
else:
return x
df['x'] = df['x'].apply(conditional_add)
Will add 2 to the final row of column x.
More like (data from Willie)
df-((df<2)*2)
Out[727]:
x y z
0 -1 2 3
1 2 2 2
2 4 4 4
In this case I would use the np.where method from the NumPy library.
The method uses the following logic:
np.where(<condition>, <value if true>, <value if false>)
Example:
# import modules which are needed
import pandas as pd
import numpy as np
# create exmaple dataframe
df = pd.DataFrame({'A':[3,1,5,0.5,2,0.2]})
| A |
|-----|
| 3 |
| 1 |
| 5 |
| 0.5 |
| 2 |
| 0.2 |
# apply the np.where method with conditional statement
df['A'] = np.where(df.A < 2, df.A - 1, df.A)
| A |
|------|
| 3 |
| 0.0 |
| 5 |
| -0.5 |
| 2 |
| -0.8 |`

Pandas: Apply function over each pair of columns under constraints

As the title says, I'm trying to apply a function over each pair of columns of a dataframe under some conditions. I'm going to try to illustrate this. My df is of the form:
Code | 14 | 17 | 19 | ...
w1 | 0 | 5 | 3 | ...
w2 | 2 | 5 | 4 | ...
w3 | 0 | 0 | 5 | ...
The Code corresponds to a determined location in a rectangular grid and the ws are different words. I would like to apply cosine similarity measure between each pair of columns only (EDITED!) if the sum of items in one of the columns of the pair is greater thah 5.
The desired output would be something like:
| [14,17] | [14,19] | [14,...] | [17,19] | ...
Sim |cs(14,17) |cs(14,19) |cs(14,...) |cs(17,19)..| ...
cs is the result of the cosine similarity for each pair of columns.
Is there any suitable method to do this?
Any help would be appreciated :-)
To apply the cosine metric to each pair from two collections of inputs, you
could use scipy.spatial.distance.cdist. This will be much much faster than
using a double Python loop.
Let one collection be all the columns of df. Let the other collection be only those columns where the sum is greater than 5:
import pandas as pd
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]
Then all the cosine similarities can be computed with one call to cdist:
import scipy.spatial.distance as SSD
values = SSD.cdist(df2.T, df.T, metric='cosine')
# array([[ 2.92893219e-01, 1.11022302e-16, 3.00000000e-01],
# [ 4.34314575e-01, 3.00000000e-01, 1.11022302e-16]])
The values can be wrapped in a new DataFrame and reshaped:
result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()
import pandas as pd
import scipy.spatial.distance as SSD
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]
values = SSD.cdist(df2.T, df.T, metric='cosine')
result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()
mask = result.index.get_level_values(0) != result.index.get_level_values(1)
result = result.loc[mask]
print(result)
yields the Series
17 14 0.292893
19 0.300000
19 14 0.434315
17 0.300000

Categories

Resources