Pandas: Apply function over each pair of columns under constraints - python

As the title says, I'm trying to apply a function over each pair of columns of a dataframe under some conditions. I'm going to try to illustrate this. My df is of the form:
Code | 14 | 17 | 19 | ...
w1 | 0 | 5 | 3 | ...
w2 | 2 | 5 | 4 | ...
w3 | 0 | 0 | 5 | ...
The Code corresponds to a determined location in a rectangular grid and the ws are different words. I would like to apply cosine similarity measure between each pair of columns only (EDITED!) if the sum of items in one of the columns of the pair is greater thah 5.
The desired output would be something like:
| [14,17] | [14,19] | [14,...] | [17,19] | ...
Sim |cs(14,17) |cs(14,19) |cs(14,...) |cs(17,19)..| ...
cs is the result of the cosine similarity for each pair of columns.
Is there any suitable method to do this?
Any help would be appreciated :-)

To apply the cosine metric to each pair from two collections of inputs, you
could use scipy.spatial.distance.cdist. This will be much much faster than
using a double Python loop.
Let one collection be all the columns of df. Let the other collection be only those columns where the sum is greater than 5:
import pandas as pd
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]
Then all the cosine similarities can be computed with one call to cdist:
import scipy.spatial.distance as SSD
values = SSD.cdist(df2.T, df.T, metric='cosine')
# array([[ 2.92893219e-01, 1.11022302e-16, 3.00000000e-01],
# [ 4.34314575e-01, 3.00000000e-01, 1.11022302e-16]])
The values can be wrapped in a new DataFrame and reshaped:
result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()
import pandas as pd
import scipy.spatial.distance as SSD
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]
values = SSD.cdist(df2.T, df.T, metric='cosine')
result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()
mask = result.index.get_level_values(0) != result.index.get_level_values(1)
result = result.loc[mask]
print(result)
yields the Series
17 14 0.292893
19 0.300000
19 14 0.434315
17 0.300000

Related

Pandas DataFrame string replace followed by split and set intersection

I have following pandas DataFrame
data = ['18#38#123#23=>21', '18#38#23#55=>35']
d = pd.DataFrame(data, columns = ['rule'])
and I have list of integers
r = [18, 55]
and I want to filter rules from above DataFrame if all integers in the list r are present in the rule too. I tried the following code and failed
d[d['rule'].str.replace('=>','#').split('#').astype(set).issuperset(set(r))]
How can I achieve the desired filtering with pandas
You were going in right direction, just need to use apply function instead:
d[d['rule'].str.replace('=>','#').str.split('#').apply(lambda x: set(x).issuperset(set(map(str,r))))]
Using str.get_dummies
d.rule.str.replace('=>','#').str.get_dummies(sep='#').loc[:, map(str, r)].all(1)
Outputs
0 False
1 True
dtype: bool
Detail:
get_dummies+loc returns
18 55
0 1 0
1 1 1
My initial instinct would be to use a list comprehension:
df = pd.DataFrame(['18#38#123#23=>21', '188#38#123#23=>21', '#18#38#23#55=>35'], columns = ['rule'])
def wrap(n):
return r'(?<=[^|^\d]){}(?=[^\d])'.format(n)
patterns = [18, 55]
pd.concat([df['rule'].str.contains(wrap(pattern)) for pattern in patterns], axis=1).all(axis=1)
Output:
0 False
1 False
2 True
My approach is similar to #RafaelC's answer, but convert all string into int:
new_df = d.rule.str.replace('=>','#').str.get_dummies(sep='#')
new_df.columns = new_df.columns.astype(int)
has_all = new_df[r].all(1)
# then you can assign new column for initial data frame
d['new_col'] = 10
d.loc[has_all, 'new_col'] = 100
Output:
+-------+-------------------+------------+
| | rule | new_col |
+-------+-------------------+------------+
| 0 | 18#38#123#23=>21 | 10 |
| 1 | 188#38#23#55=>35 | 10 |
| 2 | 18#38#23#55=>35 | 100 |
+-------+-------------------+------------+

Dataframe summary math based on condition from another dataframe?

I have what amounts to 3D data but can't install the Pandas recommended xarray package.
df_values
| a b c
-----------------
0 | 5 9 2
1 | 6 9 5
2 | 1 6 8
df_condition
| a b c
-----------------
0 | y y y
1 | y n y
2 | n n y
I know I can get the average of all values in df_values like this.
df_values.stack().mean()
Question... 👇
What is the simplest way to find the average of df_values where df_condition == "y"?
Assuming you wish to find the mean of all values where df_condition == 'y':
res = np.nanmean(df_values[df_condition.eq('y')]) #5.833333333333333
Using NumPy is substantially cheaper than Pandas stack or where:
# Pandas 0.23.0, NumPy 1.14.3
n = 10**5
df_values = pd.concat([df_values]*n, ignore_index=True)
df_condition = pd.concat([df_condition]*n, ignore_index=True)
%timeit np.nanmean(df_values.values[df_condition.eq('y')]) # 32 ms
%timeit np.nanmean(df_values.where(df_condition == 'y').values) # 88 ms
%timeit df_values[df_condition.eq('y')].stack().mean() # 107 ms
IIUC Boolean mask
df[c.eq('y')].mean().mean()
6.5
Or you may want
df[c.eq('y')].sum().sum()/c.eq('y').sum().sum()
5.833333333333333
You can get the mean of all values where the condition is 'y' with only pandas DataFrame and Series methods like below
df_values[df_condition.eq('y')].stack().mean() # 5.833333333333333
or
df_values[df_condition == 'y'].stack().mean() # 5.833333333333333
Is this simple? :)
Try:
np.nanmean(df.where(dfcon == 'y').values)
Output:
5.8333333333

What is the fastest way to conditionally change the values of a dataframe in every index and column?

Is there a way to reduce by a constant number each element of a dataframe verifying a condition including their own value without using a loop?
For instance, each cells < 2 sees its value reducing by 1.
Thank you very much.
I like to do this masking.
Here is an inefficient loop using your example
#Example using loop
for val in df['column']:
if(val<2):
val = val - 1
The following code gives the same result, but it will generally be much faster because it does not use a loop.
# Same effect using masks
mask = (df['column'] < 2) #Find entries that are less than 2.
df.loc[mask,'column'] = df.loc[mask,'column'] - 1 #Subtract 1.
I am not sure if this is the fastest, but you can use the .apply function:
import pandas as pd
df = pd.DataFrame(data=np.array([[1,2,3], [2,2,2], [4,4,4]]),
columns=['x', 'y', 'z'])
def conditional_add(x):
if x > 2:
return x + 2
else:
return x
df['x'] = df['x'].apply(conditional_add)
Will add 2 to the final row of column x.
More like (data from Willie)
df-((df<2)*2)
Out[727]:
x y z
0 -1 2 3
1 2 2 2
2 4 4 4
In this case I would use the np.where method from the NumPy library.
The method uses the following logic:
np.where(<condition>, <value if true>, <value if false>)
Example:
# import modules which are needed
import pandas as pd
import numpy as np
# create exmaple dataframe
df = pd.DataFrame({'A':[3,1,5,0.5,2,0.2]})
| A |
|-----|
| 3 |
| 1 |
| 5 |
| 0.5 |
| 2 |
| 0.2 |
# apply the np.where method with conditional statement
df['A'] = np.where(df.A < 2, df.A - 1, df.A)
| A |
|------|
| 3 |
| 0.0 |
| 5 |
| -0.5 |
| 2 |
| -0.8 |`

Using pd.cut & pd.vales_count then results as 2d array

Use case
I get random observations from a population.
Then I group them by bin using pd.cut
Then I extract values with pd.values_counts
I want to get the calculated interval labels and the frequency count
I want to 'glue' the labels column to the frequency counts column to get 2d array (with 2 columns, and n interval rows)
I want to convert 2d array to a list for COM interop.
I am close to desired output but I am Python newbie so some smart guy can optimize my label code.
The problem here is the constraint of the final output which needs to be a list so it can be marshalled via COM interop layer to Excel VBA.
import inspect
import numpy as np
import pandas as pd
from scipy.stats import skewnorm
pop = skewnorm.rvs(0, size=20)
bins=[-5,-4,-3,-2,-1,0,1,2,3,4,5]
bins2 = np.array(bins)
bins3 = pd.cut(pop,bins2)
bins4 = [0]*(bins2.size-1)
#print my own labels, doh!
idx=0
for binLoop in bins3.categories:
intervalAsString="(" + str(binLoop.left)+ "," + str(binLoop.right)+"]"
print (intervalAsString)
bins4[idx]=intervalAsString
idx=idx+1
table = pd.value_counts(bins3, sort=False)
joined = np.vstack((bins4,table.tolist()))
print (joined)
Target output a 2d array convertible to a list
| (-5, -4] | 0 |
| (-4, -3] | 0 |
| (-3, -2] | 0 |
| (-2, -1] | 1 |
| (-1, 0] | 3 |
| (0, 1] | 9 |
| (1, 2] | 4 |
| (2, 3] | 2 |
| (3, 4] | 1 |
| (4, 5] | 0 |
If I understand you correctly, the following should do what you are after:
pop = skewnorm.rvs(0, size=20)
bins = range(-5, 5)
binned = pd.cut(pop, bins)
# create the histogram data
hist = binned.value_counts()
# hist is a pandas series with a categorical index describing the bins
# `index.astype(str)` will convert the categories to strings.
hist.index = hist.index.astype(str)
# `.reset_index()` will turn the index into an ordinary column
# `.values` gives you the underlying numpy array
# `tolist()` converts the numpy array to a native python list o' lists.
print(hist.reset_index().values.tolist())

How to get average of increasing values using Pandas?

I'm trying to figure out the average of increasing values in my table per column.
my table
A | B | C
----------------
0 | 5 | 10
100 | 2 | 20
50 | 2 | 30
100 | 0 | 40
function I'm trying to write for my problem
def avergeIncreace(data,value): #not complete but what I have so far
x = data[value].pct_change().fillna(0).gt(0)
print( x )
pct_change() returns a table of the percentage of the number at that index compared to the number in row before it.fillna(0) replaces the NaN in position 0 of the chart that pct_change() creates with 0.gt(0) returns true or false table depending if the value at that index is greater than 0
current output of this function
In[1]:avergeIncreace(df,'A')
Out[1]: 0 False
1 True
2 False
3 True
Name: BAL, dtyle: bool
desired output
In[1]:avergeIncreace(df,'A')
Out[1]:75
In[2]:avergeIncreace(df,'B')
Out[2]:0
In[3]:avergeIncreace(df,'C')
Out[3]:10
From my limited understanding of pandas there should be a way to return an array of all the indexes that are true and then use a for loop and go through the original data table, but I believe pandas should have a way to do this without a for loop.
what I think the for loop way would look plus missing code so indexes returned are ones that are true instead of every index
avergeIncreace(df,'A')
indexes = data[value].pct_change().fillna(0).gt(0).index.values #this returns an array containing all of the index (true and false)
answer = 0
times = 0
for x in indexes:
answer += (data[value][x] - data[value][x-1])
times += 1
print( answer/times )
How to I achieve my desired output without using a for loop in the function?
You can use mask() and diff():
df.diff().mask(df.diff()<=0, np.nan).mean().fillna(0)
Yields:
A 75.0
B 0.0
C 10.0
dtype: float64
How about
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0, 100, 50, 100],
'B': [5, 2, 2, 0],
'C': [10, 20, 30, 40]})
def averageIncrease(df, col_name):
# Create array of deltas. Replace nan and negative values with zero
a = np.maximum(df[col_name] - df[col_name].shift(), 0).replace(np.nan, 0)
# Count non-zero values
count = np.count_nonzero(a)
if count == 0:
# If only zero values… there is no increase
return 0
else:
return np.sum(a) / count
print(averageIncrease(df, 'A'))
print(averageIncrease(df, 'B'))
print(averageIncrease(df, 'C'))
75.0
0
10.0

Categories

Resources