I have what amounts to 3D data but can't install the Pandas recommended xarray package.
df_values
| a b c
-----------------
0 | 5 9 2
1 | 6 9 5
2 | 1 6 8
df_condition
| a b c
-----------------
0 | y y y
1 | y n y
2 | n n y
I know I can get the average of all values in df_values like this.
df_values.stack().mean()
Question... 👇
What is the simplest way to find the average of df_values where df_condition == "y"?
Assuming you wish to find the mean of all values where df_condition == 'y':
res = np.nanmean(df_values[df_condition.eq('y')]) #5.833333333333333
Using NumPy is substantially cheaper than Pandas stack or where:
# Pandas 0.23.0, NumPy 1.14.3
n = 10**5
df_values = pd.concat([df_values]*n, ignore_index=True)
df_condition = pd.concat([df_condition]*n, ignore_index=True)
%timeit np.nanmean(df_values.values[df_condition.eq('y')]) # 32 ms
%timeit np.nanmean(df_values.where(df_condition == 'y').values) # 88 ms
%timeit df_values[df_condition.eq('y')].stack().mean() # 107 ms
IIUC Boolean mask
df[c.eq('y')].mean().mean()
6.5
Or you may want
df[c.eq('y')].sum().sum()/c.eq('y').sum().sum()
5.833333333333333
You can get the mean of all values where the condition is 'y' with only pandas DataFrame and Series methods like below
df_values[df_condition.eq('y')].stack().mean() # 5.833333333333333
or
df_values[df_condition == 'y'].stack().mean() # 5.833333333333333
Is this simple? :)
Try:
np.nanmean(df.where(dfcon == 'y').values)
Output:
5.8333333333
Related
I want to shuffle the columns of a pandas data frame.
However, the default method (sample) shuffles all the columns in the same way.
How can I efficiently shuffle the columns of each row differently?
import pandas as pd
df = pd.DataFrame({'foo':[1,4,7],'bar':[2,5,8],'baz':[3,6,9],})
display(df)
df.sample(frac=1, axis=1)
Certainly, an apply based solution would work - but this would not be vectorized and thus slow.
Is there a fast (and ideally vectorized) way to sample differently for each row?
Let us try with np.random.rand and argsort to generate shuffled indices
i = np.random.rand(*df.shape).argsort(1)
df.values[:] = np.take_along_axis(df.to_numpy(), i, axis=1)
print(df)
foo bar baz
0 3 1 2
1 4 5 6
2 7 9 8
You can try this solution:
def shuffle_columns_per_row(df):
arr = df.values
x, y = arr.shape
rows = np.indices((x,y))[0]
cols = [np.random.permutation(y) for _ in range(x)]
return pd.DataFrame(arr[rows, cols], columns=df.columns)
| foo | bar | baz |
|-----|-----|-----|
| 3 | 2 | 1 |
| 5 | 6 | 4 |
| 9 | 7 | 8 |
A quick check gives the benchmark:
%%timeit
df.sample(frac=1, axis=1)
# 1000 loops, best of 5: 288 µs per loop
With apply, as you said, we get:
%%timeit
idx = np.random.choice([0, 1, 2], size=(3,), replace=False)
df.apply(lambda x: x.iloc[idx], axis=1)
# 1000 loops, best of 5: 1.47 ms per loop -> ~3700 times slower
We could rather use iloc:
%%timeit
idx = np.random.choice([0, 1, 2], size=(3,), replace=False)
df.iloc[:, idx]
# 1000 loops, best of 5: 398 µs per loop -> ~1.4 times slower
If you could live with a roughly 1.4 times decrease in speed, I think the iloc version would work.
I have following pandas DataFrame
data = ['18#38#123#23=>21', '18#38#23#55=>35']
d = pd.DataFrame(data, columns = ['rule'])
and I have list of integers
r = [18, 55]
and I want to filter rules from above DataFrame if all integers in the list r are present in the rule too. I tried the following code and failed
d[d['rule'].str.replace('=>','#').split('#').astype(set).issuperset(set(r))]
How can I achieve the desired filtering with pandas
You were going in right direction, just need to use apply function instead:
d[d['rule'].str.replace('=>','#').str.split('#').apply(lambda x: set(x).issuperset(set(map(str,r))))]
Using str.get_dummies
d.rule.str.replace('=>','#').str.get_dummies(sep='#').loc[:, map(str, r)].all(1)
Outputs
0 False
1 True
dtype: bool
Detail:
get_dummies+loc returns
18 55
0 1 0
1 1 1
My initial instinct would be to use a list comprehension:
df = pd.DataFrame(['18#38#123#23=>21', '188#38#123#23=>21', '#18#38#23#55=>35'], columns = ['rule'])
def wrap(n):
return r'(?<=[^|^\d]){}(?=[^\d])'.format(n)
patterns = [18, 55]
pd.concat([df['rule'].str.contains(wrap(pattern)) for pattern in patterns], axis=1).all(axis=1)
Output:
0 False
1 False
2 True
My approach is similar to #RafaelC's answer, but convert all string into int:
new_df = d.rule.str.replace('=>','#').str.get_dummies(sep='#')
new_df.columns = new_df.columns.astype(int)
has_all = new_df[r].all(1)
# then you can assign new column for initial data frame
d['new_col'] = 10
d.loc[has_all, 'new_col'] = 100
Output:
+-------+-------------------+------------+
| | rule | new_col |
+-------+-------------------+------------+
| 0 | 18#38#123#23=>21 | 10 |
| 1 | 188#38#23#55=>35 | 10 |
| 2 | 18#38#23#55=>35 | 100 |
+-------+-------------------+------------+
Is there a way to reduce by a constant number each element of a dataframe verifying a condition including their own value without using a loop?
For instance, each cells < 2 sees its value reducing by 1.
Thank you very much.
I like to do this masking.
Here is an inefficient loop using your example
#Example using loop
for val in df['column']:
if(val<2):
val = val - 1
The following code gives the same result, but it will generally be much faster because it does not use a loop.
# Same effect using masks
mask = (df['column'] < 2) #Find entries that are less than 2.
df.loc[mask,'column'] = df.loc[mask,'column'] - 1 #Subtract 1.
I am not sure if this is the fastest, but you can use the .apply function:
import pandas as pd
df = pd.DataFrame(data=np.array([[1,2,3], [2,2,2], [4,4,4]]),
columns=['x', 'y', 'z'])
def conditional_add(x):
if x > 2:
return x + 2
else:
return x
df['x'] = df['x'].apply(conditional_add)
Will add 2 to the final row of column x.
More like (data from Willie)
df-((df<2)*2)
Out[727]:
x y z
0 -1 2 3
1 2 2 2
2 4 4 4
In this case I would use the np.where method from the NumPy library.
The method uses the following logic:
np.where(<condition>, <value if true>, <value if false>)
Example:
# import modules which are needed
import pandas as pd
import numpy as np
# create exmaple dataframe
df = pd.DataFrame({'A':[3,1,5,0.5,2,0.2]})
| A |
|-----|
| 3 |
| 1 |
| 5 |
| 0.5 |
| 2 |
| 0.2 |
# apply the np.where method with conditional statement
df['A'] = np.where(df.A < 2, df.A - 1, df.A)
| A |
|------|
| 3 |
| 0.0 |
| 5 |
| -0.5 |
| 2 |
| -0.8 |`
I'm trying to figure out the average of increasing values in my table per column.
my table
A | B | C
----------------
0 | 5 | 10
100 | 2 | 20
50 | 2 | 30
100 | 0 | 40
function I'm trying to write for my problem
def avergeIncreace(data,value): #not complete but what I have so far
x = data[value].pct_change().fillna(0).gt(0)
print( x )
pct_change() returns a table of the percentage of the number at that index compared to the number in row before it.fillna(0) replaces the NaN in position 0 of the chart that pct_change() creates with 0.gt(0) returns true or false table depending if the value at that index is greater than 0
current output of this function
In[1]:avergeIncreace(df,'A')
Out[1]: 0 False
1 True
2 False
3 True
Name: BAL, dtyle: bool
desired output
In[1]:avergeIncreace(df,'A')
Out[1]:75
In[2]:avergeIncreace(df,'B')
Out[2]:0
In[3]:avergeIncreace(df,'C')
Out[3]:10
From my limited understanding of pandas there should be a way to return an array of all the indexes that are true and then use a for loop and go through the original data table, but I believe pandas should have a way to do this without a for loop.
what I think the for loop way would look plus missing code so indexes returned are ones that are true instead of every index
avergeIncreace(df,'A')
indexes = data[value].pct_change().fillna(0).gt(0).index.values #this returns an array containing all of the index (true and false)
answer = 0
times = 0
for x in indexes:
answer += (data[value][x] - data[value][x-1])
times += 1
print( answer/times )
How to I achieve my desired output without using a for loop in the function?
You can use mask() and diff():
df.diff().mask(df.diff()<=0, np.nan).mean().fillna(0)
Yields:
A 75.0
B 0.0
C 10.0
dtype: float64
How about
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0, 100, 50, 100],
'B': [5, 2, 2, 0],
'C': [10, 20, 30, 40]})
def averageIncrease(df, col_name):
# Create array of deltas. Replace nan and negative values with zero
a = np.maximum(df[col_name] - df[col_name].shift(), 0).replace(np.nan, 0)
# Count non-zero values
count = np.count_nonzero(a)
if count == 0:
# If only zero values… there is no increase
return 0
else:
return np.sum(a) / count
print(averageIncrease(df, 'A'))
print(averageIncrease(df, 'B'))
print(averageIncrease(df, 'C'))
75.0
0
10.0
As the title says, I'm trying to apply a function over each pair of columns of a dataframe under some conditions. I'm going to try to illustrate this. My df is of the form:
Code | 14 | 17 | 19 | ...
w1 | 0 | 5 | 3 | ...
w2 | 2 | 5 | 4 | ...
w3 | 0 | 0 | 5 | ...
The Code corresponds to a determined location in a rectangular grid and the ws are different words. I would like to apply cosine similarity measure between each pair of columns only (EDITED!) if the sum of items in one of the columns of the pair is greater thah 5.
The desired output would be something like:
| [14,17] | [14,19] | [14,...] | [17,19] | ...
Sim |cs(14,17) |cs(14,19) |cs(14,...) |cs(17,19)..| ...
cs is the result of the cosine similarity for each pair of columns.
Is there any suitable method to do this?
Any help would be appreciated :-)
To apply the cosine metric to each pair from two collections of inputs, you
could use scipy.spatial.distance.cdist. This will be much much faster than
using a double Python loop.
Let one collection be all the columns of df. Let the other collection be only those columns where the sum is greater than 5:
import pandas as pd
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]
Then all the cosine similarities can be computed with one call to cdist:
import scipy.spatial.distance as SSD
values = SSD.cdist(df2.T, df.T, metric='cosine')
# array([[ 2.92893219e-01, 1.11022302e-16, 3.00000000e-01],
# [ 4.34314575e-01, 3.00000000e-01, 1.11022302e-16]])
The values can be wrapped in a new DataFrame and reshaped:
result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()
import pandas as pd
import scipy.spatial.distance as SSD
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]
values = SSD.cdist(df2.T, df.T, metric='cosine')
result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()
mask = result.index.get_level_values(0) != result.index.get_level_values(1)
result = result.loc[mask]
print(result)
yields the Series
17 14 0.292893
19 0.300000
19 14 0.434315
17 0.300000