I have following pandas DataFrame
data = ['18#38#123#23=>21', '18#38#23#55=>35']
d = pd.DataFrame(data, columns = ['rule'])
and I have list of integers
r = [18, 55]
and I want to filter rules from above DataFrame if all integers in the list r are present in the rule too. I tried the following code and failed
d[d['rule'].str.replace('=>','#').split('#').astype(set).issuperset(set(r))]
How can I achieve the desired filtering with pandas
You were going in right direction, just need to use apply function instead:
d[d['rule'].str.replace('=>','#').str.split('#').apply(lambda x: set(x).issuperset(set(map(str,r))))]
Using str.get_dummies
d.rule.str.replace('=>','#').str.get_dummies(sep='#').loc[:, map(str, r)].all(1)
Outputs
0 False
1 True
dtype: bool
Detail:
get_dummies+loc returns
18 55
0 1 0
1 1 1
My initial instinct would be to use a list comprehension:
df = pd.DataFrame(['18#38#123#23=>21', '188#38#123#23=>21', '#18#38#23#55=>35'], columns = ['rule'])
def wrap(n):
return r'(?<=[^|^\d]){}(?=[^\d])'.format(n)
patterns = [18, 55]
pd.concat([df['rule'].str.contains(wrap(pattern)) for pattern in patterns], axis=1).all(axis=1)
Output:
0 False
1 False
2 True
My approach is similar to #RafaelC's answer, but convert all string into int:
new_df = d.rule.str.replace('=>','#').str.get_dummies(sep='#')
new_df.columns = new_df.columns.astype(int)
has_all = new_df[r].all(1)
# then you can assign new column for initial data frame
d['new_col'] = 10
d.loc[has_all, 'new_col'] = 100
Output:
+-------+-------------------+------------+
| | rule | new_col |
+-------+-------------------+------------+
| 0 | 18#38#123#23=>21 | 10 |
| 1 | 188#38#23#55=>35 | 10 |
| 2 | 18#38#23#55=>35 | 100 |
+-------+-------------------+------------+
Related
I want to do
df[(df['col']==50) | (df['col']==150) | etc ..]
"etc" is size changing from 1 to many
so I do a loop
result is like
str= "(df['col']==50) | (df['col']==150) | (df['col']==100)"
then I do this
df[str]
but this does not work
How can I make it work ?
A simple solution:
list_of_numbers = [50,150]
df[df["col"].isin(list_of_numbers)]
Where list_of_numbers are the numbers you want to include in the condition. I'm assuming here your condition is always or.
Use query to filter a dataframe from a string
df = pd.DataFrame({'col': range(25, 225, 25)})
l = [50, 100, 150]
q = ' | '.join([f"col == {i}" for i in l])
out = df.query(f)
>>> q
'col == 50 | col == 100 | col == 150'
>>> out
col
1 50
3 100
5 150
Is there a way to reduce by a constant number each element of a dataframe verifying a condition including their own value without using a loop?
For instance, each cells < 2 sees its value reducing by 1.
Thank you very much.
I like to do this masking.
Here is an inefficient loop using your example
#Example using loop
for val in df['column']:
if(val<2):
val = val - 1
The following code gives the same result, but it will generally be much faster because it does not use a loop.
# Same effect using masks
mask = (df['column'] < 2) #Find entries that are less than 2.
df.loc[mask,'column'] = df.loc[mask,'column'] - 1 #Subtract 1.
I am not sure if this is the fastest, but you can use the .apply function:
import pandas as pd
df = pd.DataFrame(data=np.array([[1,2,3], [2,2,2], [4,4,4]]),
columns=['x', 'y', 'z'])
def conditional_add(x):
if x > 2:
return x + 2
else:
return x
df['x'] = df['x'].apply(conditional_add)
Will add 2 to the final row of column x.
More like (data from Willie)
df-((df<2)*2)
Out[727]:
x y z
0 -1 2 3
1 2 2 2
2 4 4 4
In this case I would use the np.where method from the NumPy library.
The method uses the following logic:
np.where(<condition>, <value if true>, <value if false>)
Example:
# import modules which are needed
import pandas as pd
import numpy as np
# create exmaple dataframe
df = pd.DataFrame({'A':[3,1,5,0.5,2,0.2]})
| A |
|-----|
| 3 |
| 1 |
| 5 |
| 0.5 |
| 2 |
| 0.2 |
# apply the np.where method with conditional statement
df['A'] = np.where(df.A < 2, df.A - 1, df.A)
| A |
|------|
| 3 |
| 0.0 |
| 5 |
| -0.5 |
| 2 |
| -0.8 |`
I'm trying to figure out the average of increasing values in my table per column.
my table
A | B | C
----------------
0 | 5 | 10
100 | 2 | 20
50 | 2 | 30
100 | 0 | 40
function I'm trying to write for my problem
def avergeIncreace(data,value): #not complete but what I have so far
x = data[value].pct_change().fillna(0).gt(0)
print( x )
pct_change() returns a table of the percentage of the number at that index compared to the number in row before it.fillna(0) replaces the NaN in position 0 of the chart that pct_change() creates with 0.gt(0) returns true or false table depending if the value at that index is greater than 0
current output of this function
In[1]:avergeIncreace(df,'A')
Out[1]: 0 False
1 True
2 False
3 True
Name: BAL, dtyle: bool
desired output
In[1]:avergeIncreace(df,'A')
Out[1]:75
In[2]:avergeIncreace(df,'B')
Out[2]:0
In[3]:avergeIncreace(df,'C')
Out[3]:10
From my limited understanding of pandas there should be a way to return an array of all the indexes that are true and then use a for loop and go through the original data table, but I believe pandas should have a way to do this without a for loop.
what I think the for loop way would look plus missing code so indexes returned are ones that are true instead of every index
avergeIncreace(df,'A')
indexes = data[value].pct_change().fillna(0).gt(0).index.values #this returns an array containing all of the index (true and false)
answer = 0
times = 0
for x in indexes:
answer += (data[value][x] - data[value][x-1])
times += 1
print( answer/times )
How to I achieve my desired output without using a for loop in the function?
You can use mask() and diff():
df.diff().mask(df.diff()<=0, np.nan).mean().fillna(0)
Yields:
A 75.0
B 0.0
C 10.0
dtype: float64
How about
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0, 100, 50, 100],
'B': [5, 2, 2, 0],
'C': [10, 20, 30, 40]})
def averageIncrease(df, col_name):
# Create array of deltas. Replace nan and negative values with zero
a = np.maximum(df[col_name] - df[col_name].shift(), 0).replace(np.nan, 0)
# Count non-zero values
count = np.count_nonzero(a)
if count == 0:
# If only zero values… there is no increase
return 0
else:
return np.sum(a) / count
print(averageIncrease(df, 'A'))
print(averageIncrease(df, 'B'))
print(averageIncrease(df, 'C'))
75.0
0
10.0
As the title says, I'm trying to apply a function over each pair of columns of a dataframe under some conditions. I'm going to try to illustrate this. My df is of the form:
Code | 14 | 17 | 19 | ...
w1 | 0 | 5 | 3 | ...
w2 | 2 | 5 | 4 | ...
w3 | 0 | 0 | 5 | ...
The Code corresponds to a determined location in a rectangular grid and the ws are different words. I would like to apply cosine similarity measure between each pair of columns only (EDITED!) if the sum of items in one of the columns of the pair is greater thah 5.
The desired output would be something like:
| [14,17] | [14,19] | [14,...] | [17,19] | ...
Sim |cs(14,17) |cs(14,19) |cs(14,...) |cs(17,19)..| ...
cs is the result of the cosine similarity for each pair of columns.
Is there any suitable method to do this?
Any help would be appreciated :-)
To apply the cosine metric to each pair from two collections of inputs, you
could use scipy.spatial.distance.cdist. This will be much much faster than
using a double Python loop.
Let one collection be all the columns of df. Let the other collection be only those columns where the sum is greater than 5:
import pandas as pd
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]
Then all the cosine similarities can be computed with one call to cdist:
import scipy.spatial.distance as SSD
values = SSD.cdist(df2.T, df.T, metric='cosine')
# array([[ 2.92893219e-01, 1.11022302e-16, 3.00000000e-01],
# [ 4.34314575e-01, 3.00000000e-01, 1.11022302e-16]])
The values can be wrapped in a new DataFrame and reshaped:
result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()
import pandas as pd
import scipy.spatial.distance as SSD
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]
values = SSD.cdist(df2.T, df.T, metric='cosine')
result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()
mask = result.index.get_level_values(0) != result.index.get_level_values(1)
result = result.loc[mask]
print(result)
yields the Series
17 14 0.292893
19 0.300000
19 14 0.434315
17 0.300000
I have a pandas dataframe structured as follows:
| ID | Start | Stop |
________________________________________
| 1 | 1,2,3,4 | 5,6,7,7 |
| 2 | 100,101 | 200,201 |
For each row in the dataframe, I'd like to add 1 to each value in the Start column. The dtype for the Start column is 'object'.
Desired output looks like this:
| ID | Start | Stop |
________________________________________
| 1 | 2,3,4,5 | 5,6,7,7 |
| 2 | 101,102 | 200,201 |
I've tried the following (and many versions of the following), but get an error stating ,TypeError: cannot concatenate 'str' and 'int' objects,:
df['test'] = [str(x + 1) for x in df['Start']]
I tried casting the column as an int, but got 'Invalid literal for long() with base 10: '101,102':
df['test'] = [int(x) + 1 for x in df['start'].astype(int)]
I tried converting the field to a list using str.split(), then casting each item as an integer:
Thanks in advance!
df['Start'] is the whole series, so you have to iterate that and then split:
new_series = []
for x in df['Start']:
value_list = []
for y in x.rstrip(',').split(','):
value_list.append(str(int(y) + 1))
new_series.append(','.join(value_list))
df['test'] = new_series
By telling you that you cannot concatenate string and int objects you know that x must be a string. You can solve this by casting x to an int before adding 1 to it. So str(x+1) becomes str(int(x)+1).
df['test'] = [str(int(x) + 1) for x in df['Start']]
df = pd.DataFrame({'Start' : [ [1 , 2, 3 , 4] , [100 , 101] ] , 'Stop' : [ [5 , 6 , 7 ,7] , [200,201] ] })
df.Start = df.Start.apply(lambda x : [y + 1 for y in x ])