How to include two lambda operations in transform function? - python

I have a dataframe like as given below
df = pd.DataFrame({
'date' :['2173-04-03 12:35:00','2173-04-03 17:00:00','2173-04-03 20:00:00','2173-04-04 11:00:00','2173-04-04 12:00:00','2173-04-04 11:30:00','2173-04-04 16:00:00','2173-04-04 22:00:00','2173-04-05 04:00:00'],
'subject_id':[1,1,1,1,1,1,1,1,1],
'val' :[5,5,5,10,10,5,5,8,8]
})
I would like to apply couple of logic (logic_1 on val column and logic_2 on date column) to the code. Please find below the logic
logic_1 = lambda x: (x.shift(2).ge(x.shift(1))) & (x.ge(x.shift(2).add(3))) & (x.eq(x.shift(-1)))
logic_2 = lambda y: (y.shift(1).ge(1)) & (y.shift(2).ge(2)) & (y.shift(-1).ge(1))
credit to SO users for helping me with logic
This is what I tried below
df['label'] = ''
df['date'] = pd.to_datetime(df['date'])
df['tdiff'] = df['date'].shift(-1) - df['date']
df['tdiff'] = df['tdiff'].dt.total_seconds()/3600
df['lo_1'] = df.groupby('subject_id')['val'].transform(logic_1).map({True:'1',False:''})
df['lo_2'] = df.groupby('subject_id')['tdiff'].transform(logic_2).map({True:'1',False:''})
How can I make both the logic_1 and logic_2 be part of one logic statement? Is it even possible? I might have more than 2 logics as well. Instead of writing one line for each logic, is it possible to couple all logics together in one logic statement.
I expect my output to be with label column being filled with 1 when both logic_1 and logic_2 are satisfied

You have a few things to fix
First, in logic_2, you have lambda x but use y, so, you got to change that as below
logic_2 = lambda y: (y.shift(1).ge(1)) & (y.shift(2).ge(2)) & (y.shift(-1).ge(1))
Then you can use the logic's together as below'
No need to create a blank column label. You can create the '`label' column directly as below.
df['label'] = ((df.groupby('subject_id')['val'].transform(logic_1))
& (df.groupby('subject_id')['tdiff'].transform(logic_2))).map({True:'0',False:'1'})
Note: You logic produces all False values. So, you will get 1's if False is mapped to 1, not True

Related

Comparing datetime64[ns] colmns in a pandas dataframe in lambda seems to be not working

What i want to acheive is to change date-time value in Dt-1 with Dt-2 if Dt-1 is grater than Dt-2.
I am using lambda function with if-else pattern to compare colums but, Dt-1 value is changed if the condition is not met. Any help is appriciated.Thank you.
Bellow is my short code sample , and output when run :
import pandas as pd
df = pd.DataFrame([["a", "2022-11-01 00:01:11", "2022-11-02 00:02:33"]
],
columns=["gr", "Dt-1", "Dt-2"])
df["Dt-1"] = pd.to_datetime(df["Dt-1"])
df["Dt-2"] = pd.to_datetime(df["Dt-2"])
print(df)
print(df.info())
df["Dt-1"] = df.apply(lambda x: x["Dt-1"] if x["Dt-1"] >
x["Dt-2"] else x["Dt-2"], axis=1)
print(df)
print(df.info())
use following code
df["Dt-1"] = df[['Dt-1', 'Dt-2']].max(axis=1)
Sorry for posting this , if-else works correct... my bad sorry... :)
lambda x: x["Dt-1"] if x["Dt-1"] > x["Dt-2"] else x["Dt-2"]

Selection in dataframe base on multiple condition

I am developping a dashboard using dash.
The user can select different parameters and a dataframe is updated (6 parameters).
The idea was to do :
filtering = []
if len(filter1)>0:
filtering.append("df['col1'].isin(filter1)")
if len(filter2)>0:
filtering.append("df['col2'].isin(filter2)")
condition = ' & '.join(filtering)
df.loc[condition]
But I have a key error, what i understand, as condition is a string.
Any advice on how I can do it ? What is the best practise ?
NB : I have a solution with if condition but I would like to maximise this part, avoiding the copy of the dataframe (>10 millions of rows).
dff = df.copy()
if len(filter1)>0:
dff = dff.loc[dff.col1.isin(filter1)]
if len(filter2)>0:
dff = dff.loc[dff.col2.isin(filter2)]
you can use eval:
filtering = []
if len(filter1)>0:
filtering.append("df['col1'].isin(filter1)")
if len(filter2)>0:
filtering.append("df['col2'].isin(filter2)")
condition = ' & '.join(filtering)
df.loc[eval(condition)]
You can merge the masks using the & operator and only apply the merged mask once
from functools import reduce
filters = []
if len(filter1)>0:
filters.append(df.col1.isin(filter1))
if len(filter2)>0:
filters.append(df.col2.isin(filter2))
if len(filters) > 0:
final_filter = reduce(lambda a, b: a&b, filters)
df = df.loc[final_filter]

How to apply style selectively to rows of specific columns?

I want to flag the anomalies in the desired_columns (desired_D to L). Here, an anomaly is defined as any value <1500 and >400000 in each row.
See below for the dataset
import pandas as pd
# intialise data of lists
data = {
'A':['L1', 'L2', 'L3', 'L4', 'L5'],
'B':[1,1,1,1,1],
'C':[1,2,3,5,9],
'desired_D':[12005, 18190, 1021, 13301, 31119],
'desired_E':[11021, 19112, 19021, 15, 24509 ],
'desired_F':[10022,19910, 19113,449999, 25519],
'desired_G':[14029, 29100, 39022, 24509, 412271],
'desired_H':[52119,32991,52883,69359,57835],
'desired_J':[41218, 52991,55121,69152,79355],
'desired_K': [43211,7672991,56881,211,77342],
'desired_L': [31211,42901,53818,62158,69325],
}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df
Currently, my code flags columns B, and C inclusively (I want to exclude them).
The revised code looks like this:
# function to flag the anomaly in each row- this flags columns B and C as well (I want to exclude these columns)
dont_format_cols = ['B','C']
def flag_outliers(s, dont_format_cols):
if s.name in dont_format_cols:
return '' # or None, or whatever df.style() needs
else:
s = pd.to_numeric(s, errors='coerce')
indexes = (s<1500)|(s>400000)
return ['background-color: red' if v else '' for v in indexes]
styled = df.style.apply(flag_outliers, axis=1)
styled
The error after edits
Desired output: should exclude B and C,refer to the image below.
df.style.apply(..., axis=1) applies your outlier-styling function (column-wise) to all of df's columns. If you only want to apply it to some columns, use the subset argument.
EDIT: I wasn't aware df.style.apply() had a subset argument, I had proposed these hacky approaches:
1: inspect the series name s.name inside the styling function, like the solution Pandas style function to highlight specific columns.
### Hack solution just hardwire it into the body of `flag_outliers()` without adding in an extra arg `dont_format_cols`
def flag_outliers(s):
dont_format_cols = ['B','C']
if s.name in dont_format_cols:
return '' # or None, or whatever df.style() needs
else:
# code to apply formatting
2: Another hack approach: add a second arg dont_format_cols to your function flag_outliers(s, dont_format_cols). Now you have to pass it in in the apply call, so you'll need a lambda:
styled = df.style.apply(lambda s: flag_outliers(s, dont_format_cols), axis=1)
and:
def flag_outliers(s, dont_format_cols):
if s.name in dont_format_cols:
return '' # or None, or whatever df.style() needs
else:
# code to apply formatting
Use the subset argument. That is precisely its purpose to isolate styles to only specific regions.
i.e df.style.apply(flag_outliers, axis=1, subset=<list of used columns>)
You can see examples in the pandas Styler user guide documentation entitled finer slicing.

Error trying to concatenate strings with groupby in Python

So, I'm new to Python and I have this dataframe with company names, country information and activities description. I'm trying to group all this information by names, concatenating the countries and activities strings.
First, I did something like this:
df3_['Country'] = df3_.groupby(['Name', 'Activity'])['Country'].transform(lambda x: ','.join(x))
df4_ = df3_.drop_duplicates()
df4_['Activity'] = df4_.groupby(['Name', 'Country'])['Activity'].transform(lambda x: ','.join(x))
This way, I got a 'SettingWithCopyWarning', so I read a little bit about this error and tried copying the dataframe before applying the functions (didn't work) and using .loc (didn't work as well):
df3_.loc[:, 'Country'] = df3_.groupby(['Name', 'Activity'])['Country'].transform(lambda x: ','.join(x))
Any idea how to fix this?
Edit: I was asked to post an example of my data. The first pic is what I have, the second one is what it should look like
You want to group by the Company Name and then use some aggregating functions for the other columns, like:
df.groupby('Company Name').agg({'Country Code':', '.join, 'Activity':', '.join})
You were trying it the other way around.
Note that the empty string value ('') gets ugly with this aggregation, so you could make it more difficult with an aggregation like such:
df.groupby('Company Name').agg({'Country Code':lambda x: ', '.join(filter(None,x)), 'Activity':', '.join})
Following should work,
import pandas as pd
data = {
'Country Code': ['HK','US','SG','US','','US'],
'Company Name': ['A','A','A','A','B','B'],
'Activity': ['External services','Commerce','Transfer','Others','Others','External services'],
}
df = pd.DataFrame(data)
#grouping
grp = df.groupby('Company Name')
#custom function for replacing space and adding ,
def str_replace(ser):
s = ','.join(ser.values)
if s[0] == ',':
s = s[1:]
if s[len(s)-1] == ',':
s = s[:len(s)-1]
return s
#using agg functions
res = grp.agg({'Country Code':str_replace,'Activity':str_replace}).reset_index()
res
Output:
Company Name Country Code Activity
0 A HK,US,SG,US External services,Commerce,Transfer,Others
1 B US Others,External services
Another approach this time using transform()
# group the companies and concatenate the activities
df['Activities'] = df.groupby(['Company Name'])['Activity'] \
.transform(lambda x : ', '.join(x))
# group the companies and concatenate the country codes
df['Country Codes'] = df.groupby(['Company Name'])['Country Code'] \
.transform(lambda x : ', '.join([i for i in x if i != '']))
# the list comprehension deals with missing country codes (that have the value '')
# take this, drop the original columns and remove all the duplicates
result = df.drop(['Activity', 'Country Code'], axis=1) \
.drop_duplicates().reset_index(drop=True)
# reset index isn't really necessary
Result is
Company Name Activitys Country Codes
0 A External services, Commerce, Transfer, Others HK, US, SG, US
1 B Others, External services US

Compare entire rows for equality if some condition is satisfied

Let's say I have the following data of a match in a CSV file:
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4
I'm writing a python program. Somewhere in my program I have scores collected for a match stored in a list, say x = [1,0,4]. I have found where in the data these scores exist using pandas and I can print "found" or "not found". However I want my code to print out to which name these scores correspond to. In this case the program should output "charlie" since charlie has all these values [1,0,4]. how can I do that?
I will have a large set of data so I must be able to tell which name corresponds to the numbers I pass to the program.
Yes, here's how to compare entire rows in a dataframe:
df[(df == x).all(axis=1)].index # where x is the pd.Series we're comparing to
Also, it makes life easiest if you directly set name as the index column when you read in the CSV.
import pandas as pd
from io import StringIO
df = """\
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4"""
df = pd.read_csv(StringIO(df), index_col='name')
x = pd.Series({'match1':1, 'match2':0, 'match3':4})
Now you can see that doing df == x, or equivalently df.eq(x), is not quite what you want because it does element-wise compare and returns a row of True/False. So you need to aggregate those rows with .all(axis=1) which finds rows where all comparison results were True...
df.eq(x).all(axis=1)
df[ (df == x).all(axis=1) ]
# match1 match2 match3
# name
# Charlie 1 0 4
...and then finally since you only want the name of such rows:
df[ (df == x).all(axis=1) ].index
# Index(['Charlie'], dtype='object', name='name')
df[ (df == x).all(axis=1) ].index.tolist()
# ['Charlie']
which is what you wanted. (I only added the spaces inside the expression for clarity).
You need to use DataFrame.loc which would work like this:
print(df.loc[(df.match1 == 1) & (df.match2 == 0) & (df.match3 == 4), 'name'])
Maybe try something like this:
import pandas as pd
import numpy as np
# Makes sample data
match1 = np.array([2,2,1])
match2 = np.array([4,4,0])
match3 = np.array([3,3,4])
name = np.array(['Alice','Bob','Charlie'])
df = pd.DataFrame({'name': id, 'match1': match1, 'match2':match2, 'match3' :match3})
df
# example of the list you want to get the data from
x=[1,0,4]
#x=[2,4,3]
# should return the name Charlie as well as the index (based on the values in the list x)
df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])]
# Makes a new dataframe out of the above
mydf = pd.DataFrame(df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])])
# Loop that prints out the name based on the index of mydf
# Assuming there are more than one name, it will print all. if there is only one name, it will print only that)
for i in range(0,len(mydf)):
print(mydf['name'].iloc[i])
you can use this
here data is your Data frame ,you can change accordingly your data frame name,
and
considering [1,0,4] is int type
data = data[(data['match1']== 1)&(data['match2']==0)&(data['match3']== 4 ).index
print(data[0])
if data is object type then use this
data = data[(data['match1']== "1")&(data['match2']=="0")&(data['match3']== "4" ).index
print(data[0])

Categories

Resources