How to write conditionals across multiple columns in dataframe? - python

I have the following pandas dataframe:
I am trying to write some conditional python statements, where if we have issue_status of 10 or 40 AND market_phase of 0 AND tade_state of (which is what we have in all of the cases in the above screenshot). Then I want to call a function called resolve_collision_mp(...).
Can I write the conditional in Python as follows?
# Collision for issue_status == 10
if market_info_df['issue_status'].eq('10').all() and market_info_df['market_phase'].eq('0').all() \
and market_info_df['trading_state'] == ' ': # need to change this, can't have equality for dataframe, need loc[...]
return resolve_collision_mp_10(market_info_df)
# Collision for issue_status == 40
if market_info_df['issue_status'].eq('40').all() and market_info_df['market_phase'].eq('0').all() \
and not market_info_df['trading_state']:
return resolve_collision_mp_40(market_info_df)
I don't think the above is correct, any help would be much appreciated!

You can use .apply() with the relevant conditions,
df['new_col'] = df.apply(lambda row: resolve_collision_mp_10(row) if (row['issue_status'] == 10 and row['market_phase'] == 0 and row['tade_state'] = '') else None, axis=1)
df['new_col'] = df.apply(lambda row: resolve_collision_mp_40(row) if (row['issue_status'] == 40 and row['market_phase'] == 0 and row['tade_state'] = '') else None, axis=1)
Note: I am assuming that you are trying to create a new column with the return values of the resolve_collision_mp_10 and resolve_collision_mp_40 functions.

Related

How can i use .fillna with specific values?

df["load_weight"] = df.loc[(df["dropoff_site"] == "HORNSBY BEND") & (df['load_type'] == "BRUSH")].fillna(1000, inplace=True)
i want to change the NaN value in "load_weight" column, but only for the rows that contain "HORNSBY BEND" and "BRUSH", but above code gave me "none" to the whole "load_weight" column, what did i do wrong?
I would use a mask for boolean indexing:
m = (df["dropoff_site"] == "HORNSBY BEND") & (df['load_type'] == "BRUSH")
df.loc[m, "load_weight"] = df.loc[m, 'load_weight'].fillna(1000)
NB. you can't keep inplace=True when you assign the output. This is what was causing your data to be replaced with None as methods called with inplace=True return nothing.
Alternative with only boolean indexing:
m1 = (df["dropoff_site"] == "HORNSBY BEND") & (df['load_type'] == "BRUSH")
m2 = df['load_weight'].isna()
df.loc[m1&m2, "load_weight"] = 1000
Instead of fillna, you can directly use df.loc to do the required imputation
df.loc[((df['dropoff_site']=='HORNSBY BEND')&(df['load_type']=='BRUSH')
&(df['load_weight'].isnull())),'load_weight'] = 1000

Trying to filter a CSV file with multiple variables using pandas in python

import pandas as pd
import numpy as np
df = pd.read_csv("adult.data.csv")
print("data shape: "+str(data.shape))
print("number of rows: "+str(data.shape[0]))
print("number of cols: "+str(data.shape[1]))
print(data.columns.values)
datahist = {}
for index, row in data.iterrows():
k = str(row['age']) + str(row['sex']) +
str(row['workclass']) + str(row['education']) +
str(row['marital-status']) + str(row['race'])
if k in datahist:
datahist[k] += 1
else:
datahist[k] = 1
uniquerows = 0
for key, value in datahist.items():
if value == 1:
uniquerows += 1
print(uniquerows)
for key, value in datahist.items():
if value == 1:
print(key)
df.loc[data['age'] == 58] & df.loc[data['sex'] == Male]
I have been trying to get the above code to work.
I have limited experience in coding but it seems like the issue lies with some of the columns being objects. The int64 columns work just fine when it comes to filtering.
Any assistance will be much appreciated!
df.loc[data['age'] == 58] & df.loc[data['sex'] == Male]
Firstly you are attemping to use Male variable, you probably meant string, i.e. it should be 'Male', secondly observe [ and ] placement, you are extracting part of DataFrame with age equal 58 then extracting part of DataFrame with sex equal Male and then try to use bitwise and. You should probably use & with conditions rather than pieces of DataFrame that is
df.loc[(data['age'] == 58) & (data['sex'] == 'Male')]
The int64 columns work just fine because you've specified the condition correctly as:
data['age'] == 58
However, the object column condition data['sex'] == Male should be specified as a string:
data['sex'] == 'Male'
Also, I noticed that you have loaded the dataframe df = pd.read_csv("adult.data.csv"). Do you mean this instead?
data = pd.read_csv("adult.data.csv")
The query at the end includes 2 conditions, and should be enclosed in brackets within the square brackets [ ] filter. If the dataframe name is data (instead of df), it should be:
data.loc[ (data['age'] == 58]) & (data['sex'] == Male) ]

Dropping rows at specific minutes

I am trying to drop rows at specific minutes ( 05,10, 20 )
I have datetime as an index
df5['Year'] = df5.index.year
df5['Month'] = df5.index.month
df5['Day']= df5.index.day
df5['Day_of_Week']= df5.index.day_name()
df5['hour']= df5.index.strftime('%H')
df5['Min']= df5.index.strftime('%M')
df5
Then I run below
def clean(df5):
for i in range(len(df5)):
hour = pd.Timestamp(df5.index[i]).hour
minute = pd.Timestamp(df5.index[i]).minute
if df5 = df5[(df5.index.minute ==5) | (df5.index.minute == 10)| (df5.index.minute == 20)]
df.drop(axis=1, index=i, inplace=True)
it returnes invalid syntax error.
Here looping is not necessary, also not recommended.
Use DatetimeIndex.minute with Index.isin and inverted mask by ~ filtering in boolean indexing:
df5 = df5[~df5.index.minute.isin([5, 10, 20])]
For reuse column df5['Min'] use strings values:
df5 = df5[~df5['Min'].isin(['05', '10', '20'])]
All together:
def clean(df5):
return df5[~df5.index.minute.isin([5, 10, 20])]
You can just do it using boolean indexing, assuming that the index is already parsed as datetime.
df5 = df5[~((df5.index.minute == 5) | (df5.index.minute == 10) | (df5.index.minute == 20))]
Or the opposite of the same answer:
df5 = df5[(df5.index.minute != 5) | (df5.index.minute != 10) | (df5.index.minute != 20)]
Generally speaking, the right synthax to combine a logic OR inside an IF statement is the following:
today = 'Saturday'
if today=='Sunday' OR today=='Saturday':
print('Today is off. Rest at home')
In your case, you should probably use something like this:
if df5 == df5[(df5.index.minute ==5)] OR df5[(df5.index.minute ==10)]
......
FINAL NOTE:
You made some mistakes using == and =
In Python (and many other programming languages), a single equal mark = is used to assign a value to a variable, whereas two consecutive equal marks == is used to check whether 2 expressions give the same value .
= is an assignment operator
== is an equality operator

How to include two lambda operations in transform function?

I have a dataframe like as given below
df = pd.DataFrame({
'date' :['2173-04-03 12:35:00','2173-04-03 17:00:00','2173-04-03 20:00:00','2173-04-04 11:00:00','2173-04-04 12:00:00','2173-04-04 11:30:00','2173-04-04 16:00:00','2173-04-04 22:00:00','2173-04-05 04:00:00'],
'subject_id':[1,1,1,1,1,1,1,1,1],
'val' :[5,5,5,10,10,5,5,8,8]
})
I would like to apply couple of logic (logic_1 on val column and logic_2 on date column) to the code. Please find below the logic
logic_1 = lambda x: (x.shift(2).ge(x.shift(1))) & (x.ge(x.shift(2).add(3))) & (x.eq(x.shift(-1)))
logic_2 = lambda y: (y.shift(1).ge(1)) & (y.shift(2).ge(2)) & (y.shift(-1).ge(1))
credit to SO users for helping me with logic
This is what I tried below
df['label'] = ''
df['date'] = pd.to_datetime(df['date'])
df['tdiff'] = df['date'].shift(-1) - df['date']
df['tdiff'] = df['tdiff'].dt.total_seconds()/3600
df['lo_1'] = df.groupby('subject_id')['val'].transform(logic_1).map({True:'1',False:''})
df['lo_2'] = df.groupby('subject_id')['tdiff'].transform(logic_2).map({True:'1',False:''})
How can I make both the logic_1 and logic_2 be part of one logic statement? Is it even possible? I might have more than 2 logics as well. Instead of writing one line for each logic, is it possible to couple all logics together in one logic statement.
I expect my output to be with label column being filled with 1 when both logic_1 and logic_2 are satisfied
You have a few things to fix
First, in logic_2, you have lambda x but use y, so, you got to change that as below
logic_2 = lambda y: (y.shift(1).ge(1)) & (y.shift(2).ge(2)) & (y.shift(-1).ge(1))
Then you can use the logic's together as below'
No need to create a blank column label. You can create the '`label' column directly as below.
df['label'] = ((df.groupby('subject_id')['val'].transform(logic_1))
& (df.groupby('subject_id')['tdiff'].transform(logic_2))).map({True:'0',False:'1'})
Note: You logic produces all False values. So, you will get 1's if False is mapped to 1, not True

Replace value of column based on value in separate column

I have a pandas DataFrame that looks like:
ID | StateName | ZipCode
____________________________________
0 MD 20814
1 90210
2 DC 20006
3 05777
4 12345
I have a function that will fill in StateName based on ZipCode value:
def FindZip(x):
search = ZipcodeSearchEngine()
zipcode = search.by_zipcode(x)
return zipcode['State']
I want to fill in the blanks in the column StateName - based on the value of the corresponding ZipCode. I've unsuccessfully tried this:
test['StateName'] = test['StateName'].apply(lambda x: FindZip(test['Zip_To_Use']) if x == "" else x)
Basically, I want to apply a function to a column different from the column I am trying to change. I would appreciate any help! Thanks!
You can try following:
test['StateName'] = test.apply(lambda x: FindZip(test['Zip_To_Use'])
if x['StateName'] == ""
else x['StateName'], axis = 1)
The above code applies to dataframe instead of StateName and using axis = 1, applies to columns.
Updated:
Updated with multiple condition in if statement (looking at the solution below):
test['StateName'] = test.apply(lambda x: FindZip(test['Zip_To_Use'])
if ((x['StateName'] == "") and (x['Zip_To_Use'] != ""))
else x['StateName'], axis = 1)
I came up with a not very "pandorable" workaround. I would still love to see a more "pythonic" or "pandorable" solution if anyone has ideas! I essentially created a new list of the same length as the DataFrame and iterated through each row and then wrote over the column with the new list.
state = [FindState(test['Zip_To_Use'].iloc[i]) if (test['StateName'].iloc[i] == "" and test['Zip_To_Use'].iloc[i] != "")
else test['StateName'].iloc[i] for i in range(len(test))]
Restated in a regular for loop (for readability):
state = []
for i in range(len(test)):
if (test['StateName'].iloc[i] == "" and test['Zip_To_Use'].iloc[i] != ""):
state.append(FindState(test['Zip_To-Use'].iloc[i]))
else:
state.append(test['StateName'].iloc[i])
And then reassigned the column with this new list
test['StateName'] = state
Please let me know if you have a better solution!

Categories

Resources