How can i use .fillna with specific values? - python

df["load_weight"] = df.loc[(df["dropoff_site"] == "HORNSBY BEND") & (df['load_type'] == "BRUSH")].fillna(1000, inplace=True)
i want to change the NaN value in "load_weight" column, but only for the rows that contain "HORNSBY BEND" and "BRUSH", but above code gave me "none" to the whole "load_weight" column, what did i do wrong?

I would use a mask for boolean indexing:
m = (df["dropoff_site"] == "HORNSBY BEND") & (df['load_type'] == "BRUSH")
df.loc[m, "load_weight"] = df.loc[m, 'load_weight'].fillna(1000)
NB. you can't keep inplace=True when you assign the output. This is what was causing your data to be replaced with None as methods called with inplace=True return nothing.
Alternative with only boolean indexing:
m1 = (df["dropoff_site"] == "HORNSBY BEND") & (df['load_type'] == "BRUSH")
m2 = df['load_weight'].isna()
df.loc[m1&m2, "load_weight"] = 1000

Instead of fillna, you can directly use df.loc to do the required imputation
df.loc[((df['dropoff_site']=='HORNSBY BEND')&(df['load_type']=='BRUSH')
&(df['load_weight'].isnull())),'load_weight'] = 1000

Related

How to write conditionals across multiple columns in dataframe?

I have the following pandas dataframe:
I am trying to write some conditional python statements, where if we have issue_status of 10 or 40 AND market_phase of 0 AND tade_state of (which is what we have in all of the cases in the above screenshot). Then I want to call a function called resolve_collision_mp(...).
Can I write the conditional in Python as follows?
# Collision for issue_status == 10
if market_info_df['issue_status'].eq('10').all() and market_info_df['market_phase'].eq('0').all() \
and market_info_df['trading_state'] == ' ': # need to change this, can't have equality for dataframe, need loc[...]
return resolve_collision_mp_10(market_info_df)
# Collision for issue_status == 40
if market_info_df['issue_status'].eq('40').all() and market_info_df['market_phase'].eq('0').all() \
and not market_info_df['trading_state']:
return resolve_collision_mp_40(market_info_df)
I don't think the above is correct, any help would be much appreciated!
You can use .apply() with the relevant conditions,
df['new_col'] = df.apply(lambda row: resolve_collision_mp_10(row) if (row['issue_status'] == 10 and row['market_phase'] == 0 and row['tade_state'] = '') else None, axis=1)
df['new_col'] = df.apply(lambda row: resolve_collision_mp_40(row) if (row['issue_status'] == 40 and row['market_phase'] == 0 and row['tade_state'] = '') else None, axis=1)
Note: I am assuming that you are trying to create a new column with the return values of the resolve_collision_mp_10 and resolve_collision_mp_40 functions.

Python Pandas How to combine columns according to the condition of checking another column

I have a DataFrame:
value,combined,value_shifted,Sequence_shifted,long,short
12834.0,2.0,12836.0,3.0,2.0,-2.0
12813.0,-2.0,12781.0,-3.0,-32.0,32.0
12830.0,2.0,12831.0,3.0,1.0,-1.0
12809.0,-2.0,12803.0,-3.0,-6.0,6.0
12822.0,2.0,12805.0,3.0,-17.0,17.0
12800.0,-2.0,12807.0,-3.0,7.0,-7.0
12773.0,2.0,12772.0,3.0,-1.0,1.0
12786.0,-2.0,12787.0,1.0,1.0,-1.0
12790.0,2.0,12784.0,3.0,-6.0,6.0
I want to combine the long and short columns according to the value of the combined column
If df.combined == 2 then we leave the value long
If df.combined == -2 then we leave the value short
Expected result:
value,combined,value_shifted,Sequence_shifted,calc
12834.0,2.0,12836.0,3.0,2.0
12813.0,-2.0,12781.0,-3.0,32
12830.0,2.0,12831.0,3.0,1.0
12809.0,-2.0,12803.0,-3.0,6.0
12822.0,2.0,12805.0,3.0,-17
12800.0,-2.0,12807.0,-3.0,-1.0
12773.0,2.0,12772.0,3.0,-1.0
12786.0,-2.0,12787.0,1.0,-6.0
12790.0,2.0,12784.0,3.0,20.0
Use if possible 2,-2 or another values in combined column numpy.select:
df['calc'] = np.select([df['combined'].eq(2), df['combined'].eq(-2)],
[df['long'], df['short']])
Or if only 2,-1 values use numpy.where:
df['calc'] = np.where(df['combined'].eq(2), df['long'], df['short'])
Try this:
df['calc'] = df['long'].where(df['combined'] == 2, df['short'])
df['calc'] = np.nan
mask_2 = df['combined'] == 2
df.loc[mask_2, 'calc'] = df.loc[mask_2, 'long']
mask_minus_2 = df['combined'] == -2
df.loc[mask_minus_2, 'calc'] = df.loc[mask_minus_2, 'short']
then you can drop the long and short columns:
df.drop(columns=['long', 'short'], inplace=True)

fillna() method does not replace values in data

I need to replace the values inside the dataset, I used the fillna() method, the function runs, but when I check the data is still null
import pandas as pd
import numpy as np
dataset = pd.read_csv('mamografia.csv')
dataset
mamografia = dataset
mamografia
malignos = mamografia[mamografia['Severidade'] == 0].isnull().sum()
print('Valores ausentes: ')
print()
print('Valores Malignos: ', malignos)
print()
belignos = mamografia[mamografia['Severidade'] == 1].isnull().sum()
print('Valores Belignos:', belignos)
def substitui_ausentes(lista_col):
for lista in lista_col:
if lista != 'Idade':
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 0)].mode())
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 1)].mode())
else:
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 0)].mean())
mamografia[lista].fillna(value = mamografia[lista][(mamografia['Severidade'] == 1)].mean())
mamografia.columns
substitui_ausentes(mamografia.columns)
mamografia
I'm trying to replace the null values, using fillna()
By default fillna does not work in place but returns the result of the operation.
You can either set the new value manually using
df = df.fillna(...)
Or overwrite the default behaviour by setting the parameter inplace=True
df.fillna(... , inplace=True)
However your code will still not work since you want to fill the different severities separately.
Since the function is being rewritten lets also make it more pandonic by not making it change the Dataframe by default
def substitui_ausentes(dfc, reglas, inplace = False):
if inplace:
df = dfc
else:
df = dfc.copy()
fill_values = df.groupby('Severidade').agg(reglas).to_dict(orient='index')
for k in fill_values:
df.loc[df['Severidade'] == k] = df.loc[df['Severidade'] == k].fillna(fill_values[k])
return df
Note that you now need to call the function using
reglas = {
'Idade':lambda x: pd.Series.mode(x)[0],
'Densidade':'mean'
}
substitui_ausentes(df,reglas, inplace=True)
and the reglas dictionary needs to include only the columns you want to fill and how you want to fill them.

Dropping rows at specific minutes

I am trying to drop rows at specific minutes ( 05,10, 20 )
I have datetime as an index
df5['Year'] = df5.index.year
df5['Month'] = df5.index.month
df5['Day']= df5.index.day
df5['Day_of_Week']= df5.index.day_name()
df5['hour']= df5.index.strftime('%H')
df5['Min']= df5.index.strftime('%M')
df5
Then I run below
def clean(df5):
for i in range(len(df5)):
hour = pd.Timestamp(df5.index[i]).hour
minute = pd.Timestamp(df5.index[i]).minute
if df5 = df5[(df5.index.minute ==5) | (df5.index.minute == 10)| (df5.index.minute == 20)]
df.drop(axis=1, index=i, inplace=True)
it returnes invalid syntax error.
Here looping is not necessary, also not recommended.
Use DatetimeIndex.minute with Index.isin and inverted mask by ~ filtering in boolean indexing:
df5 = df5[~df5.index.minute.isin([5, 10, 20])]
For reuse column df5['Min'] use strings values:
df5 = df5[~df5['Min'].isin(['05', '10', '20'])]
All together:
def clean(df5):
return df5[~df5.index.minute.isin([5, 10, 20])]
You can just do it using boolean indexing, assuming that the index is already parsed as datetime.
df5 = df5[~((df5.index.minute == 5) | (df5.index.minute == 10) | (df5.index.minute == 20))]
Or the opposite of the same answer:
df5 = df5[(df5.index.minute != 5) | (df5.index.minute != 10) | (df5.index.minute != 20)]
Generally speaking, the right synthax to combine a logic OR inside an IF statement is the following:
today = 'Saturday'
if today=='Sunday' OR today=='Saturday':
print('Today is off. Rest at home')
In your case, you should probably use something like this:
if df5 == df5[(df5.index.minute ==5)] OR df5[(df5.index.minute ==10)]
......
FINAL NOTE:
You made some mistakes using == and =
In Python (and many other programming languages), a single equal mark = is used to assign a value to a variable, whereas two consecutive equal marks == is used to check whether 2 expressions give the same value .
= is an assignment operator
== is an equality operator

pandas loop on elements

I would like to compute delta time in a dataframe (with some condition) so I write a loop:
for i in range(1,len(df.index)):
if df.type[i] == df.type[i-1]:
df.delta[i]=df.time[i]-df.time[i-1]
else:
df.delta[i]= ''
but it seems not very optimised because it's very long and I get a SettingWithCopyWarning (which I don't understand). What is the best way to do such a computation?
I would use .shift() for that. It makes a new column with values shifted by 1.
So if we have no conditions, you'll need just df["time"] - df["time"].shift(), but as you want to add condition, where would help. So here is a one line solution
(df["time"] - df["time"].shift()).where(df["type"] == df["type"].shift(), "")
Or as suggested in other answer, you can use diff
df["time"].diff().where(df["type"] == df["type"].shift(), "")
You should use a vectorised approach. For example, you can use numpy.where with pd.Series.shift and pd.Series.diff:
df['C_id'] = np.where(df['type'] == df['type'].shift(), df['time'].diff(), np.nan)
Note I strongly recommend you do not use an empty string '' as your alternative value, as this will force your series to have object dtype instead of float.
my approach would be to use pandas.apply()
type_prev = ''
time_prev = 0
def lambda_func(row):
global type_prev
global time_prev
if row['type'] == time_prev:
time_diff = row['time'] - time_prev
else:
time_diff = ''
time_prev = row['time']
type_prev = row['type']
return time_diff
df['delta'] = df.apply(lambda_func)

Categories

Resources