I have a database (pd.DataFrame) like this:
condition odometer
0 new NaN
1 bad 1100
2 excellent 110
3 NaN 200
4 NaN 2000
5 new 20
6 bad NaN
And I want to fill the NaN of "condition" based on the values of "odometer":
new: odometer >0 and <= 100
excellent: odometer >100 and <= 1000
bad: odometer >1000
I tried to do this but it is not working:
for i in range(len(database)):
if math.isnan(database['condition'][i]) == True:
odometer = database['odometer'][i]
if odometer > 0 & odometer <= 100: value = 'new'
elif odometer > 100 & odometer <= 1000: value = 'excellent'
elif odometer > 1000: value = 'bad'
database['condition'][i] = value
Tried also making the first "if" condition:
database['condition'][i] == np.nan
But it doesn't work as well.
You can use DataFrame.apply() to generate a new condition column with your function, and replace it afterwards. Not sure what types your columns are. df['condition'].dtype will tell you. It looks like condition could either be string or object, which could create a bug in your logic. If it's a string column, you'll need to do a direct comparison == 'NaN'. If it's an object, you can use np.nan or math.nan. I included a sample database for each case below. You also might want to test the type of your odometer column.
import numpy as np
import pandas as pd
# condition column as string
df = pd.DataFrame({'condition':['new','bad','excellent','NaN','NaN','new','bad'], 'odometer':np.array([np.nan, 1100, 110, 200, 2000, 20, np.nan], dtype=object)})
# condition column as object
# df = pd.DataFrame({'condition':np.array(['new','bad','excellent',np.nan,np.nan,'new','bad'], dtype=object), 'odometer':np.array([np.nan, 1100, 110, 200, 2000, 20, np.nan], dtype=object)})
def f(database):
if database['condition'] == 'NaN':
#if np.isnan(database['condition']):
odometer = database['odometer']
if odometer > 0 & odometer <= 100: value = 'new'
elif odometer > 100 & odometer <= 1000: value = 'excellent'
elif odometer > 1000: value = 'bad'
return value
return database['condition']
df['condition'] = df.apply(f, axis=1)
I have a nice one liner solution for you:
Lets create a sample dataframe:
import pandas as pd
df = pd.DataFrame({'condition':['new','bad',None,None,None], 'odometer':[None,1100,50,500,2000]})
df
Out:
condition odometer
0 new NaN
1 bad 1100.0
2 None 50.0
3 None 500.0
4 None 2000.0
Solution:
df.condition = df.condition.fillna(df.odometer.apply(lambda number: 'new' if number in range(101) else 'excellent' if number in range(101,1000) else 'bad'))
df
Out:
condition odometer
0 new NaN
1 bad 1100.0
2 new 50.0
3 excellent 500.0
4 bad 2000.0
Related
I am definitely still learning python and have tried countless approaches, but can't figure this one out.
I have a dataframe with 2 columns, call them A and B. I need to return a df that will sum the row values of each of these two columns independently until a threshold sum of A exceeds some value, for this example let's say 10. So far I am am trying to use iterrows() and can get segment based on if A >= 10, but can't seem to solve summation of rows until the threshold is met. The resultant df must be exhaustive even if the final A values do not meet the conditional threshold - see final row of desired output.
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df1
A B
0 20 16
1 10 5
2 3 2
3 1 1
4 12 10
5 9 7
6 6 6
7 5 2
Desired result:
A B
0 20 16
1 10 5
2 16 13
3 15 13
4 5 2
Thank you in advance, much time spent, and assistance is much appreciated!!!
Cheers
I rarely write long loops for pandas, but I didn't see a way to do this with a pandas method. Try this horrible loop :( :
The variable I created t is essentially checking the cumulative sums to see if > n (which we have set to 10). Then, we decide to use t, the cumulative some or i the value in the dataframe for any given row (j and u are just there in parallel with to the same thing for column B).
There are a few conditions so some elif statements, and there will be different behavior for the last row the way I have set it up, so I had to have some separate logic for that with the last if -- otherwise the last value wasn't getting appended:
import pandas as pd
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df1
a,b = [],[]
t,u,count = 0,0,0
n=10
for (i,j) in zip(df1['A'], df1['B']):
count+=1
if i < n and t >= n:
a.append(t)
b.append(u)
t = i
u = j
elif 0 < t < n:
t += i
u += j
elif i < n and t == 0:
t += i
u += j
else:
t = 0
u = 0
a.append(i)
b.append(j)
if count == len(df1['A']):
if t == i or t == 0:
a.append(i)
b.append(j)
elif t > 0 and t != i:
t += i
u += j
a.append(t)
b.append(u)
df2 = pd.DataFrame({'A' : a, 'B' : b})
df2
Here's one that works that's shorter:
import pandas as pd
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df2 = pd.DataFrame()
index = 0
while index < df1.size/2:
if df1.iloc[index]['A'] >= 10:
a = df1.iloc[index]['A']
b = df1.iloc[index]['B']
temp_df = pd.DataFrame(data=[[a,b]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
index += 1
else:
a_sum = 0
b_sum = 0
while a_sum < 10 and index < df1.size/2:
a_sum += df1.iloc[index]['A']
b_sum += df1.iloc[index]['B']
index += 1
if a_sum >= 10:
temp_df = pd.DataFrame(data=[[a_sum,b_sum]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
else:
a = df1.iloc[index-1]['A']
b = df1.iloc[index-1]['B']
temp_df = pd.DataFrame(data=[[a,b]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
The key is to keep track of where you are in the DataFrame and track the sums. Don't be afraid to use variables.
In Pandas, use iloc to access each row by index. Make sure you don't go out of the DataFrame by checking the size. df.size returns the number of elements, so it will multiply the rows by the columns. This is why I divided the size by the number of columns, to get the actual number of rows.
I try to add a new column "energy_class" to a dataframe "df_energy" which it contains the string "high" if the "consumption_energy" value > 400, "medium" if the "consumption_energy" value is between 200 and 400, and "low" if the "consumption_energy" value is under 200.
I try to use np.where from numpy, but I see that numpy.where(condition[, x, y]) treat only two condition not 3 like in my case.
Any idea to help me please?
Thank you in advance
Try this:
Using the setup from #Maxu
col = 'consumption_energy'
conditions = [ df2[col] >= 400, (df2[col] < 400) & (df2[col]> 200), df2[col] <= 200 ]
choices = [ "high", 'medium', 'low' ]
df2["energy_class"] = np.select(conditions, choices, default=np.nan)
consumption_energy energy_class
0 459 high
1 416 high
2 186 low
3 250 medium
4 411 high
5 210 medium
6 343 medium
7 328 medium
8 208 medium
9 223 medium
You can use a ternary:
np.where(consumption_energy > 400, 'high',
(np.where(consumption_energy < 200, 'low', 'medium')))
I like to keep the code clean. That's why I prefer np.vectorize for such tasks.
def conditions(x):
if x > 400: return "High"
elif x > 200: return "Medium"
else: return "Low"
func = np.vectorize(conditions)
energy_class = func(df_energy["consumption_energy"])
Then just add numpy array as a column in your dataframe using:
df_energy["energy_class"] = energy_class
The advantage in this approach is that if you wish to add more complicated constraints to a column, it can be done easily.
Hope it helps.
I would use the cut() method here, which will generate very efficient and memory-saving category dtype:
In [124]: df
Out[124]:
consumption_energy
0 459
1 416
2 186
3 250
4 411
5 210
6 343
7 328
8 208
9 223
In [125]: pd.cut(df.consumption_energy,
[0, 200, 400, np.inf],
labels=['low','medium','high']
)
Out[125]:
0 high
1 high
2 low
3 medium
4 high
5 medium
6 medium
7 medium
8 medium
9 medium
Name: consumption_energy, dtype: category
Categories (3, object): [low < medium < high]
WARNING: Be careful with NaNs
Always be careful that if your data has missing values np.where may be tricky to use and may give you the wrong result inadvertently.
Consider this situation:
df['cons_ener_cat'] = np.where(df.consumption_energy > 400, 'high',
(np.where(df.consumption_energy < 200, 'low', 'medium')))
# if we do not use this second line, then
# if consumption energy is missing it would be shown medium, which is WRONG.
df.loc[df.consumption_energy.isnull(), 'cons_ener_cat'] = np.nan
Alternatively, you can use one-more nested np.where for medium versus nan which would be ugly.
IMHO best way to go is pd.cut. It deals with NaNs and easy to use.
Examples:
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')
# pd.cut
df['age_cat'] = pd.cut(df.age, [0, 20, 60, np.inf], labels=['child','medium','old'])
# manually add another line for nans
df['age_cat2'] = np.where(df.age > 60, 'old', (np.where(df.age <20, 'child', 'medium')))
df.loc[df.age.isnull(), 'age_cat'] = np.nan
# multiple nested where
df['age_cat3'] = np.where(df.age > 60, 'old',
(np.where(df.age <20, 'child',
np.where(df.age.isnull(), np.nan, 'medium'))))
# outptus
print(df[['age','age_cat','age_cat2','age_cat3']].head(7))
age age_cat age_cat2 age_cat3
0 22.0 medium medium medium
1 38.0 medium medium medium
2 26.0 medium medium medium
3 35.0 medium medium medium
4 35.0 medium medium medium
5 NaN NaN medium nan
6 54.0 medium medium medium
Let's start by creating a dataframe with 1000000 random numbers between 0 and 1000 to be used as test
df_energy = pd.DataFrame({'consumption_energy': np.random.randint(0, 1000, 1000000)})
[Out]:
consumption_energy
0 683
1 893
2 545
3 13
4 768
5 385
6 644
7 551
8 572
9 822
A bit of a description of the dataframe
print(df.energy.describe())
[Out]:
consumption_energy
count 1000000.000000
mean 499.648532
std 288.600140
min 0.000000
25% 250.000000
50% 499.000000
75% 750.000000
max 999.000000
There are various ways to achieve that, such as:
Using numpy.where
df_energy['energy_class'] = np.where(df_energy['consumption_energy'] > 400, 'high', np.where(df_energy['consumption_energy'] > 200, 'medium', 'low'))
Using numpy.select
df_energy['energy_class'] = np.select([df_energy['consumption_energy'] > 400, df_energy['consumption_energy'] > 200], ['high', 'medium'], default='low')
Using numpy.vectorize
df_energy['energy_class'] = np.vectorize(lambda x: 'high' if x > 400 else ('medium' if x > 200 else 'low'))(df_energy['consumption_energy'])
Using pandas.cut
df_energy['energy_class'] = pd.cut(df_energy['consumption_energy'], bins=[0, 200, 400, 1000], labels=['low', 'medium', 'high'])
Using Python's built in modules
def energy_class(x):
if x > 400:
return 'high'
elif x > 200:
return 'medium'
else:
return 'low'
df_energy['energy_class'] = df_energy['consumption_energy'].apply(energy_class)
Using a lambda function
df_energy['energy_class'] = df_energy['consumption_energy'].apply(lambda x: 'high' if x > 400 else ('medium' if x > 200 else 'low'))
Time Comparison
From all the tests that I've done, by measuring time with time.perf_counter() (for other ways to measure time of execution see this), pandas.cut was the fastest approach.
method time
0 np.where() 0.124139
1 np.select() 0.155879
2 numpy.vectorize() 0.452789
3 pandas.cut() 0.046143
4 Python's built-in functions 0.138021
5 lambda function 0.19081
Notes:
For the difference between pandas.cut and pandas.qcut see this: What is the difference between pandas.qcut and pandas.cut?
Try this : Even if consumption_energy contains nulls don't worry about it.
def egy_class(x):
'''
This function assigns classes as per the energy consumed.
'''
return ('high' if x>400 else
'low' if x<200 else 'medium')
chk = df_energy.consumption_energy.notnull()
df_energy['energy_class'] = df_energy.consumption_energy[chk].apply(egy_class)
I second using np.vectorize. It is much faster than np.where and also cleaner code wise. You can definitely tell the speed up with larger data sets. You can use a dictionary format for your conditionals as well as the output of those conditions.
# Vectorizing with numpy
row_dic = {'Condition1':'high',
'Condition2':'medium',
'Condition3':'low',
'Condition4':'lowest'}
def Conditions(dfSeries_element,dictionary):
'''
dfSeries_element is an element from df_series
dictionary: is the dictionary of your conditions with their outcome
'''
if dfSeries_element in dictionary.keys():
return dictionary[dfSeries]
def VectorizeConditions():
func = np.vectorize(Conditions)
result_vector = func(df['Series'],row_dic)
df['new_Series'] = result_vector
# running the below function will apply multi conditional formatting to your df
VectorizeConditions()
myassign["assign3"]=np.where(myassign["points"]>90,"genius",(np.where((myassign["points"]>50) & (myassign["points"]<90),"good","bad"))
when you wanna use only "where" method but with multiple condition. we can add more condition by adding more (np.where) by the same method like we did above. and again the last two will be one you want.
I'd like to sum the values grouped by positive and negatives flows and then compare them to figure out the largest negative and largest positive flows.
I think itertools is probably the way to do this but can't figure it out.
#create a data frame that shows week and value
n_rows = 30
dftest = pd.DataFrame({'week': pd.date_range('1/4/2019', periods=n_rows, freq='W'),
'value': np.random.randint(-100,100,size=(n_rows))})
#flag positives and negatives
def flowFinder(row):
if row['value'] > 0:
return "Positive"
else:
return "Negative"
dftest['flag'] = dftest.apply(flowFinder,axis=1)
dftest
In this example df, you'd determine that 15-19 adds up toe 249 which is the max value of all the positive flows. The max negative flow is line 5 with -98.
Edit by Scott Boston
It is best if you added code that generates your dataframe instead of links to a picture.
df = pd.DataFrame({'week':pd.date_range('2019-01-06',periods=21, freq='W'),
'value':[64,43,94,-19,3,-98,1,80,-7,-43,45,58,27,29,
-4,20,97,30,22,80,-95],
'flag':['Positive']*3+['Negative']+['Positive']+['Negative']+
['Positive']*2+['Negative']*2+['Positive']*4+
['Negative']+['Positive']*5+['Negative']})
You can try this:
df.groupby((df['flag'] != df['flag'].shift()).cumsum())['value'].sum().agg(['min','max'])
Output:
min -98
max 249
Name: value, dtype: int64
Using rename:
df.groupby((df['flag'] != df['flag'].shift()).cumsum())['value'].sum().agg(['min','max'])\
.rename(index={'min':'Negative','max':'Positive'})
Output:
Negative -98
Positive 249
Name: value, dtype: int64
Update answer comment:
df_out = df.groupby((df['flag'] != df['flag'].shift()).cumsum())['value','week']\
.agg({'value':'sum','week':'last'})
df_out.loc[df_out.agg({'value':['idxmin','idxmax']}).squeeze().tolist()]
Output:
value week
flag
4 -98 2019-02-10
9 249 2019-05-19
I have the following sample data frame:
column1,column2,column3
tom,0100,544
tim,0101,514
ben,0899,1512
The third column contains the useraccountcontrolflag, and each line represents one user entry. The flags are cumulative.
This means - disabled user account, the UserAccountControl is set to 514 (2 + 512). In my example - tim is disabled.
I would like to create a new column for each flag where it will assign the value 1 if the flag is set or 0 for not.
For the above example, the output will look like:
column1 column2 column3 DISABELDACCOUNT NORMALUSER PASSWORDNOTREQ TEMP_DUPLICATE_ACCOUNT SPECIALUSER
tom 100 544 0 1 1 0 0
tim 100 512 0 1 0 0 0
ben 899 1512 0 1 0 0 1
Here is my python code - but it didn't work for my dataframe. It works only with one row ...
#!/bin/python
import pandas as pd
from pandas import DataFrame
import numpy as np
def get_flags(number):
df['DISABELDACCOUNT']=0
df['NORMALUSER']=0
df['PASSWORDNOTREQ']=0
df['TEMP_DUPLICATE_ACCOUNT']=0
df['SPECIALUSER']=0
while number > 0:
if number >= 1000:
df['SPECIALUSER']=1
number = number - 1000
continue
elif number >= 512:
df['NORMALUSER']=1
number = number - 512
continue
elif number >= 256:
df['TEMP_DUPLICATE_ACCOUNT']=1
number = number - 256
continue
elif number >=32:
df['PASSWORDNOTREQ']=1
number = number - 32
continue
elif number >=2:
df['TEMP_DUPLICATE_ACCOUNT']=1
number = number - 2
continue
df = pd.read_csv('data2.csv')
df['column3'].apply(get_flags)
Thanks a lot in advance!
Not sure why the column names differ from the Microsoft documentation that you quoted. But assuming that you are fine with renaming the column according to that docs, you can make use of numpy bitwise_and
df = pd.read_csv('data2.csv')
flags = {
'SCRIPT' : 0x0001,
'ACCOUNTDISABLE' : 0x0002,
'HOMEDIR_REQUIRED' : 0x0008,
'LOCKOUT' : 0x0010,
'PASSWD_NOTREQD' : 0x0020,
#.... (add more flags here as required, I just copy-pasted from the docs)
}
for (f, mask) in flags.items():
df[f] = np.bitwise_and(df['column3'], mask) / mask
print(df)
This outputs:
column1 column2 column3 SCRIPT ACCOUNTDISABLE HOMEDIR_REQUIRED LOCKOUT PASSWD_NOTREQD
0 tom 100 544 0.0 0.0 0.0 0.0 1.0
1 tim 101 514 0.0 1.0 0.0 0.0 0.0
2 ben 899 1512 0.0 0.0 1.0 0.0 1.0
Incidentally, checking for flags that have been stringed together as hexadecimal number using bitmask is a pretty common pattern.
You can't use the function you created to do what you want. Everytime you do for example df['SPECIALUSER']=1 it assigns 1 to the whole column not only to the row you think you are targetting.
To assign each value to the correct row you have to assign values to each column seprately instead:
df['SPECIALUSER'] = np.where(df['column3'] >= 1000, 1, 0)
df['NORMALUSER'] = np.where((df['column3'] - 1000) >= 512, 1, 0)
...
I didn't understand exactly the logic you use to assign 1 and 0 but if you correct that and repeat what i wrote above for all the columns you need you should be able to get the result you are looking for.
I have a pandas df with 3 columns:
Close Top_Barrier Bottom_Barrier
0 441.86 441.964112 426.369888
1 448.95 444.162225 425.227108
2 449.99 446.222271 424.285063
3 449.74 447.947051 423.678282
4 451.97 449.879254 423.029413
...
996 436.97 446.468790 426.600543
997 438.16 446.461401 426.599265
998 437.00 446.093899 426.641434
999 437.52 446.024365 426.631635
1000 437.75 446.114093 426.715907
Objective:
For every row, I need to test if any of the next 30 rows Close price touches the top or bottom barrier (from row 0), eg, start from row index 0, test if Close price (441.86) is greater than Top_Barrier (441.96) or lower than Bottom_Barrier (426.36), if it is greater than Top_Barrier, return 1, if it is lower than Bottom_Barrier, return -1. Else, loop to the next row, eg, at index 1, Close price is 448.95, but it is still being tested against barrier price from index 0, ie, Top_Barrier of 441.96, Bottom_Barrier of 426.36. This loop continue until index 29 if Close price never touches the barriers - return 0 if that's the case. Next rolling loop start from index 1 until 30, etc.
Attempts:
I tried using .rolling.apply with the following function but I just could not resolve the errors. Happy to explore any other methods as long as it achieve my objective stated above. Thanks!
def tbl_rolling(x):
start_i = x.index[0]
for i in range(len(x)):
# the barrier freeze at index 0
if x.loc[i, 'Close'] > x.loc[start_i, 'Top_Barrier']:
return 1
elif x.loc[i, 'Close'] < x.loc[start_i, 'Bottom_Barrier']:
return -1
return 0
The following then throws IndexingError: Too many indexers
test = df.rolling(30).apply(tbl_rolling, raw=False)
You can try something like this if your dataset isn't very big:
df = df.reset_index().assign(key=1)
def f(x):
cond1 = x['Close_x'] > x['Top_Barrier_y'].max()
cond2 = x['Close_x'] < x['Bottom_Barrier_y'].min()
return np.select([cond1,cond2],[1,-1], default=0)[0]
df.merge(df, on='key').query('index_y <= index_x').groupby('index_x').apply(f)
Output:
index_x
0 0
1 1
2 1
3 1
4 1
996 0
997 0
998 0
999 0
1000 0
dtype: int64