I have a dataframe like this:
id value
111 0.222
222 2.253
333 0.444
444 21.256
....
I want to add a new column flag and the first half of the rows, set flag to 0, the rest is set to 1:
id value flag
111 0.222 0
222 2.253 0
333 0.444 0
444 21.256 0
...
8888 1212.500 1
9999 0.025 1
What's the best way to do this? I tired the following:
df['flag'][:int(len(df) / 2)] = 0
df['flag'][int(len(df) / 2):] = 1
But this gave me KeyError: 'flag', assuming probably I need to create an empty column with name flag? Can someone help please? Thanks.
Even if you create an empty column, you would get some warning/error due to chain indexing. Try assign at once:
df['flag'] = (np.arange(len(df)) >= (len(df)//2)).astype(int)
Or
l = len(df)//2
df['flag'] = [0] * s + [1] * (len(df) - l)
Related
I have a df
Side ref_price price price_diff
0 100 110
1 110 100
I want to keep price_diff values based on side values.
if side==0:
df['price_diff']=df['ref_price']*df['price']
else if side==1:
df['price_diff']=df['ref_price']*df['price']*-1
Tried with
df.loc[df.Side == 0, 'price_diff'] = (df['price']*df['ref_price'])
Not working, throwing errors.
You could use "Side" column as a condition in numpy.where:
df['price_diff'] = np.where(df['Side'].astype(bool), df['ref_price']*df['price']*-1, df['ref_price']*df['price'])
or in this specific case, use "Side" column values as power of -1:
df['price_diff'] = df['ref_price']*df['price']*(-1)**df['Side']
Output:
Side ref_price price price_diff
0 0 100 110 11000
1 1 110 100 -11000
You can use np.where:
df['price_diff'] = np.where(df['side'] == 0,
df['ref_price'] * df['price'],
df['ref_price'] * df['price'] * -1)
print(df)
# Output
side ref_price price price_diff
0 0 100 110 11000
1 1 110 100 -11000
I have a data frame where I am trying to get the row of min value by subtracting the abs difference of two columns to make a third column where I am trying to get the first or second min value of the data frame of col[3] I get an error. Is there a better method to get the row of min value from a column[3].
df2 = df[[2,3]]
df2[4] = np.absolute(df[2] - df[3])
#lowest = df.iloc[df[6].min()]
2 3 4
0 -111 -104 7
1 -130 110 240
2 -105 -112 7
3 -118 -100 18
4 -147 123 270
5 225 -278 503
6 102 -122 224
2 3 4
desired result = 2 -105 -112 7
Get difference to Series, add Series.abs and then compare by minimal value in boolean indexing:
s = (df[2] - df[3]).abs()
df = df[s == s.min()]
If want new column for diffence:
df['diff'] = (df[2] - df[3]).abs()
df = df[df['diff'] == df['diff'].min()]
Another idea is get index by minimal value by Series.idxmin and then select by DataFrame.loc, for one row DataFrame are necessary [[]]:
s = (df[2] - df[3]).abs()
df = df.loc[[s.idxmin()]]
EDIT:
For more dynamic code with convert to integers if possible use:
def int_if_possible(x):
try:
return x.astype(int)
except Exception:
return x
df = df.apply(int_if_possible)
I have a DataFrame with segments,timestamps and different columns
Segment Timestamp Value1 Value2 Value2_mean
0 2018-11... 180 156 135
0 170 140 135
0 135
1
1
...
I want to aggregate/group this DataFrame with 'Segment' and get the first Timestamp for a segment as soon as this intervall condition is met and then the time intervall in seconds for this segment. Because there are more values for a function, aggregate does not work I think.
value2_mean-std(value2) <= value1 <= value2_mean+std(value2)
It should look like this:
Segment Intervall[s]
0 10
1 19
2 6
3 ...
I tried something like this:
grouped = dataSeg.groupby(['Segment'])
def grouping(df)
a = np.array(df['Value_1'])
b = np.array(df['Value2'])
c = np.array(df['Value2_mean'])
d = np.array(df['Timestamp'])
for x in a:
categories = np.logical_and(
(c-np.std(b)<= x),
(c+np.std(b)>= x))
if np.any(categories):
return d[categories]-d[0]
grouped.apply(grouping)
This does not work the way I want it to. Any suggestions would be appreciated!
Something like this? I didn't test it thoroughly.
def calc(grp):
if grp.Value1.sub(grp.Value2_mean).abs().lt(grp.Value2.std()).any():
return grp["Timestamp"].iloc[-1] - grp["Timestamp"].iloc[0]
return np.nan
df.groupby("Segment").apply(calc)
I have the following sample data frame:
column1,column2,column3
tom,0100,544
tim,0101,514
ben,0899,1512
The third column contains the useraccountcontrolflag, and each line represents one user entry. The flags are cumulative.
This means - disabled user account, the UserAccountControl is set to 514 (2 + 512). In my example - tim is disabled.
I would like to create a new column for each flag where it will assign the value 1 if the flag is set or 0 for not.
For the above example, the output will look like:
column1 column2 column3 DISABELDACCOUNT NORMALUSER PASSWORDNOTREQ TEMP_DUPLICATE_ACCOUNT SPECIALUSER
tom 100 544 0 1 1 0 0
tim 100 512 0 1 0 0 0
ben 899 1512 0 1 0 0 1
Here is my python code - but it didn't work for my dataframe. It works only with one row ...
#!/bin/python
import pandas as pd
from pandas import DataFrame
import numpy as np
def get_flags(number):
df['DISABELDACCOUNT']=0
df['NORMALUSER']=0
df['PASSWORDNOTREQ']=0
df['TEMP_DUPLICATE_ACCOUNT']=0
df['SPECIALUSER']=0
while number > 0:
if number >= 1000:
df['SPECIALUSER']=1
number = number - 1000
continue
elif number >= 512:
df['NORMALUSER']=1
number = number - 512
continue
elif number >= 256:
df['TEMP_DUPLICATE_ACCOUNT']=1
number = number - 256
continue
elif number >=32:
df['PASSWORDNOTREQ']=1
number = number - 32
continue
elif number >=2:
df['TEMP_DUPLICATE_ACCOUNT']=1
number = number - 2
continue
df = pd.read_csv('data2.csv')
df['column3'].apply(get_flags)
Thanks a lot in advance!
Not sure why the column names differ from the Microsoft documentation that you quoted. But assuming that you are fine with renaming the column according to that docs, you can make use of numpy bitwise_and
df = pd.read_csv('data2.csv')
flags = {
'SCRIPT' : 0x0001,
'ACCOUNTDISABLE' : 0x0002,
'HOMEDIR_REQUIRED' : 0x0008,
'LOCKOUT' : 0x0010,
'PASSWD_NOTREQD' : 0x0020,
#.... (add more flags here as required, I just copy-pasted from the docs)
}
for (f, mask) in flags.items():
df[f] = np.bitwise_and(df['column3'], mask) / mask
print(df)
This outputs:
column1 column2 column3 SCRIPT ACCOUNTDISABLE HOMEDIR_REQUIRED LOCKOUT PASSWD_NOTREQD
0 tom 100 544 0.0 0.0 0.0 0.0 1.0
1 tim 101 514 0.0 1.0 0.0 0.0 0.0
2 ben 899 1512 0.0 0.0 1.0 0.0 1.0
Incidentally, checking for flags that have been stringed together as hexadecimal number using bitmask is a pretty common pattern.
You can't use the function you created to do what you want. Everytime you do for example df['SPECIALUSER']=1 it assigns 1 to the whole column not only to the row you think you are targetting.
To assign each value to the correct row you have to assign values to each column seprately instead:
df['SPECIALUSER'] = np.where(df['column3'] >= 1000, 1, 0)
df['NORMALUSER'] = np.where((df['column3'] - 1000) >= 512, 1, 0)
...
I didn't understand exactly the logic you use to assign 1 and 0 but if you correct that and repeat what i wrote above for all the columns you need you should be able to get the result you are looking for.
Found a solution using .fillna
As you can guess, my title is already confusing, and so am I!
I have a dataframe like this
Index Values
0 NaN
1 NaN
...................
230 350.21
231 350.71
...................
1605 922.24
Between 230 and 1605 I have values, but not for the first 229 entries. So I calculated the slope to approximate the missing data and stored it in 'slope'.
Y1 = df['Values'].min()
X1ID = df['Values'].idxmin()
Y2 = df['Values'].max()
X2ID = df['Values'].idxmax()
slope = (Y2 - Y1)/(X2ID - X1ID)
In essence I want to get the .min from Values, subtract the slope and insert the new value in the index before the previous .min. However, I am completely lost, I tried something like this:
df['Values2'] = df['Values'].min().apply(lambda x: x.min() - slope)
But that is obviously rubbish. I would greatly appreciate some advise
EDIT
So after trying multiple ways I found a crude solution that at least works for me.
loopcounter = 0
missingValue = []
missingindex = []
missingindex.append(loopcounter)
missingValue.append(Y1)
for minValue in missingValue:
minValue = minValue-slopeave
missingValue.append(minwavelength)
loopcounter +=1
missingindex.append(loopcounter)
if loopcounter == 230:
break
del missingValue[0]
missingValue.reverse()
del missingindex[-1]
First I created two lists, one is for the missing values and the other for the index.
Afterwards I added my minimum Value (Y1) to the list and started my loop.
I wanted the loop to stop after 230 times (the amount of missing Values)
Each loop would subtract the slope from the items in the list, starting with the minimum value while also adding the counter to the missingindex list.
Deleting the first value and reversing the order transformed the list into the correct order.
missValue = dict(zip(missingindex,missingValue))
I then combined the two lists into a dictionary
df['Values'] = df['Values'].fillna(missValue)
Afterwards I used the .fillna function to fill up my dataframe with the dictionary.
This worked for me, I know its probably not the most elegant solution...
I would like to thank everyone that invested their time in trying to help me, thanks a lot.
Check this. However, I feel you would have to put this is a loop, as the insertion and min calculation has to do the re-calculation
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=('Values',),data=
[
np.nan,
np.nan,
350.21,
350.71,
922.24
])
Y1 = df['Values'].min()
X1ID = df['Values'].idxmin()
Y2 = df['Values'].max()
X2ID = df['Values'].idxmax()
slope = (Y2 - Y1)/(X2ID - X1ID)
line = pd.DataFrame(data=[Y1-slope], columns=('Values',), index=[X1ID])
df2 = pd.concat([df.ix[:X1ID-1], line, df.ix[X1ID:]]).reset_index(drop=True)
print df2
The insert logic is provided here Is it possible to insert a row at an arbitrary position in a dataframe using pandas?
I think you can use loc with interpolate:
print df
Values
Index
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
229 NaN
230 350.21
231 350.71
1605 922.24
#add value 0 to index = 0
df.at[0, 'Values'] = 0
#add value Y1 - slope (349.793978) to max NaN value
df.at[X1ID-1, 'Values'] = Y1 - slope
print df
Values
Index
0 0.000000
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
229 349.793978
230 350.210000
231 350.710000
1605 922.240000
print df.loc[0:X1ID-1, 'Values']
Index
0 0.000000
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
229 349.793978
Name: Values, dtype: float64
#filter values by indexes and interpolate
df.loc[0:X1ID-1, 'Values'] = df.loc[0:X1ID-1, 'Values'].interpolate(method='linear')
print df
Values
Index
0 0.000000
1 49.970568
2 99.941137
3 149.911705
4 199.882273
5 249.852842
6 299.823410
229 349.793978
230 350.210000
231 350.710000
1605 922.240000
I will revise this a little bit:
df['Values2'] = df['Values']
df.ix[df.Values2.isnull(), 'Values2'] = (Y1 - slope)
EDIT
Or try to put this in a loop like below. This will recursively fill in all values until it reaches the end of the series:
def fix_rec(series):
Y1 = series.min()
X1ID = series.idxmin() ##; print(X1ID)
Y2 = series.max()
X2ID = series.idxmax()
slope = (Y2 - Y1) / (X2ID - X1ID);
if X1ID == 0: ## termination condition
return series
series.loc[X1ID-1] = Y1 - slope
return fix_rec(series)
call it like this:
df['values2'] = df['values']
fix_rec(df.values2)
I hope that helps!