I have a dataframe and I want to make a new column that is 0 if any one of about 20 conditions are true, and 1 otherwise. My current code is:
conditions=[(lastYear_sale['Low_Balance_Flag'==0])&(lastYear_sale.another_Flag==0),(lastYear_sale['Low_Balance_Flag'==1])&(lastYear_sale.another_Flag==1)]
choices=[1,0]
lastYear_sale['eligible']=np.select(conditions,choices,default=0)
So here is a simplified version of the dataframe I have, that looks a little like:
data = {'ID':['a', 'b', 'c', 'd'],
'Low_Balance_Flag':[1, 0, 1, 0], 'another_Flag':[0,0,1,1]}
dfr = pd.DataFrame(data)
I would like to add a column called eligible that is 0 if low balance flag or another_flag are 1, but if all other columns are 0 then eligible should be 1. I get an error from my attempt that just says keyError: False but I can't see what the error is, thanks for any suggestions! :)
Edit: so the output I'd be looking for in this case would be:
ID Low_Balance another_Flag Eligible
a 1 0 0
b 0 0 1
c 1 1 0
d 0 1 0
So the conditions is basically what you need. You just need the proper conditions. I assume the conditions you have are 2 condition clauses separated by comma. If you have a data frame lastYear_sale which I believe is supposed to be dfr then
conditions=((lastYear_sale['Low_Balance_Flag']==0)&(lastYear_sale.another_Flag==0))|((lastYear_sale['Low_Balance_Flag']==1)&(lastYear_sale.another_Flag==1))
print((~conditions).astype(int))
0 1
1 0
2 0
3 0
dtype: int64
If your conditions are somewhat dynamic and you need to build them within code you can use pandas.DataFrame.query to evaluate a string expression you have built.
Edit: I still assume dfr is the same as lastYear_sale. Also, the data in dfr does not match the data in the expected output.
#Use either of these
conditions = ~((lastYear_sale['Low_Balance_Flag']==1)|(lastYear_sale.another_Flag==1))
conditions = ~lastYear_sale.eval('Low_Balance_Flag==1|another_Flag==1')
dfr['eligible'] = conditions.astype(int)
Related
I am currently working on a dataset which has information on total sales for each product id and product sub category. For eg, let us consider that there are three products 1, 2 and 3. There are three product sub categories - A,B,C, one or two or all of which may comprise the products 1, 2 and 3. For instance, I have included a sample table below:
Now, I would like to add a flag column 'Flag' which can assign 1 or 0 to each product id depending on whether that product id is contains record of product sub category 'C'. If it does contain 'C', then assign 1 to the flag column. Otherwise, assign 0. Below is the desired output.
I am currently not able to do this in pandas. Could you help me out? Thank you so much!
use pandas transform and contains. transform applies the lambda function to all rows in the dataframe.
txt="""ID,Sub-category,Sales
1,A,100
1,B,101
1,C,102
2,B,100
2,C,101
3,A,102
3,B,100"""
df = pd.read_table(StringIO(txt), sep=',')
#print(df)
list_id=list(df[df['Sub-category'].str.contains('C')]['ID'])
df['flag']=df['ID'].apply(lambda x: 1 if x in list_id else 0 )
print(df)
output:
ID Sub-category Sales flag
0 1 A 100 1
1 1 B 101 1
2 1 C 102 1
3 2 B 100 1
4 2 C 101 1
5 3 A 102 0
6 3 B 100 0
Try this:
Flag = [ ]
for i in dataFrame["Product sub-category]:
if i == "C":
Flag.append(1)
else:
Flag.append(0)
So you have a list called "Flag" and can add it to your dataframe.
You can add a temporary column, isC to check for your condition. Then check for the number of isC's inside every "Product Id" group (with .groupby(...).transform).
check = (
df.assign(isC=lambda df: df["Product Sub-category"] == "C")
.groupby("Product Id").isC.transform("sum")
)
df["Flag"] = (check > 0).astype(int)
I have a dataset showing below.
What I would like to do is three things.
Step 1: AA to CC is an index, however, happy to keep in the dataset for the future purpose.
Step 2: Count 0 value to each row.
Step 3: If 0 is more than 20% in the row, which means more than 2 in this case because DD to MM is 10 columns, remove the row.
So I did a stupid way to achieve above three steps.
df = pd.read_csv("dataset.csv", header=None)
df_bool = (df == "0")
print(df_bool.sum(axis=1))
then I got an expected result showing below.
0 0
1 0
2 1
3 0
4 1
5 8
6 1
7 0
So removed the row #5 as I indicated below.
df2 = df.drop([5], axis=0)
print(df2)
This works well even this is not an elegant, kind of a stupid way to go though.
However, if I import my dataset as header=0, then this approach did not work at all.
df = pd.read_csv("dataset.csv", header=0)
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
How come this happens?
Also, if I would like to write a code with loop, count and drop functions, what does the code look like?
You can just continue using boolean_indexing:
First we calculate number of columns and number of zeroes per row:
n_columns = len(df.columns) # or df.shape[1]
zeroes = (df == "0").sum(axis=1)
We then select only rows that have less than 20 % zeroes.
proportion_zeroes = zeroes / n_columns
max_20 = proportion_zeroes < 0.20
df[max_20] # This will contain only rows that have less than 20 % zeroes
One liner:
df[((df == "0").sum(axis=1) / len(df.columns)) < 0.2]
It would have been great if you could have posted how the dataframe looks in pandas rather than a picture of an excel file. However, constructing a dummy df
df = pd.DataFrame({'index1':['a','b','c'],'index2':['b','g','f'],'index3':['w','q','z']
,'Col1':[0,1,0],'Col2':[1,1,0],'Col3':[1,1,1],'Col4':[2,2,0]})
Step1, assigning the index can be done using the .set_index() method as per below
df.set_index(['index1','index2','index3'],inplace=True)
instead of doing everything manually when it comes fo filtering out, you can use the return you got from df_bool.sum(axis=1) in the filtering of the dataframe as per below
df.loc[(df==0).sum(axis=1) / (df.shape[1])>0.6]
index1 index2 index3 Col1 Col2 Col3 Col4
c f z 0 0 1 0
and using that you can drop those rows, assuming 20% then you would use
df = df.loc[(df==0).sum(axis=1) / (df.shape[1])<0.2]
Ween it comes to the header issue it's a bit difficult to answer without seeing the what the file or dataframe looks like
I have a dataframe (df) which looks like:
0 1 2 3
0 BBG.apples.S BBG.XNGS.bananas.S 0
1 BBG.apples.S BBG.XNGS.oranges.S 0
2 BBG.apples.S BBG.XNGS.pairs.S 0
3 BBG.apples.S BBG.XNGS.mango.S 0
4 BBG.apples.S BBG.XNYS.mango.S 0
5 BBG.XNGS.bananas.S BBG.XNGS.oranges.S 0
6 BBG.XNGS.bananas.S BBG.XNGS.pairs.S 0
7 BBG.XNGS.bananas.S BBG.XNGS.kiwi.S 0
8 BBG.XNGS.oranges.S BBG.XNGS.pairs.S 0
9 BBG.XNGS.oranges.S BBG.XNGS.kiwi.S 0
10 BBG.XNGS.peaches.S BBG.XNGS.strawberrys.S 0
11 BBG.XNGS.peaches.S BBG.XNGS.strawberrys.S 0
12 BBG.XNGS.peaches.S BBG.XNGS.strawberrys.S 0
13 BBG.XNGS.peaches.S BBG.XNGS.kiwi.S 0
I am trying to update a value (first row, third column) in the dataframe using:
for index, row in df.iterrows():
status = row[3]
if int(status) == 0:
df[index]['3'] = 1
but when I print the dataframe out it remains unchanged.
What am I doing wrong?
Replace your last line by:
df.at[index,'3'] = 1
Obviously as mentioned by others you're better off using a vectorized expression instead of iterating, especially for large dataframes.
You can't modify a data frame by iterating like that. See here.
If you only want to modify the element at [1, 3], you can access it directly:
df[1, 3] = 1
If you're trying to turn every 0 in column 3 to a 1, try this:
df[df['3'] == 0] = 1
EDIT: In addition, the docs for iterrows say that you'll often get a copy back, which is why the operation fails.
If you are trying to update the third column for all rows based on the row having a certain value, as shown in your example code, then it would be much easier use the where method on the dataframe:
df.loc[:,'3'] = df['3'].where(df['3']!=0, 1)
Try to update the row using .loc or .iloc (depend on your needs).
For example, in this case:
if int(status) == 0:
df.iloc[index]['3']='1'
I'm working with a dataset with a large number of predictors, and want to easily test different composite variable groupings by using a control file. For starters, the control file would indicate whether or not to include a variable. Here's an example:
control = pd.DataFrame({'Variable': ['Var1','Var2','Var3'],
'Include': [1,0,1]})
control
Out[48]:
Include Variable
0 1 Var1
1 0 Var2
2 1 Var3
data = pd.DataFrame({'Sample':['a','b','c'],
'Var1': [1,0,0],
'Var2': [0,1,0],
'Var3': [0,0,1]})
data
Out[50]:
Sample Var1 Var2 Var3
0 a 1 0 0
1 b 0 1 0
2 c 0 0 1
So the result after processing should be a new dataframe which looks like data, but drops the Var2 column:
data2
Out[51]:
Sample Var1 Var3
0 a 1 0
1 b 0 0
2 c 0 1
I can get this to work by selectively dropping columns using .itterows():
data2 = data.copy()
for index, row in control.iterrows():
if row['Include'] != 1:
z = (row['Variable'])
data2.drop(z, axis=1,inplace="True")
This works, but it seems there should be a way to do this on the whole dataframe at once. Something like:
data2 = data[control['Include'] == 1]
However, this filters out rows based on the 'Include' value, not columns.
Any suggestions appreciated.
Select the necessary headers from the control frame and use smart selection from the data:
headers = control[control['Include']==1]['Variable']
all_headers = ['Sample'] + list(headers)
data[all_headers]
# Sample Var1 Var3
#0 a 1 0
#1 b 0 0
#2 c 0 1
A side note: Consider using boolean True and False instead of 0s and 1s in the Include column, if possible.
This should be a pretty fast solution using numpy and reconstruction
# get data columns values which is a numpy array
dcol = data.columns.values
# find the positions where control Include are non-zero
non0 = np.nonzero(control.Include.values)
# slice control.Variable to get names of Variables to include
icld = control.Variable.values[non0]
# search the column names of data for the included Variables
# and the Sample column to get the positions to slice
srch = dcol.searchsorted(np.append('Sample', icld))
# reconstruct the dataframe using the srch slice we've created
pd.DataFrame(data.values[:, srch], data.index, dcol[srch])
I have a dataframe looking generated by:
df = pd.DataFrame([[100, ' tes t ', 3], [100, np.nan, 2], [101, ' test1', 3 ], [101,' ', 4]])
It looks like
0 1 2
0 100 tes t 3
1 100 NaN 2
2 101 test1 3
3 101 4
I would like to a fill column 1 "forward" with test and test1. I believe one approach would be to work with replacing whitespace by np.nan, but it is difficult since the words contain whitespace as well. I could also groupby column 0 and then use the first element of each group to fill forward. Could you provide me with some code for both alternatives I do not get it coded?
Additionally, I would like to add a column that contains the group means that is
the final dataframe should look like this
0 1 2 3
0 100 tes t 3 2.5
1 100 tes t 2 2.5
2 101 test1 3 3.5
3 101 test1 4 3.5
Could you also please advice how to accomplish something like this?
Many thanks please let me know in case you need further information.
IIUC, you could use str.strip and then check if the stripped string is empty.
Then, perform groupby operations and filling the Nans by the method ffill and calculating the means using groupby.transform function as shown:
df[1] = df[1].str.strip().dropna().apply(lambda x: np.NaN if len(x) == 0 else x)
df[1] = df.groupby(0)[1].fillna(method='ffill')
df[3] = df.groupby(0)[2].transform(lambda x: x.mean())
df
Note: If you must forward fill NaN values with first element of that group, then you must do this:
df.groupby(0)[1].apply(lambda x: x.fillna(x.iloc[0]))
Breakup of steps:
Since we want to apply the function only on strings, we drop all the NaN values present before, else we would be getting the TypeError due to both floats and string elements present in the column and complains of float having no method as len.
df[1].str.strip().dropna()
0 tes t # operates only on indices where strings are present(empty strings included)
2 test1
3
Name: 1, dtype: object
The reindexing part isn't a necessary step as it only computes on the indices where strings are present.
Also, the reset_index(drop=True) part was indeed unwanted as the groupby object returns a series after fillna which could be assigned back to column 1.