How to use both & and | operators in a pandas .loc() funcition? - python

I have a question regarding a complex .loc I'd like to run on my df to mutate a new column. Let's say I have the following df:
x y z
0 1 0 2
1 0 0 0
2 1 1 2
3 0 0 0
4 1 1 2
5 0 0 2
6 1 1 0
7 0 0 2
And I want to use both an & and a | in my .loc. I know very well how to use one or the other, but a certain task I'm trying to complete involves the use of both. Basically, I want to find the rows that meet the following condition:
x or y = 1 and z = 2
and make a new column that consists of a 1 if these conditions are met, and a 0 if they aren't. Like so:
x y z test
0 1 0 2 1
1 0 0 0 0
2 1 1 2 1
3 0 0 0 0
4 1 1 2 1
5 0 0 2 0
6 1 1 0 0
7 0 0 2 0
Again, I know how to run a loc consisting of one or the other, but not both & and |. Before posting, I tried the following code out to no avail:
df['test'] = 0
df.loc[((df['x'] == 1) | (df['y'] == 1)) & (df['z'] == 2),'test'] = 1
I thought I was super clever by including an extra set of () around the | condition, but alas it did not work. However, this code does work just fine when I'm using one operator or the other, just not both. I would really appreciate any help. Thanks!

Related

How to Invert column values in pandas - pythonic way?

I have a dataframe like as shown below
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'Label':[1,1,1,0,0]})
My objective is to
a) replace 0s as 1s AND 1s as 0s in Label column
I was trying something like the below
cdf.assign(invert_label=cdf.Label.loc[::-1].reset_index(drop=True)) #not work
cdf['invert_label'] = np.where(cdf['Label']==0, '1', '0')
'
but this doesn't work. It reverses the order
I expect my output to be like as shown below
Id Label
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
You can compare 0, so for 0 get Trues and for not 0 get Falses, then converting to integers for mapping True, False to 1, 0:
print (cdf['Label'].eq(0))
0 False
1 False
2 False
3 True
4 True
Name: Label, dtype: bool
cdf['invert_label'] = cdf['Label'].eq(0).astype(int)
print (cdf)
Id Label invert_label
0 1 1 0
1 2 1 0
2 3 1 0
3 4 0 1
4 5 0 1
Another idea is use mapping:
cdf['invert_label'] = cdf['Label'].map({1:0, 0:1})
print (cdf)
Id Label invert_label
0 1 1 0
1 2 1 0
2 3 1 0
3 4 0 1
4 5 0 1
One maybe obvious answer might be to use 1-value:
cdf['Label2'] = 1-cdf['Label']
output:
Id Label Label2
0 1 1 0
1 2 1 0
2 3 1 0
3 4 0 1
4 5 0 1
You could map the not function as well -
import operator
cdf['Label'].map(operator.not_).astype('int')
Another way, and I am adding this as a separate answer as this is probably not "pythonic" enough (in the sense that it is not very explicit) is to use the bitwise xor
cdf['Label'] ^ 1

pivot long form categorical data by group and dummy code categorical variables

For the following dataframe, I am trying to pivot the categorical variable ('purchase_item') into wide format and dummy code them as 1/0 - based on whether or not a customer purchased it in each of the 4 quarters within 2016.
I would like to generate a pivotted dataframe as follows:
To get the desired result shown above, I have tried basically in various ways to combine groupby/pivot_table functions with a call to get_dummies() function in pandas. Example:
data.groupby(["cust_id", "purchase_qtr"])["purchase_item"].reset_index().get_dummies()
However, none of my attempts have worked thus far.
Can somebody please help me generate the desired result?
One way of doing this is to get the crosstabulation, and then force all values > 1 to become 1, while keeping all 0's as they are:
TL;DR
out = (
pd.crosstab([df["cust_id"], df["purchase_qtr"]], df["purchase_item"])
.gt(0)
.astype(int)
.reset_index()
)
Breaking it all down:
Create Data
df = pd.DataFrame({
"group1": np.repeat(["a", "b", "c"], 4),
"group2": [1, 2, 3] * 4,
"item": np.random.choice(["ab", "cd", "ef", "gh", "zx"], size=12)
})
print(df)
group1 group2 item
0 a 1 cd
1 a 2 ef
2 a 3 gh
3 a 1 ef
4 b 2 zx
5 b 3 ab
6 b 1 ab
7 b 2 gh
8 c 3 gh
9 c 1 cd
10 c 2 ef
11 c 3 gh
Cross Tabulation
This returns a frequency table indicating how often each of the categories are observed together:
crosstab = pd.crosstab([df["group1"], df["group2"]], df["item"])
print(crosstab)
item ab cd ef gh zx
group1 group2
a 1 0 1 1 0 0
2 0 0 1 0 0
3 0 0 0 1 0
b 1 1 0 0 0 0
2 0 0 0 1 1
3 1 0 0 0 0
c 1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 2 0
Coerce Counts to Dummy Codes
Since we want to dummy code, and not count the co-occurance of categories, we can use a quick trick to force all values greater than 0 gt(0) to become 1 astype(int)
item ab cd ef gh zx
group1 group2
a 1 0 1 1 0 0
2 0 0 1 0 0
3 0 0 0 1 0
b 1 1 0 0 0 0
2 0 0 0 1 1
3 1 0 0 0 0
c 1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
You can do in one line via several tricks.
1) Unique Index count in junction with casting as bool
Works in any case even when you do not have any other column besides index, columns and values. This code implies to count unique indices of each index-column intersection and returning 1 incase its more than 0 else 0.
df.reset_index().pivot_table(index=['cust_id','purchase_qtr'],
columns='purchase_item',
values='index',
aggfunc='nunique', fill_value=0)\
.astype(bool).astype(int)
2) Checking if any other column is not null
If you have other columns besides index, columns and values AND want to use them for intuition. Like purchase_date in your case. It is more intutive because you can "read" it like: Check per customer per quarter if the purchase date of the item is not null and parse them as integer.
df.pivot_table(index=['cust_id','purchase_qtr'],
columns='purchase_item',values='purchase_date',
aggfunc=lambda x: all(pd.notna(x)), fill_value=0)\
.astype(int)
3) Seeing len of elements falling in each index-column intersection
This seeslen of elements falling in each each index-column intersection and returning 1 incase its more than 0 else 0. Same intuitive approach:
df.pivot_table(index=['cust_id','purchase_qtr'],
columns='purchase_item',
values='purchase_date',
aggfunc=len, fill_value=0)\
.astype(bool).astype(int)
All return desired dataframe:
Note that you should be only using crosstab when you don't have a dataframe already as it calls pivot_table internally.

apply function in specific range in row

my input:
index frame user1 user2
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 0
Also I have two objects start_frame and end_frame - pandas Series look like this for 'start frame' :
index frame
3 3
and for end frame:
index frame
4 5
My problem is apply function in specific column - user1 and in specific row number, where values I get from start_frame and end_frame.
I expect output like this:
frame user1 user2
0 0 0 0
1 1 0 0
2 2 0 0
3 3 1 0
4 4 1 0
5 5 1 0
I trying this but it return all column to ones or any other output but not that I want
def my_func(x):
x=x+1
return x
df['user1']=df['user1'].between(df['frame']==3, df['frame']==5, inclusive=False).apply(lambda x: add_one(x))
I trying another code:
df['user1']=df.apply(lambda row: 1 if row['frame'] in (3,5) else 0, axis=1)
But it return only 1 in row 3 and 5, how here in (3,5) insert range?
So I have two question: First and most important how to apply my_func exacly in rows what I need, and other question how to use my object end_frame and start_frame instead manually insert in function.
Thank you
Updated:
arr_rang = range(3,6)
df['user1']=df.apply(lambda row: 1 if row['frame'] in (arr_rang) else 0, axis=1)
Now it's return 1 in frame 3,4,5. That I need. But still I dont understand how use my objects end_frame and start_frame
let's append start_frame and end_frame since they are having common columns then check values using isin() and finally changing value by using boolean masking and loc accessor:
s=start_frame.append(end_frame)
mask=(df['index'].isin(s['index'])) | (df['frame'].isin(s['frame']))
df.loc[mask,'user1']=df.loc[mask,'user1']+1
#you can also use np.where() in place of loc accessor
output of df:
index frame user1 user2
0 0 0 0 0
1 1 1 0 0
2 2 2 0 0
3 3 3 1 0
4 4 4 1 0
5 5 5 1 0
Update:
use:
mask=df['frame'].between(3,5)
df.loc[mask,'user1']=df.loc[mask,'user1']+1
Did you try
def putHello(row):
row["hello"] = "world"
return row
data.iloc[5:7].apply(putHello,axis=1)
The output would look something like this
The documentation for pandas functions
Iloc pandas
Apply pandas

I have a dataset whose columns are words. How can I add the same columns to each other?

I have a dataset whose features are words. These words like "see", "saw", "go, "play" etc. And I try to do some preprocessing like stemming in columns. I want to add the same or same meaning columns to each other and then drop the adding column. Like below
For example, I have a dataset like,
see go see
0 0 0 1
1 2 1 3
2 0 1 1
3 0 0 0
and I want to add one "see" to another "see", and drop one of them, like below,
see go
0 1 0
1 5 1
2 1 1
3 0 0
How can I do this?
df.groupby(lambda x:x, axis=1).sum()
go see
0 0 1
1 1 5
2 1 1
3 0 0
You could use stack, groupby and then unstack:
res = df.stack().groupby(level=[0, 1]).sum().unstack()
print(res)
Output
go see
0 0 1
1 1 5
2 1 1
3 0 0

Converting indicator numbers to binary values

if have two dataframes, (pandas.DataFrame), each looking as follows. Let's call the first one df_A
code1 code2 code3 code4 code5
0 1 4 2 0 0
1 3 2 1 5 0
2 2 3 0 0 0
has1 has2 has3 has4 has5
0 1 1 0 1 0
1 1 1 0 0 1
2 0 1 1 0 0
The objects(rows) are each given up to 5 codes shown by the five columns in the first df.
I instead want a binary representation of which codes each object has. As shown in the second df.
The functions in pandas or scikit-learn for dummy-values take into account which position the code is written in, this in unimportant.
The attempts I have with my own code have not worked due to my inexperience in python and pandas.
This case is different from others I have seen on stack overflow as all the columns represent the same thing.
Thank you!
Edit:
for colname in df_bin.columns:
for row in range(len(df_codes)):
if int(colname) in df_codes.iloc[[row]]:
df_bin[colname][row]=1
This is one of the attempts I made so far.
You can try stack then str.get_dummies
s=df.stack().loc[lambda x : x!=0].astype(str).str.get_dummies().sum(level=0).add_prefix('Has')
Has1 Has2 Has3 Has4 Has5
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0
Let's try:
(df.stack().groupby(level=0)
.value_counts()
.unstack(fill_value=0)
[range(1,6)]
.add_prefix('has')
)
Output:
has1 has2 has3 has4 has5
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0
Here's another way using pd.crosstab:
df_out = df.reset_index().melt('index')
df_out = pd.crosstab(df_out['index'], df_out['value']).drop(0, axis=1).add_prefix('has')
Output:
value has1 has2 has3 has4 has5
index
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0

Categories

Resources