Replacing values greater 1 in a large pandas dataframe - python

I'm trying to replace all numbers greater than 1 with 1 while keeping the original 1s and 0s untouched in the entire dataframe with the minimal effort. Any support is appreciated!!
My dataframe looks something like this but contains way more columns and rows.
Report No Apple Orange Lemon Grape Pear
One 5 0 2 1 1
Two 1 1 0 3 2
Three 0 0 2 1 3
Four 1 1 3 0 0
Five 4 0 0 1 1
Six 1 3 1 2 0
Desired Output:
Report No Apple Orange Lemon Grape Pear
One 1 0 1 1 1
Two 1 1 0 1 1
Three 0 0 1 1 1
Four 1 1 1 0 0
Five 1 0 0 1 1
Six 1 1 1 1 0

You can try this.
Using boolean mask
df.set_index('Report No',inplace=True)
df[df > 1] = 1
df.reset_index()
Report No Apple Orange Lemon Grape Pear
One 1 0 1 1 1
Two 1 1 0 1 1
Three 0 0 1 1 1
Four 1 1 1 0 0
Five 1 0 0 1 1
Six 1 1 1 1 0
Or use this if you have some non numeric columns. No need to use set_index and reset_index. This is equivalent to df.select_dtypes('number')
val = df._get_numeric_data()
val[val > 1] = 1
df
Report No Apple Orange Lemon Grape Pear
One 1 0 1 1 1
Two 1 1 0 1 1
Three 0 0 1 1 1
Four 1 1 1 0 0
Five 1 0 0 1 1
Six 1 1 1 1 0
df.mask
df.set_index('Report No',inplace=True)
df.mask(df>1,1).reset_index()
Report No Apple Orange Lemon Grape Pear
One 1 0 1 1 1
Two 1 1 0 1 1
Three 0 0 1 1 1
Four 1 1 1 0 0
Five 1 0 0 1 1
Six 1 1 1 1 0
np.where
df[df.columns[1:]] = df.iloc[:,1:].where(df.iloc[:,1:] >1 ,1)
np.select
This can be helpful when dealing with multiple conditions. If you want to convert values less than 0 to 0 and values greater than 1 to 1.
df.set_index('Report No', inplace=True)
condlist = [df >= 1, df <= 0] #you can have more conditions and add choices accordingly.
choice = [1, 0] #len(condlist) should be equal to len(choice).
df.loc[:] = np.select(condlist, choice)
Like Jan mentioned use df.clip
Not recommended but you can try this for fun. Using df.astype.
df.set_index('Report No',inplace=True)
df.astype('bool').astype('int')
NOTE: This will only convert falsy values to False and truthy values to True i.e. this will convert 0 to False and anything other than 0 is True even negative numbers.
s = pd.Series([1,-1,0])
s.astype('bool')
0 True
1 True
2 False
dtype: bool
s.astype('bool').astype('int')
0 1
1 1
2 0
dtype: int32
np.sign
When values present are between [0, n] i.e no negative values.
df.loc[:] = np.sign(df)

Use pandas.DataFrame.clip:
new_df = df.clip(0, 1)
EDIT: To exclude the first column by name (this will edit the DataFrame in-place)
mask = df.columns != "Report No"
df.loc[:, mask] = df.loc[:, mask].clip(0, 1)

The fastest and easiest way is to go through all the keys of the datframe and change them using the where function of numpy (library which has to be imported). Then we simply pass as an attribute to that function the condition and the values for when the condition is satisfied or not. In your example it would look like this:
for x in df.keys()[1:]:
df[x] = np.where(df[x] > 1, 1, df[x])
Note that in the loop I have quited the first key because its values are not integer

Related

How to Invert column values in pandas - pythonic way?

I have a dataframe like as shown below
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'Label':[1,1,1,0,0]})
My objective is to
a) replace 0s as 1s AND 1s as 0s in Label column
I was trying something like the below
cdf.assign(invert_label=cdf.Label.loc[::-1].reset_index(drop=True)) #not work
cdf['invert_label'] = np.where(cdf['Label']==0, '1', '0')
'
but this doesn't work. It reverses the order
I expect my output to be like as shown below
Id Label
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
You can compare 0, so for 0 get Trues and for not 0 get Falses, then converting to integers for mapping True, False to 1, 0:
print (cdf['Label'].eq(0))
0 False
1 False
2 False
3 True
4 True
Name: Label, dtype: bool
cdf['invert_label'] = cdf['Label'].eq(0).astype(int)
print (cdf)
Id Label invert_label
0 1 1 0
1 2 1 0
2 3 1 0
3 4 0 1
4 5 0 1
Another idea is use mapping:
cdf['invert_label'] = cdf['Label'].map({1:0, 0:1})
print (cdf)
Id Label invert_label
0 1 1 0
1 2 1 0
2 3 1 0
3 4 0 1
4 5 0 1
One maybe obvious answer might be to use 1-value:
cdf['Label2'] = 1-cdf['Label']
output:
Id Label Label2
0 1 1 0
1 2 1 0
2 3 1 0
3 4 0 1
4 5 0 1
You could map the not function as well -
import operator
cdf['Label'].map(operator.not_).astype('int')
Another way, and I am adding this as a separate answer as this is probably not "pythonic" enough (in the sense that it is not very explicit) is to use the bitwise xor
cdf['Label'] ^ 1

pivot long form categorical data by group and dummy code categorical variables

For the following dataframe, I am trying to pivot the categorical variable ('purchase_item') into wide format and dummy code them as 1/0 - based on whether or not a customer purchased it in each of the 4 quarters within 2016.
I would like to generate a pivotted dataframe as follows:
To get the desired result shown above, I have tried basically in various ways to combine groupby/pivot_table functions with a call to get_dummies() function in pandas. Example:
data.groupby(["cust_id", "purchase_qtr"])["purchase_item"].reset_index().get_dummies()
However, none of my attempts have worked thus far.
Can somebody please help me generate the desired result?
One way of doing this is to get the crosstabulation, and then force all values > 1 to become 1, while keeping all 0's as they are:
TL;DR
out = (
pd.crosstab([df["cust_id"], df["purchase_qtr"]], df["purchase_item"])
.gt(0)
.astype(int)
.reset_index()
)
Breaking it all down:
Create Data
df = pd.DataFrame({
"group1": np.repeat(["a", "b", "c"], 4),
"group2": [1, 2, 3] * 4,
"item": np.random.choice(["ab", "cd", "ef", "gh", "zx"], size=12)
})
print(df)
group1 group2 item
0 a 1 cd
1 a 2 ef
2 a 3 gh
3 a 1 ef
4 b 2 zx
5 b 3 ab
6 b 1 ab
7 b 2 gh
8 c 3 gh
9 c 1 cd
10 c 2 ef
11 c 3 gh
Cross Tabulation
This returns a frequency table indicating how often each of the categories are observed together:
crosstab = pd.crosstab([df["group1"], df["group2"]], df["item"])
print(crosstab)
item ab cd ef gh zx
group1 group2
a 1 0 1 1 0 0
2 0 0 1 0 0
3 0 0 0 1 0
b 1 1 0 0 0 0
2 0 0 0 1 1
3 1 0 0 0 0
c 1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 2 0
Coerce Counts to Dummy Codes
Since we want to dummy code, and not count the co-occurance of categories, we can use a quick trick to force all values greater than 0 gt(0) to become 1 astype(int)
item ab cd ef gh zx
group1 group2
a 1 0 1 1 0 0
2 0 0 1 0 0
3 0 0 0 1 0
b 1 1 0 0 0 0
2 0 0 0 1 1
3 1 0 0 0 0
c 1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
You can do in one line via several tricks.
1) Unique Index count in junction with casting as bool
Works in any case even when you do not have any other column besides index, columns and values. This code implies to count unique indices of each index-column intersection and returning 1 incase its more than 0 else 0.
df.reset_index().pivot_table(index=['cust_id','purchase_qtr'],
columns='purchase_item',
values='index',
aggfunc='nunique', fill_value=0)\
.astype(bool).astype(int)
2) Checking if any other column is not null
If you have other columns besides index, columns and values AND want to use them for intuition. Like purchase_date in your case. It is more intutive because you can "read" it like: Check per customer per quarter if the purchase date of the item is not null and parse them as integer.
df.pivot_table(index=['cust_id','purchase_qtr'],
columns='purchase_item',values='purchase_date',
aggfunc=lambda x: all(pd.notna(x)), fill_value=0)\
.astype(int)
3) Seeing len of elements falling in each index-column intersection
This seeslen of elements falling in each each index-column intersection and returning 1 incase its more than 0 else 0. Same intuitive approach:
df.pivot_table(index=['cust_id','purchase_qtr'],
columns='purchase_item',
values='purchase_date',
aggfunc=len, fill_value=0)\
.astype(bool).astype(int)
All return desired dataframe:
Note that you should be only using crosstab when you don't have a dataframe already as it calls pivot_table internally.

Combine values in pandas dataframe to string

I have a dataframe similar to this:
Male Over18 Single
0 0 0 1
1 1 1 1
2 0 0 1
I would like an extra column which gets a commaseperated string with the columnnames where the value is 1:
Male Over18 Single CombinedString
0 0 0 1 Single
1 1 1 1 Male, Over18, Single
2 0 0 1 Single
Hope there is someone out there who can help :)
One pandaic way is to perform a pandas dot product with the column headers:
df['CombinedString'] = df.dot(df.columns+',').str.rstrip(',')
df
Male Over18 Single CombinedString
0 0 0 1 Single
1 1 1 1 Male,Over18,Single
2 0 0 1 Single
Another method would be to use .stack() and groupby.agg()
df['CombinedString'] = df.mask(df.eq(0)).stack().reset_index(1)\
.groupby(level=0)['level_1'].agg(','.join)
print(df)
Male Over18 Single CombinedString
0 0 0 1 Single
1 1 1 1 Male,Over18,Single
2 0 0 1 Single

Converting indicator numbers to binary values

if have two dataframes, (pandas.DataFrame), each looking as follows. Let's call the first one df_A
code1 code2 code3 code4 code5
0 1 4 2 0 0
1 3 2 1 5 0
2 2 3 0 0 0
has1 has2 has3 has4 has5
0 1 1 0 1 0
1 1 1 0 0 1
2 0 1 1 0 0
The objects(rows) are each given up to 5 codes shown by the five columns in the first df.
I instead want a binary representation of which codes each object has. As shown in the second df.
The functions in pandas or scikit-learn for dummy-values take into account which position the code is written in, this in unimportant.
The attempts I have with my own code have not worked due to my inexperience in python and pandas.
This case is different from others I have seen on stack overflow as all the columns represent the same thing.
Thank you!
Edit:
for colname in df_bin.columns:
for row in range(len(df_codes)):
if int(colname) in df_codes.iloc[[row]]:
df_bin[colname][row]=1
This is one of the attempts I made so far.
You can try stack then str.get_dummies
s=df.stack().loc[lambda x : x!=0].astype(str).str.get_dummies().sum(level=0).add_prefix('Has')
Has1 Has2 Has3 Has4 Has5
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0
Let's try:
(df.stack().groupby(level=0)
.value_counts()
.unstack(fill_value=0)
[range(1,6)]
.add_prefix('has')
)
Output:
has1 has2 has3 has4 has5
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0
Here's another way using pd.crosstab:
df_out = df.reset_index().melt('index')
df_out = pd.crosstab(df_out['index'], df_out['value']).drop(0, axis=1).add_prefix('has')
Output:
value has1 has2 has3 has4 has5
index
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0

Drop Rows of an id after a particular column value in Pandas

I have a dataset like:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
1 0
2 0
3 0
3 0
I want to drop all rows of an id after its status became 1, i.e. my new dataset will be:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
2 0
3 0
3 0
i.e.
1 0 --> gets removed since this row appears after id 1 already had a status of 1
How to implement it efficiently since I have a very large (200 GB+) dataset.
Thanks for your help.
Here's an idea;
You can create a dict with the first index where the status is 1 for each ID (assuming the DataFrame is sorted by ID):
d = df.loc[df["Status"]==1].drop_duplicates()
d = dict(zip(d["Id"], d.index))
Then you create a column with the first status=1 for each Id:
df["first"] = df["Id"].map(d)
Finally you drop every row where the index is less than than the first column:
df = df.loc[df.index<df["first"]]
EDIT: Revisiting this question a month later, there is actually a much simpler way with groupby and cumsum: Just group by Id and take the cumsum of Status, then drop the values where the cumsum is more than 0:
df[df.groupby('Id')['Status'].cumsum() < 1]
The best way I have found is to find the index of the first 1 and slice each group that way. In cases where no 1 exists, return the group unchanged:
def remove(series):
indexless = series.reset_index(drop=True)
ones = indexless[indexless['Status'] == 1]
if len(ones) > 0:
return indexless.iloc[:ones.index[0] + 1]
else:
return indexless
df.groupby('Id').apply(remove).reset_index(drop=True)
Output:
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
6 2 0
7 3 0
8 3 0
Use groupby with cumsum to find where status is 1.
res = df.groupby('Id', group_keys=False).apply(lambda x: x[x.Status.cumsum() > 0])
res
Id Status
4 1 1
6 1 0
Exclude index that Status==0.
not_select_id = res[res.Status==0].index
df[~df.index.isin(not_select_id)]
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
7 2 0
8 3 0
9 3 0

Categories

Resources