Combine values in pandas dataframe to string - python

I have a dataframe similar to this:
Male Over18 Single
0 0 0 1
1 1 1 1
2 0 0 1
I would like an extra column which gets a commaseperated string with the columnnames where the value is 1:
Male Over18 Single CombinedString
0 0 0 1 Single
1 1 1 1 Male, Over18, Single
2 0 0 1 Single
Hope there is someone out there who can help :)

One pandaic way is to perform a pandas dot product with the column headers:
df['CombinedString'] = df.dot(df.columns+',').str.rstrip(',')
df
Male Over18 Single CombinedString
0 0 0 1 Single
1 1 1 1 Male,Over18,Single
2 0 0 1 Single

Another method would be to use .stack() and groupby.agg()
df['CombinedString'] = df.mask(df.eq(0)).stack().reset_index(1)\
.groupby(level=0)['level_1'].agg(','.join)
print(df)
Male Over18 Single CombinedString
0 0 0 1 Single
1 1 1 1 Male,Over18,Single
2 0 0 1 Single

Related

How to multiply every column in one dataframe with all columns in other dataframe

I have two dataframes X_dummy and X_var, where X_dummy contains dummies and looks like this:
dummy1 dummy2
1 0
0 1
1 0
The X_var dataframe looks contains variables and looks like this:
var1 var2
4 2
10 5
1 1
Now I want to create a dataframe containing the cellwise product of every column from X_dummy with the complete X_var dataframe. Hence, my resulting dataframe should look like, X_result:
var1dummy1 var2dummy1 var1dummy2 var2dummy2
4 2 0 0
0 0 10 5
1 1 0 0
Does anyone know how to do this without using multiple for loops?
Something like numpy broadcast
new = pd.DataFrame(np.concatenate(df2.T.values * df1.T.values[:,None]).T)
new
Out[161]:
0 1 2 3
0 4 2 0 0
1 0 0 10 5
2 1 1 0 0
##new.columns = pd.MultiIndex.from_product([df1.columns,df2.columns]).map('_'.join)
Try:
pd.concat([(df1[i]*df2[j]).rename(f'{i}{j}') for i in df1 for j in df2], axis=1)
Output:
dummy1var1 dummy1var2 dummy2var1 dummy2var2
0 4 2 0 0
1 0 0 10 5
2 1 1 0 0
You can definitely do it with one loop:
dummies = X_dummy.astype(bool)
pd.concat([X_var.loc[dummies[c]] for c in dummies], axis=1).fillna(0).astype(int)
# var1 var2 var1 var2
#0 4 2 0 0
#1 0 0 10 5
#2 1 1 0 0
Note that because one of your dataframes contains dummies, you do not need multiplication at all.

I have a dataset whose columns are words. How can I add the same columns to each other?

I have a dataset whose features are words. These words like "see", "saw", "go, "play" etc. And I try to do some preprocessing like stemming in columns. I want to add the same or same meaning columns to each other and then drop the adding column. Like below
For example, I have a dataset like,
see go see
0 0 0 1
1 2 1 3
2 0 1 1
3 0 0 0
and I want to add one "see" to another "see", and drop one of them, like below,
see go
0 1 0
1 5 1
2 1 1
3 0 0
How can I do this?
df.groupby(lambda x:x, axis=1).sum()
go see
0 0 1
1 1 5
2 1 1
3 0 0
You could use stack, groupby and then unstack:
res = df.stack().groupby(level=[0, 1]).sum().unstack()
print(res)
Output
go see
0 0 1
1 1 5
2 1 1
3 0 0

Converting indicator numbers to binary values

if have two dataframes, (pandas.DataFrame), each looking as follows. Let's call the first one df_A
code1 code2 code3 code4 code5
0 1 4 2 0 0
1 3 2 1 5 0
2 2 3 0 0 0
has1 has2 has3 has4 has5
0 1 1 0 1 0
1 1 1 0 0 1
2 0 1 1 0 0
The objects(rows) are each given up to 5 codes shown by the five columns in the first df.
I instead want a binary representation of which codes each object has. As shown in the second df.
The functions in pandas or scikit-learn for dummy-values take into account which position the code is written in, this in unimportant.
The attempts I have with my own code have not worked due to my inexperience in python and pandas.
This case is different from others I have seen on stack overflow as all the columns represent the same thing.
Thank you!
Edit:
for colname in df_bin.columns:
for row in range(len(df_codes)):
if int(colname) in df_codes.iloc[[row]]:
df_bin[colname][row]=1
This is one of the attempts I made so far.
You can try stack then str.get_dummies
s=df.stack().loc[lambda x : x!=0].astype(str).str.get_dummies().sum(level=0).add_prefix('Has')
Has1 Has2 Has3 Has4 Has5
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0
Let's try:
(df.stack().groupby(level=0)
.value_counts()
.unstack(fill_value=0)
[range(1,6)]
.add_prefix('has')
)
Output:
has1 has2 has3 has4 has5
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0
Here's another way using pd.crosstab:
df_out = df.reset_index().melt('index')
df_out = pd.crosstab(df_out['index'], df_out['value']).drop(0, axis=1).add_prefix('has')
Output:
value has1 has2 has3 has4 has5
index
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0

Replacing values greater 1 in a large pandas dataframe

I'm trying to replace all numbers greater than 1 with 1 while keeping the original 1s and 0s untouched in the entire dataframe with the minimal effort. Any support is appreciated!!
My dataframe looks something like this but contains way more columns and rows.
Report No Apple Orange Lemon Grape Pear
One 5 0 2 1 1
Two 1 1 0 3 2
Three 0 0 2 1 3
Four 1 1 3 0 0
Five 4 0 0 1 1
Six 1 3 1 2 0
Desired Output:
Report No Apple Orange Lemon Grape Pear
One 1 0 1 1 1
Two 1 1 0 1 1
Three 0 0 1 1 1
Four 1 1 1 0 0
Five 1 0 0 1 1
Six 1 1 1 1 0
You can try this.
Using boolean mask
df.set_index('Report No',inplace=True)
df[df > 1] = 1
df.reset_index()
Report No Apple Orange Lemon Grape Pear
One 1 0 1 1 1
Two 1 1 0 1 1
Three 0 0 1 1 1
Four 1 1 1 0 0
Five 1 0 0 1 1
Six 1 1 1 1 0
Or use this if you have some non numeric columns. No need to use set_index and reset_index. This is equivalent to df.select_dtypes('number')
val = df._get_numeric_data()
val[val > 1] = 1
df
Report No Apple Orange Lemon Grape Pear
One 1 0 1 1 1
Two 1 1 0 1 1
Three 0 0 1 1 1
Four 1 1 1 0 0
Five 1 0 0 1 1
Six 1 1 1 1 0
df.mask
df.set_index('Report No',inplace=True)
df.mask(df>1,1).reset_index()
Report No Apple Orange Lemon Grape Pear
One 1 0 1 1 1
Two 1 1 0 1 1
Three 0 0 1 1 1
Four 1 1 1 0 0
Five 1 0 0 1 1
Six 1 1 1 1 0
np.where
df[df.columns[1:]] = df.iloc[:,1:].where(df.iloc[:,1:] >1 ,1)
np.select
This can be helpful when dealing with multiple conditions. If you want to convert values less than 0 to 0 and values greater than 1 to 1.
df.set_index('Report No', inplace=True)
condlist = [df >= 1, df <= 0] #you can have more conditions and add choices accordingly.
choice = [1, 0] #len(condlist) should be equal to len(choice).
df.loc[:] = np.select(condlist, choice)
Like Jan mentioned use df.clip
Not recommended but you can try this for fun. Using df.astype.
df.set_index('Report No',inplace=True)
df.astype('bool').astype('int')
NOTE: This will only convert falsy values to False and truthy values to True i.e. this will convert 0 to False and anything other than 0 is True even negative numbers.
s = pd.Series([1,-1,0])
s.astype('bool')
0 True
1 True
2 False
dtype: bool
s.astype('bool').astype('int')
0 1
1 1
2 0
dtype: int32
np.sign
When values present are between [0, n] i.e no negative values.
df.loc[:] = np.sign(df)
Use pandas.DataFrame.clip:
new_df = df.clip(0, 1)
EDIT: To exclude the first column by name (this will edit the DataFrame in-place)
mask = df.columns != "Report No"
df.loc[:, mask] = df.loc[:, mask].clip(0, 1)
The fastest and easiest way is to go through all the keys of the datframe and change them using the where function of numpy (library which has to be imported). Then we simply pass as an attribute to that function the condition and the values for when the condition is satisfied or not. In your example it would look like this:
for x in df.keys()[1:]:
df[x] = np.where(df[x] > 1, 1, df[x])
Note that in the loop I have quited the first key because its values are not integer

Returning the Position of a Pandas Dataframe Entry when first Value in the Row equals 1

As described above i want to get the Position Index of the Dataframe entry based on the condition. It should look something like this
import pandas as pd
a = [[1,0,0,1],[0,1,0,1],[0,0,0,1]]
df = pd.DataFrame(a)
df
Out[61]:
0 1 2 3
0 1 0 0 1
1 0 1 0 1
2 0 0 0 1
And i want to create a new column, that returns the position of the first 1 of the corresponding row. So the End result should look like this:
Out[62]:
0 1 2 3 New
0 1 0 0 1 0
1 0 1 0 1 1
2 0 0 0 1 3
This is my first Question on stackoverflow, so sorry if i did some formal mistakes while asking this question.
Any help appreciated

Categories

Resources