if have two dataframes, (pandas.DataFrame), each looking as follows. Let's call the first one df_A
code1 code2 code3 code4 code5
0 1 4 2 0 0
1 3 2 1 5 0
2 2 3 0 0 0
has1 has2 has3 has4 has5
0 1 1 0 1 0
1 1 1 0 0 1
2 0 1 1 0 0
The objects(rows) are each given up to 5 codes shown by the five columns in the first df.
I instead want a binary representation of which codes each object has. As shown in the second df.
The functions in pandas or scikit-learn for dummy-values take into account which position the code is written in, this in unimportant.
The attempts I have with my own code have not worked due to my inexperience in python and pandas.
This case is different from others I have seen on stack overflow as all the columns represent the same thing.
Thank you!
Edit:
for colname in df_bin.columns:
for row in range(len(df_codes)):
if int(colname) in df_codes.iloc[[row]]:
df_bin[colname][row]=1
This is one of the attempts I made so far.
You can try stack then str.get_dummies
s=df.stack().loc[lambda x : x!=0].astype(str).str.get_dummies().sum(level=0).add_prefix('Has')
Has1 Has2 Has3 Has4 Has5
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0
Let's try:
(df.stack().groupby(level=0)
.value_counts()
.unstack(fill_value=0)
[range(1,6)]
.add_prefix('has')
)
Output:
has1 has2 has3 has4 has5
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0
Here's another way using pd.crosstab:
df_out = df.reset_index().melt('index')
df_out = pd.crosstab(df_out['index'], df_out['value']).drop(0, axis=1).add_prefix('has')
Output:
value has1 has2 has3 has4 has5
index
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0
Related
I have a dataframe similar to this:
Male Over18 Single
0 0 0 1
1 1 1 1
2 0 0 1
I would like an extra column which gets a commaseperated string with the columnnames where the value is 1:
Male Over18 Single CombinedString
0 0 0 1 Single
1 1 1 1 Male, Over18, Single
2 0 0 1 Single
Hope there is someone out there who can help :)
One pandaic way is to perform a pandas dot product with the column headers:
df['CombinedString'] = df.dot(df.columns+',').str.rstrip(',')
df
Male Over18 Single CombinedString
0 0 0 1 Single
1 1 1 1 Male,Over18,Single
2 0 0 1 Single
Another method would be to use .stack() and groupby.agg()
df['CombinedString'] = df.mask(df.eq(0)).stack().reset_index(1)\
.groupby(level=0)['level_1'].agg(','.join)
print(df)
Male Over18 Single CombinedString
0 0 0 1 Single
1 1 1 1 Male,Over18,Single
2 0 0 1 Single
I have two dataframes X_dummy and X_var, where X_dummy contains dummies and looks like this:
dummy1 dummy2
1 0
0 1
1 0
The X_var dataframe looks contains variables and looks like this:
var1 var2
4 2
10 5
1 1
Now I want to create a dataframe containing the cellwise product of every column from X_dummy with the complete X_var dataframe. Hence, my resulting dataframe should look like, X_result:
var1dummy1 var2dummy1 var1dummy2 var2dummy2
4 2 0 0
0 0 10 5
1 1 0 0
Does anyone know how to do this without using multiple for loops?
Something like numpy broadcast
new = pd.DataFrame(np.concatenate(df2.T.values * df1.T.values[:,None]).T)
new
Out[161]:
0 1 2 3
0 4 2 0 0
1 0 0 10 5
2 1 1 0 0
##new.columns = pd.MultiIndex.from_product([df1.columns,df2.columns]).map('_'.join)
Try:
pd.concat([(df1[i]*df2[j]).rename(f'{i}{j}') for i in df1 for j in df2], axis=1)
Output:
dummy1var1 dummy1var2 dummy2var1 dummy2var2
0 4 2 0 0
1 0 0 10 5
2 1 1 0 0
You can definitely do it with one loop:
dummies = X_dummy.astype(bool)
pd.concat([X_var.loc[dummies[c]] for c in dummies], axis=1).fillna(0).astype(int)
# var1 var2 var1 var2
#0 4 2 0 0
#1 0 0 10 5
#2 1 1 0 0
Note that because one of your dataframes contains dummies, you do not need multiplication at all.
I have a dataset whose features are words. These words like "see", "saw", "go, "play" etc. And I try to do some preprocessing like stemming in columns. I want to add the same or same meaning columns to each other and then drop the adding column. Like below
For example, I have a dataset like,
see go see
0 0 0 1
1 2 1 3
2 0 1 1
3 0 0 0
and I want to add one "see" to another "see", and drop one of them, like below,
see go
0 1 0
1 5 1
2 1 1
3 0 0
How can I do this?
df.groupby(lambda x:x, axis=1).sum()
go see
0 0 1
1 1 5
2 1 1
3 0 0
You could use stack, groupby and then unstack:
res = df.stack().groupby(level=[0, 1]).sum().unstack()
print(res)
Output
go see
0 0 1
1 1 5
2 1 1
3 0 0
As described above i want to get the Position Index of the Dataframe entry based on the condition. It should look something like this
import pandas as pd
a = [[1,0,0,1],[0,1,0,1],[0,0,0,1]]
df = pd.DataFrame(a)
df
Out[61]:
0 1 2 3
0 1 0 0 1
1 0 1 0 1
2 0 0 0 1
And i want to create a new column, that returns the position of the first 1 of the corresponding row. So the End result should look like this:
Out[62]:
0 1 2 3 New
0 1 0 0 1 0
1 0 1 0 1 1
2 0 0 0 1 3
This is my first Question on stackoverflow, so sorry if i did some formal mistakes while asking this question.
Any help appreciated
Ok, I admit, I had troubles to really formulate a good header for that. So I will try to make give an example.
This is my sample dataframe:
df = pd.DataFrame([
(1,"a","good"),
(1,"a","good"),
(1,"b","good"),
(1,"c","bad"),
(2,"a","good"),
(2,"b","bad"),
(3,"a","none")], columns=["id", "type", "eval"])
What I do with it is the following:
df.groupby(["id", "type"])["id"].agg({'id':'count'})
This results in:
id
id type
1 a 2
b 1
c 1
2 a 1
b 1
3 a 1
This is fine, although what I will need later on is that e.g. the id would be repeated in every row. But this is not the most important part.
What I would need now is something like this:
id good bad none
id type
1 a 2 2 0 0
b 1 1 0 0
c 1 0 1 0
2 a 1 1 0 0
b 1 0 1 0
3 a 1 0 0 1
And even better would be a result like this, because I will need this back in a dataframe (and finally in an Excel sheet) with all fields populated. In reality, there will be many more columns I am grouping by. They would have to be completely populated as well.
id good bad none
id type
1 a 2 2 0 0
1 b 1 1 0 0
1 c 1 0 1 0
2 a 1 1 0 0
2 b 1 0 1 0
3 a 1 0 0 1
Thank you for helping me out.
You can use groupby + size (last column was added) or value_counts with unstack:
df1 = df.groupby(["id", "type", 'eval'])
.size()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
df1 = df.groupby(["id", "type"])[ 'eval']
.value_counts()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
But for write to excel get:
df1.to_excel('file.xlsx')
So need reset_index last.
df1.reset_index().to_excel('file.xlsx', index=False)
EDIT:
I forget for id column, but it is duplicate column name, so need id1:
df1.insert(0, 'id1', df1.sum(axis=1))