I am new to pandas. I am trying to move the items of a column to the columns of dataframe. I am struggling for hours but could not do so.
MWE
import numpy as np
import pandas as pd
df = pd.DataFrame({
'X': [10,20,30,40,50],
'Y': [list('abd'), list(), list('ab'),list('abefc'),list('e')]
})
print(df)
X Y
0 10 [a, b, d]
1 20 []
2 30 [a, b]
3 40 [a, b, e, f, c]
4 50 [e]
How to get the result like this:
X a b c d e
0 10 1 1 0 1 0
1 20 0 0 0 0 0
2 30 1 1 0 0 0
3 40 1 1 1 0 1
4 50 0 0 0 0 1
MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df[mlb.classes_] = mlb.fit_transform(df['Y'])
Pandas alternative
df.join(df['Y'].explode().str.get_dummies().groupby(level=0).max())
X Y a b c d e f
0 10 [a, b, d] 1 1 0 1 0 0
1 20 [] 0 0 0 0 0 0
2 30 [a, b] 1 1 0 0 0 0
3 40 [a, b, e, f, c] 1 1 1 0 1 1
4 50 [e] 0 0 0 0 1 0
You can try pandas.Series.str.get_dummies
out = df[['X']].join(df['Y'].apply(','.join).str.get_dummies(sep=','))
print(out)
X a b c d e f
0 10 1 1 0 1 0 0
1 20 0 0 0 0 0 0
2 30 1 1 0 0 0 0
3 40 1 1 1 0 1 1
4 50 0 0 0 0 1 0
My straight forward solution :
Check if the current col is in your Y list or add a 0 :
for col in ['a', 'b', 'c', 'd', 'e']:
df[col] = pd.Series([1 if col in df["Y"][x] else 0 for x in range(len(df.index))])
df = df.drop('Y', axis=1)
print(df)
Edit: Okay, the groupby is cleaner
Related
I have a dataframe with 45 columns. Most are string values, so I'm trying to use pd.get_dummies to turn the strings into numbers using df = pd.get_dummies(drop_first=True); however, the columns without string values are removed from my dataframe. I don't want to have to type out 40 or so columns names. How can I iterate over every column, ignoring ones without strings and still have them remain after the get_dummies call?
Columns can be filtered by dtypes to programmatically determine which columns to pass to get_dummies, namely only the "object or category" type columns:
new_df = pd.get_dummies(
df,
columns=df.columns[(df.dtypes == 'object') | (df.dtypes == 'category')]
)
Sample Data:
import numpy as np
import pandas as pd
np.random.seed(5)
n = 10
df = pd.DataFrame({
'A': np.random.randint(1, 100, n),
'B': pd.Series(np.random.choice(list("ABCD"), n), dtype='category'),
'C': np.random.random(n) * 100,
'D': np.random.choice(list("EFGH"), n)
})
new_df = pd.get_dummies(
df,
columns=df.columns[(df.dtypes == 'object') | (df.dtypes == 'category')]
)
df:
A B C D
0 79 A 76.437261 G
1 62 D 11.090076 E
2 17 B 20.415475 E
3 74 B 11.909536 E
4 9 D 87.790307 G
5 63 A 52.367529 E
6 28 D 49.213600 F
7 31 A 73.187110 H
8 81 B 1.458075 H
9 8 D 9.336303 H
df.dtypes:
A int32
B category
C float64
D object
dtype: object
new_df:
A C B_A B_B B_D D_E D_F D_G D_H
0 79 76.437261 1 0 0 0 0 1 0
1 62 11.090076 0 0 1 1 0 0 0
2 17 20.415475 0 1 0 1 0 0 0
3 74 11.909536 0 1 0 1 0 0 0
4 9 87.790307 0 0 1 0 0 1 0
5 63 52.367529 1 0 0 1 0 0 0
6 28 49.213600 0 0 1 0 1 0 0
7 31 73.187110 1 0 0 0 0 0 1
8 81 1.458075 0 1 0 0 0 0 1
9 8 9.336303 0 0 1 0 0 0 1
Suppose, I have the following dataframe:
A B C D E F
1 1 1 0 0 0
0 0 0 0 0 0
1 1 0.9 1 0 0
0 0 0 0 -1.95 0
0 0 0 0 2.75 0
1 1 1 1 1 1
I want to select rows which have only zeros as well as ones (0 & 1) based on the columns C, D, E and F. For this example, the expected output is
A B C D E F
1 1 1 0 0 0
How can I do this with considering a range of columns in pandas?
Thanks in advance.
Let's try boolean indexing with loc to filter the rows:
c = ['C', 'D', 'E', 'F']
df.loc[df[c].isin([0, 1]).all(1) & df[c].eq(0).any(1) & df[c].eq(1).any(1)]
Result:
A B C D E F
0 1 1 1.0 0 0.0 0
Try apply and loc:
print(df.loc[df.apply(lambda x: sorted(x.drop_duplicates().tolist()) == [0, 1], axis=1)])
Output:
A B C D E F
0 1 1 1.0 0 0.0 0
I would like to 'OR' between row and row+1
for example,
A B C D E F G
r0 0 1 1 0 0 1 0
r1 0 0 0 0 0 0 0
r2 0 0 1 0 1 0 1
and the expected output will be like this
result 0 1 1 0 1 1
I know only how to sum it.
df.loc['result'] = df.sum()
but in this case i would like to do OR
thank you in advance
You can apply any over the first axis.
>>> df
>>>
A B C D E F G
r0 0 1 1 0 0 1 0
r1 0 0 0 0 0 0 0
r2 0 0 1 0 1 0 1
>>>
>>> df.loc['result'] = df.any(axis=0).astype(int)
>>> df
>>>
A B C D E F G
r0 0 1 1 0 0 1 0
r1 0 0 0 0 0 0 0
r2 0 0 1 0 1 0 1
result 0 1 1 0 1 1 1
... assuming that in your output you forgot the last column.
HI everybody i need some help with python.
I'm working with an excel with several rows, some of this rows has zero value in all the columns, so i need to delete that rows.
In
id a b c d
a 0 1 5 0
b 0 0 0 0
c 0 0 0 0
d 0 0 0 1
e 1 0 0 1
Out
id a b c d
a 0 1 5 0
d 0 0 0 1
e 1 0 0 1
I think in something like show the rows that do not contain zeros, but do not work because is deleting all the rows with zero and without zero
path = '/Users/arronteb/Desktop/excel/ejemplo1.xlsx'
xlsx = pd.ExcelFile(path)
df = pd.read_excel(xlsx,'Sheet1')
df_zero = df[(df.OTC != 0) & (df.TM != 0) & (df.Lease != 0) & (df.Maint != 0) & (df.Support != 0) & (df.Other != 0)]
Then i think like just show the columns with zero
In
id a b c d
a 0 1 5 0
b 0 0 0 0
c 0 0 0 0
d 0 0 0 1
e 1 0 0 1
Out
id a b c d
b 0 0 0 0
c 0 0 0 0
So i make a little change and i have something like this
path = '/Users/arronteb/Desktop/excel/ejemplo1.xlsx'
xlsx = pd.ExcelFile(path)
df = pd.read_excel(xlsx,'Sheet1')
df_zero = df[(df.OTC == 0) & (df.TM == 0) & (df.Lease == 0) & (df.Maint == 0) & (df.Support == 0) & (df.Other == 0)]
In this way I just get the column with zeros. I need a way to remove this 2 rows from the original input, and receive the output without that rows. Thanks, and sorry for the bad English, I'm working on that too
Given your input you can group by whether all the columns are zero or not, then access them, eg:
groups = df.groupby((df.drop('id', axis= 1) == 0).all(axis=1))
all_zero = groups.get_group(True)
non_all_zero = groups.get_group(False)
For this dataframe:
df
Out:
id a b c d e
0 a 2 0 2 0 1
1 b 1 0 1 1 1
2 c 1 0 0 0 1
3 d 2 0 2 0 2
4 e 0 0 0 0 2
5 f 0 0 0 0 0
6 g 0 2 1 0 2
7 h 0 0 0 0 0
8 i 1 2 2 0 2
9 j 2 2 1 2 1
Temporarily set the index:
df = df.set_index('id')
Drop rows containing all zeros and reset the index:
df = df[~(df==0).all(axis=1)].reset_index()
df
Out:
id a b c d e
0 a 2 0 2 0 1
1 b 1 0 1 1 1
2 c 1 0 0 0 1
3 d 2 0 2 0 2
4 e 0 0 0 0 2
5 g 0 2 1 0 2
6 i 1 2 2 0 2
7 j 2 2 1 2 1
I have a DataFrame of authors and their papers:
author paper
0 A z
1 B z
2 C z
3 D y
4 E y
5 C y
6 F x
7 G x
8 G w
9 B w
I want to get a matrix of how many papers each pair of authors has together.
A B C D E F G
A
B 1
C 1 1
D 1 0 1
E 0 0 1 1
F 0 0 0 0 0
G 0 1 0 0 0 1
Is there a way to transform the DataFrame using pandas to get this results? Or is there a more efficient way (like with numpy) to do this so that it is scalable?
get_dummies, which I first reached for, isn't as convenient here as hoped; needed to add an extra groupby. Instead, it's actually simpler to add a dummy column or use a custom aggfunc. For example, if we start from a df like this (note that I've added an extra paper a so that there's at least one pair who's written more than one paper together)
>>> df
author paper
0 A z
1 B z
2 C z
[...]
10 A a
11 B a
We can add a dummy tick column, pivot, and then use the "it's simply a dot product" observation from this question:
>>> df["dummy"] = 1
>>> dm = df.pivot("author", "paper").fillna(0)
>>> dout = dm.dot(dm.T)
>>> dout
author A B C D E F G
author
A 2 2 1 0 0 0 0
B 2 3 1 0 0 0 1
C 1 1 2 1 1 0 0
D 0 0 1 1 1 0 0
E 0 0 1 1 1 0 0
F 0 0 0 0 0 1 1
G 0 1 0 0 0 1 2
where the diagonal counts how many papers an author has written. If you really want to obliterate the diagonal and above, we can do that too:
>>> dout.values[np.triu_indices_from(dout)] = 0
>>> dout
author A B C D E F G
author
A 0 0 0 0 0 0 0
B 2 0 0 0 0 0 0
C 1 1 0 0 0 0 0
D 0 0 1 0 0 0 0
E 0 0 1 1 0 0 0
F 0 0 0 0 0 0 0
G 0 1 0 0 0 1 0