I am trying to create a new dataframes df_A, df_B and df_C from an existing dataframe df based on categorical values in the column category (A,B and C).
This doesn't work
df_A = {n: df.ix[rows]
for n, rows in enumerate(df.groupby('Category').groups)}
Here I get the error "Key Error: A"
(Note: A is one of the categories)
This doesn't work either
df_A = np.where(df['Category']=='A')).copy()
Here I get the error: "syntax error"
Finally, this doesn't work
df_A = np.where(raw[raw['Category']=='A']).copy()
"AttributeError: 'tuple' object has no attribute 'copy'"
Thank You
It seems you need first boolean indexing because Category is column, not index if need dictionary :
df2 = {n: data[ data['Category'] == rows]
for n, rows in enumerate(data.groupby('Category').groups)}
Or try remove groups:
df2 = {n: rows[1] for n, rows in enumerate(data.groupby('Category'))}
Sample:
data = pd.DataFrame({'Category':['A','A','D'],
'B':[4,5,6],
'C':[7,8,9]})
print (data)
B C Category
0 4 7 A
1 5 8 A
2 6 9 D
df2 = {n: rows[1] for n, rows in enumerate(data.groupby('Category'))}
print (df2)
{0: B C Category
0 4 7 A
1 5 8 A, 1: B C Category
2 6 9 D}
df2 = {n: data[ data['Category'] == rows]
for n, rows in enumerate(data.groupby('Category').groups)}
print (df2)
{0: B C Category
0 4 7 A
1 5 8 A, 1: B C Category
2 6 9 D}
Solution without groupby
df2 = {n: data[data['Category'] == rows] for n, rows in enumerate(data['Category'].unique())}
print (df2)
{0: B C Category
0 4 7 A
1 5 8 A, 1: B C Category
2 6 9 D}
print (df2[0])
B C Category
0 4 7 A
1 5 8 A
But if need select dict of DataFrame by Category value:
dfs = {n: rows for n, rows in data.groupby('Category')}
print (dfs)
{'A': B C Category
0 4 7 A
1 5 8 A, 'D': B C Category
2 6 9 D}
print (dfs['A'])
B C Category
0 4 7 A
1 5 8 A
Related
I have two dataframes with hundreds of columns.
Some have the same name, some do not.
I want the two dataframes to have the columns with same name listed in the same order.
Typically, if those were the only columns, I would do:
df2 = df2.filter(df1.columns)
However, because there are columns with different names, this would eliminate all columns in df2 that do not exists in df1.
How do I order all common columns with same order without losing the columns that are not in common? Those not in common must be kept in the original order. Because I have hundreds of columns I cannot do it manually but need a quick solution like "filter". Please, note that though there are similar questions, they do not deal with the case of "some columns are in common and some are different".
Example:
df1.columns = A,B,C,...,Z,1,2,...,1000
df2.columns = Z,K,P,T,...,01,02,...,01000
I want to reorder the columns for df2 to be:
df2.columns = A,B,C,...,Z,01,02,...,01000
Try sets operations on column names like intersection and difference:
Setup a MRE
>>> df1
A B C D
0 2 7 7 5
1 6 8 4 2
>>> df2
C B E F
0 8 7 3 2
1 8 6 5 8
c0 = df1.columns.intersection(df2.columns)
c1 = df1.columns.difference(df2.columns)
c2 = df2.columns.difference(df1.columns)
df1 = df1[c0.tolist() + c1.tolist()]
df2 = df2[c0.tolist() + c2.tolist()]
Output:
>>> df1
B C A D
0 7 7 2 5
1 8 4 6 2
>>> df2
B C E F
0 7 8 3 2
1 6 8 5 8
Assume you want to also keep columns that are not in common in the same place:
# make a copy of df2 column names
new_cols = df2.columns.values.copy()
# reorder common column names in df2 to be same order as df1
new_cols[df2.columns.isin(df1.columns)] = df1.columns[df1.columns.isin(df2.columns)]
# reorder columns using new_cols
df2[new_cols]
Example:
df1 = pd.DataFrame([[1,2,3,4,5]], columns=list('badfe'))
df2 = pd.DataFrame([[1,2,3,4,5]], columns=list('fsxad'))
df1
b a d f e
0 1 2 3 4 5
df2
f s x a d
0 1 2 3 4 5
new_cols = df2.columns.values.copy()
new_cols[df2.columns.isin(df1.columns)] = df1.columns[df1.columns.isin(df2.columns)]
df2[new_cols]
a s x d f
0 4 2 3 5 1
You can do using pd.Index.difference and pd.index.union
i = df1.columns.intersection(df2.columns,sort=False).union(
df2.columns.difference(df1.columns),sort=False
)
out = df2.loc[:,i]
df1 = pd.DataFrame(columns=list("ABCEFG"))
df2 = pd.DataFrame(columns=list("ECDAFGHI"))
print(df1)
print(df2)
i = df2.columns.intersection(df1.columns,sort=False).union(
df2.columns.difference(df1.columns),sort=False
)
print(df2.loc[:,i])
Empty DataFrame
Columns: [A, B, C, E, F, G]
Index: []
Empty DataFrame
Columns: [E, C, D, A, F, G, H, I]
Index: []
Empty DataFrame
Columns: [A, C, E, F, G, D, H, I]
Index: []
I have a dataframe:
df = [A B C D E_p0 E_p1 E_p2 K_p0 K_p1 K_2
a 2 r 4 3 6 1 9 5 1
e g 1 d 5 8 2 7 1 4]
And I want to group columns based on the prefix and aggregate them by a function, such as mean or max or rms.
So, for example if my function is max, the output is:
df = [A B C D E K
a 2 r 4 6 9
e g 1 d 8 7 ]
You can convert columns without separator to index and then grouping with lambda function per columns with aggregate function like max:
m = df.columns.str.contains('_')
df = (df.set_index(df.columns[~m].tolist())
.groupby(lambda x: x.split('_')[0], axis=1)
.max()
.reset_index())
print (df)
A B C D E K
0 a 2 r 4 6 9
1 e g 1 d 8 7
Solution with custom function:
def rms(x):
return np.sqrt(np.sum(x**2, axis=1)/len(x.columns))
m = df.columns.str.contains('_')
df1 = (df.set_index(df.columns[~m].tolist())
.groupby(lambda x: x.split('_')[0], axis=1)
.agg(rms)
.reset_index())
print (df1)
A B C D E K
0 a 2 r 4 3.915780 5.972158
1 e g 1 d 5.567764 4.690416
I have dataframes:
df1:
| |A|B|C|D|E|
|0|1|2|3|4|5|
|1|1|3|4|5|0|
|2|3|1|2|3|5|
|3|2|3|1|2|6|
|4|2|5|1|2|3|
df2:
| |K|L|M|N|
|0|1|3|4|2|
|1|1|2|5|3|
|2|3|2|3|1|
|3|1|4|5|0|
|4|2|2|3|6|
|5|2|1|2|7|
What I need to do is match column A of df1 with column k of df2; column C of df1 with L of df2; and column D of df1 with column M of df2. If the values are matched the corresponding value of N in df2 should be assigned to a new column F in df1. The output should be:
| |A|B|C|D|E|F|
|0|1|2|3|4|5|2|
|1|1|3|4|5|0|0|
|2|3|1|2|3|5|1|
|3|2|3|1|2|6|7|
|4|2|5|1|2|3|7|
Use DataFrame.merge with left join and rename columns for match:
df = df1.merge(df2.rename(columns={'K':'A','L':'C','M':'D', 'N':'F'}), how='left')
print (df)
A B C D E F
0 1 2 3 4 5 2
1 1 3 4 5 0 0
2 3 1 2 3 5 1
3 2 3 1 2 6 7
4 2 5 1 2 3 7
df3 = df1.join(df2)
F = []
for _, row in df3.iterrows():
if row['A'] == row['K'] and row['C'] == row['L'] and row['D'] == row['M']:
F.append(row['N'])
else:
F.append(0)
df1['F'] = F
df1
I have two dataframes. DF and SubDF. SubDF is a subset of DF. I want to extract the rows in DF that are NOT in SubDF.
I tried the following:
DF2 = DF[~DF.isin(SubDF)]
The number of rows are correct and most rows are correct,
ie number of rows in subDF + number of rows in DF2 = number of rows in DF
but I get rows with NaN values that do not exist in the original DF
Not sure what I'm doing wrong.
Note: the original DF does not have any NaN values, and to double check I did DF.dropna() before and the result still produced NaN
You need merge with outer join and boolean indexing, because DataFrame.isin need values and index match:
DF = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (DF)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
SubDF = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
print (SubDF)
A B C D E F
0 3 6 9 5 6 3
#return no match
DF2 = DF[~DF.isin(SubDF)]
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
DF2 = pd.merge(DF, SubDF, how='outer', indicator=True)
DF2 = DF2[DF2._merge == 'left_only'].drop('_merge', axis=1)
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
Another way, borrowing the setup from #jezrael:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
sub = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
extract_idx = list(set(df.index) - set(sub.index))
df_extract = df.loc[extract_idx]
The rows may not be sorted in the original df order. If matching order is required:
extract_idx = list(set(df.index) - set(sub.index))
idx_dict = dict(enumerate(df.index))
order_dict = dict(zip(idx_dict.values(), idx_dict.keys()))
df_extract = df.loc[sorted(extract_idx, key=order_dict.get)]
I am using Python 2.7 with Pandas on a Windows 10 machine.
I have an n by n Dataframe where:
1) The index represents peoples names
2) The column headers are the same peoples names in the same order
3) Each cell of the Dataframeis the average number of times they email each other each day.
How would I transform that Dataframeinto a Dataframewith 3 columns, where:
1) Column 1 would be the index of the n by n Dataframe
2) Column 2 would be the row headers of the n by n Dataframe
3) Column 3 would be the cell value corresponding to those two names from the index, column header combination from the n by n Dataframe
Edit
Appologies for not providing an example of what I am looking for. I would like to take df1 and turn it into rel_df, from the code below.
import pandas as pd
from itertools import permutations
df1 = pd.DataFrame()
df1['index'] = ['a', 'b','c','d','e']
df1.set_index('index', inplace = True)
df1['a'] = [0,1,2,3,4]
df1['b'] = [1,0,2,3,4]
df1['c'] = [4,1,0,3,4]
df1['d'] = [5,1,2,0,4]
df1['e'] = [7,1,2,3,0]
##df of all relationships to build
flds = pd.Series(SO_df.fld1.unique())
flds = pd.Series(flds.append(pd.Series(SO_df.fld2.unique())).unique())
combos = []
for L in range(0, len(flds)+1):
for subset in permutations(flds, L):
if len(subset) == 2:
combos.append(subset)
if len(subset) > 2:
break
rel_df = pd.DataFrame.from_records(data = combos, columns = ['fld1','fld2'])
rel_df['value'] = [1,4,5,7,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4]
print df1
>>> print df1
a b c d e
index
a 0 1 4 5 7
b 1 0 1 1 1
c 2 2 0 2 2
d 3 3 3 0 3
e 4 4 4 4 0
>>> print rel_df
fld1 fld2 value
0 a b 1
1 a c 4
2 a d 5
3 a e 7
4 b a 1
5 b c 1
6 b d 1
7 b e 1
8 c a 2
9 c b 2
10 c d 2
11 c e 2
12 d a 3
13 d b 3
14 d c 3
15 d e 3
16 e a 4
17 e b 4
18 e c 4
19 e d 4
Use melt:
df1 = df1.reset_index()
pd.melt(df1, id_vars='index', value_vars=df1.columns.tolist()[1:])
(If in your actual code you're explicitly setting the index as you do here, just skip that step rather than doing the reset_index; melt doesn't work on an index.)
# Flatten your dataframe.
df = df1.stack().reset_index()
# Remove duplicates (e.g. fld1 = 'a' and fld2 = 'a').
df = df.loc[df.iloc[:, 0] != df.iloc[:, 1]]
# Rename columns.
df.columns = ['fld1', 'fld2', 'value']
>>> df
fld1 fld2 value
1 a b 1
2 a c 4
3 a d 5
4 a e 7
5 b a 1
7 b c 1
8 b d 1
9 b e 1
10 c a 2
11 c b 2
13 c d 2
14 c e 2
15 d a 3
16 d b 3
17 d c 3
19 d e 3
20 e a 4
21 e b 4
22 e c 4
23 e d 4