I have a dataframe like the following:
and I want to group the answers like the following
I tried to use multindex, but it won’t work.
You can use pandas.MultiIndex.from_array to manually craft your custom index:
new_level = ['GROUP1', 'GROUP1', 'GROUP1', 'GROUP2', 'GROUP2', 'GROUP3', 'GROUP3']
df.columns = pd.MultiIndex.from_arrays([new_level, df.columns])
example input:
A B C D E
0 X X X X X
output:
>>> df.columns = pd.MultiIndex.from_arrays([[1,1,2,2,3], df.columns])
>>> df
1 2 3
A B C D E
0 X X X X X
Related
I have a data frame with:
A B C
1 3 6
I want to take the 2 columns and create column D that reads {"A":"1", "C":"6}
new dataframe output would be:
A B C D
1 3 6 {"A":"1", "C":"6}
I have the following code:
df['D'] = n.apply(lambda x: x.to_json(), axis=1)
but this is taking all columns while I only need columns A and C and want to leave B from the JSON that is created.
Any tips on just targeting the two columns would be appreciated.
It's not exactly what you ask but you can convert your 2 columns into a dict then if you want to export your data in JSON format, use df['D'].to_json():
df['D'] = df[['A', 'C']].apply(dict, axis=1)
print(df)
# Output
A B C D
0 1 3 6 {'A': 1, 'C': 6}
For example, export the column D as JSON:
print(df['D'].to_json(orient='records', indent=4))
# Output
[
{
"A":1,
"C":6
}
]
Use subset in lambda function:
df['D'] = df.apply(lambda x: x[['A','C']].to_json(), axis=1)
Or sellect columns before apply:
df['D'] = df[['A','C']].apply(lambda x: x.to_json(), axis=1)
If possible create dictionaries:
df['D'] = df[['A','C']].to_dict(orient='records')
print (df)
A B C D
0 1 3 6 {'A': 1, 'C': 6}
I have a sample dataframe that looks like below. I'd like to eventually group row 1 and row 3 together, since they contain identical items in different columns.
x y count
a,b b,a 5
a,c c,a 2
b,a a,b 1
I've spent a lot of time trying to solve this, but have not encountered a good solution yet. What steps should I take to reach the below final dataframe?
x y count
a,b b,a 5+1
a,c c,a 2
You can try:
df.groupby((df.x + df.y).str.replace(',', '').apply(lambda x: ''.join(sorted(x)))
).agg({'x': 'first', 'y': 'first', 'count': sum}).reset_index(drop=True)
OUTPUT:
x y count
0 a,b b,a 6
1 a,c c,a 2
Slightly different approach.
Sort x and y rowwise using np.sort on axis=1:
cols = ['x', 'y']
df[cols] = np.sort(df[cols].values, axis=1)
x y count
0 a,b b,a 5
1 a,c c,a 2
2 a,b b,a 1
Then a standard groupby aggregate:
df = df.groupby(cols, as_index=False).aggregate(count=('count', 'sum'))
x y count
0 a,b b,a 6
1 a,c c,a 2
Complete Code:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'x': ['a,b', 'a,c', 'b,a'],
'y': ['b,a', 'c,a', 'a,b'],
'count': [5, 2, 1]
})
cols = ['x', 'y']
df[cols] = np.sort(df[cols].values, axis=1)
df = df.groupby(cols, as_index=False).aggregate(count=('count', 'sum'))
Let's say I have a dataframe like this:
df = pd.DataFrame({'foo':[1, 2], 'bar': [3, 4], 'xyz': [5, 6]})
bar foo xyz
0 3 1 5
1 4 2 6
I now want to put the column that contains oo at the first position (i.e. at 0th index); there is always only one column with this pattern.
I currently solve this using filter twice and a concat:
pd.concat([df.filter(like='oo'), df.filter(regex='^((?!(oo)).)*$')], axis=1)
which gives the desired output:
foo bar xyz
0 1 3 5
1 2 4 6
I am wondering whether there is a more efficient way of doing this.
Use list comprehensions only, join lists together and select by subset:
a = [x for x in df.columns if 'oo' in x]
b = [x for x in df.columns if not 'oo' in x]
df = df[a + b]
print (df)
foo bar xyz
0 1 3 5
1 2 4 6
What about:
df[sorted(df, key = lambda x: x not in df.filter(like="oo").columns)]
Using pop:
cols = list(df)
col_oo = [col for col in df.columns if 'oo' in col]
cols.insert(0, cols.pop(cols.index(col_oo[0])))
df = df.ix[:, cols]
Or using regex:
col_oo = [col for col in cols if re.search('oo', col)]
I have a dataframe and want to eliminate duplicate rows, that have same values, but in different columns:
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df
Out[8]:
a b c d
1 x y e f
2 e f x y
3 w v s t
Rows [1],[2] have the values {x,y,e,f}, but they are arranged in a cross - i.e. if you would exchange columns c,d with a,b in row [2] you would have a duplicate.
I want to drop these lines and only keep one, to have the final output:
df_new
Out[20]:
a b c d
1 x y e f
3 w v s t
How can I efficiently achieve that?
I think you need filter by boolean indexing with mask created by numpy.sort with duplicated, for invert it use ~:
df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()]
print (df)
a b c d
1 x y e f
3 w v s t
Detail:
print (np.sort(df, axis=1))
[['e' 'f' 'x' 'y']
['e' 'f' 'x' 'y']
['s' 't' 'v' 'w']]
print (pd.DataFrame(np.sort(df, axis=1), index=df.index))
0 1 2 3
1 e f x y
2 e f x y
3 s t v w
print (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 False
2 True
3 False
dtype: bool
print (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 True
2 False
3 True
dtype: bool
Here's another solution, with a for loop:
data = df.as_matrix()
new = []
for row in data:
if not new:
new.append(row)
else:
if not any([c in nrow for nrow in new for c in row]):
new.append(row)
new_df = pd.DataFrame(new, columns=df.columns)
Use sorting(np.sort) and then get duplicates(.duplicated()) out of it.
Later use that duplicates to drop(df.drop) the required index
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()
index_to_drop = [ind for ind in range(len(df_duplicated)) if df_duplicated[ind]]
df.drop(df.index[df_duplicated])
I have a dataframe and want to eliminate duplicate rows, that have same values, but in different columns:
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df
Out[8]:
a b c d
1 x y e f
2 e f x y
3 w v s t
Rows [1],[2] have the values {x,y,e,f}, but they are arranged in a cross - i.e. if you would exchange columns c,d with a,b in row [2] you would have a duplicate.
I want to drop these lines and only keep one, to have the final output:
df_new
Out[20]:
a b c d
1 x y e f
3 w v s t
How can I efficiently achieve that?
I think you need filter by boolean indexing with mask created by numpy.sort with duplicated, for invert it use ~:
df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()]
print (df)
a b c d
1 x y e f
3 w v s t
Detail:
print (np.sort(df, axis=1))
[['e' 'f' 'x' 'y']
['e' 'f' 'x' 'y']
['s' 't' 'v' 'w']]
print (pd.DataFrame(np.sort(df, axis=1), index=df.index))
0 1 2 3
1 e f x y
2 e f x y
3 s t v w
print (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 False
2 True
3 False
dtype: bool
print (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 True
2 False
3 True
dtype: bool
Here's another solution, with a for loop:
data = df.as_matrix()
new = []
for row in data:
if not new:
new.append(row)
else:
if not any([c in nrow for nrow in new for c in row]):
new.append(row)
new_df = pd.DataFrame(new, columns=df.columns)
Use sorting(np.sort) and then get duplicates(.duplicated()) out of it.
Later use that duplicates to drop(df.drop) the required index
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()
index_to_drop = [ind for ind in range(len(df_duplicated)) if df_duplicated[ind]]
df.drop(df.index[df_duplicated])