Group columns from column to column

Group columns from column to column - python

I have a dataframe like the following:
and I want to group the answers like the following
I tried to use multindex, but it won’t work.

You can use pandas.MultiIndex.from_array to manually craft your custom index:
new_level = ['GROUP1', 'GROUP1', 'GROUP1', 'GROUP2', 'GROUP2', 'GROUP3', 'GROUP3']
df.columns = pd.MultiIndex.from_arrays([new_level, df.columns])
example input:
A B C D E
0 X X X X X
output:
>>> df.columns = pd.MultiIndex.from_arrays([[1,1,2,2,3], df.columns])
>>> df
1 2 3
A B C D E
0 X X X X X

Related

use specific columns to map new column with json

I have a data frame with:
A B C
1 3 6
I want to take the 2 columns and create column D that reads {"A":"1", "C":"6}
new dataframe output would be:
A B C D
1 3 6 {"A":"1", "C":"6}
I have the following code:
df['D'] = n.apply(lambda x: x.to_json(), axis=1)
but this is taking all columns while I only need columns A and C and want to leave B from the JSON that is created.
Any tips on just targeting the two columns would be appreciated.

It's not exactly what you ask but you can convert your 2 columns into a dict then if you want to export your data in JSON format, use df['D'].to_json():
df['D'] = df[['A', 'C']].apply(dict, axis=1)
print(df)
# Output
A B C D
0 1 3 6 {'A': 1, 'C': 6}
For example, export the column D as JSON:
print(df['D'].to_json(orient='records', indent=4))
# Output
[
{
"A":1,
"C":6
}
]

Use subset in lambda function:
df['D'] = df.apply(lambda x: x[['A','C']].to_json(), axis=1)
Or sellect columns before apply:
df['D'] = df[['A','C']].apply(lambda x: x.to_json(), axis=1)
If possible create dictionaries:
df['D'] = df[['A','C']].to_dict(orient='records')
print (df)
A B C D
0 1 3 6 {'A': 1, 'C': 6}

How to add rows with identical items in different columns in Pandas together

I have a sample dataframe that looks like below. I'd like to eventually group row 1 and row 3 together, since they contain identical items in different columns.
x y count
a,b b,a 5
a,c c,a 2
b,a a,b 1
I've spent a lot of time trying to solve this, but have not encountered a good solution yet. What steps should I take to reach the below final dataframe?
x y count
a,b b,a 5+1
a,c c,a 2

You can try:
df.groupby((df.x + df.y).str.replace(',', '').apply(lambda x: ''.join(sorted(x)))
).agg({'x': 'first', 'y': 'first', 'count': sum}).reset_index(drop=True)
OUTPUT:
x y count
0 a,b b,a 6
1 a,c c,a 2

Slightly different approach.
Sort x and y rowwise using np.sort on axis=1:
cols = ['x', 'y']
df[cols] = np.sort(df[cols].values, axis=1)
x y count
0 a,b b,a 5
1 a,c c,a 2
2 a,b b,a 1
Then a standard groupby aggregate:
df = df.groupby(cols, as_index=False).aggregate(count=('count', 'sum'))
x y count
0 a,b b,a 6
1 a,c c,a 2
Complete Code:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'x': ['a,b', 'a,c', 'b,a'],
'y': ['b,a', 'c,a', 'a,b'],
'count': [5, 2, 1]
})
cols = ['x', 'y']
df[cols] = np.sort(df[cols].values, axis=1)
df = df.groupby(cols, as_index=False).aggregate(count=('count', 'sum'))

How to reorder columns based on regex?

Let's say I have a dataframe like this:
df = pd.DataFrame({'foo':[1, 2], 'bar': [3, 4], 'xyz': [5, 6]})
bar foo xyz
0 3 1 5
1 4 2 6
I now want to put the column that contains oo at the first position (i.e. at 0th index); there is always only one column with this pattern.
I currently solve this using filter twice and a concat:
pd.concat([df.filter(like='oo'), df.filter(regex='^((?!(oo)).)*$')], axis=1)
which gives the desired output:
foo bar xyz
0 1 3 5
1 2 4 6
I am wondering whether there is a more efficient way of doing this.

Use list comprehensions only, join lists together and select by subset:
a = [x for x in df.columns if 'oo' in x]
b = [x for x in df.columns if not 'oo' in x]
df = df[a + b]
print (df)
foo bar xyz
0 1 3 5
1 2 4 6

What about:
df[sorted(df, key = lambda x: x not in df.filter(like="oo").columns)]

Using pop:
cols = list(df)
col_oo = [col for col in df.columns if 'oo' in col]
cols.insert(0, cols.pop(cols.index(col_oo[0])))
df = df.ix[:, cols]
Or using regex:
col_oo = [col for col in cols if re.search('oo', col)]

Python Pandas - Remove Duplicates with Inverse Values [duplicate]

I have a dataframe and want to eliminate duplicate rows, that have same values, but in different columns:
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df
Out[8]:
a b c d
1 x y e f
2 e f x y
3 w v s t
Rows [1],[2] have the values {x,y,e,f}, but they are arranged in a cross - i.e. if you would exchange columns c,d with a,b in row [2] you would have a duplicate.
I want to drop these lines and only keep one, to have the final output:
df_new
Out[20]:
a b c d
1 x y e f
3 w v s t
How can I efficiently achieve that?

I think you need filter by boolean indexing with mask created by numpy.sort with duplicated, for invert it use ~:
df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()]
print (df)
a b c d
1 x y e f
3 w v s t
Detail:
print (np.sort(df, axis=1))
[['e' 'f' 'x' 'y']
['e' 'f' 'x' 'y']
['s' 't' 'v' 'w']]
print (pd.DataFrame(np.sort(df, axis=1), index=df.index))
0 1 2 3
1 e f x y
2 e f x y
3 s t v w
print (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 False
2 True
3 False
dtype: bool
print (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 True
2 False
3 True
dtype: bool

Here's another solution, with a for loop:
data = df.as_matrix()
new = []
for row in data:
if not new:
new.append(row)
else:
if not any([c in nrow for nrow in new for c in row]):
new.append(row)
new_df = pd.DataFrame(new, columns=df.columns)

Use sorting(np.sort) and then get duplicates(.duplicated()) out of it.
Later use that duplicates to drop(df.drop) the required index
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()
index_to_drop = [ind for ind in range(len(df_duplicated)) if df_duplicated[ind]]
df.drop(df.index[df_duplicated])

Pandas find Duplicates in cross values

I have a dataframe and want to eliminate duplicate rows, that have same values, but in different columns:
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df
Out[8]:
a b c d
1 x y e f
2 e f x y
3 w v s t
Rows [1],[2] have the values {x,y,e,f}, but they are arranged in a cross - i.e. if you would exchange columns c,d with a,b in row [2] you would have a duplicate.
I want to drop these lines and only keep one, to have the final output:
df_new
Out[20]:
a b c d
1 x y e f
3 w v s t
How can I efficiently achieve that?

I think you need filter by boolean indexing with mask created by numpy.sort with duplicated, for invert it use ~:
df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()]
print (df)
a b c d
1 x y e f
3 w v s t
Detail:
print (np.sort(df, axis=1))
[['e' 'f' 'x' 'y']
['e' 'f' 'x' 'y']
['s' 't' 'v' 'w']]
print (pd.DataFrame(np.sort(df, axis=1), index=df.index))
0 1 2 3
1 e f x y
2 e f x y
3 s t v w
print (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 False
2 True
3 False
dtype: bool
print (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 True
2 False
3 True
dtype: bool

Here's another solution, with a for loop:
data = df.as_matrix()
new = []
for row in data:
if not new:
new.append(row)
else:
if not any([c in nrow for nrow in new for c in row]):
new.append(row)
new_df = pd.DataFrame(new, columns=df.columns)

Use sorting(np.sort) and then get duplicates(.duplicated()) out of it.
Later use that duplicates to drop(df.drop) the required index
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()
index_to_drop = [ind for ind in range(len(df_duplicated)) if df_duplicated[ind]]
df.drop(df.index[df_duplicated])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Group columns from column to column - python

I have a dataframe like the following: and I want to group the answers like the following I tried to use multindex, but it won’t work.

Related

use specific columns to map new column with json

How to add rows with identical items in different columns in Pandas together

How to reorder columns based on regex?

Python Pandas - Remove Duplicates with Inverse Values [duplicate]

Pandas find Duplicates in cross values

Categories

Resources