i have a sample data set:
import pandas as pd
df = {
'columA':['1A','2A','3A','4A','5A','6A'],
'count': [ 1, 12, 34, 52, '3',2],
'columnB': ['a','dd','dd','ee','d','f']
}
df = pd.DataFrame(df)
it looks like this:
columA columnB count
1A a 1
2A dd 12
3A dd 34
4A ee 52
5A d 3
6A f 2
Update: The combined 2A and 3A name should be something arbitrary like 'SAB' or '2A plus 3A', etc., I used '2A|3A' as the example and it confused some of the people.
I want to sum up the count the rows 2A and 3A and give it a name SAB
desired output:
columA columnB count
1A a 1
SAB dd 46
4A ee 52
5A d 3
6A f 2
We can use a groupby on columnB
df = {'columA':['1A','2A','3A','4A','5A','6A'],
'count': [ 1, 12, 34, 52, '3',2],
'columnB': ['a','dd','dd','ee','d','f']}
df = pd.DataFrame(df)
df.groupby('columnB').agg({'count': 'sum', 'columA': 'sum'})
columA count
columnB
a 1A 1
d 5A 3
dd 2A3A 46
ee 4A 52
f 6A 2
If you're concerned about the index name you can write a function like so.
def join_by_pipe(s):
return '|'.join(s)
df.groupby('columnB').agg({'count': 'sum', 'columA': join_by_pipe})
columA count
columnB
a 1A 1
d 5A 3
dd 2A|3A 46
ee 4A 52
f 6A 2
Related
I have a dataset in which some columns have lookup values. There are several such columns in the dataset. I need to expand these columns so that the column name consists of the name of the column itself and the keys in the dict.
Example df:
df
col1 col2 col3
a '{key_1: 1a, key_2: 2a}' '{key_3: 1a, key_4: 2a}'
b '{key_1: 1b, key_2: 2b}' '{key_3: 1a, key_4: 2a}'
c '{key_1: 1c, key_2: 2c}' '{key_3: 1a, key_4: 2a}'
Desired result:
df_res
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
a 1a 2a 1a 2a
b 1b 2b 1a 2a
c 1c 2c 1a 2a
How can I do that?
If in columns are dictionaries, not strings use list comprehension with json_normalize:
cols = ['col2','col3']
print (type(df['col2'].iat[0]))
<class 'dict'>
dfs = [pd.json_normalize(df.pop(x)).add_prefix(f'{x}_') for x in cols]
df = df.join(pd.concat(dfs, axis=1))
print (df)
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2a 1a 2a
2 c 1c 2a 1a 2a
Solution with strings possible converted to dictionaries:
print (df)
col1 col2 col3
0 a {'key_1': '1a', 'key_2': '2a'} {'key_3': '1a', 'key_4': '2a'}
1 b {'key_1': '1b', 'key_2': '2a'} {'key_3': '1a', 'key_4': '2a'}
2 c {'key_1': '1c', 'key_2': '2a'} {'key_3': '1a', 'key_4': '2a'}
cols = ['col2','col3']
print (type(df['col2'].iat[0]))
<class 'str'>
import ast
dfs = [pd.json_normalize(df.pop(x).apply(ast.literal_eval)).add_prefix(f'{x}_')
for x in cols]
df = df.join(pd.concat(dfs, axis=1))
print (df)
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2a 1a 2a
2 c 1c 2a 1a 2a
EDIT: Solution for original format with custom function:
print (df)
col1 col2 col3
0 a '{key_1: 1a, key_2: 2a}' '{key_3: 1a, key_4: 2a}'
1 b '{key_1: 1b, key_2: 2b}' '{key_3: 1a, key_4: 2a}'
2 c '{key_1: 1c, key_2: 2c}' '{key_3: 1a, key_4: 2a}'
cols = ['col2','col3']
print (type(df['col2'].iat[0]))
# <class 'str'>
f = lambda x: dict([x.split(': ') for x in x.strip("{'}").split(', ')])
dfs = [pd.json_normalize(df.pop(x).apply(f)).add_prefix(f'{x}_')
for x in cols]
df = df.join(pd.concat(dfs, axis=1))
print (df)
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2b 1a 2a
2 c 1c 2c 1a 2a
I would use the df[col].apply(pd.Series) method to achieve this. It would then look something like this:
def explode_dictcol(df, col):
temp = df[col].apply(pd.Series)
temp = temp.rename(columns={cc: col + '_' + cc for cc in temp.columns})
return temp
df = pd.concat([df, explode_dictcol(df, 'col2'), explode_dictcol(df, 'col3')], axis=1)
df = df.drop(columns=['col2', 'col3]
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2b 1b 2b
2 c 1c 2c 1c 2c
If the columns are strings, the following will do the work
df_new = pd.DataFrame(data = [
[row['col1'],
row['col2'].split(':')[1].split(',')[0].strip(),
row['col2'].split(':')[2].split('}')[0].strip(),
row['col3'].split(':')[1].split(',')[0].strip(),
row['col3'].split(':')[2].split('}')[0].strip()]
for index, row in df.iterrows()
]).rename(columns = {0: 'col1', 1: 'col2_key_1', 2: 'col2_key_2', 3: 'col3_key_3', 4: 'col3_key_4'})
[Out]:
col1 col2_key_1 col2_key_2 col3_key_3 col3_key_4
0 a 1a 2a 1a 2a
1 b 1b 2b 1a 2a
2 c 1c 2c 1a 2a
Notes:
Check the data type with
print(type(df['col2'][0]))
# or
print(type(df['col2'].iat[0]))
The first part of the proposed solution
df_new = pd.DataFrame(data = [
[row['col1'], row['col2'].split(':')[1].split(',')[0].strip(),
row['col2'].split(':')[2].split('}')[0].strip(),
row['col3'].split(':')[1].split(',')[0].strip(),
row['col3'].split(':')[2].split('}')[0].strip()]
for index, row in df.iterrows()
])
gives the following output
0 1 2 3 4
0 a 1a 2a 1a 2a
1 b 1b 2b 1a 2a
2 c 1c 2c 1a 2a
which is almost the same, but that is why one has to pass .rename() to make sure the column names are as OP wants.
I have a dataframe like this and can group it by library and sample columns and create new columns:
df = pd.DataFrame({'barcode': ['b1', 'b2','b1','b2','b1',
'b2','b1','b2'],
'library': ['l1', 'l1','l1','l1','l2', 'l2','l2','l2'],
'sample': ['s1','s1','s2','s2','s1','s1','s2','s2'],
'category': ['c1', 'c2','c1','c2','c1', 'c2','c1','c2'],
'count': [10,21,13,54,51,16,67,88]})
df
barcode library sample category count
0 b1 l1 s1 c1 10
1 b2 l1 s1 c2 21
2 b1 l1 s2 c1 13
3 b2 l1 s2 c2 54
4 b1 l2 s1 c1 51
5 b2 l2 s1 c2 16
6 b1 l2 s2 c1 67
7 b2 l2 s2 c2 88
I used grouping to reduce dimentions of the df:
grp=df.groupby(['library','sample'])
df=grp.get_group(('l1','s1')).rename(columns={"count":
"l1_s1_count"}).reset_index(drop=True)
df['l1_s2_count']=grp.get_group(('l1','s2'))[['count']].values
df['l2_s1_count']=grp.get_group(('l2','s1'))[['count']].values
df['l2_s2_count']=grp.get_group(('l2','s2'))[['count']].values
df=df.drop(['sample','library'],axis=1)
result
barcode category l1_s1_count l1_s2_count l2_s1_count
l2_s2_count
0 b1 c1 10 13 51 67
1 b2 c2 21 54 16 88
I think there should be a neater way for this transformation, like using pivot table which I failed with, could you please suggest how this could be done with pivot table?
thanks.
try pivot_table function as below,
it will produce multi-index result, which will need to be flattened.
df2 = pd.pivot_table(df,index=['barcode', 'category'], columns= ['sample', 'library'], values='count').reset_index()
df2.columns = ["_".join(a) for a in df2.columns.to_flat_index()]
out:
barcode_ category_ s1_l1 s1_l2 s2_l1 s2_l2
0 b1 c1 10 51 13 67
1 b2 c2 21 16 54 88
or even without , values='count'.
df2 = pd.pivot_table(df,index=['barcode', 'category'], columns= ['sample', 'library']).reset_index()
df2.columns = ["_".join(a) for a in df2.columns.to_flat_index()]
out:
barcode__ category__ count_s1_l1 count_s1_l2 count_s2_l1 count_s2_l2
0 b1 c1 10 51 13 67
1 b2 c2 21 16 54 88
as per your preference
df1 = pd.DataFrame({
'Year': ["1A", "2A", "3A", "4A", "5A"],
'Tval1' : [1, 9, 8, 1, 6],
'Tval2' : [34, 56, 67, 78, 89]
})
it looks more like this
and I want to change it to make it look like this, the 2nd column is moved under the individual row.
Idea is get numbers from Year column, then set new columns names after Year column and reshape by DataFrame.stack:
df1['Year'] = df1['Year'].str.extract('(\d+)')
df = df1.set_index('Year')
#add letters by length of columns, working for 1 to 26 columns A-Z
import string
df.columns = list(string.ascii_uppercase[:len(df.columns)])
#here working same like
#df.columns = ['A','B']
df = df.stack().reset_index(name='Val')
df['Year'] = df['Year'] + df.pop('level_1')
print (df)
Year Val
0 1A 1
1 1B 34
2 2A 9
3 2B 56
4 3A 8
5 3B 67
6 4A 1
7 4B 78
8 5A 6
9 5B 89
Another idea with DataFrame.melt:
df = (df1.replace({'Year': {'A':''}}, regex=True)
.rename(columns={'Tval1':'A','Tval2':'B'})
.melt('Year'))
df['Year'] = df['Year'] + df.pop('variable')
print (df)
Year value
0 1A 1
1 2A 9
2 3A 8
3 4A 1
4 5A 6
5 1B 34
6 2B 56
7 3B 67
8 4B 78
9 5B 89
Try the below code. I split it into two dataframes, and then concatenated after changing the Years' ends to be a 'B' instead of an 'A'.
import pandas as pd
df = pd.DataFrame(data=dict(Year=['1A', '2A', '3A'], val1=[1, 2, 3], val2=[4,5,6]))
df1 = df.drop(columns=['val2'])
df2 = df.drop(columns=['val1'])
columns = ['Year', 'val']
df1.columns = columns
df2.columns = columns
df2['Year'] = df2['Year'].str.replace('A', 'B')
pd.concat([df1, df2]).reset_index(drop=True)
Column A 2C GAD D2 6F ABCDE
2C 1B D2 6F ABC 1 0 1 1 0
2C 1248 Bulers 1 0 0 0 0
Above is the dataframe I want to create.
The first row represents the field names. The logic I want to employ is as follows:
If the column name is in the "Column A" row, then 1 otherwise 0
I have scoured Google looking for code answering a question similar to mine so I can test it out and backward engineer a solution. Unfortunately, I have not been able to find anything.
Otherwise I would post some code that I attempted to solve this problem but I literally have no clue.
You can use a list comprehension to create the desire data based on the columns and rows:
In [39]: row =['2C 1B D2 6F ABC', '2C 1248 Bulers']
In [40]: columns=['2C', 'GAD', 'D2', '6F', 'ABCDE']
In [41]: df = pd.DataFrame([[int(k in r) for k in columns] for r in row], index = ['2C 1B D2 6F ABC','2C 1248 Bulers'], columns=['2C', 'GAD', 'D2', '6F', 'ABCDE'])
In [42]: df
Out[42]:
2C GAD D2 6F ABCDE
2C 1B D2 6F ABC 1 0 1 1 0
2C 1248 Bulers 1 0 0 0 0
If you want a pure Pandas approach you can use pd.Series() instead of list for preserving the columns and rows then use Series.apply and Series.str.contains to get the desire result:
In [73]: data = columns.apply(row.str.contains).astype(int).transpose()
In [74]: df = pd.DataFrame(data.values, index = ['2C 1B D2 6F ABC','2C 1248 Bulers'], columns=['2C', 'GAD', 'D2', '6F', 'ABCDE'])
In [75]: df
Out[75]:
2C GAD D2 6F ABCDE
2C 1B D2 6F ABC 1 0 1 1 0
2C 1248 Bulers 1 0 0 0 0
I have a relatively big dataframe (1.5 Gb), and I want to group rows by ID and order rows by column VAL in ascending order within each group.
df =
ID VAL COL
1A 2 BB
1A 1 AA
2B 2 CC
3C 3 SS
3C 1 YY
3C 2 XX
This is the expected result:
df =
ID VAL COL
1A 1 AA
1A 2 BB
2B 2 CC
3C 1 YY
3C 2 XX
3C 3 SS
This is what I tried, but it runs very long time. Is there any faster solution?:
df = df.groupby("ID").apply(pd.DataFrame.sort, 'VAL')
If you have a big df and speed is important, try a little numpy
# note order of VAL first, then ID is intentional
# np.lexsort sorts by right most column first
df.iloc[np.lexsort((df.VAL.values, df.ID.values))]
ID VAL COL
1 1A 1 AA
0 1A 2 BB
2 2B 2 CC
4 3C 1 YY
5 3C 2 XX
3 3C 3 SS
super charged
v = df.values
i, j = np.searchsorted(df.columns.values, ['VAL', 'ID'])
s = np.lexsort((v[:, i], v[:, j]))
pd.DataFrame(v[s], df.index[s], df.columns)
timing
sort_values on 'ID', 'VAL' should give you
In [39]: df.sort_values(by=['ID', 'VAL'])
Out[39]:
ID VAL COL
1 1A 1 AA
0 1A 2 BB
2 2B 2 CC
4 3C 1 YY
5 3C 2 XX
3 3C 3 SS
Time it for your use-case
In [89]: dff.shape
Out[89]: (12000, 3)
In [90]: %timeit dff.sort_values(by=['ID', 'VAL'])
100 loops, best of 3: 2.62 ms per loop
In [91]: %timeit dff.iloc[np.lexsort((dff.VAL.values, dff.ID.values))]
100 loops, best of 3: 8.8 ms per loop