Ordering a dataframe with a key function and multiple columns - python

I have the following
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
'col2': [2, -1, 9, -8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
'col4': ['a', 'B', 'c', 'D', 'e', 'F'],
'col5': [2, 1, 9, 8, 7, 4],
'col6': [1.00005,1.00001,-2.12132, -2.12137,1.00003,-2.12135]
})
print(df)
print(df.sort_values(by=['col5']))
print(df.sort_values(by=['col2']))
print(df.sort_values(by='col2', key=lambda col: col.abs() ))
So far so good.
However I would like to order the dataframe by two columns:
First col6 and then col5
However, with the following conditions:
col6 only has to consider 4 decimals (meaning that 1.00005 and 1.00001 should be consider equal
col6 should be considered as absolute (meaning 1.00005 is less than -2.12132)
So the desired output would be
col1 col2 col3 col4 col5 col6
1 A -1 1 B 1 1.00001
0 A 2 0 a 2 1.00005
4 D 7 2 e 7 1.00003
5 C 4 3 F 4 -2.12135
3 NaN -8 4 D 8 -2.12137
2 B 9 9 c 9 -2.12132
How can I combine the usage of keys with multiple columns?

If you want to use arbitrary conditions on different columns, the easiest (ans most efficient) is to use numpy.lexsort:
import numpy as np
out = df.iloc[np.lexsort([df['col5'].abs(), df['col6'].round(4)])]
NB. unlike sort_values, the keys with higher priority are in the end with lexsort.
If you really want to use sort_values, you can use a custom function that choses the operation to apply depending on the Series name:
def sorter(s):
funcs = {
'col5': lambda s: s.abs(),
'col6': lambda s: s.round(4)
}
return funcs[s.name](s) if s.name in funcs else s
out = df.sort_values(by=['col6', 'col5'], key=sorter)
Output:
col1 col2 col3 col4 col5 col6
5 C 4 3 F 4 -2.12135
3 NaN -8 4 D 8 -2.12137
2 B 9 9 c 9 -2.12132
1 A -1 1 B 1 1.00001
4 D 7 2 e 7 1.00003
0 A 2 0 a 2 1.00005
provided example
reading again the question and the provided example, I think you might want:
df.iloc[np.lexsort([df['col5'], np.trunc(df['col6'].abs()*10**4)/10**4])]
Output:
col1 col2 col3 col4 col5 col6
1 A -1 1 B 1 1.00001
0 A 2 0 a 2 1.00005
4 D 7 2 e 7 1.00003
5 C 4 3 F 4 -2.12135
3 NaN -8 4 D 8 -2.12137
2 B 9 9 c 9 -2.12132

round() should not be used to truncate because round(1.00005, 4) = 1.0001.
Proposed code :
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
'col2': [2, -1, 9, -8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
'col4': ['a', 'B', 'c', 'D', 'e', 'F'],
'col5': [2, 1, 9, 8, 7, 4],
'col6': [1.00005,1.00001,-2.12132, -2.12137,1.00003,-2.12135]
})
r = df.sort_values(by=['col6', 'col5'], key=lambda c: c.apply(lambda x: abs(float(str(x)[:-1]))) if c.name=='col6' else c)
print(r)
Result :
col1 col2 col3 col4 col5 col6
1 A -1 1 B 1 1.00001
0 A 2 0 a 2 1.00005
4 D 7 2 e 7 1.00003
5 C 4 3 F 4 -2.12135
3 NaN -8 4 D 8 -2.12137
2 B 9 9 c 9 -2.12132
Other coding style inspired from Mozway
I have read the inspiring #Mozway way.
Very interesting but like s is a serie you should use the following script :
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
'col2': [2, -1, 9, -8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
'col4': ['a', 'B', 'c', 'D', 'e', 'F'],
'col5': [2, 1, 9, 8, 7, 4],
'col6': [1.00005,1.00001,-2.12132, -2.12137,1.00003,-2.12135]
})
def truncate(x):
s = str(x).split('.')
s[1] = s[1][:4]
return '.'.join(s)
def sorter(s):
funcs = {
'col5': lambda s: s,
'col6': lambda s: s.apply(lambda x: abs(float(truncate(x))))
}
return funcs[s.name](s) if s.name in funcs else s
out = df.sort_values(by=['col6', 'col5'], key=sorter)
print(out)

Related

Pandas dataframe create new row for every value over multiple columns [duplicate]

This question already has answers here:
Reshape wide to long in pandas
(2 answers)
Closed 1 year ago.
I have a pandas dataframe with unique values in ID column.
df = pd.DataFrame({'ID': ['A', 'B', 'C'],
'STAT': ['X', 'X', 'X'],
'IN1': [1, 3, 7],
'IN2': [2, 5, 8],
'IN3': [3, 6, 9]})
I need to create a new dataframe where I have a row for each value in IN1, IN2 and IN3 with corresponding ID and STAT:
df_new = pd.DataFrame({'IN': [1, 2, 3, 3, 5, 6, 7, 8, 9],
'ID': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'STAT': ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X']})
You can use pandas.wide_to_long:
(pd.wide_to_long(df, ['IN'], j='to_drop', i='ID')
.droplevel('to_drop')
.sort_index()
.reset_index()
)
output:
ID STAT IN
0 A X 1
1 A X 2
2 A X 3
3 B X 3
4 B X 5
5 B X 6
6 C X 7
7 C X 8
8 C X 9
You can use melt
df.melt(id_vars=['ID','STAT'], value_name='IN')
Gives:
ID STAT variable IN
0 A X IN1 1
1 B X IN1 3
2 C X IN1 7
3 A X IN2 2
4 B X IN2 5
5 C X IN2 8
6 A X IN3 3
7 B X IN3 6
8 C X IN3 9
To make the df into a row:
(df.melt(id_vars=['ID','STAT'], value_name='IN')
.sort_values(by='ID')
.drop('variable', axis=1)
)
Gives the exact same results.

multiply 2 columns in 2 dfs if they match the column name

I have 2 dfs with some similar colnames.
I tried this, it worked only when I have nonrepetitive colnames in national df.
out = {}
for col in national.columns:
for col2 in F.columns:
if col == col2:
out[col] = national[col].values * F[col2].values
I tried to use the same code on df where it has several names, but I got the following error 'shapes (26,33) and (1,26) not aligned: 33 (dim 1) != 1 (dim 0)'. Because in the second df it has 33 columns with the same name, and that needs to be multiplied elementwise with one column for the first df.
This code does not work, as there are repeated same colnames in urban.columns.
[np.matrix(urban[col].values) * np.matrix(F[col2].values) for col in urban.columns for col2 in F.columns if col == col2]
Reproducivle code
df1 = pd.DataFrame({
'Col1': [1, 2, 1, 2, 3],
'Col2': [2, 4, 2, 4, 6],
'Col2': [7, 4, 2, 8, 6]})
df2 = pd.DataFrame({
'Col1': [1.5, 2.0, 3.0, 5.0, 10.0],
'Col2': [1, 0.0, 4.0, 5.0, 7.0})
Hopefully the below working example helps. Please provided a minimum reproducible example in your question with input code and desired output like I have provided. Please see how to ask a good pandas question:
df1 = pd.DataFrame({
'Product': ['AA', 'AA', 'BB', 'BB', 'BB'],
'Col1': [1, 2, 1, 2, 3],
'Col2': [2, 4, 2, 4, 6]})
print(df1)
df2 = pd.DataFrame({
'FX Rate': [1.5, 2.0, 3.0, 5.0, 10.0]})
print(df2)
df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
for col in ['Col1', 'Col2']:
df1[col] = df1[col] * df2['FX Rate']
df1
(df1)
Product Col1 Col2
0 AA 1 2
1 AA 2 4
2 BB 1 2
3 BB 2 4
4 BB 3 6
(df2)
FX Rate
0 1.5
1 2.0
2 3.0
3 5.0
4 10.0
Out[1]:
Product Col1 Col2
0 AA 1.5 3.0
1 AA 4.0 8.0
2 BB 3.0 6.0
3 BB 10.0 20.0
4 BB 30.0 60.0
You can't multiply two DataFrame if they have different shapes but if you want to multiply it anyway then use transpose:
out = {}
for col in national.columns:
for col2 in F.columns:
if col == col2:
out[col] = national[col].values * F[col2].T.values
You can get the common columns of the 2 dataframes, then multiply the 2 dataframe by simple multiplication. Then, join back the only column(s) in df1 to the multiplication result, as follows:
common_cols = df1.columns.intersection(df2.columns)
df1_only_cols = df1.columns.difference(common_cols)
df1_out = df1[df1_only_cols].join(df1[common_cols] * df2[common_cols])
df1 = df1_out.reindex_like(df1)
Demo
df1 = pd.DataFrame({
'Product': ['AA', 'AA', 'BB', 'BB', 'BB'],
'Col1': [1, 2, 1, 2, 3],
'Col2': [2, 4, 2, 4, 6],
'Col3': [7, 4, 2, 8, 6]})
df2 = pd.DataFrame({
'Col1': [1.5, 2.0, 3.0, 5.0, 10.0],
'Col2': [1, 0.0, 4.0, 5.0, 7.0})
common_cols = df1.columns.intersection(df2.columns)
df1_only_cols = df1.columns.difference(common_cols)
df1_out = df1[df1_only_cols].join(df1[common_cols] * df2[common_cols])
df1 = df1_out.reindex_like(df1)
print(df1)
Product Col1 Col2 Col3
0 AA 1.5 2.0 7
1 AA 4.0 0.0 4
2 BB 3.0 8.0 2
3 BB 10.0 20.0 8
4 BB 30.0 42.0 6
A friend of mine sent this solution wich works just as i wanted.
out = urban.copy()
for col in urban.columns:
for col2 in F.columns:
if col == col2:
out.loc[:,col] = urban.loc[:,[col]].values * F.loc[:,[col2]].values

Filter rows of DataFrame by list of tuples

Let's assume I have the following DataFrame:
dic = {'a' : [1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'b' : [1, 1, 1, 1, 2, 2, 1, 1, 2, 2],
'c' : ['f', 'f', 'f', 'e', 'f', 'f', 'f', 'e', 'f', 'f'],
'd' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
df = pd.DataFrame(dic)
df
Out[10]:
a b c d
0 1 1 f 10
1 1 1 f 20
2 2 1 f 30
3 2 1 e 40
4 2 2 f 50
5 2 2 f 60
6 3 1 f 70
7 3 1 e 80
8 3 2 f 90
9 3 2 f 100
In the following I want to take the values of column a and b, where c='e' and use those values to select respective rows of df (which would filter rows 2, 3, 6, 7). The idea is to create a list of tuples and index df by that list:
list_tup = list(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False))
df_new = df.set_index(['a', 'b']).sort_index()
df_new
Out[13]:
c d
a b
1 1 f 10
1 f 20
2 1 f 30
1 e 40
2 f 50
2 f 60
3 1 f 70
1 e 80
2 f 90
2 f 100
list_tup
Out[14]: [(2, 1), (3, 1)]
df.loc[list_tup]
Results in an TypeError: unhashable type: 'writeable void-scalar', which I don't understand. Any suggestions? I'm pretty new to python and pandas, hence I assume that I miss something fundamental.
I believe it's better to groupby().transform() and boolean indexing in this use case:
valids = (df['c'].eq('e') # check if `c` is 'e`
.groupby([df['a'],df['b']]) # group by `a` and `b`
.transform('any') # check if `True` occurs in the group
# use the same label for all rows in group
)
# filter with `boolean indexing
df[valids]
Output:
a b c d
2 2 1 f 30
3 2 1 e 40
6 3 1 f 70
7 3 1 e 80
A similar idea with groupby().filter() which is more readable but can be slightly slower:
df.groupby(['a','b']).filter(lambda x: x['c'].eq('e').any())
You could try an innner join.
import pandas as pd
dic = {'a' : [1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'b' : [1, 1, 1, 1, 2, 2, 1, 1, 2, 2],
'c' : ['f', 'f', 'f', 'e', 'f', 'f', 'f', 'e', 'f', 'f'],
'd' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]}
df = pd.DataFrame(dic)
df.merge(df.loc[df['c']=='e', ['a','b']], on=['a','b'])
Output
a b c d
0 2 1 f 30
1 2 1 e 40
2 3 1 f 70
3 3 1 e 80
May be try a MultiIndex:
df_new.loc[pd.MultiIndex.from_tuples(list_tup)]
Full code:
list_tup = list(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False))
df_new = df.set_index(['a', 'b']).sort_index()
df_new.loc[pd.MultiIndex.from_tuples(list_tup)]
Outputs:
c d
a b
2 1 f 30
1 e 40
3 1 f 70
1 e 80
Let us try
m = df[['a','b']].apply(tuple,1).isin(df.loc[df['c'] == 'e', ['a','b']].to_records(index=False).tolist())
out = df[m].copy()

Reorder Multiindex Pandas Dataframe

I would like to reorder the columns in a dataframe, and keep the underlying values in the right columns.
For example this is the dataframe I have
cols = [ ['Three', 'Two'],['A', 'D', 'C', 'B']]
header = pd.MultiIndex.from_product(cols)
df = pd.DataFrame([[1,4,3,2,5,8,7,6]]*4,index=np.arange(1,5),columns=header)
df.loc[:,('One','E')] = 9
df.loc[:,('One','F')] = 10
>>> df
And I would like to change it as follows:
header2 = pd.MultiIndex(levels=[['One', 'Two', 'Three'], ['E', 'F', 'A', 'B', 'C', 'D']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2], [0, 1, 2, 3, 4, 5, 2, 3, 4, 5]])
df2 = pd.DataFrame([[9,10,1,2,3,4,5,6,7,8]]*4,index=np.arange(1,5), columns=header2)
>>>>df2
First, define a categorical ordering on the top level. Then, call sort_index on the first axis with both levels.
v = pd.Categorical(df.columns.get_level_values(0),
categories=['One', 'Two', 'Three'],
ordered=True)
v2 = pd.Categorical(df.columns.get_level_values(1),
categories=['E', 'F', 'C', 'B', 'A', 'D'],
ordered=True)
df.columns = pd.MultiIndex.from_arrays([v, v2])
df = df.sort_index(axis=1, level=[0, 1])
df
One Two Three
E F C B A D C B A D
1 9 10 7 6 5 8 3 2 1 4
2 9 10 7 6 5 8 3 2 1 4
3 9 10 7 6 5 8 3 2 1 4
4 9 10 7 6 5 8 3 2 1 4

Python Pandas concatenate strings and numbers into one string

I am working with a pandas dataframe and trying to concatenate multiple string and numbers into one string.
This works
df1 = pd.DataFrame({'Col1': ['a', 'b', 'c'], 'Col2': ['a', 'b', 'c']})
df1.apply(lambda x: ', '.join(x), axis=1)
0 a, a
1 b, b
2 c, c
How can I make this work just like df1?
df2 = pd.DataFrame({'Col1': ['a', 'b', 1], 'Col2': ['a', 'b', 1]})
df2.apply(lambda x: ', '.join(x), axis=1)
TypeError: ('sequence item 0: expected str instance, int found', 'occurred at index 2')
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(3, 3)),
columns=list('abc')
)
print(df)
a b c
0 0 2 7
1 3 8 7
2 0 6 8
You can use astype(str) ahead of the lambda
df.astype(str).apply(', '.join, 1)
0 0, 2, 7
1 3, 8, 7
2 0, 6, 8
dtype: object
Using a comprehension
pd.Series([', '.join(l) for l in df.values.astype(str).tolist()], df.index)
0 0, 2, 7
1 3, 8, 7
2 0, 6, 8
dtype: object
In [75]: df2
Out[75]:
Col1 Col2 Col3
0 a a x
1 b b y
2 1 1 2
In [76]: df2.astype(str).add(', ').sum(1).str[:-2]
Out[76]:
0 a, a, x
1 b, b, y
2 1, 1, 2
dtype: object
You have to convert column types to strings.
import pandas as pd
df2 = pd.DataFrame({'Col1': ['a', 'b', 1], 'Col2': ['a', 'b', 1]})
df2.apply(lambda x: ', '.join(x.astype('str')), axis=1)

Categories

Resources