How to sort pandas DataFrame with a key? - python

I'm looking for a way to sort pandas DataFrame. pd.DataFrame.sort_values doesn't accept a key function. I can convert it to list and apply a key to sorted function, but that will be slow. The other way seems something related to categorical index. I don't have a fixed number of rows so I don't know if categorical index will be applicable.
I have given an example case of what kind of data I want to sort:
Input DataFrame:
clouds fluff
0 {[} 1
1 >>> 2
2 {1 3
3 123 4
4 AAsda 5
5 aad 6
Output DataFrame:
clouds fluff
0 >>> 2
1 {[} 1
2 {1 3
3 123 4
4 aad 6
5 AAsda 5
The rule for sorting (priority):
First special characters (sort by ascii among themselves)
Next is by numbers
next is by lower case alphabets (lexicographical)
next is Capital case alphabets (lexicographical)
In plain python I'd do it like
from functools import cmp_to_key
def ks(a, b):
# "Not exactly this but similar"
if a.isupper():
return -1
else:
return 1
Case
sorted(['aa', 'AA', 'dd', 'DD'], key=cmp_to_key(ks))
Answer:
['DD', 'AA', 'aa', 'dd']
How would you do it with Pandas?

As of pandas 1.1.0, pandas.DataFrame.sort_values accepts an argument key with type callable.
So in this case we would use:
df.sort_values(by='clouds', key=kf)
where kf is the key function that operates on type Series. Accepts and returns Series.

As of pandas 1.2.0,
I did this
import numpy as np
import pandas as pd
df = pd.DataFrame(['aa', 'dd', 'DD', 'AA'], columns=["data"])
# This is the sorting rule
rule = {
"DD": 1,
"AA": 10,
"aa": 20,
"dd": 30,
}
def particular_sort(series):
"""
Must return one Series
"""
return series.apply(lambda x: rule.get(x, 1000))
new_df = df.sort_values(by=["data"], key=particular_sort)
print(new_df) # DD, AA, aa, dd
Of course, you can do this too, but it may be difficult to understand,smile
new_df = df.sort_values(by=["data"], key=lambda x: x.apply(lambda y: rule.get(y, 1000)))
print(new_df) # DD, AA, aa, dd

This might be useful, yet still not sure about special characters! can they actally be sorted!!
import pandas as pd
a = [2, 'B', 'c', 1, 'a', 'b',3, 'C', 'A']
df = pd.DataFrame({"a": a})
df['upper'] = df['a'].str.isupper()
df['lower'] = df['a'].str.islower()
df['int'] = df['a'].apply(isinstance,args = [int])
df2 = pd.concat([df[df['int'] == True].sort_values(by=['a']),
df[df['lower'] == True].sort_values(by=['a']),
df[df['upper'] == True].sort_values(by=['a'])])
print(df2)
a upper lower int
3 1 NaN NaN True
0 2 NaN NaN True
6 3 NaN NaN True
4 a False True False
5 b False True False
2 c False True False
8 A True False False
1 B True False False
7 C True False False
you can also do it in one step with creating new True False columns!
a = [2, 'B', 'c', 1, 'a', 'b',3, 'C', 'A']
df = pd.DataFrame({"a": a})
df2 = pd.concat([df[df['a'].apply(isinstance,args = [int])].sort_values(by=['a']),
df[df['a'].str.islower() == True].sort_values(by=['a']),
df[df['a'].str.isupper() == True].sort_values(by=['a'])])
a
3 1
0 2
6 3
4 a
5 b
2 c
8 A
1 B
7 C

This seems to work:
def sort_dataframe_by_key(dataframe: DataFrame, column: str, key: Callable) -> DataFrame:
""" Sort a dataframe from a column using the key """
sort_ixs = sorted(np.arange(len(dataframe)), key=lambda i: key(dataframe.iloc[i][column]))
return DataFrame(columns=list(dataframe), data=dataframe.iloc[sort_ixs].values)
It passes tests:
def test_sort_dataframe_by_key():
dataframe = DataFrame([{'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}, {'a': 3, 'b': 4, 'c': 0}])
assert sort_dataframe_by_key(dataframe, column='a', key=lambda x: x).equals(
DataFrame([{'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}, {'a': 3, 'b': 4, 'c': 0}]))
assert sort_dataframe_by_key(dataframe, column='a', key=lambda x: -x).equals(
DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 2, 'b': 1, 'c': 1}, {'a': 1, 'b': 2, 'c': 3}]))
assert sort_dataframe_by_key(dataframe, column='b', key=lambda x: -x).equals(
DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}]))
assert sort_dataframe_by_key(dataframe, column='c', key=lambda x: x).equals(
DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 2, 'b': 1, 'c': 1}, {'a': 1, 'b': 2, 'c': 3}]))

Related

Replacing multiple columns with values in pandas

I am replacing multiple columns values in pandas with the pd.DataFrame.replace method, however, this will not update any values inside my loop, and I cannot understand why it wont.
For example:
import pandas as pd
df = pd.DataFrame({'A': [0, 1, 2, 2, 2],
'B': [5, 6, 7, 8, 9],
'C': ['a', 'b', 'c', 'd', 'e']})
operators = { 'A':{ 0 : 2 } , 'B': { 5 : 8 }, 'C': None }
for keys, values in operators.items():
if values == None:
continue
else:
for existing, new in values.items():
if keys == 'A' and new is not None:
print(keys, existing, new)
df.replace({keys: existing}, new)
elif keys == 'B' and new is not None:
df.replace({keys: existing}, new)
else:
df.replace({keys: existing}, new)
Will print the exact same values for the dataframe.
You need to store your replacements. That means you should either do:
put inplace=True
df.replace({keys: existing}, new, inplace=True)
save the replacement back to df without inplace
df = df.replace({keys: existing}, new)
Loop
Anyway, why do you need that monstrous 👾 loop?
You can simply pass a dict to pandas.DataFrame.replace, it works like a charm.
It does not support 'C': None though, as the dictionary should either be value to value for all columns {0:2, 5:8} or column value to value as you have it for 'A' and 'B'.
Let's get rid of your None and do the replacement in one go:
import pandas as pd
df = pd.DataFrame({'A': [0, 1, 2, 2, 2],
'B': [5, 6, 7, 8, 9],
'C': ['a', 'b', 'c', 'd', 'e']})
operators = { 'A':{ 0 : 2 } , 'B': { 5 : 8 }, 'C': None }
# df = df.replace({k:v for k,v in operators.items() if v!=None})
df.replace({k:v for k,v in operators.items() if v!=None}, inplace=True)
print(df)
Output:
A B C
0 2 8 a
1 1 6 b
2 2 7 c
3 2 8 d
4 2 9 e

Multiply pandas series by dictionary values

I have a pandas series
A 3
B 5
and a dictionary
dic={'A':4 , 'B':3}
I wanted to match and multiply this series by a dictionary and its values.
So the outcome is
A 12
B 15
Is this possible?
I've tried
s=s.mul(s.map(dic))
This should do it if you convert your pandas input series into a DataFrame:
df = pd.DataFrame({
'a': [1,2,3,4,5],
'b': [5,4,3,3,4],
'c': [3,2,4,3,10],
'd': [3, 2, 1, 1, 1]
})
params = {'a': 2.5, 'b': 3.0, 'c': 1.3, 'd': 0.9}
df1 = df.assign(**params).mul(df)
Just make a Series out of your dictionary and you can multiply it straightforward
s1 = pd.Series({'A':3 , 'B':5})
s2 = pd.Series({'A':4 , 'B':3})
s3 = s1*s2
print(s3)
A 12
B 15
dtype: int64
Depending on what you want to achieve, this function may also help.
def multiply(series, dictionary):
r = {}
for k in series.keys():
if k in dictionary:
r[k] = series[k] * dictionary[k]
return pd.Series(r)

Count of rows linked by ids in Pandas dataframe

I have a table of ids, and previous ids (see image 1), I want to count the number of unique ids in total linked in one chain, e.g. if we take the latest id as the 'parent' then the result for the example data below would be something like Image 2, where 'a' is linked to 5 total ids (a, b, c, d & e) and 'w' is linked to 4 ids (w, x, y & z). In practicality, I am dealing with randomly generated ids, not sequenced letters.
Python Code to produce example dataframes:
import pandas as pd
raw_data = pd.DataFrame([['a','b'], ['b','c'], ['c', 'd'],['d','e'],['e','-'],
['w','x'], ['x', 'y'], ['y','z'], ['z','-']], columns=['id', 'previous_id'])
output = pd.DataFrame([['a',5],['w',4]], columns = ['parent_id','linked_ids'])
Use convert_matrix.from_pandas_edgelist with connected_components first, then create dictionary for mapping, get first mapped values per groups by Series.map filtered by Series.duplicated and last add new column by Series.map with Counter for mapp dictionary:
df1 = raw_data[raw_data['previous_id'].ne('-')]
import networkx as nx
from collections import Counter
g = nx.from_pandas_edgelist(df1,'id','previous_id')
connected_components = nx.connected_components(g)
d = {y:i for i, x in enumerate(connected_components) for y in x}
print (d)
{'c': 0, 'e': 0, 'b': 0, 'd': 0, 'a': 0, 'y': 1, 'x': 1, 'w': 1, 'z': 1}
c = Counter(d.values())
mapp = {k: c[v] for k, v in d.items()}
print (mapp)
{'c': 5, 'e': 5, 'b': 5, 'd': 5, 'a': 5, 'y': 4, 'x': 4, 'w': 4, 'z': 4}
df = (raw_data.loc[~raw_data['id'].map(d).duplicated(), ['id']]
.rename(columns={'id':'parent_id'})
.assign(linked_ids = lambda x: x['parent_id'].map(mapp)))
print (df)
parent_id linked_ids
0 a 5
5 w 4

drop non-json object rows from python dataframe column

I have a dataframe such that the column contains both json objects and strings. I want to get rid of rows that does not contains json objects.
Below is how my dataframe looks like :
import pandas as pd
df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india",{"a":9,"b":10,"c":11}]})
print(df)
How should i remove the rows that contains only strings, so that after removing those string rows, I can apply below to this column to convert json object into separate columns of dataframe:
from pandas.io.json import json_normalize
df = json_normalize(df['A'])
print(df)
I think I would prefer to use an isinstance check:
In [11]: df.loc[df.A.apply(lambda d: isinstance(d, dict))]
Out[11]:
A
2 {'a': 5, 'b': 6, 'c': 8}
5 {'d': 9, 'e': 10, 'f': 11}
If you want to include numbers too, you can do:
In [12]: df.loc[df.A.apply(lambda d: isinstance(d, (dict, np.number)))]
Out[12]:
A
2 {'a': 5, 'b': 6, 'c': 8}
5 {'d': 9, 'e': 10, 'f': 11}
Adjust this to whichever types you want to include...
The last step, json_normalize takes a list of json objects, for whatever reason a Series is no good (and gives the KeyError), you can make this a list and your good to go:
In [21]: df1 = df.loc[df.A.apply(lambda d: isinstance(d, (dict, np.number)))]
In [22]: json_normalize(list(df1["A"]))
Out[22]:
a b c d e f
0 5.0 6.0 8.0 NaN NaN NaN
1 NaN NaN NaN 9.0 10.0 11.0
df[df.applymap(np.isreal).sum(1).gt(0)]
Out[794]:
A
2 {'a': 5, 'b': 6, 'c': 8}
5 {'d': 9, 'e': 10, 'f': 11}
If you want an ugly solution that also works...here's a function I created that finds columns that contain only strings, and returns the df minus those rows. (since your df has only one column, you'll just dataframe containing 1 column with all dicts). Then, from there, you'll want to use
df = json_normalize(df['A'].values) instead of just df = json_normalize(df['A']).
For a single column dataframe...
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize
def delete_strings(df):
nrows = df.shape[0]
rows_to_keep = []
for row in np.arange(nrows):
if type(df.iloc[row,0]) == dict:
rows_to_keep.append(row) #add the row number to list of rows
#to keep if the row contains a dict
return df.iloc[rows_to_keep,0] #return only rows with dicts
df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india",
{"a":9,"b":10,"c":11}]})
df = delete_strings(df)
df = json_normalize(df['A'].values)
print(df)
#0 {'a': 5, 'b': 6, 'c': 8}
#1 {'a': 9, 'b': 10, 'c': 11}
For a multi-column df (also works with a single column df):
def delete_rows_of_strings(df):
rows = df.shape[0] #of rows in df
cols = df.shape[1] #of coluns in df
rows_to_keep = [] #list to track rows to keep
for row in np.arange(rows): #for every row in the dataframe
#num_string will count the number of strings in the row
num_string = 0
for col in np.arange(cols): #for each column in the row...
#if the value is a string, add one to num_string
if type(df.iloc[row,col]) == str:
num_string += 1
#if num_string, the number of strings in the column,
#isn't equal to the number of columns in the row...
if num_string != cols: #...add that row number to the list of rows to keep
rows_to_keep.append(row)
#return the df with rows containing at least one non string
return(df.iloc[rows_to_keep,:])
df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india"],
'B' : ['hi',{"a":5,"b":6,"c":8},'sup','america','china']})
# A B
#0 hello hi
#1 world {'a': 5, 'b': 6, 'c': 8}
#2 {'a': 5, 'b': 6, 'c': 8} sup
print(delete_rows_of_strings(df))
# A B
#1 world {'a': 5, 'b': 6, 'c': 8}
#2 {'a': 5, 'b': 6, 'c': 8} sup

python: How can combine rows in dataframe

I tried to combine rows with apply function in dataframe but couldn't.
I would like to combine rows to one list if column (c1+c2) information is same.
for example
Dataframe df1
c1 c2 c3
0 0 x {'a':1 ,'b':2}
1 0 x {'a':3 ,'b':4}
2 0 y {'a':5 ,'b':6}
3 0 y {'a':7 ,'b':8}
4 2 x {'a':9 ,'b':10}
5 2 x {'a':11 ,'b':12}
expected result
Dataframe df1
c1 c2 c3
0 0 x [{'a':1 ,'b':2},{'a':3 ,'b':4}]
1 0 y [{'a':5 ,'b':6},{'a':7 ,'b':8}]
2 2 z [{'a':9 ,'b':10},{'a':11,'b':12}]
Source Pandas DF:
In [20]: df
Out[20]:
c1 c2 c3
0 0 x {'a': 1, 'b': 2}
1 0 x {'a': 3, 'b': 4}
2 0 y {'a': 5, 'b': 6}
3 0 y {'a': 7, 'b': 8}
4 2 x {'a': 9, 'b': 10}
5 2 x {'a': 11, 'b': 12}
Solution:
In [21]: df.groupby(['c1','c2'])['c3'].apply(list).to_frame('c3').reset_index()
Out[21]:
c1 c2 c3
0 0 x [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
1 0 y [{'a': 5, 'b': 6}, {'a': 7, 'b': 8}]
2 2 x [{'a': 9, 'b': 10}, {'a': 11, 'b': 12}]
NOTE: I would recommend you to avoid using non-scalar values in Pandas DFs cells - this might cause various difficulties and performance issues

Categories

Resources