Related
I have a list of dictionaries (row data) like below:
from typing import List, Dict, Any
testDict: List[Dict[str, Any]] = list(
(
{"A": 0.1, "B": 1, "E": "ABE"},
{"A": 0.11, "B": 20, "C": 0.2, "E": "ABCE"},
{"A": 0.11, "B": 3, "D": 33, "E": "ABDE"},
{"A": 0.13, "B": 40, "C": 0.5, "D": 23, "E": "ABCDE"},
)
)
testDict
[{'A': 0.1, 'B': 1, 'E': 'ABE'},
{'A': 0.11, 'B': 20, 'C': 0.2, 'E': 'ABCE'},
{'A': 0.11, 'B': 3, 'D': 33, 'E': 'ABDE'},
{'A': 0.13, 'B': 40, 'C': 0.5, 'D': 23, 'E': 'ABCDE'}]
I want to convert this testDict to a pandas dataframe. So, I did this:
testDf: pd.DataFrame = pd.json_normalize(data=testDict, max_level=1)
testDf
A B E C D
0 0.10 1 ABE NaN NaN
1 0.11 20 ABCE 0.2 NaN
2 0.11 3 ABDE NaN 33.0
3 0.13 40 ABCDE 0.5 23.0
However, I want the relative order of the keys to be maintained in the column names like [A, B, C, D, E] (or [A, B, D, C, E] only if I don't have the last entry).
I've 100K such row data with total number of keys as 256 in the actual data. Is there any easy way to achieve this? Or do I need to merge these key names like merge-sort to build the column name orders and use that?
Update 1:
I'm not looking for how to lexiographically sort the columns here. In each dict, keys are already in specific order. For any row, the relative order of the keys should remain the same. My sample list could be following:
[{'XYZ': 0.1, 'ABC': 1, 'PQR': 'ABE'},
{'XYZ': 0.11, 'ABC': 20, 'KLM': 0.2, 'PQR': 'ABCE'},
{'XYZ': 0.11, 'ABC': 3, 'DEF': 33, 'PQR': 'ABDE'},
{'XYZ': 0.13, 'ABC': 40, 'KLM': 0.5, 'DEF': 23, 'PQR': 'ABCDE'}]
In this case, final column order should be ['XYZ', 'ABC', 'KLM', 'DEF', 'PQR']
You can just define the dataframe with testdict as value and explicitly declare the order of columns:
df = pd.DataFrame(testDict, columns=['A', 'B', 'C', 'D', 'E'])
this prints as:
A B C D E
0 0.10 1 NaN NaN ABE
1 0.11 20 0.2 NaN ABCE
2 0.11 3 NaN 33.0 ABDE
3 0.13 40 0.5 23.0 ABCDE
Alternatively, if you do not want to explicitly declare columns, yo can sort them in place:
df = pd.DataFrame(testDict).sort_index(axis=1)
Okay, with your Update 1 I get it.
Note that you have to use python versions higher than 3.6 otherwise the order of the keys in the dictionaries is not guaranteed.
That said, you need to go through all the dictionaries and mark the maximum index of each key, then just sort them in ascending position order.
Then you can use the resulting list to specify the order of the columns.
You could do this in the following way:
import operator
colIndxs = {}
for dy in data:
i = 0
for key in dy.keys():
if key in colIndxs:
colIndxs[key] = max(colIndxs[key], i)
else:
colIndxs[key] = i
i += 1
cols = list( dict( sorted(colIndxs.items(),key=operator.itemgetter(1))) )
df = pd.DataFrame(testDict, columns=cols)
I have a table of ids, and previous ids (see image 1), I want to count the number of unique ids in total linked in one chain, e.g. if we take the latest id as the 'parent' then the result for the example data below would be something like Image 2, where 'a' is linked to 5 total ids (a, b, c, d & e) and 'w' is linked to 4 ids (w, x, y & z). In practicality, I am dealing with randomly generated ids, not sequenced letters.
Python Code to produce example dataframes:
import pandas as pd
raw_data = pd.DataFrame([['a','b'], ['b','c'], ['c', 'd'],['d','e'],['e','-'],
['w','x'], ['x', 'y'], ['y','z'], ['z','-']], columns=['id', 'previous_id'])
output = pd.DataFrame([['a',5],['w',4]], columns = ['parent_id','linked_ids'])
Use convert_matrix.from_pandas_edgelist with connected_components first, then create dictionary for mapping, get first mapped values per groups by Series.map filtered by Series.duplicated and last add new column by Series.map with Counter for mapp dictionary:
df1 = raw_data[raw_data['previous_id'].ne('-')]
import networkx as nx
from collections import Counter
g = nx.from_pandas_edgelist(df1,'id','previous_id')
connected_components = nx.connected_components(g)
d = {y:i for i, x in enumerate(connected_components) for y in x}
print (d)
{'c': 0, 'e': 0, 'b': 0, 'd': 0, 'a': 0, 'y': 1, 'x': 1, 'w': 1, 'z': 1}
c = Counter(d.values())
mapp = {k: c[v] for k, v in d.items()}
print (mapp)
{'c': 5, 'e': 5, 'b': 5, 'd': 5, 'a': 5, 'y': 4, 'x': 4, 'w': 4, 'z': 4}
df = (raw_data.loc[~raw_data['id'].map(d).duplicated(), ['id']]
.rename(columns={'id':'parent_id'})
.assign(linked_ids = lambda x: x['parent_id'].map(mapp)))
print (df)
parent_id linked_ids
0 a 5
5 w 4
I have a dictionary that contains the names of various players with all values set to None like so...
players = {'A': None,
'B': None,
'C': None,
'D': None,
'E': None}
A pandas data frame (df_1) that contains the keys, i.e. player names
col_0 col_1 col_2
----- ----- -----
0 A B C
1 A E D
2 C B A
and a dataframe (df_2) that contains their scores in corresponding matches
score_0 score_1 score_2
----- ----- -----
0 1 10 2
1 6 15 7
2 8 1 9
Hence, total score of A is..
1 + 6 + 9 = 16
(0, score_0) + (1, score_0) + (2, score_2)
and I would like to map all the players(A, B, C..) to their total score in the dictionary of players that I had created earlier.
Here's some code that I wrote...
for player in players:
players[player] = df_2.loc[df_1['col_0'] == player, 'score_0'].sum()
players[player] += df_2.loc[df_1['col_1'] == player, 'score_1'].sum()
players[player] += df_2.loc[df_1['col_2'] == player, 'score_2'].sum()
print(players)
This produces the desired result, but I am wondering if a faster, more pandas like way is available. Any help would be appreciated.
Ummm pandas stack , usually we can groupby after flatten the df
s=df2.stack().groupby(df1.stack().values).sum()
s
A 16
B 11
C 10
D 7
E 15
dtype: int64
s.to_dict()
{'A': 16, 'B': 11, 'C': 10, 'D': 7, 'E': 15}
You can generate such dictionary with:
import numpy as np
result = { k: np.nansum(df_2[df_1 == k]) for k in players }
For the given sample data, this will return:
>>> { k: np.nansum(df_2[df_1 == k]) for k in players }
{'A': 16.0, 'B': 11.0, 'C': 10.0, 'D': 7.0, 'E': 15.0}
Given no values for the given key exist, it will map to zero. For example if we add a key R to the players:
>>> players['R'] = None
>>> { k: np.nansum(df_2[df_1 == k]) for k in players }
{'A': 16.0, 'B': 11.0, 'C': 10.0, 'D': 7.0, 'E': 15.0, 'R': 0.0}
Or we can boost efficiency by first extracting numpy arrays out of the data frames:
arr_2 = df_2.values
arr_1 = df_1.values
result = { k: arr_2[arr_1 == k].sum() for k in players }
Benchmarks
If we define functions f (the original implemention) g (this implementation), and h (#WeNYoBen's implementation), and we use timeit to measure the time for 1'000 calls with the given sample data, I get the following for an Intel Intel(R) Core(TM) i7-7500U CPU # 2.70GHz (that is unfortunately a bit buzy at the moment):
>>> df_1 = pd.DataFrame({'col_0': ['A', 'A', 'C'], 'col_1': ['B', 'E', 'B'], 'col_2': ['C', 'D', 'A']})
>>> df_2 = pd.DataFrame({'score_0': [1, 6, 8], 'score_1': [10, 15, 1], 'score_2': [2, 7, 9]})
>>> def f():
... for player in players:
... players[player] = df_2.loc[df_1['col_0'] == player, 'score_0'].sum()
... players[player] += df_2.loc[df_1['col_1'] == player, 'score_1'].sum()
... players[player] += df_2.loc[df_1['col_2'] == player, 'score_2'].sum()
... return players
...
>>> def g():
... arr_2 = df_2.values
... arr_1 = df_1.values
... result = { k: arr_2[arr_1 == k].sum() for k in players }
...
>>> def h():
... return df_2.stack().groupby(df_1.stack().values).sum().to_dict()
...
>>> timeit(f, number=1000)
47.23081823496614
>>> timeit(g, number=1000)
0.32561282289680094
>>> timeit(h, number=1000)
8.169926556991413
The most important optimization is probably to use the numpy array instead of performing the calculations at the pandas level.
I'm looking for a way to sort pandas DataFrame. pd.DataFrame.sort_values doesn't accept a key function. I can convert it to list and apply a key to sorted function, but that will be slow. The other way seems something related to categorical index. I don't have a fixed number of rows so I don't know if categorical index will be applicable.
I have given an example case of what kind of data I want to sort:
Input DataFrame:
clouds fluff
0 {[} 1
1 >>> 2
2 {1 3
3 123 4
4 AAsda 5
5 aad 6
Output DataFrame:
clouds fluff
0 >>> 2
1 {[} 1
2 {1 3
3 123 4
4 aad 6
5 AAsda 5
The rule for sorting (priority):
First special characters (sort by ascii among themselves)
Next is by numbers
next is by lower case alphabets (lexicographical)
next is Capital case alphabets (lexicographical)
In plain python I'd do it like
from functools import cmp_to_key
def ks(a, b):
# "Not exactly this but similar"
if a.isupper():
return -1
else:
return 1
Case
sorted(['aa', 'AA', 'dd', 'DD'], key=cmp_to_key(ks))
Answer:
['DD', 'AA', 'aa', 'dd']
How would you do it with Pandas?
As of pandas 1.1.0, pandas.DataFrame.sort_values accepts an argument key with type callable.
So in this case we would use:
df.sort_values(by='clouds', key=kf)
where kf is the key function that operates on type Series. Accepts and returns Series.
As of pandas 1.2.0,
I did this
import numpy as np
import pandas as pd
df = pd.DataFrame(['aa', 'dd', 'DD', 'AA'], columns=["data"])
# This is the sorting rule
rule = {
"DD": 1,
"AA": 10,
"aa": 20,
"dd": 30,
}
def particular_sort(series):
"""
Must return one Series
"""
return series.apply(lambda x: rule.get(x, 1000))
new_df = df.sort_values(by=["data"], key=particular_sort)
print(new_df) # DD, AA, aa, dd
Of course, you can do this too, but it may be difficult to understand,smile
new_df = df.sort_values(by=["data"], key=lambda x: x.apply(lambda y: rule.get(y, 1000)))
print(new_df) # DD, AA, aa, dd
This might be useful, yet still not sure about special characters! can they actally be sorted!!
import pandas as pd
a = [2, 'B', 'c', 1, 'a', 'b',3, 'C', 'A']
df = pd.DataFrame({"a": a})
df['upper'] = df['a'].str.isupper()
df['lower'] = df['a'].str.islower()
df['int'] = df['a'].apply(isinstance,args = [int])
df2 = pd.concat([df[df['int'] == True].sort_values(by=['a']),
df[df['lower'] == True].sort_values(by=['a']),
df[df['upper'] == True].sort_values(by=['a'])])
print(df2)
a upper lower int
3 1 NaN NaN True
0 2 NaN NaN True
6 3 NaN NaN True
4 a False True False
5 b False True False
2 c False True False
8 A True False False
1 B True False False
7 C True False False
you can also do it in one step with creating new True False columns!
a = [2, 'B', 'c', 1, 'a', 'b',3, 'C', 'A']
df = pd.DataFrame({"a": a})
df2 = pd.concat([df[df['a'].apply(isinstance,args = [int])].sort_values(by=['a']),
df[df['a'].str.islower() == True].sort_values(by=['a']),
df[df['a'].str.isupper() == True].sort_values(by=['a'])])
a
3 1
0 2
6 3
4 a
5 b
2 c
8 A
1 B
7 C
This seems to work:
def sort_dataframe_by_key(dataframe: DataFrame, column: str, key: Callable) -> DataFrame:
""" Sort a dataframe from a column using the key """
sort_ixs = sorted(np.arange(len(dataframe)), key=lambda i: key(dataframe.iloc[i][column]))
return DataFrame(columns=list(dataframe), data=dataframe.iloc[sort_ixs].values)
It passes tests:
def test_sort_dataframe_by_key():
dataframe = DataFrame([{'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}, {'a': 3, 'b': 4, 'c': 0}])
assert sort_dataframe_by_key(dataframe, column='a', key=lambda x: x).equals(
DataFrame([{'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}, {'a': 3, 'b': 4, 'c': 0}]))
assert sort_dataframe_by_key(dataframe, column='a', key=lambda x: -x).equals(
DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 2, 'b': 1, 'c': 1}, {'a': 1, 'b': 2, 'c': 3}]))
assert sort_dataframe_by_key(dataframe, column='b', key=lambda x: -x).equals(
DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}]))
assert sort_dataframe_by_key(dataframe, column='c', key=lambda x: x).equals(
DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 2, 'b': 1, 'c': 1}, {'a': 1, 'b': 2, 'c': 3}]))
I have a dataframe such that the column contains both json objects and strings. I want to get rid of rows that does not contains json objects.
Below is how my dataframe looks like :
import pandas as pd
df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india",{"a":9,"b":10,"c":11}]})
print(df)
How should i remove the rows that contains only strings, so that after removing those string rows, I can apply below to this column to convert json object into separate columns of dataframe:
from pandas.io.json import json_normalize
df = json_normalize(df['A'])
print(df)
I think I would prefer to use an isinstance check:
In [11]: df.loc[df.A.apply(lambda d: isinstance(d, dict))]
Out[11]:
A
2 {'a': 5, 'b': 6, 'c': 8}
5 {'d': 9, 'e': 10, 'f': 11}
If you want to include numbers too, you can do:
In [12]: df.loc[df.A.apply(lambda d: isinstance(d, (dict, np.number)))]
Out[12]:
A
2 {'a': 5, 'b': 6, 'c': 8}
5 {'d': 9, 'e': 10, 'f': 11}
Adjust this to whichever types you want to include...
The last step, json_normalize takes a list of json objects, for whatever reason a Series is no good (and gives the KeyError), you can make this a list and your good to go:
In [21]: df1 = df.loc[df.A.apply(lambda d: isinstance(d, (dict, np.number)))]
In [22]: json_normalize(list(df1["A"]))
Out[22]:
a b c d e f
0 5.0 6.0 8.0 NaN NaN NaN
1 NaN NaN NaN 9.0 10.0 11.0
df[df.applymap(np.isreal).sum(1).gt(0)]
Out[794]:
A
2 {'a': 5, 'b': 6, 'c': 8}
5 {'d': 9, 'e': 10, 'f': 11}
If you want an ugly solution that also works...here's a function I created that finds columns that contain only strings, and returns the df minus those rows. (since your df has only one column, you'll just dataframe containing 1 column with all dicts). Then, from there, you'll want to use
df = json_normalize(df['A'].values) instead of just df = json_normalize(df['A']).
For a single column dataframe...
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize
def delete_strings(df):
nrows = df.shape[0]
rows_to_keep = []
for row in np.arange(nrows):
if type(df.iloc[row,0]) == dict:
rows_to_keep.append(row) #add the row number to list of rows
#to keep if the row contains a dict
return df.iloc[rows_to_keep,0] #return only rows with dicts
df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india",
{"a":9,"b":10,"c":11}]})
df = delete_strings(df)
df = json_normalize(df['A'].values)
print(df)
#0 {'a': 5, 'b': 6, 'c': 8}
#1 {'a': 9, 'b': 10, 'c': 11}
For a multi-column df (also works with a single column df):
def delete_rows_of_strings(df):
rows = df.shape[0] #of rows in df
cols = df.shape[1] #of coluns in df
rows_to_keep = [] #list to track rows to keep
for row in np.arange(rows): #for every row in the dataframe
#num_string will count the number of strings in the row
num_string = 0
for col in np.arange(cols): #for each column in the row...
#if the value is a string, add one to num_string
if type(df.iloc[row,col]) == str:
num_string += 1
#if num_string, the number of strings in the column,
#isn't equal to the number of columns in the row...
if num_string != cols: #...add that row number to the list of rows to keep
rows_to_keep.append(row)
#return the df with rows containing at least one non string
return(df.iloc[rows_to_keep,:])
df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india"],
'B' : ['hi',{"a":5,"b":6,"c":8},'sup','america','china']})
# A B
#0 hello hi
#1 world {'a': 5, 'b': 6, 'c': 8}
#2 {'a': 5, 'b': 6, 'c': 8} sup
print(delete_rows_of_strings(df))
# A B
#1 world {'a': 5, 'b': 6, 'c': 8}
#2 {'a': 5, 'b': 6, 'c': 8} sup