I have a dictionary that contains the names of various players with all values set to None like so...
players = {'A': None,
'B': None,
'C': None,
'D': None,
'E': None}
A pandas data frame (df_1) that contains the keys, i.e. player names
col_0 col_1 col_2
----- ----- -----
0 A B C
1 A E D
2 C B A
and a dataframe (df_2) that contains their scores in corresponding matches
score_0 score_1 score_2
----- ----- -----
0 1 10 2
1 6 15 7
2 8 1 9
Hence, total score of A is..
1 + 6 + 9 = 16
(0, score_0) + (1, score_0) + (2, score_2)
and I would like to map all the players(A, B, C..) to their total score in the dictionary of players that I had created earlier.
Here's some code that I wrote...
for player in players:
players[player] = df_2.loc[df_1['col_0'] == player, 'score_0'].sum()
players[player] += df_2.loc[df_1['col_1'] == player, 'score_1'].sum()
players[player] += df_2.loc[df_1['col_2'] == player, 'score_2'].sum()
print(players)
This produces the desired result, but I am wondering if a faster, more pandas like way is available. Any help would be appreciated.
Ummm pandas stack , usually we can groupby after flatten the df
s=df2.stack().groupby(df1.stack().values).sum()
s
A 16
B 11
C 10
D 7
E 15
dtype: int64
s.to_dict()
{'A': 16, 'B': 11, 'C': 10, 'D': 7, 'E': 15}
You can generate such dictionary with:
import numpy as np
result = { k: np.nansum(df_2[df_1 == k]) for k in players }
For the given sample data, this will return:
>>> { k: np.nansum(df_2[df_1 == k]) for k in players }
{'A': 16.0, 'B': 11.0, 'C': 10.0, 'D': 7.0, 'E': 15.0}
Given no values for the given key exist, it will map to zero. For example if we add a key R to the players:
>>> players['R'] = None
>>> { k: np.nansum(df_2[df_1 == k]) for k in players }
{'A': 16.0, 'B': 11.0, 'C': 10.0, 'D': 7.0, 'E': 15.0, 'R': 0.0}
Or we can boost efficiency by first extracting numpy arrays out of the data frames:
arr_2 = df_2.values
arr_1 = df_1.values
result = { k: arr_2[arr_1 == k].sum() for k in players }
Benchmarks
If we define functions f (the original implemention) g (this implementation), and h (#WeNYoBen's implementation), and we use timeit to measure the time for 1'000 calls with the given sample data, I get the following for an Intel Intel(R) Core(TM) i7-7500U CPU # 2.70GHz (that is unfortunately a bit buzy at the moment):
>>> df_1 = pd.DataFrame({'col_0': ['A', 'A', 'C'], 'col_1': ['B', 'E', 'B'], 'col_2': ['C', 'D', 'A']})
>>> df_2 = pd.DataFrame({'score_0': [1, 6, 8], 'score_1': [10, 15, 1], 'score_2': [2, 7, 9]})
>>> def f():
... for player in players:
... players[player] = df_2.loc[df_1['col_0'] == player, 'score_0'].sum()
... players[player] += df_2.loc[df_1['col_1'] == player, 'score_1'].sum()
... players[player] += df_2.loc[df_1['col_2'] == player, 'score_2'].sum()
... return players
...
>>> def g():
... arr_2 = df_2.values
... arr_1 = df_1.values
... result = { k: arr_2[arr_1 == k].sum() for k in players }
...
>>> def h():
... return df_2.stack().groupby(df_1.stack().values).sum().to_dict()
...
>>> timeit(f, number=1000)
47.23081823496614
>>> timeit(g, number=1000)
0.32561282289680094
>>> timeit(h, number=1000)
8.169926556991413
The most important optimization is probably to use the numpy array instead of performing the calculations at the pandas level.
Related
I have a pandas series
A 3
B 5
and a dictionary
dic={'A':4 , 'B':3}
I wanted to match and multiply this series by a dictionary and its values.
So the outcome is
A 12
B 15
Is this possible?
I've tried
s=s.mul(s.map(dic))
This should do it if you convert your pandas input series into a DataFrame:
df = pd.DataFrame({
'a': [1,2,3,4,5],
'b': [5,4,3,3,4],
'c': [3,2,4,3,10],
'd': [3, 2, 1, 1, 1]
})
params = {'a': 2.5, 'b': 3.0, 'c': 1.3, 'd': 0.9}
df1 = df.assign(**params).mul(df)
Just make a Series out of your dictionary and you can multiply it straightforward
s1 = pd.Series({'A':3 , 'B':5})
s2 = pd.Series({'A':4 , 'B':3})
s3 = s1*s2
print(s3)
A 12
B 15
dtype: int64
Depending on what you want to achieve, this function may also help.
def multiply(series, dictionary):
r = {}
for k in series.keys():
if k in dictionary:
r[k] = series[k] * dictionary[k]
return pd.Series(r)
I have a table of ids, and previous ids (see image 1), I want to count the number of unique ids in total linked in one chain, e.g. if we take the latest id as the 'parent' then the result for the example data below would be something like Image 2, where 'a' is linked to 5 total ids (a, b, c, d & e) and 'w' is linked to 4 ids (w, x, y & z). In practicality, I am dealing with randomly generated ids, not sequenced letters.
Python Code to produce example dataframes:
import pandas as pd
raw_data = pd.DataFrame([['a','b'], ['b','c'], ['c', 'd'],['d','e'],['e','-'],
['w','x'], ['x', 'y'], ['y','z'], ['z','-']], columns=['id', 'previous_id'])
output = pd.DataFrame([['a',5],['w',4]], columns = ['parent_id','linked_ids'])
Use convert_matrix.from_pandas_edgelist with connected_components first, then create dictionary for mapping, get first mapped values per groups by Series.map filtered by Series.duplicated and last add new column by Series.map with Counter for mapp dictionary:
df1 = raw_data[raw_data['previous_id'].ne('-')]
import networkx as nx
from collections import Counter
g = nx.from_pandas_edgelist(df1,'id','previous_id')
connected_components = nx.connected_components(g)
d = {y:i for i, x in enumerate(connected_components) for y in x}
print (d)
{'c': 0, 'e': 0, 'b': 0, 'd': 0, 'a': 0, 'y': 1, 'x': 1, 'w': 1, 'z': 1}
c = Counter(d.values())
mapp = {k: c[v] for k, v in d.items()}
print (mapp)
{'c': 5, 'e': 5, 'b': 5, 'd': 5, 'a': 5, 'y': 4, 'x': 4, 'w': 4, 'z': 4}
df = (raw_data.loc[~raw_data['id'].map(d).duplicated(), ['id']]
.rename(columns={'id':'parent_id'})
.assign(linked_ids = lambda x: x['parent_id'].map(mapp)))
print (df)
parent_id linked_ids
0 a 5
5 w 4
I have a series of filenames containing key,value pairs. For example, filename1 contains:
A : U
B : 10
C : checksum1
I would like to get a set of values based on a selection of unique values of other keys.
For example, if my key values in files can be represented like:
A B C D
-------------------------
U 10 checksum1 filename1
U 10 checksum2 filename2
U 20 checksum3 filename3
V 20 checksum4 filename4
V 20 checksum5 filename5
I would like to obtain:
t = table.unique_values_for(["A","B"])
# [("U",10), ("U",20), ("V,20")]
t.result_for_unique(["C","D"])
# [
# [(checksum1, filename1),(checksum2 filename2)], <-result for ("U",10)
# [(checksum3, filename3)], <- result for ("U",20)
# [(checksum4, filename4), (checksum5, filename5)] <- result for ("V,20")
# ]
I have tried with plain dicts, pandas, astropy.table.
This is on of the tests I have tried so far:
class minidb():
def __init__(self, pattern):
if isinstance(pattern, str):
pattern = [pattern]
self.pattern = pattern
self.heads = [ get_fits_header(f, fast=True) for f in pattern ]
keys = self.heads[0].keys()
values = [ [ h.get(k) for h in self.heads ] for k in keys ]
dic = dict(zip(keys, values))
dic["ARP FILENAME"] = pattern # adding filename
self.dic = dic
self.table = Table(dic) # original
self.data = self.table
self.unique = None
self.names = None
def unique_for(self, keys):
# if isinstance(keys, str):
# keys = [keys]
self.data = self.table.group_by(keys)
self.unique = self.data.groups.keys.as_array().tolist()
return self.unique
def names_for(self, keys):
if isinstance(keys, str):
keys = [keys]
self.names = [ np.array(g[keys]).tolist() for g in self.data.groups]
self.data = self.table[keys]
return self.names
Pandas can do this easily using groupby:
In [1]: df = pd.DataFrame([
...: dict(A='U', B=10, C=1, D=1),
...: dict(A='U', B=10, C=2, D=2),
...: dict(A='U', B=20, C=3, D=3),
...: dict(A='V', B=20, C=4, D=4),
...: dict(A='V', B=20, C=5, D=5)
...: ])
In [2]: list(df.groupby(['A', 'B']))
Out[2]:
[(('U', 10),
A B C D
0 U 10 1 1
1 U 10 2 2),
(('U', 20),
A B C D
2 U 20 3 3),
(('V', 20),
A B C D
3 V 20 4 4
4 V 20 5 5)]
Each element in that list is a tuple of the key (the values of "A" and "B") and a dataframe (technically a view into the original dataframe) containing just the rows that have those values for "A" and "B". You can loop on the group-by'd results and extract whatever information you want from "C" and "D" as you'd normally get data out of a dataframe.
You can use itertools.groupby in order to group the data by the first two elements. This requires the data to be already ordered by their keys; if it isn't you can use sorted beforehand.
import itertools as it
data = [
('U', 10, 'checksum1', 'filename1'),
('U', 10, 'checksum2', 'filename2'),
('U', 20, 'checksum3', 'filename3'),
('V', 20, 'checksum4', 'filename4'),
('V', 20, 'checksum5', 'filename5'),
]
result = [list(g) for k, g in it.groupby(data, lambda x: x[:2])]
result = [[x[2:] for x in group] for group in result] # optionally drop the first two items
print(result)
astropy.table can do this as well, in a manner similar to pandas DataFrame:
>>> text = """A B C D
... -------------------------
... U 10 checksum1 filename1
... U 10 checksum2 filename2
... U 20 checksum3 filename3
... V 20 checksum4 filename4
... V 20 checksum5 filename5"""
...
>>> dat = Table.read(text, format='ascii', data_start=2)
>>> dat
<Table length=5>
A B C D
str1 int64 str9 str9
---- ----- --------- ---------
U 10 checksum1 filename1
U 10 checksum2 filename2
U 20 checksum3 filename3
V 20 checksum4 filename4
V 20 checksum5 filename5
>>> list(dat.group_by(['A', 'B']).groups)
[<Table length=2>
A B C D
str1 int64 str9 str9
---- ----- --------- ---------
U 10 checksum1 filename1
U 10 checksum2 filename2,
<Table length=1>
A B C D
str1 int64 str9 str9
---- ----- --------- ---------
U 20 checksum3 filename3,
<Table length=2>
A B C D
str1 int64 str9 str9
---- ----- --------- ---------
V 20 checksum4 filename4
V 20 checksum5 filename5]
Generalizing #a_guest, I had:
data = [
('U', 10, 'checksum1', 'filename1'),
('U', 10, 'checksum2', 'filename2'),
('U', 20, 'checksum3', 'filename3'),
('V', 20, 'checksum4', 'filename4'),
('V', 20, 'checksum5', 'filename5'),
]
data = [dict(zip(('A','B','C','D'), (x))) for x in data]
# [{'A': 'U', 'B': 10, 'C': 'checksum1', 'D': 'filename1'},
# {'A': 'U', 'B': 10, 'C': 'checksum2', 'D': 'filename2'},
# {'A': 'U', 'B': 20, 'C': 'checksum3', 'D': 'filename3'},
# {'A': 'V', 'B': 20, 'C': 'checksum4', 'D': 'filename4'},
# {'A': 'V', 'B': 20, 'C': 'checksum5', 'D': 'filename5'}]
Then, "grouping by" A and B:
keys=["A","B"]
result = [ list(g) for k,g in it.groupby(data, lambda x: (tuple(x[k] for k in keys)) ) ]
# [[{'A': 'U', 'B': 10, 'C': 'checksum1', 'D': 'filename1'},
# {'A': 'U', 'B': 10, 'C': 'checksum2', 'D': 'filename2'}],
# [{'A': 'U', 'B': 20, 'C': 'checksum3', 'D': 'filename3'}],
# [{'A': 'V', 'B': 20, 'C': 'checksum4', 'D': 'filename4'},
# {'A': 'V', 'B': 20, 'C': 'checksum5', 'D': 'filename5'}]]
and "extracting" C and D for these groups:
names=["C","D"]
res2 = [ [ tuple(x[n] for n in names) for x in r] for r in result ]
# [[('checksum1', 'filename1'), ('checksum2', 'filename2')],
# [('checksum3', 'filename3')],
# [('checksum4', 'filename4'), ('checksum5', 'filename5')]]
So a representation can be:
values = [ tuple(d.get(k) for k in keys) for d in data ]
res3 = dict(zip(set(values),res2))
# {('U', 10): [('checksum1', 'filename1'), ('checksum2', 'filename2')],
# ('U', 20): [('checksum3', 'filename3')],
# ('V', 20): [('checksum4', 'filename4'), ('checksum5', 'filename5')]}
I don't know if my comprehension lists can be simplified or the itertool import can be avoided. I am new to that.
I'm looking for a way to sort pandas DataFrame. pd.DataFrame.sort_values doesn't accept a key function. I can convert it to list and apply a key to sorted function, but that will be slow. The other way seems something related to categorical index. I don't have a fixed number of rows so I don't know if categorical index will be applicable.
I have given an example case of what kind of data I want to sort:
Input DataFrame:
clouds fluff
0 {[} 1
1 >>> 2
2 {1 3
3 123 4
4 AAsda 5
5 aad 6
Output DataFrame:
clouds fluff
0 >>> 2
1 {[} 1
2 {1 3
3 123 4
4 aad 6
5 AAsda 5
The rule for sorting (priority):
First special characters (sort by ascii among themselves)
Next is by numbers
next is by lower case alphabets (lexicographical)
next is Capital case alphabets (lexicographical)
In plain python I'd do it like
from functools import cmp_to_key
def ks(a, b):
# "Not exactly this but similar"
if a.isupper():
return -1
else:
return 1
Case
sorted(['aa', 'AA', 'dd', 'DD'], key=cmp_to_key(ks))
Answer:
['DD', 'AA', 'aa', 'dd']
How would you do it with Pandas?
As of pandas 1.1.0, pandas.DataFrame.sort_values accepts an argument key with type callable.
So in this case we would use:
df.sort_values(by='clouds', key=kf)
where kf is the key function that operates on type Series. Accepts and returns Series.
As of pandas 1.2.0,
I did this
import numpy as np
import pandas as pd
df = pd.DataFrame(['aa', 'dd', 'DD', 'AA'], columns=["data"])
# This is the sorting rule
rule = {
"DD": 1,
"AA": 10,
"aa": 20,
"dd": 30,
}
def particular_sort(series):
"""
Must return one Series
"""
return series.apply(lambda x: rule.get(x, 1000))
new_df = df.sort_values(by=["data"], key=particular_sort)
print(new_df) # DD, AA, aa, dd
Of course, you can do this too, but it may be difficult to understand,smile
new_df = df.sort_values(by=["data"], key=lambda x: x.apply(lambda y: rule.get(y, 1000)))
print(new_df) # DD, AA, aa, dd
This might be useful, yet still not sure about special characters! can they actally be sorted!!
import pandas as pd
a = [2, 'B', 'c', 1, 'a', 'b',3, 'C', 'A']
df = pd.DataFrame({"a": a})
df['upper'] = df['a'].str.isupper()
df['lower'] = df['a'].str.islower()
df['int'] = df['a'].apply(isinstance,args = [int])
df2 = pd.concat([df[df['int'] == True].sort_values(by=['a']),
df[df['lower'] == True].sort_values(by=['a']),
df[df['upper'] == True].sort_values(by=['a'])])
print(df2)
a upper lower int
3 1 NaN NaN True
0 2 NaN NaN True
6 3 NaN NaN True
4 a False True False
5 b False True False
2 c False True False
8 A True False False
1 B True False False
7 C True False False
you can also do it in one step with creating new True False columns!
a = [2, 'B', 'c', 1, 'a', 'b',3, 'C', 'A']
df = pd.DataFrame({"a": a})
df2 = pd.concat([df[df['a'].apply(isinstance,args = [int])].sort_values(by=['a']),
df[df['a'].str.islower() == True].sort_values(by=['a']),
df[df['a'].str.isupper() == True].sort_values(by=['a'])])
a
3 1
0 2
6 3
4 a
5 b
2 c
8 A
1 B
7 C
This seems to work:
def sort_dataframe_by_key(dataframe: DataFrame, column: str, key: Callable) -> DataFrame:
""" Sort a dataframe from a column using the key """
sort_ixs = sorted(np.arange(len(dataframe)), key=lambda i: key(dataframe.iloc[i][column]))
return DataFrame(columns=list(dataframe), data=dataframe.iloc[sort_ixs].values)
It passes tests:
def test_sort_dataframe_by_key():
dataframe = DataFrame([{'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}, {'a': 3, 'b': 4, 'c': 0}])
assert sort_dataframe_by_key(dataframe, column='a', key=lambda x: x).equals(
DataFrame([{'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}, {'a': 3, 'b': 4, 'c': 0}]))
assert sort_dataframe_by_key(dataframe, column='a', key=lambda x: -x).equals(
DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 2, 'b': 1, 'c': 1}, {'a': 1, 'b': 2, 'c': 3}]))
assert sort_dataframe_by_key(dataframe, column='b', key=lambda x: -x).equals(
DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}]))
assert sort_dataframe_by_key(dataframe, column='c', key=lambda x: x).equals(
DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 2, 'b': 1, 'c': 1}, {'a': 1, 'b': 2, 'c': 3}]))
I want to create a matrix.
Input:
data = [
{'a': 2, 'g': 1},
{'p': 3, 'a': 5, 'cat': 4}
...
]
Output:
a p cat g
1st 2 0 0 1
2nd 5 3 4 0
This is my code. But I think it's not smart and very slow when data size huge.
Have any good ways to do this one?
Thank you.
data = [
{'a': 2, 'g': 1},
{'p': 3, 'a': 5, 'cat': 4}
]
### Get keyword map ###
key_map = set()
for row in data:
key_map = key_map.union(set(row.keys()))
key_map = list(key_map) # ['a', 'p', 'g', 'cat']
### Create matrix ###
result = []
for row in data:
matrix = [0] * len(key_map)
for k, v in row.iteritems():
matrix[key_map.index(k)] = v
result.append(matrix)
print result
# [[2, 0, 0, 1], [5, 3, 4, 0]]
Edited
By #wwii advice. Use Pandas looks good:
from pandas import DataFrame
result = DataFrame(data, index=range(len(data)))
print result.fillna(0, downcast=int).as_matrix().tolist()
# [[2, 0, 1, 0], [5, 4, 0, 3]]
You can use set comprehension to generate the key_map
key_map = list({data for row in data for data in row})
Here is a partial answer. I couldn't get the columns in the order specified - it is limited by how the keys get ordered in the set, key_map. It uses string formatting to line the data up - you can play around with the spacing to fit larger or smaller numbers.
# ordinal from
# http://code.activestate.com/recipes/576888-format-a-number-as-an-ordinal/
from ordinal import ordinal
data = [
{'a': 2, 'g': 1},
{'p': 3, 'a': 5, 'cat': 4}
]
### Get keyword map ###
key_map = set()
for row in data:
key_map = key_map.union(set(row.keys()))
key_map = list(key_map) # ['a', 'p', 'g', 'cat']
# strings to format the output
header = '{: >10}{: >8}{: >8}{: >8}'.format(*key_map)
line_fmt = '{: <8}{: >2}{: >8}{: >8}{: >8}'
print header
def ordered_data(d, keys):
"""Returns an ordered list of dictionary values.
returns 0 if key not in d
d --> dict
keys --> list of keys
returns list
"""
return [d.get(key, 0) for key in keys]
for i, thing in enumerate(data):
print line_fmt.format(ordinal(i+1), *ordered_data(thing, key_map))
Output
a p g cat
1st 2 0 1 0
2nd 5 3 0 4
It might be worthwhile to dig into the Pandas docs and look at its DataFrame - it might make life easier.
I second the answer using the Pandas dataframes. However, my code should be a bit simpler than yours.
In [1]: import pandas as pd
In [5]: data = [{'a': 2, 'g': 1},{'p': 3, 'a': 5, 'cat': 4}]
In [6]: df = pd.DataFrame(data)
In [7]: df
Out[7]:
a cat g p
0 2 NaN 1 NaN
1 5 4 NaN 3
In [9]: df = df.fillna(0)
In [10]: df
Out[10]:
a cat g p
0 2 0 1 0
1 5 4 0 3
I did my coding in iPython, which I highly recommend!
To save to csv, just use an additional line of code:
df.to_csv('filename.csv')
I am a freshie in python, just suggestions that may be helpful hopefully:)
key_map = []
for row in data:
key_map.extend(row.keys())
key_map = list(set(key_map))
you can change the middle part to this, which will save you some time to find the key_map
In your case union will at least scan through each row to find the different item.