Hierarchical summing in python

Hierarchical summing in python - python

Given the following array:
a = []
a.append({'c': 1, 'v': 10, 'p': 4})
a.append({'c': 2, 'v': 10, 'p': 4})
a.append({'c': 3, 'v': 10, 'p': None})
a.append({'c': 4, 'v': 0, 'p': None})
a.append({'c': 5, 'v': 10, 'p': 1})
a.append({'c': 6, 'v': 10, 'p': 1})
Where c = code, v= value and p=parent
table looks like that:
c v p
1 4
2 10 4
3 10
4
5 10 1
6 10 1
I have to sum up each parent with the values of it's children
Expected result table would be:
c v p
1 20 4
2 10 4
3 10
4 30
5 10 1
6 10 1
How to achieve this?

First, you should derive another dictionary, mapping parents to lists of their children, instead of children to their parents. You can use collections.defaultdict for this.
from collections import defaultdict
children = defaultdict(list)
for d in a:
children[d["p"]].append(d["c"])
Also, I suggest another dictionary, mapping codes to their values, so you don't have to search the entire list each time:
values = {}
for d in a:
values[d["c"]] = d["v"]
Now you can very easily define a recursive function for calculating the total value. Note, however, that this will do some redundant calculations. If your data is much larger, you might want to circumvent this by using a bit of memoization.
def total_value(x):
v = values[x]
for c in children[x]:
v += total_value(c)
return v
Finally, using this function in a dict comprehension gives you the total values for each code:
>>> {x: total_value(x) for x in values}
{1: 30, 2: 10, 3: 10, 4: 40, 5: 10, 6: 10}

Related

How to generate the component id in the Networkx graph?

I have a large Graph Network generated using Networkx package.
Here I'm adding a sample
import networkx as nx
import pandas as pd
G = nx.path_graph(4)
nx.add_path(G, [10, 11, 12])
I'm trying to create a dataframe with Node, degrees, component id, component.
Created degrees using
degrees = list(nx.degree(G))
data = pd.DataFrame([list(d) for d in degrees], columns=['Node', 'degree']).sort_values('degree', ascending=False)
extracted components using
Gcc = sorted(nx.connected_components(G), key=len, reverse=True)
Gcc
[{0, 1, 2, 3}, {10, 11, 12}]
And not sure how I can create the Component ID and components in the data.
Required output:
Node degree ComponentID Components
1 1 2 1 {0, 1, 2, 3}
2 2 2 1 {0, 1, 2, 3}
5 11 2 2 {10, 11, 12}
0 0 1 1 {0, 1, 2, 3}
3 3 1 1 {0, 1, 2, 3}
4 10 1 2 {10, 11, 12}
6 12 1 2 {10, 11, 12}
How to generate the component ids and add them to the nodes and degrees?

Create triplets of Node, ComponentId and Component by enumerating over the connected component list, then create a new dataframe from these triplets and merge it with the given dataframe on Node
df = pd.DataFrame([(n, i, c) for i,c in enumerate(Gcc, 1) for n in c],
columns=['Node', 'ComponentID', 'Components'])
data = data.merge(df, on='Node')
Alternatively you can use map instead of merge to individually create ComponentID and Components columns
d = dict(enumerate(Gcc, 1))
data['ComponentID'] = data['Node'].map({n:i for i,c in d.items() for n in c})
data['Components'] = data['ComponentID'].map(d)
print(data)
Node degree ComponentID Components
1 1 2 1 {0, 1, 2, 3}
2 2 2 1 {0, 1, 2, 3}
5 11 2 2 {10, 11, 12}
0 0 1 1 {0, 1, 2, 3}
3 3 1 1 {0, 1, 2, 3}
4 10 1 2 {10, 11, 12}
6 12 1 2 {10, 11, 12}

Count of rows linked by ids in Pandas dataframe

I have a table of ids, and previous ids (see image 1), I want to count the number of unique ids in total linked in one chain, e.g. if we take the latest id as the 'parent' then the result for the example data below would be something like Image 2, where 'a' is linked to 5 total ids (a, b, c, d & e) and 'w' is linked to 4 ids (w, x, y & z). In practicality, I am dealing with randomly generated ids, not sequenced letters.
Python Code to produce example dataframes:
import pandas as pd
raw_data = pd.DataFrame([['a','b'], ['b','c'], ['c', 'd'],['d','e'],['e','-'],
['w','x'], ['x', 'y'], ['y','z'], ['z','-']], columns=['id', 'previous_id'])
output = pd.DataFrame([['a',5],['w',4]], columns = ['parent_id','linked_ids'])

Use convert_matrix.from_pandas_edgelist with connected_components first, then create dictionary for mapping, get first mapped values per groups by Series.map filtered by Series.duplicated and last add new column by Series.map with Counter for mapp dictionary:
df1 = raw_data[raw_data['previous_id'].ne('-')]
import networkx as nx
from collections import Counter
g = nx.from_pandas_edgelist(df1,'id','previous_id')
connected_components = nx.connected_components(g)
d = {y:i for i, x in enumerate(connected_components) for y in x}
print (d)
{'c': 0, 'e': 0, 'b': 0, 'd': 0, 'a': 0, 'y': 1, 'x': 1, 'w': 1, 'z': 1}
c = Counter(d.values())
mapp = {k: c[v] for k, v in d.items()}
print (mapp)
{'c': 5, 'e': 5, 'b': 5, 'd': 5, 'a': 5, 'y': 4, 'x': 4, 'w': 4, 'z': 4}
df = (raw_data.loc[~raw_data['id'].map(d).duplicated(), ['id']]
.rename(columns={'id':'parent_id'})
.assign(linked_ids = lambda x: x['parent_id'].map(mapp)))
print (df)
parent_id linked_ids
0 a 5
5 w 4

Fill dictonary values as the sum of values from a pandas dataframe

I have a dictionary that contains the names of various players with all values set to None like so...
players = {'A': None,
'B': None,
'C': None,
'D': None,
'E': None}
A pandas data frame (df_1) that contains the keys, i.e. player names
col_0 col_1 col_2
----- ----- -----
0 A B C
1 A E D
2 C B A
and a dataframe (df_2) that contains their scores in corresponding matches
score_0 score_1 score_2
----- ----- -----
0 1 10 2
1 6 15 7
2 8 1 9
Hence, total score of A is..
1 + 6 + 9 = 16
(0, score_0) + (1, score_0) + (2, score_2)
and I would like to map all the players(A, B, C..) to their total score in the dictionary of players that I had created earlier.
Here's some code that I wrote...
for player in players:
players[player] = df_2.loc[df_1['col_0'] == player, 'score_0'].sum()
players[player] += df_2.loc[df_1['col_1'] == player, 'score_1'].sum()
players[player] += df_2.loc[df_1['col_2'] == player, 'score_2'].sum()
print(players)
This produces the desired result, but I am wondering if a faster, more pandas like way is available. Any help would be appreciated.

Ummm pandas stack , usually we can groupby after flatten the df
s=df2.stack().groupby(df1.stack().values).sum()
s
A 16
B 11
C 10
D 7
E 15
dtype: int64
s.to_dict()
{'A': 16, 'B': 11, 'C': 10, 'D': 7, 'E': 15}

You can generate such dictionary with:
import numpy as np
result = { k: np.nansum(df_2[df_1 == k]) for k in players }
For the given sample data, this will return:
>>> { k: np.nansum(df_2[df_1 == k]) for k in players }
{'A': 16.0, 'B': 11.0, 'C': 10.0, 'D': 7.0, 'E': 15.0}
Given no values for the given key exist, it will map to zero. For example if we add a key R to the players:
>>> players['R'] = None
>>> { k: np.nansum(df_2[df_1 == k]) for k in players }
{'A': 16.0, 'B': 11.0, 'C': 10.0, 'D': 7.0, 'E': 15.0, 'R': 0.0}
Or we can boost efficiency by first extracting numpy arrays out of the data frames:
arr_2 = df_2.values
arr_1 = df_1.values
result = { k: arr_2[arr_1 == k].sum() for k in players }
Benchmarks
If we define functions f (the original implemention) g (this implementation), and h (#WeNYoBen's implementation), and we use timeit to measure the time for 1'000 calls with the given sample data, I get the following for an Intel Intel(R) Core(TM) i7-7500U CPU # 2.70GHz (that is unfortunately a bit buzy at the moment):
>>> df_1 = pd.DataFrame({'col_0': ['A', 'A', 'C'], 'col_1': ['B', 'E', 'B'], 'col_2': ['C', 'D', 'A']})
>>> df_2 = pd.DataFrame({'score_0': [1, 6, 8], 'score_1': [10, 15, 1], 'score_2': [2, 7, 9]})
>>> def f():
... for player in players:
... players[player] = df_2.loc[df_1['col_0'] == player, 'score_0'].sum()
... players[player] += df_2.loc[df_1['col_1'] == player, 'score_1'].sum()
... players[player] += df_2.loc[df_1['col_2'] == player, 'score_2'].sum()
... return players
...
>>> def g():
... arr_2 = df_2.values
... arr_1 = df_1.values
... result = { k: arr_2[arr_1 == k].sum() for k in players }
...
>>> def h():
... return df_2.stack().groupby(df_1.stack().values).sum().to_dict()
...
>>> timeit(f, number=1000)
47.23081823496614
>>> timeit(g, number=1000)
0.32561282289680094
>>> timeit(h, number=1000)
8.169926556991413
The most important optimization is probably to use the numpy array instead of performing the calculations at the pandas level.

Indexing for a changing number python

I'm trying to figure out the index for this program. I want it to print a number for each letter entered in the input, for example the string "Jon" would be:
"10 15 14"
but I keep getting an error with the for loop I created with the indexes. If anyone has any thoughts on how to fix this it would be great help!
a = 1
b = 2
c = 3
d = 4
e = 5
f = 6
g = 7
h = 8
i = 9
j = 10
k = 11
l = 12
m = 13
n = 14
o = 15
p = 16
q = 17
r = 18
s = 19
t = 20
u = 21
v = 22
w = 23
x = 24
y = 25
z = 26
name = input("Name: ")
lowercase = name.lower()
print("Your 'cleaned up' name is:", lowercase)
print("Your 'cleaned up name reduces to:")
length = len(name)
name1 = 0
for x in range(name[0], name[length]):
print(name[name1])
name1 += 1

You could save yourself all those variables, and not even need a dictionary by just utilizing ord here and calculating the numerical position in the alphabet:
Example: Taking letter c, which using the following should give us 3:
>>> ord('c') - 96
3
ord will:
Return the integer ordinal of a one-character string.
The 96 is used because of the positions of the alphabet on the ascii table.
So, with that in mind. When you enter a word, using your example: "Jon"
word = input("enter name")
print(*(ord(c) - 96 for c in name.lower()))
# 10 15 14

You can store each letter and the indices in a dict so you can easily retrieve the ones in name:
>>> from string import ascii_lowercase
>>> letters = dict(zip(ascii_lowercase, range(1, len(ascii_lowercase) + 1)))
>>> for c in name:
... print(letters[c])
If you want indices lined up in the string:
>>> print(" ".join(str(letters[c]) for c in name))
"10 15 14"

You are currently passing characters to range(). range(name[0], name[length]) with a name of 'Jon' is equivalent to range('J', 'n')... or it would be if strings were 1-indexed. Unfortunately for this code snippet, a sequence does not have an element with an index equal to the sequence's length. The last element of a three-character string has an index of two, not three. Your algorithm also has zero interaction with the letter values you defined above. It has little chance of succeeding.
Rather than defining each letter's value separately, store it in a string and then look up each letter's index in that string:
name = input('Name: ')
s = 'abcdefghijklmnopqrstuvwxyz'
print(*(s.index(c)+1 for c in name.lower()))
A generator that produces the index of each of the name's characters in the master string (plus one, because you want it one-indexed) is unpacked and sent to print(), which, with the default separator of a space, produces the desired output.

Rather than define 26 different variables, how about using a dictionary? Then you can write something like:
mapping = {
'a': 1,
# etc
}
name_in_numbers = ' '.join(mapping[letter] for letter in name)
Note that this will break for any input that doesn't only contain letters.

First of all you must use a dict to store the mapping of characters to int, you may use, string module to access all the lowercase characters, it makes your code less error prone. Secondly you just need to iterate over the characters in the lowercase string and access the mapped values as int from the given mapping:
import string
mapping = dict(zip(string.ascii_lowercase, range(1, len(string.ascii_lowercase)+1)))
name = "Anmol"
lowercase = name.lower()
print("Your 'cleaned up' name is:", lowercase)
print("Your 'cleaned up name reduces to:")
for char in lowercase:
print mapping[char],

# make up a data structure to not pollute the namespace with 26 variable names; a dictionary does well for this
dict={}
for i in range(97,123):
dict[chr(i)]=i-96
print dict
name=raw_input("Name: ")
name=name.lower()
for i in name:
print dict[i],
Output:
{'a': 1, 'c': 3, 'b': 2, 'e': 5, 'd': 4, 'g': 7, 'f': 6, 'i': 9, 'h': 8, 'k': 11, 'j': 10, 'm': 13, 'l': 12, 'o': 15, 'n': 14, 'q': 17, 'p': 16, 's': 19, 'r': 18, 'u': 21, 't': 20, 'w': 23, 'v': 22, 'y': 25, 'x': 24, 'z': 26}
Name: Jon
10 15 14

Personally, I prefer that you build a mapping of char:index, that you can always refer to later, this way:
>>> ascii_chrs = 'abcdefghijklmnopqrstuvwxyz'
>>> d = {x:i for i,x in enumerate(ascii_chrs, 1)}
>>> d
{'q': 17, 'i': 9, 'u': 21, 'x': 24, 'a': 1, 's': 19, 'm': 13, 'n': 14, 'e': 5, 'v': 22, 'b': 2, 'p': 16, 'g': 7, 'o': 15, 'j': 10, 't': 20, 'h': 8, 'f': 6, 'r': 18, 'y': 25, 'c': 3, 'k': 11, 'd': 4, 'z': 26, 'w': 23, 'l': 12}
>>>
>>> word = 'Salam'
>>> print(*(d[c] for c in word.lower()))
19 1 12 1 13

How to count each group of values in a dictionary in Python 3?

I have a dictionary with multiple values under multiple keys. I do NOT want a single sum of the values. I want to find a way to find the sum for each key.
The file is tab delimited, with an identifier being a combination of two of these items, Btarg. There are multiple values for each of these identifiers.
Here is a test file:
Here is a test file with the desired result below:
Pattern Item Abundance
1 Ant 2
2 Dog 10
3 Giraffe 15
1 Ant 4
2 Dog 5
Here is the expected results:
Pattern1Ant, 6
Pattern2Dog, 15
Pattern3Giraffe, 15
This is what I have so far:
for line in K:
if "pattern" in line:
find = line
Bsplit = find.split("\t")
Buid = Bsplit[0]
Borg = Bsplit[1]
Bnum = (Bsplit[2])
Btarg = Buid[:-1] + "//" + Borg
if Btarg not in dict1:
dict1[Btarg] = []
dict1[Btarg].append(Bnum)
#The following used to work
#for key in dict1.iterkeys():
#dict1[key] = sum(dict1[key])
#print (dict1)
How do I make this work in Python 3 without the error message "Unsupported operand type(s) for +: 'int' and 'list'?
Thanks in advance!

Use from collections import Counter
From the documentation:
c = Counter('gallahad')
Counter({'a': 3, 'l': 2, 'h': 1, 'g': 1, 'd': 1})
Responding to your comment, now I think I know what you want, although I don't know what structure you have your data in. I will take for granted that you can organize your data like this:
In [41]: d
Out[41]: [{'Ant': 2}, {'Dog': 10}, {'Giraffe': 15}, {'Ant': 4}, {'Dog': 5}]
First create a defaultdict
from collections import defaultdict
a = defaultdict(int)
Then start couting:
In [42]: for each in d:
a[each.keys()[0]] += each.values()[0]
Result:
In [43]: a
Out[43]: defaultdict(<type 'int'>, {'Ant': 6, 'Giraffe': 15, 'Dog': 15})
UPDATE 2
Supposing you can get your data in this format:
In [20]: d
Out[20]: [{'Ant': [2, 4]}, {'Dog': [10, 5]}, {'Giraffe': [15]}]
In [21]: from collections import defaultdict
In [22]: a = defaultdict(int)
In [23]: for each in d:
a[each.keys()[0]] =sum(each.values()[0])
....:
In [24]: a
Out[24]: defaultdict(<type 'int'>, {'Ant': 6, 'Giraffe': 15, 'Dog': 15})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Hierarchical summing in python - python

Related

How to generate the component id in the Networkx graph?

Count of rows linked by ids in Pandas dataframe

Fill dictonary values as the sum of values from a pandas dataframe

Indexing for a changing number python

How to count each group of values in a dictionary in Python 3?

Categories

Resources