Dataframe to a dictionary - python

I have a very large dataframe, a sample of which looks like this:
df = pd.DataFrame({'From':['a','b','c','a','d'], 'To':['b', 'c', 'a', 'd', 'e'], 'Rates':[1e-4, 2.3e-2, 1e-2, 100, 70]})
In[121]: df
Out[121]:
From To Rates
0 a b 0.0001
1 b c 0.0230
2 c a 0.0100
3 a d 100.0000
4 d e 70.0000
The end result I would like is a dictionary that looks like this:
{('a', 'b'): 0.0001,
('a', 'd'): 100.0,
('b', 'c'): 0.023,
('c', 'a'): 0.01,
('d', 'e'): 70.0}
The following code works but it is very inefficient for a large df.
from_comps = list(df['From'])
to_comps = list(df['To'])
transfer_rates = {}
for from_comp in from_comps:
for to_comp in to_comps:
try:
transfer_rates[from_comp, to_comp] = df.loc[(df['From'] == from_comp) & (df['To'] == to_comp)]['Rates'].values[0]
except:
pass
Is there a more efficient way of doing this?

Given the input provided, it's far simpler to use the built-in to_dict() method. Note that for a more complex dataset, this might require more tweaking.
df = pd.DataFrame({'From':['a','b','c','a','d'], 'To':['b', 'c', 'a', 'd', 'e'], 'Rates':[1e-4, 2.3e-2, 1e-2, 100, 70]})
df.set_index(['From','To']).to_dict()
{'Rates': {('a', 'b'): 0.0001,
('b', 'c'): 0.023,
('c', 'a'): 0.01,
('a', 'd'): 100.0,
('d', 'e'): 70.0}}
df.set_index(['From','To']).to_dict()['Rates']
{('a', 'b'): 0.0001,
('b', 'c'): 0.023,
('c', 'a'): 0.01,
('a', 'd'): 100.0,
('d', 'e'): 70.0}

We can also use the to_records method to get the desired results.
{(item.From, item.To): item.Rates for item in df.to_records(index=False)}
{('a', 'b'): 0.0001,
('b', 'c'): 0.023,
('c', 'a'): 0.01,
('a', 'd'): 100.0,
('d', 'e'): 70.0}

You could use df.to_dict and pivot_table
df['key'] = list(zip(df['From'], df['To']))
df[['key', 'Rates']].pivot_table(columns='key', values='Rates').to_dict()
{('a', 'b'): {'Rates': 0.0001}, ('a', 'd'): {'Rates': 100.0}, ('b', 'c'): {'Rates': 0.023}, ('c', 'a'): {'Rates': 0.01}, ('d', 'e'): {'Rates': 70.0}}

Related

Is there a way to decrease run time for nested for loop?

I am trying to calculate total_score of each path derived. However, my current nested for loop takes forever to run when there are above 90 nodes, but runs well for just 20 nodes. I would need your help on how to make this code more efficient to run for bigger graphs.
I managed to retrieve all the possible paths (total of 7) from start node a to end node i. They are assigned into a list called paths:
[['a', 'b', 'f', 'i'], # 1st path
['a', 'y', 'f', 'i'], # 2nd path
['a', 'b', 'd', 'y', 'f', 'i'],
['a', 'b', 'o', 'd', 'y', 'f', 'i'],
['a', 'b', 'd', 'o', 'y', 'f', 'i'],
['a', 'b', 'p', 'd', 'o', 'y', 'f', 'i'],
['a', 'b', 'p', 's', 'o', 'y', 'f', 'i']]
This is the graph, with key being nodes, and values being the neighbouring nodes the key node is connected to, with their respective weights.
graph = {'b': {'f': 0.1, 'o': 0.4, 'd': 0.3},
'y': {'f': 0.3, 'o': 0.1},
'i': {'d': 0.7, 'z': 0.3},
'm': {'o': 0.8},
'd': {'y': 0.6, 'o': 0.1},
'o': {'d': 0.5, 'm': 0.9},
'z': {'d': 0.1, 'o': 0.2, 'y': 0.4, 'o': 0.1, 'i3': 0.6},
'o': {'y': 0.8},
'a': {'b': 1.0, 'y': 0.5}, 'f': {'i': 0.3}}
node b is directed to node f (weight=0.1), node o (weight=0.4) and node d (weight=0.3).
total_score = multiplication of all costs to travel from node to next node
For example, for the 1st path (a,b,f,i), the total cost is 1.0 (a to b) * 0.1 (b to f) * 0.3 (f to i) = 0.03
This is my current nested for loop code:
def total_scores(graph, paths):
scores = []
score = 1.0
for i in range(len(paths)): # go into 1st possible path
path = paths[i]
for j in range(len(path) - 1): # go from 1st to last node in 1st path
first = paths[i][j] # 1st node in 1st path
next = paths[i][j + 1] # 2nd node in 2nd path
score *= graph[first][next]
scores.append(round(score,5))
score = 1.0
return scores
total_scores(graph, paths)
total_scores(graph, paths)
I have thought about recursion, but i doubt it will help much in terms of runtime, any help is appreciated!
Python's lists are very flexible and very inefficient for this sort of computation. Replace your character node identifiers by numbers, and store the whole thing in a numpy array. That will speed it up by many factors.
Although having a nested loop your code already as has O(n) time complexity, where n is the total number of edges in all paths. So if we limit ourselves to the standard library there's not much that can be done.
I came up with a ~40% faster version, but I doubt you can do much better than that: the idea is to store the edges instead of the nodes in the paths, and modify the graph structure accordingly (note some values were missing in the graph you posted, I added them with a weight of 1)
epaths = [
[('a', 'b'), ('b', 'f'), ('f', 'i')],
[('a', 'y'), ('y', 'f'), ('f', 'i')],
[('a', 'b'), ('b', 'd'), ('d', 'y'), ('y', 'f'), ('f', 'i')],
[('a', 'b'), ('b', 'o'), ('o', 'd'), ('d', 'y'), ('y', 'f'), ('f', 'i')],
[('a', 'b'), ('b', 'd'), ('d', 'o'), ('o', 'y'), ('y', 'f'), ('f', 'i')],
[('a', 'b'), ('b', 'p'), ('p', 'd'), ('d', 'o'), ('o', 'y'), ('y', 'f'), ('f', 'i')],
[('a', 'b'), ('b', 'p'), ('p', 's'), ('s', 'o'), ('o', 'y'), ('y', 'f'), ('f', 'i')]]
tgraph = {
('a', 'b'): 1.0,
('a', 'y'): 0.5,
('b', 'd'): 0.3,
('b', 'f'): 0.1,
('b', 'o'): 0.4,
('b', 'p'): 1.0,
('d', 'o'): 0.1,
('d', 'y'): 0.6,
('f', 'i'): 0.3,
('i', 'd'): 0.7,
('i', 'z'): 0.3,
('m', 'o'): 0.8,
('o', 'd'): 1.0,
('o', 'y'): 0.8,
('p', 'd'): 1.0,
('p', 's'): 1.0,
('s', 'o'): 1.0,
('y', 'f'): 0.3,
('y', 'o'): 0.1,
('z', 'd'): 0.1,
('z', 'i3'): 0.6,
('z', 'o'): 0.1,
('z', 'y'): 0.4}
from functools import reduce
from operator import mul
def total_scores(graph, paths):
scores = [reduce(mul, [graph[edge] for edge in path], 1) for path in paths]
return scores
total_scores(tgraph, epaths)
However I feel the best approach would be to compute the scores while building the paths: for instance we are counting the weight of ('a','b') six times, while if you do a BFS you will only take it into account once

Pandas get cell index by header and row names

I would like to acces cells by header and row names (size can vary)
df = pandas.read_excel('file.xlsx')
print(df)
id A B C
0 D 1 2 3
1 E 4 5 6
2 F 7 8 1
# into this...
{
("A", "D") : "1",
("B", "D") : "2",
...
}
If order is not important create MultiIndex Series by DataFrame.set_index with DataFrame.unstack and then use Series.to_dict:
d = df.set_index('id').unstack().to_dict()
print (d)
{('A', 'D'): 1, ('A', 'E'): 4, ('A', 'F'): 7, ('B', 'D'): 2,
('B', 'E'): 5, ('B', 'F'): 8, ('C', 'D'): 3, ('C', 'E'): 6, ('C', 'F'): 1}
If order is important (python 3):
d = df.set_index('id').unstack().sort_index(level=1).to_dict()
print (d)
{('A', 'D'): 1, ('B', 'D'): 2, ('C', 'D'): 3, ('A', 'E'): 4, ('B', 'E'): 5,
('C', 'E'): 6, ('A', 'F'): 7, ('B', 'F'): 8, ('C', 'F'): 1}
df1 = df.set_index('id')
c = np.tile(df1.columns, len(df1))
i = np.repeat(df1.index, len(df1.columns))
v = np.ravel(df1)
d = {(a,b):c for a,b,c in zip(c,i, v)}
print (d)
{('A', 'D'): 1, ('B', 'D'): 2, ('C', 'D'): 3, ('A', 'E'): 4, ('B', 'E'): 5,
('C', 'E'): 6, ('A', 'F'): 7, ('B', 'F'): 8, ('C', 'F'): 1}

How can I replace all the values of a Python dictionary with a range of values?

I have the following dictionary:
mydict = {('a', 'b'): 28.379,
('c', 'd'): 32.292,
('e', 'f'): 61.295,
('g', 'h'): 112.593,
('i', 'j'): 117.975}
And I would like to replace all the values with a range from 1 to 5, but keep the order of the keys. As a result, I would get this:
mydict = {('a', 'b'): 1,
('c', 'd'): 2,
('e', 'f'): 3,
('g', 'h'): 4,
('i', 'j'): 5}
The length of the dictionary is actually 22000, so I need a range from 1 to 22000.
How can I do it?
Thanks in advance.
Using enumerate to iterate on the keys, you can do:
mydict = {('a', 'b'): 28.379,
('c', 'd'): 32.292,
('e', 'f'): 61.295,
('g', 'h'): 112.593,
('i', 'j'): 117.975}
for i, key in enumerate(mydict): # iterates on the keys
mydict[key] = i
print(mydict)
# {('a', 'b'): 0, ('c', 'd'): 1, ('e', 'f'): 2, ('g', 'h'): 3, ('i', 'j'): 4}
Important note: dicts are only officially ordered since Python 3.7 (and in the CPython implementation since 3.6), so this would n't make much sense with older versions of Python.
To answer your comment: enumerate takes an optional second parameter start(that defaults to 0)
So, if you want to start at 1, just do:
for i, key in enumerate(mydict, start=1): # iterates on the keys
mydict[key] = i
The most simple is to create another dictionary from the keys of the previous one.
mydict2=dict()
for i,key in enumerate(mydict):
mydict2[key]=i+1
You can do this with a one-liner which is more compact:
mydict = {('a', 'b'): 28.379,
('c', 'd'): 32.292,
('e', 'f'): 61.295,
('g', 'h'): 112.593,
('i', 'j'): 117.975}
{k: i for i, (k, v) in enumerate(mydict.items())}
Pandas solution for this:
import pandas as pd
a = pd.DataFrame(mydict, index=[0]).T
a[0] = list(range(0,len(a)))
a.to_dict()[0]
# {('a', 'b'): 0, ('c', 'd'): 1, ('e', 'f'): 2, ('g', 'h'): 3, ('i', 'j'): 4}
This can be done gracefully with dict.update and itertools.count, and explicit loops can be avoided:
>>> mydict = {('a', 'b'): 28.379,
... ('c', 'd'): 32.292,
... ('e', 'f'): 61.295,
... ('g', 'h'): 112.593,
... ('i', 'j'): 117.975}
>>> from itertools import count
>>> mydict.update(zip(mydict, count(1)))
>>> mydict
{('a', 'b'): 1, ('c', 'd'): 2, ('e', 'f'): 3, ('g', 'h'): 4, ('i', 'j'): 5}

How to convert a list into a dictionary which uses tuple as a key

I'd like to read an excel table with Panda, and create a list of tuples. Then, I want to convert the list into a dictionary which has a tuple as key. How can I do that?
Here is the table that I am reading;
A B 0.6
A C 0.7
C D 1.0
C A 1.2
D B 0.7
D C 0.6
Here is how I read my table;
import pandas as pd
df= pd.read_csv("my_file_name.csv", header= None)
my_tuple = [tuple(x) for x in df.values]
Now, I want to have the following structure.
my_data = {("A", "B"): 0.6,
("A", "C"): 0.7,
("C", "D"): 1,
("C", "A"): 1.2,
("D", "B"): 0.7,
("D", "C"): 0.6}
Set_index and to_dict
df.set_index(['a', 'b']).c.to_dict()
{('A', 'B'): 0.6,
('A', 'C'): 0.7,
('C', 'A'): 1.2,
('C', 'D'): 1.0,
('D', 'B'): 0.7,
('D', 'C'): 0.6}
Option2: Another solution using zip
dict(zip(df[['A', 'B']].apply(tuple, 1), df['C']))
Option 3:
k = df[['A', 'B']].to_records(index=False).tolist()
dict(zip(k, df['C']))
A comprehension will work well for smaller frames:
dict((tuple((a, b)), c) for a,b,c in df.values)
#{('A', 'B'): 0.6,
# ('A', 'C'): 0.7,
# ('C', 'A'): 1.2,
# ('C', 'D'): 1.0,
# ('D', 'B'): 0.7,
# ('D', 'C'): 0.6}
If having issues with ordering:
from collections import OrderedDict
d = OrderedDict((tuple((a, b)), c) for a,b,c in df.values)
#OrderedDict([(('A', 'B'), 0.6),
# (('A', 'C'), 0.7),
# (('C', 'D'), 1.0),
# (('C', 'A'), 1.2),
# (('D', 'B'), 0.7),
# (('D', 'C'), 0.6)])
Jan - here's one idea: just create a key column using the pandas apply function to generate a tuple of your first 2 columns, then zip them up to a dict.
import pandas as pd
df = pd.read_clipboard()
df.columns = ['first', 'second', 'value']
df.head()
def create_key(row):
return (row['first'], row['second'])
df['key'] = df.apply(create_key, axis=1)
dict(zip(df['key'], df['value']))
{('A', 'C'): 0.7,
('C', 'A'): 1.2,
('C', 'D'): 1.0,
('D', 'B'): 0.7,
('D', 'C'): 0.6}
This is less concise than #Vaishali's answer but gives you more of an idea of the steps.
vals1 = df['A'].values
vals2 = df['B'].values
vals3 = df['C'].values
dd = {}
for i in range(len(vals1)):
key = (vals1[i], vals2[i])
value = vals3[i]
dd[key] = value
{('A', 'B'): '0.6',
('A', 'C'): '0.7',
('C', 'D'): '1.0',
('C', 'A'): '1.2',
('D', 'B'): '0.7',
('D', 'C'): '0.6'}
If you would use simple code:
this one would not use any importing something like panda :
def change_csv(filename):
file_pointer = open(filename, 'r')
data = file_pointer.readlines()
dict = {}
file_pointer.close()
for each_line in data:
a, b, c = each_line.strip().split(" ")
dict[a, b] = c
return dict
so out put of this yours.
and out put is :
{('A', 'B'): '0.6', ('A', 'C'): '0.7', ('C', 'D'): '1.0', ('C', 'A'): '1.2', ('D', 'B'): '0.7', ('D', 'C'): '0.6'}

any value in certain key in python dictionary

In python, I have a dictionary called
d = {('A', 'A', 'A'):1, ('A', 'A', 'B'):1, ('A', 'A', 'C'):1, ('A', 'B', 'A'): 2, ('A', 'B','C'):2, ...}.
Is there a simple way to change the values (to 10 for example) of for when the key is, for example, ('A', 'A', _) where _ can be any char A~Z ?
So, it will look like {('A', 'A', 'A'):10, ('A', 'A', 'B'):10, ('A', 'A', 'C'):10, ('A', 'B', 'A'): 2, ('A', 'B', 'C'):2, ...} at the end.
As for now, I'm using a loop with a variable x for ('A', 'A', x), but I'm wondering if there are such keywords in python.
Thanks for the tips.
Just check the first two elements of each tuple, the last is irrelevant unless you specifically want to make sure it is also a letter:
for k in d:
if k[0] == "A" and k[1] == "A":
d[k] = 10
print(d)
{('A', 'B', 'A'): 2, ('A', 'B', 'C'): 2, ('A', 'A', 'A'): 10, ('A', 'A', 'C'): 10, ('A', 'A', 'B'): 10}
If the last element must also actually be alpha then use str.isalpha:
d = {('A', 'A', '!'):1, ('A', 'A', 'B'):1, ('A', 'A', 'C'):1, ('A', 'B', 'A'): 2, ('A', 'B','C'):2}
for k in d:
if all((k[0] == "A", k[1] == "A", k[2].isalpha())):
d[k] = 10
print(d)
{('A', 'B', 'A'): 2, ('A', 'B', 'C'): 2, ('A', 'A', '!'): 1, ('A', 'A', 'C'): 10, ('A', 'A', 'B'): 10}
There is no keyword where d[('A', 'A', _)]=10 will work, you could hack a functional approach using map with python2:
d = {('A', 'A', 'A'):1, ('A', 'A', 'B'):1, ('A', 'A', 'C'):1, ('A', 'B', 'A'): 2, ('A', 'B','C'):2}
map(lambda k: d.__setitem__(k, 10) if ((k[0], k[1]) == ("A", "A")) else k, d)
print(d)
{('A', 'B', 'A'): 2, ('A', 'B', 'C'): 2, ('A', 'A', 'A'): 10, ('A', 'A', 'C'): 10, ('A', 'A', 'B'): 10}
Or including isalpha:
d = {('A', 'A', '!'):1, ('A', 'A', 'B'):1, ('A', 'A', 'C'):1, ('A', 'B', 'A'): 2, ('A', 'B','C'):2}
map(lambda k: d.__setitem__(k, 10) if ((k[0], k[1],k[2].isalpha()) == ("A", "A",True)) else k, d)
print(d)
How about something like this:
for item in d.keys():
if re.match("\('A', 'A', '[A-Z]'\)",str(item)):
d[item] = 10
This is another method. Returns None in the console, but appears to update the values:
[d.update({y:10}) for y in [x for x in d.keys() if re.match("\('A', 'A', '[A-Z]'\)",str(x))]]

Categories

Resources