DataFrame to Dictionary with different numbers of values Pandas - python

I am trying to create a dictionary from a DataFrame where the key sometimes has multiple values.
For example:
df
ID value
A 10
B 45
C 20
C 30
D 20
E 10
E 70
E 110
F 20
And I want the dictionary to look like:
dic = {'A': 10,
'B': 45,
'C':[20,30],
'D': 20,
'E': [10,70,110],
'F': 20}
I tried using the following code:
dic=df.set_index('ID').T.to_dict('list')
But it returned a dictionary with only one value per ID:
{'A': 10,
'B': 45,
'C': 30,
'D': 20,
'E': 110,
'F': 20}
I'm assuming the right way to go about it is with some kind of loop appending to an empty dictionary but I'm not sure what the proper syntax would be.
My actual DataFrame is much longer than this, so what would I use to convert the DataFrame to the dictionary?
Thanks!

example dataframe:
df = pd.DataFrame({'ID':['A', 'B', 'B'], 'value': [1,2,3]})
df_tmp = df.groupby('ID')['value'].apply(list).reset_index()
dict(zip(df_tmp['ID'], df_tmp['value']))
outputs
{'A': [1], 'B': [2, 3]}

Related

is it possible to recognize unique values ​within the keys in the dictionaries?

I have
df.shape
> (12702, 27)
df['x'][0]
>{'a': '123',
'b': '214',
'c': '654',}
I try:
df['x'].unique()
>TypeError: unhashable type: 'list'
is it possible to recognize unique values ​​within the keys in the dictionaries?
or
should i use dummies?
(Providing an answer similar to https://stackoverflow.com/a/12897477/15744261)
Edited:
Sorry for the multiple edits, I misunderstood your question.
From your snippet it looks like your df[x] is returning a list of dictionaries. If what you're asking is to get all unique values across some of the dictionaries, you can add the keys using list(my_dict) (which will return a list of the keys). Then use a set on the list to return the unique values. Example:
values = set(list(df['x'][0]) + list(df['x'][1]) + ... )
If you need unique keys across all of these dictionaries, you could get a little more creative with list comprehension to compile all the keys and then wrap that in a set for the unique values.
Old answer:
For a list you can simply convert to a set which will remove the duplicates:
values = set(df['x'][0])
If you want to use these values as a list you can convert that set into a list as well:
list_values = list(values)
Or in one line:
values = list(set(df['x'][0]))
Keep in mind, this is certainly not the most efficient way to do this. I'm sure there are better ways to do it if you're dealing with a large amount of data.
It seems that you want to find the unique keys across all the dictionaries in this column. This can be done easily with functools.reduce. I've generated some sample data:
import pandas as pd
import random
possible_keys = 'abcdefg'
df = pd.DataFrame({'x': [{key: 1 for key in random.choices(possible_keys, k=3)} for _ in range(10)]})
This dataframe looks like this:
x
0 {'c': 1, 'a': 1}
1 {'b': 1, 'd': 1, 'c': 1}
2 {'d': 1, 'b': 1}
3 {'b': 1, 'f': 1, 'e': 1}
4 {'a': 1, 'd': 1, 'c': 1}
5 {'g': 1, 'b': 1}
6 {'d': 1}
7 {'e': 1}
8 {'c': 1, 'd': 1, 'f': 1}
9 {'b': 1, 'a': 1, 'f': 1}
Now the actual meat of the answer:
from functools import reduce
reduce(set.union, df['x'], set())
Results in:
{'a', 'b', 'c', 'd', 'e', 'f', 'g'}

How to convert two columns values into a key-value pair dictionary?

How to convert values from the columns from the DataFrame below to a key-value pair dictionary like {"a": 29, "b": 1042, "c": 2928, "d": 4492}
event_type count
0 a 29
1 b 1042
2 c 2928
3 d 4492
Create Series and convert to dict:
d = df.set_index('event_type')['count'].to_dict()
print (d)
{'a': 29, 'b': 1042, 'c': 2928, 'd': 4492}
One way is using zip:
dict(zip(*df.values.T))
# {'a': 29, 'b': 1042, 'c': 2928, 'd': 4492}
If the dataframe contains more columns:
dict(zip(df['event_type'], df['count']))
# {'a': 29, 'b': 1042, 'c': 2928, 'd': 4492}

python merge row of text with same category

I'm new to python can anyone help me with this.
For example, I have a data frame of
data = pd.DataFrame({'a': [1,1,2,2,2,3], 'b': [12,22,23,34,44,55], 'c'['a','','','','c',''], 'd':['','b','b','a','a','']})
I want to sum a and ignore the different in b
data = ({'a':[1,2,3],'c':['a','c',''],'d':['b','baa','']})
How can I do this?
You Question is bit difficult to under stand but if I guess correctly, this can be solution.
data = {'a': [1,1,2,2,2,3], 'b': [12,22,23,34,44,55], 'c':['a','','','','c',''], 'd':['','b','b','a','a','']}
new_data = {k: list(set(v)) for k, v in data.items()}
{'a': [1, 2, 3],
'b': [34, 12, 44, 55, 22, 23],
'c': ['', 'a', 'c'],
'd': ['', 'a', 'b']}
You need groupby + agg
data.groupby('a').agg({'b':'sum','c' : lambda x : ''.join(x),'d' : lambda x : ''.join(x)}).reset_index()
Out[54]:
a d c b
0 1 b a 34
1 2 baa c 101
2 3 55

Python deduplicate records - dedupe

I want to use https://github.com/datamade/dedupe to deduplicate some records in python. Looking at their examples
data_d = {}
for row in data:
clean_row = [(k, preProcess(v)) for (k, v) in row.items()]
row_id = int(row['id'])
data_d[row_id] = dict(clean_row)
the dictionary consumes quite a lot of memory compared to e.g. a dictionary created by pandas out of a pd.Datafrmae, or even a normal pd.Dataframe.
If this format is required, how can I convert a pd.Dataframe efficiently to such a dictionary?
edit
Example what pandas generates
{'column1': {0: 1389225600000000000,
1: 1388707200000000000,
2: 1388707200000000000,
3: 1389657600000000000,....
Example what dedupe expects
{'1': {column1: 1389225600000000000, column2: "ddd"},
'2': {column1: 1111, column2: "ddd} ...}
It appears that df.to_dict(orient='index') will produce the representation you are looking for:
import pandas
data = [[1, 2, 3], [4, 5, 6]]
columns = ['a', 'b', 'c']
df = pandas.DataFrame(data, columns=columns)
df.to_dict(orient='index')
results in
{0: {'a': 1, 'b': 2, 'c': 3}, 1: {'a': 4, 'b': 5, 'c': 6}}
You can try something like this:
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [6,7,8,9,10]})
A B
0 1 6
1 2 7
2 3 8
3 4 9
4 5 10
print(df.T.to_dict())
{0: {'A': 1, 'B': 6}, 1: {'A': 2, 'B': 7}, 2: {'A': 3, 'B': 8}, 3: {'A': 4, 'B': 9}, 4: {'A': 5, 'B': 10}}
This is the same output as in #chthonicdaemon answer so his answer is probably better. I am using pandas.DataFrame.T to transpose index and columns.
A python dictionary is not required, you just need an object that allows indexing by column name. i.e. row['col_name']
So, assuming data is a pandas dataframe should just be able to do something like:
data_d = {}
for row_id, row in data.iterrows():
data_d[row_id] = row
That said, the memory overhead of python dicts is not going to be where you have memory bottlenecks in dedupe.

How to retrieve and store multiple values from a python Data Frame?

I have the following Dataframe that represents a From-To distance matrix between pairs of points. I have predetermined "trips" that visit specific pairs of points that I need to calculate the total distance for.
For example,
Trip 1 = [A:B] + [B:C] + [B:D] = 6 + 5 + 8 = 19
Trip 2 = [A:D] + [B:E] + [C:E] = 6 + 15 + 3 = 24
import pandas
graph = {'A': {'A': 0, 'B': 6, 'C': 10, 'D': 6, 'E': 7},
'B': {'A': 10, 'B': 0, 'C': 5, 'D': 8, 'E': 15},
'C': {'A': 40, 'B': 30, 'C': 0, 'D': 9, 'E': 3}}
df = pd.DataFrame(graph).T
df.to_excel('file.xls')
I have many "trips" that I need to repeat this process for and then need to store the values in a row in a new Dataframe that I can export to excel. I know I can use df.at[A,'B'] to retrieve specific values in the Dataframe but how can retrieve multiple values, sum them, store in new Dataframe, and then repeat for the enxt trip.
Thank you in advance for any help or guidance,
I think if you don't transpose then maybe an unstack will help?
import pandas as pd
graph = {'A': {'A': 0, 'B': 6, 'C': 10, 'D': 6, 'E': 7},
'B': {'A': 10, 'B': 0, 'C': 5, 'D': 8, 'E': 15},
'C': {'A': 40, 'B': 30, 'C': 0, 'D': 9, 'E': 3}}
df = pd.DataFrame(graph)
df = df.unstack()
df.index.names = ['start','finish']
# a list of tuples to represent the trip(s)
trip1 = [('A','B'),('B','C'),('B','D')]
trip2 = [('A','D'),('B','E'),('C','E')]
trips = [trip1,trip2]
my_trips = {}
for trip in trips:
my_trips[str(trip)] = df.loc[trip].sum()
distance_df = pd.DataFrame(my_trips,index=['distance']).T
distance_df
distance
[('A', 'B'), ('B', 'C'), ('B', 'D')] 19
[('A', 'D'), ('B', 'E'), ('C', 'E')] 24

Categories

Resources