change pandas df to python dict - python

I would like to change the following pandas DataFrame object to a python dictionary, and then I want to get the value with energy_emob_BB and energy_emob_BE.
emob_energy = pd.DataFrame(session.query(abbb_emob.id,
abbb_emob.scenario,
abbb_emob.region,
abbb_emob.energy).filter(
abbb_emob.scenario == 'ES2030').all())
energy_emob_BB = float(emob_energy.query('region=="BB"')['energy'])
energy_emob_BE = float(emob_energy.query('region=="BE"')['energy'])
Here is some output of the DataFrame and 2 objects which I have to get from the DataFrame
(Pdb) emob_energy
id scenario region energy
0 1 ES2030 BB 2183000.0
1 2 ES2030 BE 1298333.0
(Pdb) energy_emob_BB
2183000.0
(Pdb) energy_emob_BE
1298333.0
How would I make .query to work with python dict?

pandas.DataFrame.to_dict renders the data a dicts in dicts by default.
df_dict = emob_energy.to_dict()
print(df_dict['energy_emob_bb']) # {0: 2183000.0, 1: 1298333.0, ...}
If you want them as dict of lists:
df_dict = emob_energy.to_dict(orient='list')

Related

Nested dictionary with key: list[key:value] pairs to dataframe

I'm currently struggling with creating a dataframe based on a dictionary that is nested like {key1:[{key:value},{key:value}, ...],key2:[{key:value},{key:value},...]}
And I want this to go into a dataframe, where the value of key1 and key2 are the index, while the list nested key:value pairs would become the column and record values.
Now, for each key1, key2, etc the list key:value pairs can be different in size. Example data:
some_dict = {'0000297386FB11E2A2730050568F1BAB': [{'FILE_ID': '0000297386FB11E2A2730050568F1BAB'},
{'FileTime': '1362642335'},
{'Size': '1016439'},
{'DocType_Code': 'AF3BD580734A77068DD083389AD7FDAF'},
{'Filenr': 'F682B798EC9481FF031C4C12865AEB9A'},
{'DateRegistered': 'FAC4F7F9C3217645C518D5AE473DCB1E'},
{'TITLE': '2096158F036B0F8ACF6F766A9B61A58B'}],
'000031EA51DA11E397D30050568F1BAB': [{'FILE_ID': '000031EA51DA11E397D30050568F1BAB'},
{'FileTime': '1384948248'},
{'Size': '873514'},
{'DatePosted': '7C6BCB90AC45DA1ED6D1C376FC300E7B'},
{'DocType_Code': '28F404E9F3C394518AF2FD6A043D3A81'},
{'Filenr': '13A6A062672A88DE75C4D35917F3C415'},
{'DateRegistered': '8DD4262899F20DE45F09F22B3107B026'},
{'Comment': 'AE207D73C9DDB76E1EEAA9241VJGN02'},
{'TITLE': 'DF96336A6FE08E34C5A94F6A828B4B62'}]}
The final result should look like this:
Index | File_ID | ... | DatePosted | ... | Comment | Title
0000297386FB11E2A2730050568F1BAB|0000297386FB11E2A2730050568F1BAB|...|NaN|...|NaN|2096158F036B0F8ACF6F766A9B61A58B
000031EA51DA11E397D30050568F1BAB|000031EA51DA11E397D30050568F1BAB|...|7C6BCB90AC45DA1ED6D1C376FC300E7B|...|AE207D73C9DDB76E1EEAA9241VJGN02|DF96336A6FE08E34C5A94F6A828B4B62
Now I've tried to parse the dict directly to pandas using comprehension as suggested in Creating dataframe from a dictionary where entries have different lengths and tried to flatten the dict more, and then parsing it to pandas Flatten nested dictionaries, compressing keys. Both with no avail.
Here you go.
You do not need key of first dict. Because it's also available in lower stages.
Then you need to merge multiple dicts into single one. I did that with update.
THen we turn dict into pd series.
And concat it into a dataframe.
In [39]: seriess = []
...: for values in some_dict.values():
...: d = {}
...: for thing in values:
...: d.update(thing)
...: s = pd.Series(d)
...: seriess.append(s)
...:
In [40]: pd.concat(seriess,axis=1).T
Out[40]:
FILE_ID FileTime Size ... TITLE DatePosted Comment
0 0000297386FB11E2A2730050568F1BAB 1362642335 1016439 ... 2096158F036B0F8ACF6F766A9B61A58B NaN NaN
1 000031EA51DA11E397D30050568F1BAB 1384948248 873514 ... DF96336A6FE08E34C5A94F6A828B4B62 7C6BCB90AC45DA1ED6D1C376FC300E7B AE207D73C9DDB76E1EEAA9241VJGN02
Let's try the following code:
dfs = []
for k in some_dict.keys():
dfs.append(pd.DataFrame.from_records(some_dict[k]))
new_df = [dfs[0].append(x) for x in dfs[1:]][0]
final_result = (new_df
.groupby(new_df['FILE_ID'].notna().cumsum())
.first())
Output
FILE_ID FileTime Size DocType_Code Filenr DateRegistered TITLE DatePosted Comment
FILE_ID
1 0000297386FB11E2A2730050568F1BAB 1362642335 1016439 AF3BD580734A77068DD083389AD7FDAF F682B798EC9481FF031C4C12865AEB9A FAC4F7F9C3217645C518D5AE473DCB1E 2096158F036B0F8ACF6F766A9B61A58B None None
2 000031EA51DA11E397D30050568F1BAB 1384948248 873514 28F404E9F3C394518AF2FD6A043D3A81 13A6A062672A88DE75C4D35917F3C415 8DD4262899F20DE45F09F22B3107B026 DF96336A6FE08E34C5A94F6A828B4B62 7C6BCB90AC45DA1ED6D1C376FC300E7B AE207D73C9DDB76E1EEAA9241VJGN02

Dataframe to dictionary, values came out scrambled

I have a dataframe that contains two columns that I would like to convert into a dictionary to use as a map.
I have tried multiple ways of converting, but my dictionary values always comes up in the wrong order.
My python version is 3 and Pandas version is 0.24.2.
This is what the first few rows of my dataframe looks like:
geozip.head()
Out[30]:
Geoid ZIP
0 100100 36276
1 100124 36310
2 100460 35005
3 100460 35062
4 100460 35214
I would like my dictionary to look like this:
{100100: 36276,
100124: 36310,
100460: 35005,
100460: 35062,
100460: 35214,...}
But instead my outputs came up with the wrong order for the values.
{100100: 98520,
100124: 36310,
100460: 57520,
100484: 35540,
100676: 19018,
100820: 57311,
100988: 15483,
101132: 36861,...}
I tried this first but the dictionary came out unordered:
geozipmap = geozip.set_index('Geoid')['ZIP'].to_dict()
Then I tried coverting the two columns into list first then convert to dictionary, but same problem occurred:
geoid = geozip.Geoid.tolist()
zipcode = geozip.ZIP.tolist()
geozipmap = dict(zip(geoid, zipcode))
I tried converting to OrderedDict and that didn't work either.
Then I've tried:
geozipmap = {k: v for k, v in zip(geoid, zipcode)}
I've also tried:
geozipmap = {}
for index, g in enumerate(geoid):
geozipmap[geoid[index]] = zipcode[index]
I've also tried the answers suggested:
panda dataframe to ordered dictionary
None of these work. Really not sure what is going on?
try this default_dict and if same key have multiple values you can provide those as list
from collections import defaultdict
df =pd.DataFrame(data={"Geoid":[100100,100124,100460,100460,100460],
"ZIP":[36276,36310,35005,35062,35214]})
data_dict = defaultdict(list)
for k,v in zip(df['Geoid'],df['ZIP']):
data_dict[k].append(v)
print(data_dict)
defaultdict(<class 'list'>, {100100: [36276], 100124: [36310], 100460: [35005, 35062, 35214]})
Will this work for you?
dfG = df['Geoid'].values
dfZ = df['ZIP'].values
for g , z in zip (dfG,dfZ):
print(str(g)+':'+str(z))
This gives the output as below (but the values are strings)
100100:36276
100124:36310
100460:35005
100460:35062
100460:35214

Cannot compare types 'ndarray(dtype=int64)' and 'int64'

I have a column in my dataframe with barcodes and created a dictionary to map barcodes to item ids.
I am creating a new column:
df['item_id'] = df['bar_code']
A dictionary (out of a second dataframe - imdb -)
keys = (int(i) for i in imdb['bar_code'])
values = (int(i) for i in imdb['item_id'])
map_barcode = dict(zip(keys, values))
map_barcode (first 5 e.g.)
{0: 1000159,
9000000017515: 11,
7792690324216: 16,
7792690324209: 20,
70942503334: 33}
And then mapping the item id with the dict
df = df.replace({'item_id':map_barcode})
Here I am hoping to obtain the item ids in the column
(Going back to the dict examples:)
df['item_id'][0] = 1000159
df['item_id'][1] = 11
df['item_id'][2] = 16
df['item_id'][3] = 20
df['item_id'][4] = 33
But end up getting this error:
Cannot compare types 'ndarray(dtype=int64)' and 'int64'
I tried to change the type of the dictionary to np.int64
keys = (np.int64(i) for i in imdb['bar_code'])
values = (np.int64(i) for i in imdb['item_id'])
map_barcode = dict(zip(keys, values))
But got the same error.
Is there anything I am missing here?
replace example
Firstly, I cannot reproduce your error. This works fine:
map_dict = {0: 1000159, 9000000017515: 11, 7792690324216: 16, 7792690324209: 20, 70942503334: 33}
df = pd.DataFrame({'item_id': [0, 7792690324216, 70942503334, 9000000017515, -1, 7792690324209]})
df = df.replace({'item_id': map_dict})
Result:
item_id
0 1000159
1 16
2 33
3 11
4 -1
5 20
Use map + fillna instead
Secondly, manually iterating Pandas series within generator expressions is relatively expensive. In addition, replace is inefficient when mapping via a dictionary.
In fact, creating a dictionary is not even necessary. There are optimized series-based methods for these tasks:
map_series = imdb[['bar_code', 'item_id']].astype(int).set_index('bar_code')['item_id']
df['item_id'] = df['item_id'].map(map_series).fillna(df['item_id'])
See also:
Replace values in a pandas series via dictionary efficiently
This answer on why you shouldn't ideally use zip with NumPy arrays

How to implement a select-like function

I got a dataset in python and the structure of it is like
Tree Species number of trunks
------------------------------
Acer rubrum 1
Quercus bicolor 1
Quercus bicolor 1
aabbccdd 0
and I have a question of can I implement a function similar to
Select sum(number of trunks)
from trees.data['Number of Trunks']
where x = trees.data["Tree Species"]
group by trees.data["Tree Species"]
in python? x is an array contains five elements:
x = array(['Acer rubrum', 'Acer saccharum', 'Acer saccharinum',
'Quercus rubra', 'Quercus bicolor'], dtype='<U16')
what I want to do is mapping each elements in x to trees.data["Tree Species"] and calculate the sum of number of trunks, it should return an array of
array = (sum_num(Acer rubrum), sum_num(Acer saccharum), sum_num(Acer saccharinum),
sum_num(Acer Quercus rubra), sum_num(Quercus bicolor))
Did you want to look at Python Pandas. That will allow you to do something like
df.groupby('Tree Species')['Number of Trunks'].sum()
Please note here df is whatever the variable name you read in your data frame. I would recommend you to look at pandas and lambda function too.
You can do something like this:
import pandas as pd
df = pd.DataFrame()
tree_species = ["Acer rubrum", "Quercus bicolor", "Quercus bicolor", "aabbccdd"]
no_of_trunks = [1,1,1,0]
df["Tree Species"] = tree_species
df["Number of Trunks"] = no_of_trunks
df.groupby('Tree Species').sum() #This will create a pandas dataframe
df.groupby('Tree Species')['Number of Trunks'].sum() #This will create a pandas series.
You can do the same thing by just using dictionaries too:
tree_species = ["Acer rubrum", "Quercus bicolor", "Quercus bicolor", "aabbccdd"]
no_of_trunks = [1,1,1,0]
d = {}
for key, trunk in zip(tree_species, no_of_trunks):
if not key in d.keys():
d[key] = 0
d[key] += trunk
print(d)

turning a collections counter into dictionary

I have a collection outcome resulting from the function:
Counter(df.email_address)
it returns each individual email address with the count of its repetitions.
Counter({nan: 1618, 'store#kiddicare.com': 265, 'testorders#worldstores.co.uk': 1})
what I want to do is to use it as if it was a dictionary and create a pandas dataframe out of it with two columns one for email addresses and one for the value associated.
I tried with:
dfr = repeaters.from_dict(repeaters, orient='index')
but i got the following error:
AttributeError: 'Counter' object has no attribute 'from_dict'
It makes thing that Counter is not a dictionary as it looks like. Any idea on how to append it to a df?
d = {}
cnt = Counter(df.email_address)
for key, value in cnt.items():
d[key] = value
EDIT
Or, how #Trif Nefzger suggested:
d = dict(Counter(df.email_address))
as ajcr wrote at the comment, from_dict is a method that belongs to dataframe and thus you can write the following to achieve your goal:
from collections import Counter
import pandas as pd
repeaters = Counter({"nan": 1618, 'store#kiddicare.com': 265, 'testorders#worldstores.co.uk': 1})
dfr = pd.DataFrame.from_dict(repeaters, orient='index')
print dfr
Output:
testorders#worldstores.co.uk 1
nan 1618
store#kiddicare.com 265
Alternatively you could use pd.Series.value_counts, which returns a Series object.
df.email_address.value_counts(dropna=False)
Sample output:
b#y.com 2
a#x.com 1
NaN 1
dtype: int64
This is not exactly what you asked for but looks like what you'd like to achieve.

Categories

Resources