Dataframe to dictionary, values came out scrambled - python

I have a dataframe that contains two columns that I would like to convert into a dictionary to use as a map.
I have tried multiple ways of converting, but my dictionary values always comes up in the wrong order.
My python version is 3 and Pandas version is 0.24.2.
This is what the first few rows of my dataframe looks like:
geozip.head()
Out[30]:
Geoid ZIP
0 100100 36276
1 100124 36310
2 100460 35005
3 100460 35062
4 100460 35214
I would like my dictionary to look like this:
{100100: 36276,
100124: 36310,
100460: 35005,
100460: 35062,
100460: 35214,...}
But instead my outputs came up with the wrong order for the values.
{100100: 98520,
100124: 36310,
100460: 57520,
100484: 35540,
100676: 19018,
100820: 57311,
100988: 15483,
101132: 36861,...}
I tried this first but the dictionary came out unordered:
geozipmap = geozip.set_index('Geoid')['ZIP'].to_dict()
Then I tried coverting the two columns into list first then convert to dictionary, but same problem occurred:
geoid = geozip.Geoid.tolist()
zipcode = geozip.ZIP.tolist()
geozipmap = dict(zip(geoid, zipcode))
I tried converting to OrderedDict and that didn't work either.
Then I've tried:
geozipmap = {k: v for k, v in zip(geoid, zipcode)}
I've also tried:
geozipmap = {}
for index, g in enumerate(geoid):
geozipmap[geoid[index]] = zipcode[index]
I've also tried the answers suggested:
panda dataframe to ordered dictionary
None of these work. Really not sure what is going on?

try this default_dict and if same key have multiple values you can provide those as list
from collections import defaultdict
df =pd.DataFrame(data={"Geoid":[100100,100124,100460,100460,100460],
"ZIP":[36276,36310,35005,35062,35214]})
data_dict = defaultdict(list)
for k,v in zip(df['Geoid'],df['ZIP']):
data_dict[k].append(v)
print(data_dict)
defaultdict(<class 'list'>, {100100: [36276], 100124: [36310], 100460: [35005, 35062, 35214]})

Will this work for you?
dfG = df['Geoid'].values
dfZ = df['ZIP'].values
for g , z in zip (dfG,dfZ):
print(str(g)+':'+str(z))
This gives the output as below (but the values are strings)
100100:36276
100124:36310
100460:35005
100460:35062
100460:35214

Related

Python - Get the top 5 items from dictionary type column pandas dataframe

I've a dataframe which one of the columns are a dictionary, I'm getting a huge number of items inside that dictionary which is causing me memory problems. The solution was get only the first 10 items from that dictionary. I already have the code but it gives a error:
TypeError: '<' not supported between instances of 'dict' and 'dict'
I made a sample code just to show you my problem:
import pandas as pd
import datetime
res = pd.DataFrame([])
res_tmp = pd.DataFrame([])
d = {'club': ['A1', 'B1'], 'score': [3, 4]}
df = pd.DataFrame(data=d)
for index, row in df.iterrows():
total = int(row['score']) * -1
res_tmp = res_tmp.append({'today': str(datetime.datetime.now()), 'total': total}, ignore_index=True)
res = res.append({'club': row['club'], 'details': res_tmp.to_dict('dict')},ignore_index=True)
res['details'] = res['details'].apply(lambda y: (sorted(y.items(), key=lambda x: x[1]))[:1])
What I am doing wrong? Note: In example I just have two rows that's why I put the top 1 instead of top 10
Thanks!
As the error message tells you, there is no defined value ordering for dicts. If you want to sort the dicts, you must provide the function you write to define the sort order. You extracted the value, but you also have to convert the dict to some type that does have < defined. For instance:
key = lambda x: list(x[1].values())

Formatting Multiple Columns in a Pandas Dataframe

I have a dataframe I'm working with that has a large number of columns, and I'm trying to format them as efficiently as possible. I have a bunch of columns that all end in .pct that need to be formatted as percentages, some that end in .cost that need to be formatted as currency, etc.
I know I can do something like this:
cost_calc.style.format({'c.somecolumn.cost' : "${:,.2f}",
'c.somecolumn.cost' : "${:,.2f}",
'e.somecolumn.cost' : "${:,.2f}",
'e.somecolumn.cost' : "${:,.2f}",...
and format each column individually, but I was hoping there was a way to do something similar to this:
cost_calc.style.format({'*.cost' : "${:,.2f}",
'*.pct' : "{:,.2%}",...
Any ideas? Thanks!
The first way doesn't seem bad if you can automatically build that dictionary... you can generate a list of all columns fitting the *.cost description with something like
costcols = [x for x in df.columns.values if x[-5:] == '.cost']
then build your dict like:
formatdict = {}
for costcol in costcols: formatdict[costcol] = "${:,.2f}"
then as you suggested:
cost_calc.style.format(formatdict)
You can easily add the .pct cases similarly. Hope this helps!
I would use regEx with dict generators:
import re
mylist = cost_calc.columns
r = re.compile(r'.*cost')
cost_cols = {key: "${:,.2f}" for key in mylist if r.match(key)}
r = re.compile(r'.*pct')
pct_cols = {key: "${:,.2f}" for key in mylist if r.match(key)}
cost_calc.style.format({**cost_cols, **pct_cols})
note: code for Python 2.7 and 3 onwards

How to implement a select-like function

I got a dataset in python and the structure of it is like
Tree Species number of trunks
------------------------------
Acer rubrum 1
Quercus bicolor 1
Quercus bicolor 1
aabbccdd 0
and I have a question of can I implement a function similar to
Select sum(number of trunks)
from trees.data['Number of Trunks']
where x = trees.data["Tree Species"]
group by trees.data["Tree Species"]
in python? x is an array contains five elements:
x = array(['Acer rubrum', 'Acer saccharum', 'Acer saccharinum',
'Quercus rubra', 'Quercus bicolor'], dtype='<U16')
what I want to do is mapping each elements in x to trees.data["Tree Species"] and calculate the sum of number of trunks, it should return an array of
array = (sum_num(Acer rubrum), sum_num(Acer saccharum), sum_num(Acer saccharinum),
sum_num(Acer Quercus rubra), sum_num(Quercus bicolor))
Did you want to look at Python Pandas. That will allow you to do something like
df.groupby('Tree Species')['Number of Trunks'].sum()
Please note here df is whatever the variable name you read in your data frame. I would recommend you to look at pandas and lambda function too.
You can do something like this:
import pandas as pd
df = pd.DataFrame()
tree_species = ["Acer rubrum", "Quercus bicolor", "Quercus bicolor", "aabbccdd"]
no_of_trunks = [1,1,1,0]
df["Tree Species"] = tree_species
df["Number of Trunks"] = no_of_trunks
df.groupby('Tree Species').sum() #This will create a pandas dataframe
df.groupby('Tree Species')['Number of Trunks'].sum() #This will create a pandas series.
You can do the same thing by just using dictionaries too:
tree_species = ["Acer rubrum", "Quercus bicolor", "Quercus bicolor", "aabbccdd"]
no_of_trunks = [1,1,1,0]
d = {}
for key, trunk in zip(tree_species, no_of_trunks):
if not key in d.keys():
d[key] = 0
d[key] += trunk
print(d)

turning a collections counter into dictionary

I have a collection outcome resulting from the function:
Counter(df.email_address)
it returns each individual email address with the count of its repetitions.
Counter({nan: 1618, 'store#kiddicare.com': 265, 'testorders#worldstores.co.uk': 1})
what I want to do is to use it as if it was a dictionary and create a pandas dataframe out of it with two columns one for email addresses and one for the value associated.
I tried with:
dfr = repeaters.from_dict(repeaters, orient='index')
but i got the following error:
AttributeError: 'Counter' object has no attribute 'from_dict'
It makes thing that Counter is not a dictionary as it looks like. Any idea on how to append it to a df?
d = {}
cnt = Counter(df.email_address)
for key, value in cnt.items():
d[key] = value
EDIT
Or, how #Trif Nefzger suggested:
d = dict(Counter(df.email_address))
as ajcr wrote at the comment, from_dict is a method that belongs to dataframe and thus you can write the following to achieve your goal:
from collections import Counter
import pandas as pd
repeaters = Counter({"nan": 1618, 'store#kiddicare.com': 265, 'testorders#worldstores.co.uk': 1})
dfr = pd.DataFrame.from_dict(repeaters, orient='index')
print dfr
Output:
testorders#worldstores.co.uk 1
nan 1618
store#kiddicare.com 265
Alternatively you could use pd.Series.value_counts, which returns a Series object.
df.email_address.value_counts(dropna=False)
Sample output:
b#y.com 2
a#x.com 1
NaN 1
dtype: int64
This is not exactly what you asked for but looks like what you'd like to achieve.

How to replace comma with dash using python pandas?

I have a file like this:
name|count_dic
name1 |{'x1':123,'x2,bv.':435,'x3':4}
name2|{'x2,bv.':435,'x5':98}
etc.
I am trying to load the data into a dataframe and count the number of keys in in the count_dic. The problem is that the dic items are separated with comma and also some of the keys contain comma. I am looking for a way to be able to replace commas in the key with '-' and then be able to separate different key,value pairs in the count_dic.something like this:
name|count_dic
name1 |{'x1':123,'x2-bv.':435,'x3':4}
name2|{'x2-bv.':435,'x5':98}
etc.
This is what I have done.
df = pd.read_csv('file' ,names = ['name','count_dic'],delimiter='|')
data = json.loads(df.count_dic)
and I get the following error:
TypeError: the JSON object must be str, not 'Series'
Does any body have any suggestions?
You can use ast.literal_eval as a converter for loading the dataframe, as it appears you have data that's more Python dict-like... JSON uses double quotes - eg:
import pandas as pd
import ast
df = pd.read_csv('file', delimiter='|', converters={'count_dic': ast.literal_eval})
Gives you a DF of:
name count_dic
0 name1 {'x2,bv.': 435, 'x3': 4, 'x1': 123}
1 name2 {'x5': 98, 'x2,bv.': 435}
Since count_dic is actually a dict, then you can apply len to get the number of keys, eg:
df.count_dic.apply(len)
Results in:
0 3
1 2
Name: count_dic, dtype: int64
Once df is defined as above:
# get a value to play around with
td = df.iloc[0].count_dic
td
# that looks like a dict definition... evaluate it?
eval(td)
eval(td).keys() #yup!
#apply to the whole df
df.count_dic = map(eval, df.count_dic)
#and a hint towards your key-counting
map(lambda i: i.keys(), df.count_dic)

Categories

Resources