How to replace empty values with reference to another dataframe? - python

I have 2 data frames. One is reference table with columns: code and name. Other one is list of dictionaries. The second data frame has code filled up but some names as empty strings. I am thinking of performing 2 for loops to get to the dictionary. But, I am new to this so unsure how to get the value from reference table.
Started with something like this:
for i in sample:
for j in i:
if j['name']=='':
(j['code'])
I am unsure how to proceed with the code. I think there is a very simple way with .map() function. Can someone help?
Reference table:
enter image description here
Edit needed table:
enter image description here

It seems to me that in this particular case you're using Pandas only to work with Python data structures. If that's the case, it would make sense to ditch Pandas altogether and just use Python data structures - usually, it results in more idiomatic and readable code that often performs better than Pandas with dtype=object.
In any case, here's the code:
import pandas as pd
sample_name = pd.DataFrame(dict(code=[8, 1, 6],
name=['Human development',
'Economic managemen',
'Social protection and risk management']))
# We just need a Series.
sample_name = sample_name.set_index('code')['name']
sample = pd.Series([[dict(code=8, name='')],
[dict(code=1, name='')],
[dict(code=6, name='')]])
def fix_dict(d):
if not d['name']:
d['name'] = sample_name.at[d['code']]
return d
def fix_dicts(dicts):
return [fix_dict(d) for d in dicts]
result = sample.map(fix_dicts)

Related

Removing nested brackets from a pandas dataframe?

So I am trying to convert a .mat file into a dataframe in order to run some data analysis After converting it, I have a dataframe structure (see 1), but I have no idea how to remove the brackets from the objects in the dataframe. I have tried utilizing:
mdataframe['0'] = mdataframe['0'].str[0]
and
mdataframe['0'] = mdataframe['0'].str.get(0)
as an attempt to fix the 0th column to no avail. Any help and guidance would be appreciated.
Thank you!
Thank you for your question. It is indeed a very interesting subject.
Personally, I have never seen a problem like yours; nevertheless, it is quite straightforward to solve your DataFrame conversion problem.
First of all, you need two steps:
Have a squashing function that will be applied for each entry in your table (i.e., DataFrame). This function must act like a dimensional reducer. Since we don't know how many dimensions we are to expect in each cell of your table, this function has to be capable of calling itself (a recursive function).
apply the squashing function for each entry of your table, and return the converted table.
Therefore, by following steps 1 and 2, I have created a code snippet that generates a DataFrame similar to your example and squashes its cells accordingly.
Code Snippet
import numpy as np
import pandas as pd
from typing import Any, List
from numbers import Number
def generateDataFrameWithnestedListsInItsCells() -> pd.DataFrame:
df = pd.DataFrame.from_records([[[["a"]]], [["b"]], [[["c"]]], [["b"]]])
return df
def squashList(element:List[Number]) -> Any:
array = np.asanyarray(element, dtype=list)
array = np.ravel(element)
while np.ndim(array) > 1:
squashList(element)
return array[0]
if "__main__" == __name__:
df = generateDataFrameWithnestedListsInItsCells()
df2 = df.applymap(squashList)
Notice that the df instance has your nested lists, while
its converted form (i.e., the df2) has its correct entries.
I hope that this example helps you in your research.
Sincerely,

Storing a dataframe in dictionary, weird output in dictionary

I have a function that returns a dataframe to my main. I am trying to store these dataframes in a dictionary, in order to retrieve them again later.
When I run this:
sa_wp5 = get_SA_WP5_value('testfile.txt')
template_dict["SAWP5Country Name"] = sa_wp5
my output looks like the following:
{'SAWP5Country Name': 1 2
0 Australia 047}
where I would rather the output just be the variable itself containing the dataframe.
What am I doing wrong here?
Nothing wrong here. Just a matter of formatting due to the default __str__() output of a DataFrame object. If you feel messy, try this way to print out your dict:
for key, df in template_dict.items():
print("%s:" % key)
print(df.to_string())
print("-------")
You can use Bunch to store all sorts of objects for easy retrieval.
from sklearn.datasets.base import Bunch
Then create a variable using the Bunch() method:
a = Bunch(df1 = df_template.copy(), df2 = df_other_df.copy())
Then you can simply call them as such:
a.df1
a.df1['col1']
df = a.df2
etc.
It's really effective for storage of objects.

How to convert Multilevel Dictionary with Irregular Data to Desired Format

Dict = {'Things' : {'Car':'Lambo', 'Home':'NatureVilla', 'Gadgets':{'Laptop':{'Programs':{'Data':'Excel', 'Officework': 'Word', 'Coding':{'Python':'PyCharm', 'Java':'Eclipse', 'Others': 'SublimeText'}, 'Wearables': 'SamsungGear', 'Smartphone': 'Nexus'}, 'clothes': 'ArmaaniSuit', 'Bags':'TravelBags'}}}}
d = {(i,j,k,l,m,n): Dict[i][j][k][l][m][n]
for i in Dict.keys()
for j in Dict[i].keys()
for k in Dict[j].keys()
for l in Dict[k].keys()
for m in Dict[l].keys()
for n in Dict[n].keys()
}
mux = pd.MultiIndex.from_tuples(d.keys())
df = pd.DataFrame(list(d.values()), index=mux)
print (df)
What I have already done:
I tried to Multiindex this Irregular Data using pandas but I am getting KeyError at 'Car'. Then I tried to handle exceptions and tried to PASS it but then it results in a Syntax Error. So May be I lost the direction. If there is any other module or way I can index this irregular data and put it in a table somehow. I have a chunk of raw data like this.
What I am trying to do:
I wanted to use this data for printing in QTableView which is from PyQt5 (Making a program with GUI).
Conditions:
This Data keeps on updating every hour from an API.
What I have thought till now:
May be I can append all this data to MySQL. But then when this data updates from API, only Values will change, rest of the KEYS will be the same. But then It will require more space.
References:
How to convert a 3-level dictionary to a desired format?
How to build a MultiIndex Pandas DataFrame from a nested dictionary with lists
Any Help will be appreciated. Thanks for reading the question.
You data is not actually a 6-level dictionary like a dictionary in a 3-level example you referenced to. The difference is: your dictionary has a data on multiple different levels, e.g. 'Lambo' value is on second level of hierarchy with key ('Things','Car') but 'Eclipse' value is on sixth level of hierarchy with key ('Things','Gadgets','Laptop','Programs','Coding','Java')
If you want to 'flatten' your structure you will need to decide what to do with 'missed' key values for deeper levels for values like 'Lambo'.
Btw, maybe it is not actually a solution for your problem, maybe you need to use more appropriate UI widgets like TreeView to work with such kind of hierarchical data, but I will try to directly address your exact question.
Unfortunately it seems to be no easy way to reference all different level values uniformly in one simple dict or list comprehension statement.
Just look at your 'value extractor' (Dict[i][j][k][l][m][n]) there are no such values for i,j,k,l,m,n exists which allows you to get a 'Lambo'. Because to get a Lambo you will need to just use Dict['Things']['Car'] (ironically, in a real life it is also could be difficult to get a Lambo :-) )
One straightforward way to solve your task is:
extract a second level data, extract a third level data, and so on, and combine them together.
E.g. to extract second level values you can write something like this:
val_level2 = {(k1,k2):Dict[k1][k2]
for k1 in Dict
for k2 in Dict[k1]
if isinstance(Dict[k1],dict) and
not isinstance(Dict[k1][k2],dict)}
but if you want to combine it later with six level values, it will need to add some padding to your key tuples:
val_level2 = {(k1,k2,'','','',''):Dict[k1][k2]
for k1 in Dict
for k2 in Dict[k1]
if isinstance(Dict[k1],dict) and
not isinstance(Dict[k1][k2],dict)}
later you can combine all together by something like:
d = {}
d.update(val_level2)
d.update(val_level3)
But usually the most organic way to work with hierarchical data is to use some recursion, like this:
def flatten_dict(d,key_prefix,max_deep):
return [(tuple(key_prefix+[k]+['']*(max_deep-len(key_prefix))),v)
for k,v in d.items() if not isinstance(v,dict)] +\
sum([flatten_dict(v,key_prefix+[k],max_deep)
for k,v in d.items() if isinstance(v,dict)],[])
And later with code like this:
d={k:v for k,v in flatten_dict(Dict,[],5)}
mux = pd.MultiIndex.from_tuples(d.keys())
df = pd.DataFrame(list(d.values()), index=mux)
df.reset_index()
I actually get this result with your data:
P.S. According to https://www.python.org/dev/peps/pep-0008/#prescriptive-naming-conventions we prefer a lowercase_with_underscores for variable names, CapWords is for classes. So src_dict would be much better, than Dict in your case.
You information looks a lot like json and that's what the API is returning. If that's the case, and you are turning it into a dictionary, then you might me better off using python's json library or even panda's built it read_json format.
Pandas read json
Python's json

How to feed array of user_ids to flickr.people.getInfo()?

I have been working on extracting the flickr users location (not lat. and long. but person's country) by using their user_ids. I have made a dataframe (Here's the dataframe) consisted with photo id, owner and few other columns. My attempt was to feed each of the owner to flickr.people.getInfo() query by iterating owner column in dataframe. Here is my attempt
for index, row in df.iterrows():
A=np.array(df["owner"])
for i in range(len(A)):
B=flickr.people.getInfo(user_id=A[i])
unfortunately, it results only 1 result. After careful examination I've found that it belongs to the last user in the dataframe. My dataframe has 250 observations. I don't know how could I extract others.
Any help is appreciated.
It seems like you forgot to store the results while iterating over the dataframe. I haven't use the API but I think that this snippet should do it.
result_dict = {}
for idx, owner in df['owner'].iteritems():
result_dict[owner] = flickr.people.getInfo(user_id=owner)
The results are stored in a dictonary where the user id is the key.
EDIT:
Since it is a JSON you can use the read_json function to parse the result.
Example:
result_list = []
for idx, owner in df['owner'].iteritems():
result_list.appen(pd.read_json(json.dumps(flickr.people.get‌​Info(user_id=owner))‌​,orient=list))
# you may have to set the orient parameter.
# Option are: 'split','records','index', Default is 'index'
Note: I switched the dictonary to a list, since it is more convenient
Afterwards you can concatenate the resulting pandas serieses together like this:
df = pd.concat(result_list, axis=1).transpose()
I added the transpose() since you probably want the ID as the index.
Afterwards you should be able to sort by the column 'location'.
Hope that helps.
The canonical way to achieve that is to use an apply. It will be much more efficient.
import pandas as pd
import numpy as np
np.random.seed(0)
# A function to simulate the call to the API
def get_user_info(id):
return np.random.randint(id, id + 10)
# Some test data
df = pd.DataFrame({'id': [0,1,2], 'name': ['Pierre', 'Paul', 'Jacques']})
# Here the call is made for each ID
df['info'] = df['id'].apply(get_user_info)
# id name info
# 0 0 Pierre 5
# 1 1 Paul 1
# 2 2 Jacques 5
Note, another way to write the same thing is
df['info'] = df['id'].map(lambda x: get_user_info(x))
Before calling the method, have the following lines first.
from flickrapi import FlickrAPI
flickr = FlickrAPI(FLICKR_KEY, FLICKR_SECRET, format='parsed-json')

What is a proper idiom in pandas for creating a dataframes from the output of a apply function on a df?

Edit --- I've made some progress, and discovered the drop_duplicates method in pandas, which saves some custom duplicate removal functions I created.
This changes the question in a couple of ways, b/c it changes my initial requirements.
One of the operations I need to conduct is grabbing the latest feed entries --- the feed urls exist in a column in a data frame. Once I've done the apply I get feed objects back:
import pandas as pd
import feedparser
import datetime
df_check_feeds = pd.DataFrame({'account_name':['NYTimes', 'WashPo'],'feed_url':['http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml', 'http://feeds.washingtonpost.com/rss/homepage'], 'last_update':['2015-12-28 23:50:40', '2015-12-28 23:50:40']})
df_check_feeds["feeds_results"] = pd.DataFrame(df_check_feeds.feed_url.apply(lambda feed_url: feedparser.parse(feed_url)))
df_check_feeds["entries"] = df_check_feeds.feeds_results.apply(lambda x: x.entries)
So, now I'm stuck with the feed entries in the "entries" column, I'd like to create a two new data frames in one apply method, and concatenate the two frames immediately.
I've expressed the equivalent in a for loop:
frames_list = []
for index in df_check_feeds.index:
df_temp = pd.DataFrame(df_check_feeds.entries[index])
df_temp['account_name'] = df_check_feeds.ix[index,'account_name']
# some error checking on the info here
frames_list.append(df_temp)
df_total_results = pd.concat(frames_list)
df_total_results
I realize I could do this in a for loop (and indeed have written that), but I feel there is some better, more succinct pandas idiomatic way of writing this statement.
A more compact way could be:
df_total_results = df_check_feeds.groupby('account_name').apply(lambda x: pd.DataFrame(x['entries'].iloc[0]))

Categories

Resources