Filling in blank dictionary values based on other key value pairs - python

I have a df that contains a column ['mjtheme_namecode'] which is in dictionary form containing a code and a name. The codes all have numbers but some of the names are missing. I would like to fill in the missing name values based on other pairs with the same code. Here is the df column in question:
import pandas as pd
import json
import numpy as np
from pandas.io.json import json_normalize
df = pd.read_json('data/world_bank_projects.json')
print(df['mjtheme_namecode'].head(15))
0 [{'code': '8', 'name': 'Human development'}, {...
1 [{'code': '1', 'name': 'Economic management'},...
2 [{'code': '5', 'name': 'Trade and integration'...
3 [{'code': '7', 'name': 'Social dev/gender/incl...
4 [{'code': '5', 'name': 'Trade and integration'...
5 [{'code': '6', 'name': 'Social protection and ...
6 [{'code': '2', 'name': 'Public sector governan...
7 [{'code': '11', 'name': 'Environment and natur...
8 [{'code': '10', 'name': 'Rural development'}, ...
9 [{'code': '2', 'name': 'Public sector governan...
10 [{'code': '10', 'name': 'Rural development'}, ...
11 [{'code': '10', 'name': 'Rural development'}, ...
12 [{'code': '4', 'name': ''}]
13 [{'code': '5', 'name': 'Trade and integration'...
14 [{'code': '6', 'name': 'Social protection and ...
Name: mjtheme_namecode, dtype: object
I know I could make the column a separate df and then ffill, but I think I would have to reindex, so I don't think I could put it back in place after that. I'm thinking ideally I'd make a list (with no duplicates) of only dict items with both codes and names then use that list to iterate over the dictionary in a for loop where name becomes the matching value from the non-duplicate list I created. Does this make sense? Not sure how to go about it.

You can take a similar approach of creating a new DataFrame, but then transition back:
theme= pd.DataFrame([val for pair in df['mjtheme_namecode'].values for val in pair])
mapper = theme.drop_duplicates().replace(r'', np.nan).dropna().set_index('code').name.to_dict()
Using a list comprehension to put it all together:
s = pd.Series(
[[{'code': i['code'], 'name': mapper[i['code']]}
for i in t] for t in df.mjtheme_namecode]
)
s.head(13)
0 [{'code': '8', 'name': 'Human development'}, {...
1 [{'code': '1', 'name': 'Economic management'},...
2 [{'code': '5', 'name': 'Trade and integration'...
3 [{'code': '7', 'name': 'Social dev/gender/incl...
4 [{'code': '5', 'name': 'Trade and integration'...
5 [{'code': '6', 'name': 'Social protection and ...
6 [{'code': '2', 'name': 'Public sector governan...
7 [{'code': '11', 'name': 'Environment and natur...
8 [{'code': '10', 'name': 'Rural development'}, ...
9 [{'code': '2', 'name': 'Public sector governan...
10 [{'code': '10', 'name': 'Rural development'}, ...
11 [{'code': '10', 'name': 'Rural development'}, ...
12 [{'code': '4', 'name': 'Financial and private ...
dtype: object
As you can see, the last row (row 12) has been correctly filled in, as have the others, and you can reassign this to your original DataFrame.

Related

Merging two lists of dictionaries using common id key

I have the following two lists of dictionaries:
list_1 = [
{'id': '1', 'name': 'Johnny Johson1'},
{'id': '2', 'name': 'Johnny Johson2'},
{'id': '1', 'name': 'Johnny Johson1'},
{'id': '3', 'name': 'Johnny Johson3'},
]
list_2 = [
{'id': '1', 'datetime': '2020-01-06T12:30:00.000Z'},
{'id': '2', 'datetime': '2020-01-06T14:00:00.000Z'},
{'id': '1', 'datetime': '2020-01-06T15:30:00.000Z'},
{'id': '3', 'datetime': '2020-01-06T15:30:00.000Z'},
]
Essentially, I would like no loss of data even on duplicate IDs, as they represent different events (there is a sepearate ID for that, but for the purpose of demonstrating the problem, is not needed). If there are any IDs in one list, not in the other, then disregard that ID all together.
Ideally, I would like to end up with the following (from the amalgamation of the two lists):
list_3 = [
{'id': '1', 'name': 'Johnny Johson1', 'datetime': '2020-01-06T12:30:00.000Z'},
{'id': '2', 'name': 'Johnny Johson2', 'datetime': '2020-01-06T14:00:00.000Z'},
{'id': '1', 'name': 'Johnny Johson1', 'datetime': '2020-01-06T15:30:00.000Z'},
{'id': '3', 'name': 'Johnny Johson3', 'datetime': '2020-01-06T15:30:00.000Z'},
]
You can use the following list comprehension, which uses the double asterisk keyword argent unpacking syntax, evaluated on both lists using pairwise elements obtained with zip(). This has the effect of combining the two dictionaries into one.
list_3 = [{**x, **y} for x, y in zip(list_1, list_2)]
Output:
>>> list3
[{'id': '1', 'name': 'Johnny Johson1', 'datetime': '2020-01-06T12:30:00.000Z'},
{'id': '2', 'name': 'Johnny Johson2', 'datetime': '2020-01-06T14:00:00.000Z'},
{'id': '1', 'name': 'Johnny Johson1', 'datetime': '2020-01-06T15:30:00.000Z'},
{'id': '3', 'name': 'Johnny Johson3', 'datetime': '2020-01-06T15:30:00.000Z'}]
Note that this approach requires at least Python 3.5.

Web scraping using Beautifulsoup to collect dropdown values

I am new to Python, trying to get a list of all the drop down values from the following website "https://www.sfma.org.sg/member/category" but failing to do so.
The below code is producing an empty list
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import re
import pandas as pd
page = "https://www.sfma.org.sg/member/category"
information = requests.get(page)
soup = BeautifulSoup(information.content, 'html.parser')
categories = soup.find_all('select', attrs={'class' :'w3-select w3-border'})
The desired output is the below list :-
['Alcoholic Beverage','Beer','Bottled
Beverage',..........,'Trader','Wholesaler']
Thanks !!
The options are loaded through Javascript, but the data is on the page. With some crude regexes you can extract it:
import re
import json
import requests
url = 'https://www.sfma.org.sg/member/category/'
text = requests.get(url).text
d = re.findall(r'var\s*cObject\s*=\s*(.*)\s*;', text)[0]
d = re.sub(r'(\w+)(?=:)', r'"\1"', d)
d = json.loads(d.replace("'", '"'))
from pprint import pprint
pprint(d, width=200)
Prints:
{'category': [{'cat_type': '1', 'id': '1', 'name': 'Alcoholic Beverage', 'permalink': 'alcoholic-beverage', 'status': '2'},
{'cat_type': '1', 'id': '2', 'name': 'Beer', 'permalink': 'beer', 'status': '2'},
{'cat_type': '1', 'id': '3', 'name': 'Bottled Beverage', 'permalink': 'bottled-beverage', 'status': '2'},
{'cat_type': '1', 'id': '4', 'name': 'Canned Beverage', 'permalink': 'canned-beverage', 'status': '2'},
{'cat_type': '1', 'id': '5', 'name': 'Carbonated Beverage', 'permalink': 'carbonated-beverage', 'status': '2'},
{'cat_type': '1', 'id': '6', 'name': 'Cereal / Grain Beverage', 'permalink': 'cereal-grain-beverage', 'status': '2'},
{'cat_type': '1', 'id': '7', 'name': 'Cider', 'permalink': 'cider', 'status': '2'},
{'cat_type': '1', 'id': '8', 'name': 'Coffee', 'permalink': 'coffee', 'status': '2'},
{'cat_type': '1', 'id': '9', 'name': 'Distilled Water', 'permalink': 'distilled-water', 'status': '2'},
{'cat_type': '1', 'id': '10', 'name': 'Fruit / Vegetable Juice', 'permalink': 'fruit-vegetable-juice', 'status': '2'},
{'cat_type': '1', 'id': '11', 'name': 'Herbal Beverage', 'permalink': 'herbal-beverage', 'status': '2'},
{'cat_type': '1', 'id': '12', 'name': 'Instant Beverage', 'permalink': 'instant-beverage', 'status': '2'},
{'cat_type': '1', 'id': '13', 'name': 'Milk', 'permalink': 'milk', 'status': '2'},
{'cat_type': '1', 'id': '14', 'name': 'Mineral Water', 'permalink': 'mineral-water', 'status': '2'},
...and so on.
EDIT: To print just names of categories, you can do this:
for c in d['category']:
print(c['name'])
Prints:
Alcoholic Beverage
Beer
Bottled Beverage
Canned Beverage
Carbonated Beverage
Cereal / Grain Beverage
Cider
...
Manufacturer
Restaurant
Retail Outlet
Supplier
Trader
Wholesaler
This is not really a proper question but still.
categories = soup.find("select", attrs={"name": "ctype"}).find_all('option')
result = [cat.get_text() for cat in categories]

Python with Json, If Statement

I have the json code below and I have a list
i want to do a for loop or if statement which
if label in selected_size:
fsize = id
selected_size[]
in selected size:
[7, 7.5, 4, 4.5]
in json:
removed
print(json_data)
for size in json_data:
if ['label'] in select_size:
fsize = ['id']
print(fsize)
i have no idea on how to do it.
You need to access to list and later to dict, for example:
json_data = [{'id': '91', 'label': '10.5', 'price': '0', 'oldPrice': '0', 'products': ['81278']}, {'id': '150', 'label': '9.5', 'price': '0', 'oldPrice': '0', 'products': ['81276']}, {'id': '28', 'label': '4', 'price': '0', 'oldPrice': '0', 'products': ['81270']}, {'id': '29', 'label': '5', 'price': '0', 'oldPrice': '0', 'products': ['81271']}, {'id': '22', 'label': '8', 'price': '0', 'oldPrice': '0', 'products': ['81274']}, {'id': '23', 'label': '9', 'price': '0', 'oldPrice': '0', 'products': ['81275']}, {'id': '24', 'label': '10', 'price': '0', 'oldPrice': '0', 'products': ['81277']}, {'id': '25', 'label': '11', 'price': '0', 'oldPrice': '0', 'products': ['81279']}, {'id': '26', 'label': '12', 'price': '0', 'oldPrice': '0', 'products': ['81280']}]
fsize = []
select_size = [7, 7.5, 4, 4.5]
[float(i) for i in select_size] #All select_size values to float value
for size in json_data:
if float(size['label']) in select_size: #For compare it i need float(size['label']) for convert to float.
fsize.append(size['id']) #Add to list
print(fsize) #Print all list, i get only 28

Join on non-unique second id - Python

I am trying to join a dictionary to another dictionary. I have two keys; one that is unique and another which is not unique. I want to join information on the non-unique key and leave all information as it is one the unique key, i.e. the number of unique id's has to stay the same.
Any ideas to how I can achieve this?
This is the first dictionary:
names = [
{'id': '1', 'name': 'Peter', 'category_id': '25'},
{'id': '2', 'name': 'Jim', 'category_id': '20'},
{'id': '3', 'name': 'Toni', 'category_id': '20'}
]
This is the second dictionary:
categories = [
{'category_id': '25', 'level': 'advanced'},
{'category_id': '20', 'level': 'beginner'}
]
And this is what I am trying to achieve:
all = [
{'id': '1', 'name': 'Peter', 'category_id': '25', 'level': 'advanced'},
{'id': '2', 'name': 'Jim', 'category_id': '20', 'level': 'beginner'},
{'id': '3', 'name': 'Toni', 'category_id': '20', 'level': 'beginner'}
]
EDIT:
names = [
{'id': '1', 'name': 'Peter', 'category_id': '25'},
{'id': '2', 'name': 'Jim', 'category_id': '20'},
{'id': '3', 'name': 'Toni', 'category_id': '20'}
]
categories = [
{'category_id': '25', 'level': 'advanced'},
{'category_id': '20', 'level': 'beginner'}
]
def merge_lists(l1, l2, key):
merged = {}
for item in l1+l2:
if item[key] in merged:
merged[item[key]].update(item)
else:
merged[item[key]] = item
return merged.values()
courses = merge_lists(names, categories, 'category_id')
print(courses)
gives:
([{'id': '1', 'name': 'Peter', 'category_id': '25', 'level': 'advanced'},
{'id': '3', 'name': 'Toni', 'category_id': '20', 'level': 'beginner'}])
Create a mapping from category_id to additional field(s), then combine the dictionaries in a loop, e.g:
cat = {d["category_id"]: d for d in categories}
res = []
for name in names:
x = name.copy()
x.update(cat[name["category_id"]])
res.append(x)
In Python 3.5+ you can use the cool new syntax:
cat = {d["category_id"]: d for d in categories}
res = [{**name, **cat[name["category_id"]]} for name in names]
Consider what you really want to do: add the level associated with each category to the names dict. So first, create a mapping from the categories to the associated levels:
cat_dict = {d['category_id']: d['level'] for d in categories}
It's then a trivial transformation on each dict in the names list:
for d in names:
d['level'] = cat_dict[d['category_id']]
The resulting names list is:
[{'category_id': '25', 'id': '1', 'level': 'advanced', 'name': 'Peter'},
{'category_id': '20', 'id': '2', 'level': 'beginner', 'name': 'Jim'},
{'category_id': '20', 'id': '3', 'level': 'beginner', 'name': 'Toni'}]

Extracting complex data from a DataFrame

I have to analyze some complex data which is in a Pandas DataFrame. I am not aware of the exact structure of the data inside the dataframe. I have pulled the data from a Json file. I used the "head" syntax to look at top level data.
If I want to extract the group manufacturer or nutrients in a seperate dataframe, how should I go about doing that so that I can do some Statistical analysis.
with open("nutrients.json") as f:
objects = [json.loads(line) for line in f]
df = pd.DataFrame(objects)
print(df.head())
group manufacturer \
0 Dairy and Egg Products
1 Dairy and Egg Products
2 Dairy and Egg Products
3 Dairy and Egg Products
4 Dairy and Egg Products
meta \
0 {'langual': [], 'nitrogen_factor': '6.38', 're...
1 {'langual': [], 'nitrogen_factor': '6.38', 're...
2 {'langual': [], 'nitrogen_factor': '6.38', 're...
3 {'langual': [], 'nitrogen_factor': '6.38', 're...
4 {'langual': [], 'nitrogen_factor': '6.38', 're...
name \
0 {'long': 'Butter, salted', 'sci': '', 'common'...
1 {'long': 'Butter, whipped, with salt', 'sci': ...
2 {'long': 'Butter oil, anhydrous', 'sci': '', '...
3 {'long': 'Cheese, blue', 'sci': '', 'common': []}
4 {'long': 'Cheese, brick', 'sci': '', 'common':...
nutrients \
0 [{'code': '203', 'value': '0.85', 'units': 'g'...
1 [{'code': '203', 'value': '0.85', 'units': 'g'...
2 [{'code': '203', 'value': '0.28', 'units': 'g'...
3 [{'code': '203', 'value': '21.40', 'units': 'g...
4 [{'code': '203', 'value': '23.24', 'units': 'g...
portions
0 [{'g': '227', 'amt': '1', 'unit': 'cup'}, {'g'...
1 [{'g': '151', 'amt': '1', 'unit': 'cup'}, {'g'...
2 [{'g': '205', 'amt': '1', 'unit': 'cup'}, {'g'...
3 [{'g': '28.35', 'amt': '1', 'unit': 'oz'}, {'g...
4 [{'g': '132', 'amt': '1', 'unit': 'cup, diced'...

Categories

Resources