I have a data structure that is something like this
my_data = [
('Continent1','Country1','State1'),
('Continent1','Country1','State2'),
('Continent1','Country2','State1'),
('Continent1','Country2','State2'),
('Continent1','Country2','State3','City1',11111)
]
With the input not limited to State it can be narrowed down further to something like
Cotinent ==> Country ==> State ==> City ==> Zip (With State, City and Zip) being optional fields.
I wish to convert it to a json like provided on the fields shared in payload
{
"Regions": [{
"Continent": 'Continent1',
"Country": "Country1",
"State": "state1"
}, {
"Continent": 'Continent1',
"Country": "Country1",
"State": "state2"
}, {
"Continent": 'Continent1',
"Country": "Country2",
"State": "state1"
}, {
"Continent": 'Continent1',
"Country": "Country1",
"State": "state2"
}, {
"Continent": 'Continent1',
"Country": "Country1",
"State": "state3",
"City": "City1",
"zip": "11111",
}]
}
Any pseudo code/approach for the same would be appreciated which would support the output based on multiple inputs.
keys = ["Continent", "Country", "State", "City", "Zip"]
transformed_data = {
"Regions": [dict(zip(keys, row)) for row in my_data]
}
Related
So, I'm trying to parse this json object into multiple events, as it's the expected input for a ETL tool. I know this is quite straight forward if we do this via loops, if statements and explicitly defining the search fields for given events. This method is not feasible because I have multiple heavily nested JSON objects and I would prefer to let the python recursions handle the heavy lifting. The following is a sample object, which consist of string, list and dict (basically covers most use-cases, from the data I have).
{
"event_name": "restaurants",
"properties": {
"_id": "5a9909384309cf90b5739342",
"name": "Mangal Kebab Turkish Restaurant",
"restaurant_id": "41009112",
"borough": "Queens",
"cuisine": "Turkish",
"address": {
"building": "4620",
"coord": {
"0": -73.9180155,
"1": 40.7427742
},
"street": "Queens Boulevard",
"zipcode": "11104"
},
"grades": [
{
"date": 1414540800000,
"grade": "A",
"score": 12
},
{
"date": 1397692800000,
"grade": "A",
"score": 10
},
{
"date": 1381276800000,
"grade": "A",
"score": 12
}
]
}
}
And I want to convert it to this following list of dictionaries
[
{
"event_name": "restaurants",
"properties": {
"restaurant_id": "41009112",
"name": "Mangal Kebab Turkish Restaurant",
"cuisine": "Turkish",
"_id": "5a9909384309cf90b5739342",
"borough": "Queens"
}
},
{
"event_name": "restaurant_address",
"properties": {
"zipcode": "11104",
"ref_id": "41009112",
"street": "Queens Boulevard",
"building": "4620"
}
},
{
"event_name": "restaurant_address_coord"
"ref_id": "41009112"
"0": -73.9180155,
"1": 40.7427742
},
{
"event_name": "restaurant_grades",
"properties": {
"date": 1414540800000,
"ref_id": "41009112",
"score": 12,
"grade": "A",
"index": "0"
}
},
{
"event_name": "restaurant_grades",
"properties": {
"date": 1397692800000,
"ref_id": "41009112",
"score": 10,
"grade": "A",
"index": "1"
}
},
{
"event_name": "restaurant_grades",
"properties": {
"date": 1381276800000,
"ref_id": "41009112",
"score": 12,
"grade": "A",
"index": "2"
}
}
]
And most importantly these events will be broken up into independent structured tables to conduct joins, we need to create primary keys/ unique identifiers. So the deeply nested dictionaries should have its corresponding parents_id field as ref_id. In this case ref_id = restaurant_id from its parent dictionary.
Most of the example on the internet flatten's the whole object to be normalized and into a dataframe, but to utilise this ETL tool to its full potential it would be ideal to solve this problem via recursions and outputting as list of dictionaries.
This is what one might call a brute force method. Create a translator function to move each item into the correct part of the new structure (like a schema).
# input dict
d = {
"event_name": "demo",
"properties": {
"_id": "5a9909384309cf90b5739342",
"name": "Mangal Kebab Turkish Restaurant",
"restaurant_id": "41009112",
"borough": "Queens",
"cuisine": "Turkish",
"address": {
"building": "4620",
"coord": {
"0": -73.9180155,
"1": 40.7427742
},
"street": "Queens Boulevard",
"zipcode": "11104"
},
"grades": [
{
"date": 1414540800000,
"grade": "A",
"score": 12
},
{
"date": 1397692800000,
"grade": "A",
"score": 10
},
{
"date": 1381276800000,
"grade": "A",
"score": 12
}
]
}
}
def convert_structure(d: dict):
''' function to convert to new structure'''
# the new dict
e = {}
e['event_name'] = d['event_name']
e['properties'] = {}
e['properties']['restaurant_id'] = d['properties']['restaurant_id']
# and so forth...
# keep building the new structure / template
# return a list
return [e]
# run & print
x = convert_structure(d)
print(x)
the reuslt (for the part done) looks like this:
[{'event_name': 'demo', 'properties': {'restaurant_id': '41009112'}}]
If a pattern is identified, then the above could be improved...
I want to exchange 2 json data's value. But keys of these datas are different from each other. I don't know how can I exchange data value between them.
sample json1: A
{
"contact_person":"Mahmut Kapur",
"contact_people": [
{
"email": "m#gmail.com",
"last_name": "Kapur"
}
],
"addresses": [
{
"city": "istanbul",
"country": "CA",
"first_name": "Mahmut",
"street1": "adres 1",
"zipcode": "34678",
"id": "5f61f72b8348230004f149fd"
}
]
"created_at": "2020-09-16T07:29:47.244-04:00",
"updated_at": "2020-09-16T07:32:50.567-04:00",
}
sample json2: B
The values in this example are: Represents the keys in the A json.
{
"Customer":{
"DisplayName":"contact_person",
"PrimaryEmailAddr":{
"Address":"contact_people/email"
},
"FamilyName":"contact_people/last_name",
"BillAddr":{
"City":"addresses/city",
"CountrySubDivisionCode":"addresses/country",
"Line1":"addresses/street1",
"PostalCode":"addresses/zipcode",
"Id":"addresses/id"
},
"GivenName":"addresses/first_name",
"MetaData":{
"CreateTime":"created_at",
"LastUpdatedTime":"updated_at"
}
}
}
The outcome needs to be:
{
"Customer":{
"DisplayName":"Mahmut Kapur",
"PrimaryEmailAddr":{
"Address":"m#gmail.com"
},
"FamilyName":"Kapur",
"BillAddr":{
"City":"istanbul",
"CountrySubDivisionCode":"CA",
"Line1":"adres 1",
"PostalCode":"34678",
"Id":"5f61f72b8348230004f149fd"
},
"GivenName":"Mahmut",
"MetaData":{
"CreateTime":"2020-09-16T07:29:47.244-04:00",
"LastUpdatedTime":"2020-09-16T07:32:50.567-04:00"
}
}
}
So the important thing here is to match the keys. I hope I was able to explain my problem.
This code can do the work for you. I dont know if someone can make this code shorter for you. It basically searches for dict and list till the leaf level and acts accordingly.
a={
"contact_person":"Mahmut Kapur",
"contact_people": [
{
"email": "m#gmail.com",
"last_name": "Kapur"
}
],
"addresses": [
{
"city": "istanbul",
"country": "CA",
"first_name": "Mahmut",
"street1": "adres 1",
"zipcode": "34678",
"id": "5f61f72b8348230004f149fd"
}
],
"created_at": "2020-09-16T07:29:47.244-04:00",
"updated_at": "2020-09-16T07:32:50.567-04:00",
}
b={
"Customer":{
"DisplayName":"contact_person",
"PrimaryEmailAddr":{
"Address":"contact_people/email"
},
"FamilyName":"contact_people/last_name",
"BillAddr":{
"City":"addresses/city",
"CountrySubDivisionCode":"addresses/country",
"Line1":"addresses/street1",
"PostalCode":"addresses/zipcode",
"Id":"addresses/id"
},
"GivenName":"addresses/first_name",
"MetaData":{
"CreateTime":"created_at",
"LastUpdatedTime":"updated_at"
}
}
}
c={}
for keys in b:
if isinstance(b[keys], dict):
for items in b[keys]:
if isinstance(b[keys][items], dict):
for leaf in b[keys][items]:
if "/" in b[keys][items][leaf]:
getter=b[keys][items][leaf].split("/")
b[keys][items][leaf]=a[getter[0]][0][getter[1]]
else:
b[keys][items][leaf]=a[b[keys][items][leaf]]
else:
if "/" in b[keys][items]:
getter=b[keys][items].split("/")
b[keys][items]=a[getter[0]][0][getter[1]]
else:
b[keys][items]=a[b[keys][items]]
else:
if "/" in b[keys]:
getter=b[keys].split("/")
b[keys]=a[getter[0]][0][getter[1]]
else:
b[keys]=a[b[keys]]
print(json.dumps(b,indent=4))
Currently I using ParseHub to scrape some basic data about a list counties, the json file for this can be seen below. I also want to scrape the current time of each country which means going to other website were such information can be found, but the list of counties on that website are in a complete different order meaning each country would end up with the incorrect time.
Is there a was I can scrape the time of each country and have it appended to the correct countries json object or am I thinking about this the wrong way?
country.json
{
"country": [
{
"name": "China",
"pop": "1,438,801,917",
"area": "9,706,961 km²",
"growth": "0.39%",
"worldPer": "18.47%",
"rank": "1"
},
{
"name": "India",
"pop": "1,378,687,736",
"area": "3,287,590 km²",
"growth": "0.99%",
"worldPer": "17.70%",
"rank": "2"
},
{
"name": "United States",
"pop": "330,812,025",
"area": "9,372,610 km²",
"growth": "0.59%",
"worldPer": "4.25%",
"rank": "3"
}
{
time.json
{
"country": [
{
"name": "china",
"time": "18:36"
}
{
How would I go about adding this data to the China object in country.json
Try this:
import json
with open('country.json') as f1, open('time.json') as f2:
country = json.loads(f1.read())
time = json.loads(f2.read())
country = {x['name'].lower(): x for x in country['country']}
for y in time['country']:
if y['name'].lower() in country:
country[y['name'].lower()]['time'] = y['time']
country = {'country': list(country.values())}
with open('country.json', 'w') as fw:
json.dump(country, fw)
Output:
country.json
{
"country": [
{
"name": "China",
"pop": "1,438,801,917",
"area": "9,706,961 km²",
"growth": "0.39%",
"worldPer": "18.47%",
"rank": "1",
"time": "18:36"
},
{
"name": "India",
"pop": "1,378,687,736",
"area": "3,287,590 km²",
"growth": "0.99%",
"worldPer": "17.70%",
"rank": "2"
},
{
"name": "United States",
"pop": "330,812,025",
"area": "9,372,610 km²",
"growth": "0.59%",
"worldPer": "4.25%",
"rank": "3"
}
]
}
How can we convert this to dataframes? I have tried multiple ways on how it can be achived, i have tried with json file on w3school but it is working correctly, i am new with python, any recommendations on this?
Json format is
[
{
"id": 14256,
"city": {
"id": {
"$numberLong": "14256"
},
"name": "Azadshahr",
"findname": "AZADSHAHR",
"country": "IR",
"coord": {
"lon": 48.570728,
"lat": 34.790878
},
"zoom": {
"$numberLong": "10"
}
}
},
{
"id": {
"$numberLong": "465726"
},
"city": {
"id": {
"$numberLong": "465726"
},
"name": "Zadonsk",
"findname": "ZADONSK",
"country": "RU",
"coord": {
"lon": 38.926102,
"lat": 52.3904
},
"zoom": {
"$numberLong": "16"
}
}
}
]
The expected output is :
it tried to do a conversion but i am receiving error and it is not the whole data
with open('data/history.city.list.json') as f:
data = json.load(f)
but not able to load as data, This is what i have tried but i feel
_id = []
country = []
coord_lat = []
coord_lon = []
counter = 0
for i in data:
_id.append(data[counter]['id'])
country.append(data[counter]['city']['country'])
coord_lat.append(data[counter]['city']['coord']['lon'])
coord_lat.append(data[counter]['city']['coord']['lat'])
counter += 1
When i have tried to print it as a dataframe
df = pd.DataFrame({'Longtitude' : coord_lat , 'Latitude' : coord_lat})
df.head(10)
This was able to set it to dataframe, but as soon as i add 'Country' to pd.dataframe() , it will return as ValueError: arrays must all be same length.
i understand that country column does not match the other columns but can we achieve this and is there a simpler way to do it ?
You can use json_normalize() as described here:
import pandas as pd
d = [
{
"id": 14256,
"city": {
"id": {
"$numberLong": "14256"
},
"name": "Azadshahr",
"findname": "AZADSHAHR",
"country": "IR",
"coord": {
"lon": 48.570728,
"lat": 34.790878
},
"zoom": {
"$numberLong": "10"
}
}
},
{
"id": {
"$numberLong": "465726"
},
"city": {
"id": {
"$numberLong": "465726"
},
"name": "Zadonsk",
"findname": "ZADONSK",
"country": "RU",
"coord": {
"lon": 38.926102,
"lat": 52.3904
},
"zoom": {
"$numberLong": "16"
}
}
}
]
pd.io.json.json_normalize(d)
Output:
id city.id.$numberLong city.name city.findname city.country city.coord.lon city.coord.lat city.zoom.$numberLong id.$numberLong
0 14256.0 14256 Azadshahr AZADSHAHR IR 48.570728 34.790878 10 NaN
1 NaN 465726 Zadonsk ZADONSK RU 38.926102 52.390400 16 465726
The column names do not match your expected output, but you can change that easily with df.columns = ['Id', 'city', ... 'Zoom']
I have a large list of transactions that I want to categorize.
It looks like this:
transactions: [
{
"id": "20200117-16045-0",
"date": "2020-01-17",
"creationTime": null,
"text": "SuperB Vesterbro T 74637",
"originalText": "SuperB Vesterbro T 74637",
"details": null,
"category": null,
"amount": {
"value": -160.45,
"currency": "DKK"
},
"balance": {
"value": 12572.68,
"currency": "DKK"
},
"type": "Card",
"state": "Booked"
},
{
"id": "20200117-4800-0",
"date": "2020-01-17",
"creationTime": null,
"text": "Rent 45228",
"originalText": "Rent 45228",
"details": null,
"category": null,
"amount": {
"value": -48.00,
"currency": "DKK"
},
"balance": {
"value": 12733.13,
"currency": "DKK"
},
"type": "Card",
"state": "Booked"
},
{
"id": "20200114-1200-0",
"date": "2020-01-14",
"creationTime": null,
"text": "Superbest 86125",
"originalText": "SUPERBEST 86125",
"details": null,
"category": null,
"amount": {
"value": -12.00,
"currency": "DKK"
},
"balance": {
"value": 12781.13,
"currency": "DKK"
},
"type": "Card",
"state": "Booked"
}
]
I loaded in the data like this:
with open('transactions.json') as transactions:
file = json.load(transactions)
data = json_normalize(file)['transactions'][0]
return pd.DataFrame(data)
And I have the following categories so far, I want to group the transactions by:
CATEGORIES = {
'Groceries': ['SuperB', 'Superbest'],
'Housing': ['Insurance', 'Rent']
}
Now I would like to loop through each row in the DataFrame and group each transaction.
I would like to do this, by checking if text contains one of the values from the CATEGORIES dictionary.
If so, that transaction should get categorized as the key of the CATEGORIES dictionary - for instance Groceries.
How do I do this most efficiently?
IIUC,
we can create a pipe delimited list from your dictionary and do some assignment with .loc
print(df)
for k,v in CATEGORIES.items():
pat = '|'.join(v)
df.loc[df['text'].str.contains(pat),'category'] = k
print(df[['text','category']])
text category
0 SuperB Vesterbro T 74637 Groceries
1 Rent 45228 Housing
2 Superbest 86125 Groceries
more efficienct solution :
we create a single list of all your values and extract them with str.extract at the same time we re-create your dictionary, so each value is now the key we will map onto your target dataframe.
words = []
mapping_dict = {}
for k,v in CATEGORIES.items():
for item in v:
words.append(item)
mapping_dict[item] = k
ext = df['text'].str.extract(f"({'|'.join(words)})")
df['category'] = ext[0].map(mapping_dict)
print(df)
text category
0 SuperB Vesterbro T 74637 Groceries
1 Rent 45228 Housing
2 Superbest 86125 Groceries