I am iterating over an XML file to insert some of its attributes inside a JSON to develop a Corpus. For some reason, when inserting the date and the body of the XML it always inserts the same text inside the JSON line
The XML file format (I'm trying to get the title, timestamp and text from all the pages):
<page>
<title>MediaWiki:Monobook.js</title>
<ns>8</ns>
<id>4</id>
<revision>
<id>45582</id>
<parentid>45581</parentid>
<timestamp>2007-09-20T03:20:46Z</timestamp>
<contributor>
<username>Superzerocool</username>
<id>404</id>
</contributor>
<comment>Limpiando</comment>
<model>javascript</model>
<format>text/javascript</format>
<text bytes="34" xml:space="preserve">/* Utilizar MediaWiki:Common.js */</text>
<sha1>5dy7xlkatqg5epjzrq48br6yeh5uu34</sha1>
</revision>
</page>
My code:
if __name__ == "__main__":
path = 'local path'
tree = ET.parse(path)
root = tree.getroot()
with open("C:/Users/User/Documents/Github/News-Corpus/corpus.json", "w") as f:
for page in tqdm(root.findall('page')):
title = page.find('title').text
dictionary = {}
dictionary["title"] = title
for revision in root.iter('revision'):
timestamp = revision.find('timestamp').text
dictionary["timestamp"] = timestamp
body = revision.find('text').text
dictionary["body"] = body
f.write(json.dumps(dictionary))
f.write("\n")
The output I get:
{"title": "MediaWiki:Monobook.js", "timestamp": "2022-09-16T13:07:15Z", "body": "Los pasajeros les fue devuelta el importe de su billete de vuelo perdida?"}
{"title": "MediaWiki:Administrators", "timestamp": "2022-09-16T13:07:15Z", "body": "Los pasajeros les fue devuelta el importe de su billete de vuelo perdida?"}
{"title": "MediaWiki:Allmessages", "timestamp": "2022-09-16T13:07:15Z", "body": "Los pasajeros les fue devuelta el importe de su billete de vuelo perdida?"}
{"title": "MediaWiki:Allmessagestext", "timestamp": "2022-09-16T13:07:15Z", "body": "Los pasajeros les fue devuelta el importe de su billete de vuelo perdida?"}
{"title": "MediaWiki:Allpagessubmit", "timestamp": "2022-09-16T13:07:15Z", "body": "Los pasajeros les fue devuelta el importe de su billete de vuelo perdida?"}
As you can see, I always get the same timestamp and the same body, does anyone know why this happens? Help is much appreciated.
I tried with \\n, but it didn't worked, so I think with another reserved word could work, but I can not find the good one.
table_config = [
{
'dbName': f'gomez_datalake_{env}_{team}_{dataset}_db',
'table': 'ConFac',
'partitionKey': 'DL_PERIODO',
'schema': [
['TIPO_DE_VALOR', 'STRING', 2, None,
"CÓDIGO DEL PARÁMETRO DE SISTEMA."
"EJEMPLOS:"
"UF: VALOR DE LA UF"
"IP: VALOR DEL IPC"
"MO: MONEDA"
"IV: VALOR DEL VA"
"UT: VALOR DEL UTM"],
['ORIGEN', 'STRING', 4, None, "IDENTIFICADOR DE USUARIO"]
]
I'm trying to get some data from a government website and store them in two different tables. One will contain the filenames and the release date (let's call this filename) and one will contain the actual data and the key to join with the filename (let's call this datasplit)
These data come into a JSON file I saved from the web page (I don't have an API for that). Here's a little example of how the JSON file looks:
{
"filename": [
{
"id": 2,
"nome": "Societa' controllate di fatto dalla Presidenza del Consiglio dei Ministri e dai Ministeri",
"aggiornamento": "04-02-2020",
"datasplit": [
{
"cf": "00081070591",
"den": "SIOG SOCIETA'ITALIANA OLEODOTTI DI GAETA SPA IN AMM.NE STRAORDINARIA",
"dm": "1513641600"
},
{
"cf": "00103540829",
"den": "INDUSTRIA SICILIANA ACIDO FOSFORICO S.P.A.IN LIQUIDAZIONE",
"dm": "1513641600"
}
]
},
{
"id": 1,
"nome": "Enti o societa' controllate dalle Amministrazioni Centrali",
"aggiornamento": "30-10-2019",
"datasplit": [
{
"cf": "00049100522",
"den": "MPS TENIMENTI POGGIO BONELLI E CHIGI SARACINI - SOC. AGRICOLA SPA",
"dm": "1513641600"
},
{
"cf": "00051010528",
"den": "SOCIETA' AGRICOLA SUVIGNANO S.R.L.",
"dm": "1513641600"
}
]
},
{
"id": 4,
"nome": "Societa' quotate inserite nell'indice FTSE MIB della Borsa italiana",
"aggiornamento": "19-12-2017",
"datasplit": [
{
"cf": "00079760328",
"den": "ASSICURAZIONI GENERALI S.P.A.",
"dm": "1513641600"
},
{
"cf": "00222620163",
"den": "FRENI BREMBO - SPA",
"dm": "1513641600"
}
]
}
]
}
So what I'd like to get is the filename table with id, nome, aggiornamento fields and the datasplit table with id, aggiornamento, cf, den, dm
What I've done so far is to get the JSON file (saving it locally) from the webpage and read it in my python program.
# this works
import json
sqlstatement = ''
with open('splitdata.json', 'r') as f: #this is where I saved the website content I want to Import
jsondata = json.loads(f.read())
I was trying to build something that would go through the json file and build some INSERT INTO table_name SQL queries to later execute them and finally have my data in the database.
So, my problem is how to read the nested JSON first and how to insert the data in my database second (if you have a better solution than creating and running the SQL script).
When trying to cycle inside the JSON it seems that it only finds one element.
for json in jsondata:
keylist = "("
valuelist = "("
firstPair = True
for key, value in jsondata.items():
if not firstPair:
keylist += ", "
valuelist += ", "
firstPair = False
keylist += key
if type(value) in (str, unicode):
valuelist += "'" + value + "'"
else:
valuelist += str(value)
keylist += ")"
valuelist += ")"
sqlstatement += "INSERT INTO " + TABLE_NAME + " " + keylist + " VALUES " + valuelist + "\n"
print(sqlstatement)
I know the code is incomplete to generate the correct SQL statements but i need help on how to get to the nested part of the JSON, the datasplit field. Could it be it's not treating it as a dictionary? If so, how can i eventually fix it?
You can attempt to solve the problem using Pandas which has SQL capabilities:
import pandas as pd
for key, val in json_data.items():
df = pd.json_normalize(val)
df = df.explode('datasplit')
df[['cf', 'den', 'dm']] = df.datasplit.apply(pd.Series)
df = df.drop('datasplit', axis=1)
df.to_sql(<name>, <con>)
This is what each datasplit (df) looks like:
id nome aggiornamento \
0 2 Societa' controllate di fatto dalla Presidenza... 04-02-2020
0 2 Societa' controllate di fatto dalla Presidenza... 04-02-2020
1 1 Enti o societa' controllate dalle Amministrazi... 30-10-2019
1 1 Enti o societa' controllate dalle Amministrazi... 30-10-2019
2 4 Societa' quotate inserite nell'indice FTSE MIB... 19-12-2017
2 4 Societa' quotate inserite nell'indice FTSE MIB... 19-12-2017
cf den dm
0 00081070591 SIOG SOCIETA'ITALIANA OLEODOTTI DI GAETA SPA I... 1513641600
0 00103540829 INDUSTRIA SICILIANA ACIDO FOSFORICO S.P.A.IN L... 1513641600
1 00049100522 MPS TENIMENTI POGGIO BONELLI E CHIGI SARACINI ... 1513641600
1 00051010528 SOCIETA' AGRICOLA SUVIGNANO S.R.L. 1513641600
2 00079760328 ASSICURAZIONI GENERALI S.P.A. 1513641600
2 00222620163 FRENI BREMBO - SPA 1513641600
I am not sure where you are stuck at, but perhaps you aren't familiar with how dictionaries work?
jsondata['filename'] # has everything you want. you array of data.
# get the first element in the array
jsondata['filename'][0]
# result: {'id': 2, 'nome': "Societa' controllate di fatto dalla Presidenza del Consiglio dei Ministri e dai Ministeri", 'aggiornamento': '04-02-2020', 'datasplit': [{'cf': '00081070591', 'den': "SIOG SOCIETA'ITALIANA OLEODOTTI DI GAETA SPA IN AMM.NE STRAORDINARIA", 'dm': '1513641600'}, {'cf': '00103540829', 'den': 'INDUSTRIA SICILIANA ACIDO FOSFORICO S.P.A.IN LIQUIDAZIONE', 'dm': '1513641600'}]}
# second element
jsondata['filename'][1]
# result: {'id': 1, 'nome': "Enti o societa' controllate dalle Amministrazioni Centrali", 'aggiornamento': '30-10-2019', 'datasplit': [{'cf': '00049100522', 'den': 'MPS TENIMENTI POGGIO BONELLI E CHIGI SARACINI - SOC. AGRICOLA SPA', 'dm': '1513641600'}, {'cf': '00051010528', 'den': "SOCIETA' AGRICOLA SUVIGNANO S.R.L.", 'dm': '1513641600'}]}
# access `datasplit` directly like
jsondata['filename'][0]['datasplit']
# result: [{'cf': '00081070591', 'den': "SIOG SOCIETA'ITALIANA OLEODOTTI DI GAETA SPA IN AMM.NE STRAORDINARIA", 'dm': '1513641600'}, {'cf': '00103540829', 'den': 'INDUSTRIA SICILIANA ACIDO FOSFORICO S.P.A.IN LIQUIDAZIONE', 'dm': '1513641600'}]
# or in for loop
for data in jsondata['filename']:
data['datasplit']
I have the following Python code to read a JSON file:
import json
from pprint import pprint
with open('traveladvisory.json') as json_data:
print 'json data ',json_data
d = json.load(json_data)
json_data.close()
Below is a piece of the 'traveladvisory.json' file opened with this code. The variable 'd' does print out all the JSON data. But I can't seem to get the syntax correct to read all of the 'country-eng' and 'advisory-text' fields and their data and print it out. Can someone assist? Here's a piece of the json data (sorry, can't get it pretty printed):
{
"metadata":{
"generated":{
"timestamp":1475854624,
"date":"2016-10-07 11:37:04"
}
},
"data":{
"AF":{
"country-id":1000,
"country-iso":"AF",
"country-eng":"Afghanistan",
"country-fra":"Afghanistan",
"advisory-state":3,
"date-published":{
"timestamp":1473866215,
"date":"2016-09-14 11:16:55",
"asp":"2016-09-14T11:16:55.000000-04:00"
},
"has-advisory-warning":1,
"has-regional-advisory":0,
"has-content":1,
"recent-updates-type":"Editorial change",
"eng":{
"name":"Afghanistan",
"url-slug":"afghanistan",
"friendly-date":"September 14, 2016 11:16 EDT",
"advisory-text":"Avoid all travel",
"recent-updates":"The Health tab was updated - travel health notices (Public Health Agency of Canada)."
},
"fra":{
"name":"Afghanistan",
"url-slug":"afghanistan",
"friendly-date":"14 septembre 2016 11:16 HAE",
"advisory-text":"\u00c9viter tout voyage",
"recent-updates":"L'onglet Sant\u00e9 a \u00e9t\u00e9 mis \u00e0 jour - conseils de sant\u00e9 aux voyageurs (Agence de la sant\u00e9 publique du Canada)."
}
},
"AL":{
"country-id":4000,
"country-iso":"AL",
"country-eng":"Albania",
"country-fra":"Albanie",
"advisory-state":0,
"date-published":{
"timestamp":1473350931,
"date":"2016-09-08 12:08:51",
"asp":"2016-09-08T12:08:51.8301256-04:00"
},
"has-advisory-warning":0,
"has-regional-advisory":1,
"has-content":1,
"recent-updates-type":"Editorial change",
"eng":{
"name":"Albania",
"url-slug":"albania",
"friendly-date":"September 8, 2016 12:08 EDT",
"advisory-text":"Exercise normal security precautions (with regional advisories)",
"recent-updates":"An editorial change was made."
},
"fra":{
"name":"Albanie",
"url-slug":"albanie",
"friendly-date":"8 septembre 2016 12:08 HAE",
"advisory-text":"Prendre des mesures de s\u00e9curit\u00e9 normales (avec avertissements r\u00e9gionaux)",
"recent-updates":"Un changement mineur a \u00e9t\u00e9 apport\u00e9 au contenu."
}
},
"DZ":{
"country-id":5000,
"country-iso":"DZ",
"country-eng":"Algeria",
"country-fra":"Alg\u00e9rie",
"advisory-state":1,
"date-published":{
"timestamp":1475593497,
"date":"2016-10-04 11:04:57",
"asp":"2016-10-04T11:04:57.7727548-04:00"
},
"has-advisory-warning":0,
"has-regional-advisory":1,
"has-content":1,
"recent-updates-type":"Full TAA review",
"eng":{
"name":"Algeria",
"url-slug":"algeria",
"friendly-date":"October 4, 2016 11:04 EDT",
"advisory-text":"Exercise a high degree of caution (with regional advisories)",
"recent-updates":"This travel advice was thoroughly reviewed and updated."
},
"fra":{
"name":"Alg\u00e9rie",
"url-slug":"algerie",
"friendly-date":"4 octobre 2016 11:04 HAE",
"advisory-text":"Faire preuve d\u2019une grande prudence (avec avertissements r\u00e9gionaux)",
"recent-updates":"Les pr\u00e9sents Conseils aux voyageurs ont \u00e9t\u00e9 mis \u00e0 jour \u00e0 la suite d\u2019un examen minutieux."
}
},
}
}
Assuming d contains the json Data
for country in d["data"]:
print "Country :",country
#country.get() gets the value of the key . the second argument is
#the value returned in case the key is not present
print "country-eng : ",country.get("country-eng",0)
print "advisory-text(eng) :",country["eng"].get("advisory-text",0)
print "advisory-text(fra) :",country["fra"].get("advisory-text",0)
This worked for me:
for item in d['data']:
print d['data'][item]['country-eng'], d['data'][item]['eng']['advisory-text']
If I am understanding your question. Here's how to do it:
import json
with open('traveladvisory.json') as json_data:
d = json.load(json_data)
# print(json.dumps(d, indent=4)) # pretty-print data read
for country in d['data']:
print(country)
print(' country-eng: {}'.format(d['data'][country]['country-eng']))
print(' advisory-state: {}'.format(d['data'][country]['advisory-state']))
Output:
DZ
country-eng: Algeria
advisory-state: 1
AL
country-eng: Albania
advisory-state: 0
AF
country-eng: Afghanistan
advisory-state: 3
Then the code bellow work for python2 (if you need the python3 version please ask)
You have to use the load function from the json module
#-*- coding: utf-8 -*-
import json # import the module we need
with open("traveladvisory.json") as f: # f for file
d = json.load(f) # d is a dictionnary
for key in d['data']:
print d['data'][key]['country-eng']
print d['data'][key]['eng']['advisory-text']
It is good practice to use the with statement when dealing with file objects.
This has the advantage that the file is properly closed after its suite finishes, even if an exception is raised on the way.
Also, the json is wrong, you have to remove to comma from line 98:
{
"metadata":{
"generated":{
"timestamp":1475854624,
"date":"2016-10-07 11:37:04"
}
},
"data":{
"AF":{
"country-id":1000,
"country-iso":"AF",
"country-eng":"Afghanistan",
"country-fra":"Afghanistan",
"advisory-state":3,
"date-published":{
"timestamp":1473866215,
"date":"2016-09-14 11:16:55",
"asp":"2016-09-14T11:16:55.000000-04:00"
},
"has-advisory-warning":1,
"has-regional-advisory":0,
"has-content":1,
"recent-updates-type":"Editorial change",
"eng":{
"name":"Afghanistan",
"url-slug":"afghanistan",
"friendly-date":"September 14, 2016 11:16 EDT",
"advisory-text":"Avoid all travel",
"recent-updates":"The Health tab was updated - travel health notices (Public Health Agency of Canada)."
},
"fra":{
"name":"Afghanistan",
"url-slug":"afghanistan",
"friendly-date":"14 septembre 2016 11:16 HAE",
"advisory-text":"\u00c9viter tout voyage",
"recent-updates":"L'onglet Sant\u00e9 a \u00e9t\u00e9 mis \u00e0 jour - conseils de sant\u00e9 aux voyageurs (Agence de la sant\u00e9 publique du Canada)."
}
},
"AL":{
"country-id":4000,
"country-iso":"AL",
"country-eng":"Albania",
"country-fra":"Albanie",
"advisory-state":0,
"date-published":{
"timestamp":1473350931,
"date":"2016-09-08 12:08:51",
"asp":"2016-09-08T12:08:51.8301256-04:00"
},
"has-advisory-warning":0,
"has-regional-advisory":1,
"has-content":1,
"recent-updates-type":"Editorial change",
"eng":{
"name":"Albania",
"url-slug":"albania",
"friendly-date":"September 8, 2016 12:08 EDT",
"advisory-text":"Exercise normal security precautions (with regional advisories)",
"recent-updates":"An editorial change was made."
},
"fra":{
"name":"Albanie",
"url-slug":"albanie",
"friendly-date":"8 septembre 2016 12:08 HAE",
"advisory-text":"Prendre des mesures de s\u00e9curit\u00e9 normales (avec avertissements r\u00e9gionaux)",
"recent-updates":"Un changement mineur a \u00e9t\u00e9 apport\u00e9 au contenu."
}
},
"DZ":{
"country-id":5000,
"country-iso":"DZ",
"country-eng":"Algeria",
"country-fra":"Alg\u00e9rie",
"advisory-state":1,
"date-published":{
"timestamp":1475593497,
"date":"2016-10-04 11:04:57",
"asp":"2016-10-04T11:04:57.7727548-04:00"
},
"has-advisory-warning":0,
"has-regional-advisory":1,
"has-content":1,
"recent-updates-type":"Full TAA review",
"eng":{
"name":"Algeria",
"url-slug":"algeria",
"friendly-date":"October 4, 2016 11:04 EDT",
"advisory-text":"Exercise a high degree of caution (with regional advisories)",
"recent-updates":"This travel advice was thoroughly reviewed and updated."
},
"fra":{
"name":"Alg\u00e9rie",
"url-slug":"algerie",
"friendly-date":"4 octobre 2016 11:04 HAE",
"advisory-text":"Faire preuve d\u2019une grande prudence (avec avertissements r\u00e9gionaux)",
"recent-updates":"Les pr\u00e9sents Conseils aux voyageurs ont \u00e9t\u00e9 mis \u00e0 jour \u00e0 la suite d\u2019un examen minutieux."
}
}
}
}
How could I print which rooms are connected to the "de Lobby"? The things I tried returned string erros or other errors.
kamers = {
1 : { "naam" : "de Lobby" ,
"trap" : 2,
"gangrechtdoor" : 3 } ,
2 : { "naam" : "de Trap" ,
"lobby" : 1,
"note" : "Terwijl je de trap oploopt hoor je in de verte Henk van Ommen schreeuwen" } ,
3 : { "naam" : "de Gang rechtdoor" ,
"lobby" : 1,
"gymzaal" : 4,
"concergie" : 5,
"gangaula" : 6 } ,
This prints where you are, but as you can see, not which rooms are connected.
print("Hier ben je: " + kamers[currentKamer]["naam"])
print("hier kan je naartoe: ")
Does this do what you want?
kamers = {
1: {"naam": "de Lobby",
"trap": 2,
"gangrechtdoor": 3},
2: {"naam": "de Trap",
"lobby": 1,
"note": "Terwijl je de trap oploopt hoor je in de verte Henk van Ommen schreeuwen"},
3: {"naam": "de Gang rechtdoor",
"lobby": 1,
"gymzaal": 4,
"concergie": 5,
"gangaula": 6}}
def find_connected_rooms(room_name, rooms):
room_number = next(room_number for room_number, props in rooms.items() if props['naam'] == room_name)
for room_props in rooms.values():
if room_number in room_props.values():
yield room_props['naam']
if __name__ == '__main__':
for connected_room in find_connected_rooms('de Lobby', kamers):
print(connected_room)
Output
de Trap
de Gang rechtdoor
Question is not quite clear but I assume you are looking for the items which has lobby key or any key with 1 as value
kamers[1] is lobby and it is "naam" is "de Lobby".
so This gets by if items inside has value 1 (Lobby's key)
[i for i in kamers.values() if 1 in i.values()]
or you can check if the key 'lobby' exists
[i for i in kamers.values() if i.get('lobby',None) ]
to get the name of the rooms you can replace for "i" with i['naam']
[i['naam'] for i in kamers.values() if i.get('lobby',None) ]
which returns
['de Trap', 'de Gang rechtdoor']