Transforming XML into JSON loadable structure for BigQuery

Transforming XML into JSON loadable structure for BigQuery - python

I’m learning python on the job and need help improving my solution.
I need to load XML data into BigQuery.
I have it working but not sure if I have done it in a sensible way.
I call an API that returns an XML structure.
I use ElementTree to parse the XML and use tree.iter() to return the tags and text from the XML.
Printing my tags and text with:
for node in tree.iter():
print(f'{node.tag}, {node.text}')
Returns:
Tag Text
Responses None
Response None
ResponseId 393
ResponseText Please respond “Has this loaded”
ResponseType single
ResponseStatus 0
The Responses tag appears only once per API call but Response through to ResponseStatus are repeating groups, ResponseId is the key for each group. Each call would return less than a 100 repeating groups.
There is a key returned in the header, Response_key, that is the parent of all ResponseIds.
My aim is to take this data, convert to JSON and stream to BigQuery.
The table structure I need is:
ResponseKey, ResponseID, Response, ResponseText, ResponseType , ResponseStatus
The approach I use is
Use tree.iter() to loop and create a list
node_list = []
for node in tree.iter():
node_list.append(node.tag)
node_list.append(node.text)
Use itertools to group the list (this I found a difficult step)
r = 'Response '
response _split = [list(y) for x, y in itertools.groupby(node_list, lambda z:
z == r) if not x]
which returns:
[['Responses', 'None'], ['None', 'ResponseId', '393', 'ResponseText', Please respond “Has this loaded”
"', 'ResponseType', 'single', 'ResponseStatus', '0'], ['None', 'ResponseId', '394', 'ResponseText', Please confirm “Connection made” "', 'ResponseType', 'single', 'ResponseStatus', '0']]
Load into a Pandas data frame, remove any double quotes in case that causes BigQuery any issues.
Add ResponseKey as a column to the dataframe.
Convert data frame to JSON and pass to load_table_from_json.
It works but not sure if it is sensible.
Any suggested improvements would be appreciated.
Here is a sample of the XML:
{"GetResponses":"<Responses><Response><ResponseId>393938<\/ResponseId><ResponseText>Please respond to the following statement:\"The assigned task was easy to complete\"<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393939<\/ResponseId><ResponseText>Did you save your datafor later? Why\/why not?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393940<\/ResponseId><ResponseText>Did you notice how much it cost to find the item? How much was it?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393941<\/ResponseId><ResponseText>Did you select ‘signature on form’? Why\/why not?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393942<\/ResponseId><ResponseText>Was it easy to find thethe new page? Why\/why not?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><Response><ResponseId>393943<\/ResponseId><ResponseText>Please enter your email. So that we can track your responses, we need you to provide this for each task.<\/ResponseText><ResponseShortCode>email<\/ResponseShortCode><ResponseType>text<\/ResponseType><ResponseStatus>1<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393944<\/ResponseId><ResponseText>Why didn't you save your datafor later?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393945<\/ResponseId><ResponseText>Why did you save your datafor later?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><Response><ResponseId>393946<\/ResponseId><ResponseText>Did you save your datafor later?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393947<\/ResponseId><ResponseText>Why didn't you select 'signature on form'?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393948<\/ResponseId><ResponseText>Why did you select 'signature on form'?<\/ResponseText><ResponseType>text<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>4444449<\/ResponseId><ResponseText>Did you select ‘signature on form’?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393950<\/ResponseId><ResponseText>Why wasn't it easy to find thethe new page?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><Response><ResponseId>393951<\/ResponseId><ResponseText>Was it easy to find thethe new page?<\/ResponseText><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>0<\/ExtendedType><\/Response><Response><ResponseId>393952<\/ResponseId><ResponseText>Please enter your email addressSo that we can track your responses, we need you to provide this for each task<\/ResponseText><ResponseShortCode>email<\/ResponseShortCode><ResponseType>single<\/ResponseType><ResponseStatus>0<\/ResponseStatus><ExtendedType>4<\/ExtendedType><\/Response><\/Responses>"}
A sample JSON without all the processing steps:
node_list = []
for node in tree.iter():
node_list.append(node.tag)
node_list.append(node.text)
json_format = json.dumps(node_list )
print(json_format)
["Responses", null, "Response", null, "ResponseId", "393938", "ResponseText", Please respond to the following statement:\"The assigned task was easy to complete"", "ResponseType", "single", "ResponseStatus", "0", "ExtendedType", "0"]

I'm not sure what is the required outpot,
this is one way of doing it
import xml.etree.ElementTree as ET
import json
p = r"d:\tmp.xml"
tree = ET.parse(p)
root = tree.getroot()
json_dict = {}
json_dict[root.tag] = root.text
json_dict['response_list'] = []
for node in root:
tmp_dict = {}
for response_info in node:
tmp_dict[response_info.tag] = response_info.text
json_dict['response_list'].append(tmp_dict)
with open(r'd:\out.json', 'w') as of:
json.dump(json_dict, of)

Related

json tree iteration missing return condition

Im iterating through a nested json tree with Pandas dataframe. The issue im having is more or less simple to solve, but im out of ideas. When im traversing trough the nested json tree i get to a part where i cant get out of it and continue on another branch (i.e. when i reach Placeholder 1 i cant return and continue with Placeholder 2 (see json below). Here is my code so far:
def recursiveImport(df):
for row,_ in enumerate(df):
# Get ID, Name, Type
id = df['ID'].values[row]
name = df['Name'].values[row]
type = df['Type'].values[row]
# Iterate through Value
if type == 'struct':
for i in df.at[row, 'Value']:
df = pd.json_normalize(i)
recursiveImport(df)
elif type != 'struct':
value = df['Value'].values[row]
print(f'Value: {value}')
return
data = pd.read_json('work_gmt.json', orient='records')
print(data)
recursiveImport(data)
And the (minified) data im using for this is below (you can use a online json viewer to get a better look):
[{"ID":11,"Name":"Data","Type":"struct","Value":[[{"ID":0,"Name":"humidity","Type":"u32","Value":0},{"ID":0,"Name":"meta","Type":"struct","Value":[{"ID":0,"Name":"height","Type":"e32","Value":[0,0]},{"ID":0,"Name":"voltage","Type":"u16","Value":0},{"ID":0,"Name":"Placeholder 1","Type":"u16","Value":0}]},{"ID":0,"Name":"Placeholder 2","Type":"struct","Value":[{"ID":0,"Name":"volume","Type":"struct","Value":[{"ID":0,"Name":"volume profile","Type":"struct","Value":[{"ID":0,"Name":"upper","Type":"u8","Value":0},{"ID":0,"Name":"middle","Type":"u8","Value":0},{"ID":0,"Name":"down","Type":"u8","Value":0}]}]}]}]]}]
I tried using an indexed approach and keep track of each branch, but that didn't work for me. Perhaps i have to use a Stack/Queue to keep track? Thanks in advance!
Cheers!

If python doesn't find certain value inside JSON, append something inside list

I'm making a script with Python to search for competitors with a Google API.
Just for you to see how it works:
First I make a request and save data inside a Json:
# make the http GET request to Scale SERP
api_result = requests.get('https://api.scaleserp.com/search', params)
# Save data inside Json
dados = api_result.json()
Then a create some lists to get position, title, domain and things like that, then I create a loop for to append the position from my competitors inside my lists:
# Create the lists
sPositions = []
sDomains = []
sUrls = []
sTitles = []
sDescription = []
sType = []
# Create loop for to look for information about competitors
for sCompetitors in dados['organic_results']:
sPositions.append(sCompetitors['position'])
sDomains.append(sCompetitors['domain'])
sUrls.append(sCompetitors['link'])
sTitles.append(sCompetitors['title'])
sDescription.append(sCompetitors['snippet'])
sType.append(sCompetitors['type'])
The problem is that not every bracket of my Json is going to have the same values. Some of them won't have the "domain" value. So I need something like "when there is no 'domain' value, append 'no domain' to sDomains list.
I'm glad if anyone could help.
Thanks!!

you should use the get method for dicts so you can set a default value incase the key doesn't exist:
for sCompetitors in dados['organic_results']:
sPositions.append(sCompetitors.get('position', 'no position'))
sDomains.append(sCompetitors.get('domain', 'no domain'))
sUrls.append(sCompetitors.get('link', 'no link'))
sTitles.append(sCompetitors.get('title', 'no title'))
sDescription.append(sCompetitors.get('snippet', 'no snippet'))
sType.append(sCompetitors.get('type', 'no type'))

BeautifulSoup4: Extracting tables, now how do I exclude certain tags and bits of information I do not want

Trying to extract coin names, price, and market cap from coinmarketcap.com. I first tried using soup.find_all to search for certain tags with a specific class but it always picked up information I didnt need or want. So instead I used find_all to search for 'td' and then planned on using a for loop to look for specific class names and to append those to a new list and then print that list but it returns a data type for some reason.
coin_table = soup.find_all('td')
class_value = 'sc-1eb5slv-0 iJjGCS'
for i in coin_table:
if class_value in coin_table:
list.append(i)
print(list)
But this returns:
<class 'list'>
to the console even though im not asking to see the data type. Very new to beautifulsoup and coding in general so sorry if this is a very basic question. Still trying to get my head around all of this stuff.

As #RJAdriaansen mentioned, you don't need to scrape website when they provide api. Here is how you do it with requests library:
import requests
url = 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit=100&sortBy=market_cap&sortType=desc&convert=USD,BTC,ETH&cryptoType=all&tagType=all&audited=false&aux=ath,atl,high24h,low24h,num_market_pairs,cmc_rank,date_added,tags,platform,max_supply,circulating_supply,total_supply,volume_7d,volume_30d'
response = requests.get(url)
data = response.json()
This will give you json data. Now you can grab all you need by accessing correct keys:
final_list = []
temp = []
for each_crypto in data['data']['cryptoCurrencyList']:
temp.append(each_crypto['name'])
# each_crypto['quotes'] gives you list of price and market gap of each crypto
for quote in each_crypto['quotes']:
# assuming you want to get USD price of each crypto
if quote['name'] == "USD":
temp.append(quote['price'])
temp.append(quote['marketCap'])
final_list.append(temp)
temp = []
Final result would look like this:
[
['Bitcoin', 34497.01819639692, 646704595579.0485],
['Ethereum', 2195.11816422801, 255815488972.87268],
['Tether', 1.0003936138399, 62398426501.02027],
['Binance Coin', 294.2550537711805, 45148405357.003],
...
]

Problems storing information of JSON into dictionary for loop in python

I'm new in this of API's and web development. so I'm sorry if my question is very basic :(.
I want to create a web browser of food recipes based in the ingredients contained. I'm using 2 queries urls to obtain the information because I need to acces to 2 json files. First one to obtain the id for each recipe based in the ingredient searched by the user and second one to obtain the information of each recipe based on the id returned in the first url.
The code I have is this one:
#Function that return id's of recipes that contains the word queried by user.
def ids(query):
try:
api_key = os.environ.get("API_KEY")
response = requests.get(f"https://api.spoonacular.com/recipes/autocomplete?apiKey={api_key}&query={urllib.parse.quote_plus(query)}")
response.raise_for_status()
except requests.RequestException:
return response
try:
ids = []
quotes = response.json()
for quote in quotes:
ids.append(quote['id'])
return ids
except (KeyError,TypeError, ValueError):
return None
#save inside a list named "ids", the id's of recipes that contains the ingredient chicken
ids = ids("chicken")
#function that return the differents options of recipes based in the ids.
def lookup(ids):
for ID in ids:
try:
api_key = os.environ.get("API_KEY")
response = requests.get(f"https://api.spoonacular.com/recipes/{ID}/information?apiKey{api_key}&includeNutrition=false")
response.raise_for_status()
except requests.RequestException:
return response
The main issue I have is that I don't know how to store the information returned in response, as you may notice I use into the "lookup" function a loop to get the responses for all ID contained in the list ids, but considering that I'll obtain 1 response for each ID (for instance if I have 6 ids, I'll obtain 6 different responses with 6 different information into the json files).
finally the info I want to store is this one
quote = response.json()
results = {'id':quote["id"],'title':quote["title"],'url':quote["sourceUrl"]}
This is the link with a sample of the data and the url used to obtain the json
https://spoonacular.com/food-api/docs#Get-Recipe-Information
I'm stucking trying to store this information located inside the different json files in a dictionary using python.
Any kind of help will be amazing!!

You would best use a dict for it with a structure matching the recipes you get back:
Assuming the API returns name, duration, difficulty and these are fields you will use later, as well as that you also save other data besides recipes for your program you could use a dict. If this is not the case simply use a list of dicts that represent single recipes
#just a dummy setup to simulate getting different recipes back from the API
one_response = {"name" : "Chicken and Egg", "duration" : 14, "difficulty" : "easy"}
another_response = {"name" : "Chicken square", "duration" : 100, "difficulty" : "hard"}
def get_recipe(id):
if id == 1:
return one_response
else:
return another_response
ids = [1,2]
# Here would be other information maybe as well, that you capture somewhere else. If you don't have this then simply use a list with recipes dicts inside..
queried_recipes = {"recipes" :[] }
for i in ids:
# Here you simply add a recipes to your recipes dict
queried _recipes["recipes"].append(get_recipe(i))
print (queried_recipes)
OUT: {'recipes': [{'name': 'Chicken and Egg', 'duration': 14, 'difficulty': 'easy'}, {'name': 'Chicken square', 'duration': 100, 'difficulty': 'hard'}]}
print(queried_recipes["recipes"][0]["duration"])
OUT: 14

You may want to use https://spoonacular.com/food-api/docs#Get-Recipe-Information-Bulk instead. That will get you all the information you want in one JSON document without having to loop through repeated calls to https://api.spoonacular.com/recipes/{ID}/information.
However, to answer the original question:
def lookup(ids):
api_key = os.environ.get("API_KEY")
results = []
for ID in ids:
response = requests.get(f"https://api.spoonacular.com/recipes/{ID}/information?apiKey{api_key}&includeNutrition=false")
response.raise_for_status()
quote = response.json()
result = {'id':quote["id"],'title':quote["title"],'url':quote["sourceUrl"]}
results.append(result)
return results

Multiple results on a Xml stack with lml (Python)

This is what i want into an external xml file, through a for bucle add a registry with the same tag in <Data>as <Availability> and <Price> like this:
<UpdateInventoryRequest>
<StartDate>21/12/2015</StartDate>
<RoomId>1</RoomId>
<Data>
<Availability>1</Availability>
<Price>100</Price>
<Availability>3</Availability>
<Price>120</Price>
</Data>
</UpdateInventoryRequest>
And this is my code now, everytime returns the same value in all fields:
from lxml import etree
# Create Xml
root = etree.Element("UpdateInventoryRequest")
doc = etree.ElementTree(root)
root.append(etree.Element("StartDate"))
root.append(etree.Element("RoomId"))
root.append(etree.Element("Data"))
data_root = root[2]
data_root.append(etree.Element("Availability"))
data_root.append(etree.Element("Price"))
# Xml in code
def buildXmlUpdate(dfrom, roomId, ldays):
start_date_sard = dfrom
roomId = str(roomId)
room_id_sard = roomId
for n in ldays:
print (dfrom, roomId, n)
ldays[-1]['avail'] = str(ldays[-1]['avail'])
ldays[-1]['price'] =str(ldays[-1]['price'])
availability_in_data = ldays[-1]['avail']
price_in_data = ldays[-1]['price']
root[0].text = start_date_sard
root[1].text = room_id_sard
data_root[0].text = availability_in_data
data_root[1].text = price_in_data
#here execute the function
buildXmlUpdate('21/12/2015', 1, [{'avail': 1, 'price': 100}, {'avail': 3, 'price': 120}])
doc.write('testoutput.xml', pretty_print=True)

If it's the case that you want your script to build an XML packet as you've shown, there are a few issues.
You're doing a lot of swapping of variables around, simply to convert them to strings - for the most part you can just use the Python string conversion (str()) on demand.
In your loop, the data you are trying to deal with is in the variable n, however, when you are pulling data out, it's from the variable ldays, which means the data you are trying to put into your XML is the same, regardless of the number of times you go through the loop.
You've built an XML object with a single "Availability" element, and a single "Price" element, so there is no way, given the code you presented, you are ever going to generate multiple "Availability" and "Price" elements as in your sample XML file.
This isn't necessarily the best way to do things, but here is a potential solutions, utilizing the paradigms you've already established:
from lxml import etree
def buildXmlUpdate(dfrom, roomId, ldays):
root = etree.Element("UpdateInventoryRequest")
root.append(etree.Element("StartDate"))
root[-1].text = dfrom
root.append(etree.Element("RoomId"))
root[-1].text = str(roomId)
root.append(etree.Element("Data"))
dataroot = root[-1]
for item in ldays:
dataroot.append(etree.Element("Availability"))
dataroot[-1].text = str(item['avail'])
dataroot.append(etree.Element("Price"))
dataroot[-1].text = str(item['price'])
return root
myroot = buildXmlUpdate('21/12/2015', 1, [{'avail': 1, 'price': 100}, {'avail': 3, 'price': 120}])
print etree.tostring(myroot, pretty_print=True)
Again, this is only one possible way to do this; there are certainly more approaches you could take.
And if you haven't already, I might suggest going through the LXML Tutorial and trying the different things they go through there, as it may help you find better ways to do what you want.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Transforming XML into JSON loadable structure for BigQuery - python

Related

json tree iteration missing return condition

If python doesn't find certain value inside JSON, append something inside list

BeautifulSoup4: Extracting tables, now how do I exclude certain tags and bits of information I do not want

Problems storing information of JSON into dictionary for loop in python

Multiple results on a Xml stack with lml (Python)

Categories

Resources