python parsing soup protocol response header and value

python parsing soup protocol response header and value - python

Trying to write the question one more time because the first time the problem was not clear.
I'm trying to extract output from a SOAP protocol using Python.
The output is structured with a lot of subcategories and this is where I can't extract the information properly.
The output I have is very long, so I will bring here a short example from soapenv:Body
the original code (soap body)
<mc:ocb-rule id="boic"><mc:id>boic</mc:id><mc:ocb-conditions><mc:rule-deactivated>true</mc:rule-deactivated><mc:international>true</mc:international></mc:ocb-conditions><mc:cb-actions><mc:allow>false</mc:allow></mc:cb-actions></mc:ocb-rule>
As you can see I also used the command
xmltodict.parse(soap_response)
for turning the output into a dictionary
OrderedDict([('#id',
'boic'),
('mc:id',
'boic'),
('mc:ocb-conditions',
OrderedDict([('mc:rule-deactivated',
'true'),
('mc:international',
'true')])),
('mc:cb-actions',
OrderedDict([('mc:allow',
'false')]))])
If there is an easier way, I would appreciate guidance
As I mentioned, my goal is ultimately to get each category its own value, when if there is a subcategory, a separate category will be added.
In the end I will have to put all the values into the Data frame and display all the categories and their values
for example :
table example
I really hope that this time I was able to explain myself properly
Thanks in advance
i am trying to parsing soup response and insert all value and header to data frame

Related

how to get nested data with pandas and request

I'm going crazy trying to get data through an API call using request and pandas. It looks like it's nested data, but I cant get the data i need.
https://xorosoft.docs.apiary.io/#reference/sales-orders/get-sales-orders
above is the api documentation. I'm just trying to keep it simple and get the itemnumber and qtyremainingtoship, but i cant even figure out how to access the nested data. I'm trying to use DataFrame to get it, but am just lost. any help would be appreciated. i keep getting stuck at the 'Data' level.
type(json['Data'])
df = pd.DataFrame(['Data'])
df.explode('SoEstimateHeader')
df.explode('SoEstimateHeader')
Cell In [64], line 1
df.explode([0:])
^
SyntaxError: invalid syntax

I used the link to grab a sample response from the API documentation page you provided. From the code you provided it looks like you are already able to get the data and I'm assuming the you have it as a dictionary type already.
From what I can tell I don't think you should be using pandas, unless its some downstream requirement in the task you are doing. But to get the ItemNumber & QtyRemainingToShip you can use the code below.
# get the interesting part of the data out of the api response
data_list = json['Data']
#the data_list is only one element long, so grab the first element which is of type dictionary
data = data_list[0]
# the dictionary has two keys at the top level
so_estimate_header = data['SoEstimateHeader']
# similar to the data list the value associated with "SoEstimateItemLineArr" is of type list and has 1 element in it, so we grab the first & only element.
so_estimate_item_line_arr = data['SoEstimateItemLineArr'][0]
# now we can grab the pieces of information we're interested in out of the dictionary
qtyremainingtoship = so_estimate_item_line_arr["QtyRemainingToShip"]
itemnumber = so_estimate_item_line_arr["ItemNumber"]
print("QtyRemainingToShip: ", qtyremainingtoship)
print("ItemNumber: ", itemnumber)
Output
QtyRemainingToShip: 1
ItemNumber: BC
Side Note
As a side note I wouldn't name any variables json because thats also the name of a popular library in python for parsing json, so that will be confusing to future readers and will clash with the name if you end up having to import the json library.

Using scrapy to extract and structure table data

I'm new to python and scrapy and thought I'd try out a simple review site to scrape. While most of the site structure is straight forward, I'm having trouble extracting the content of the reviews. This portion is visually laid out in sets of 3 (the text to the right of 良 (good), 悪 (bad), 感 (impressions) fields), but I'm having trouble pulling this content and associating it with a reviewer or section of review due to the use of generic divs, , /n and other formatting.
Any help would be appreciated.
Here's the site and code I've tried for the grabbing them, with some results.
http://www.psmk2.net/ps2/soft_06/rpg/p3_log1.html
(1):
response.xpath('//tr//td[#valign="top"]//text()').getall()
This returns the entire set of reviews, but it contains newline markup and, more of a problem, it renders each line as a separate entry. Due to this, I can't figure out where the good, bad, and impression portions end, nor can I easily parse each separate review as entry length varies.
['\n弱点をついた時のメリット、つかれたときのデメリットがはっきりしてて良い', '\nコミュをあげるのが楽しい',
'\n仲間が多くて誰を連れてくか迷う', '\n難易度はやさしめなので遊びやすい', '\nタルタロスしかダンジョンが無くて飽きる。'........and so forth
(2) As an alternative, I tried:
response.xpath('//tr//td[#valign="top"]')[0].get()
Which actually comes close to what I'd like, save for the markup. Here it seems that it returns the entire field of a review section. Every third element should be the "good" points of each separate review (I've replaced the <> with () to show the raw return).
(td valign="top")\n精一杯考えました(br)\n(br)\n戦闘が面白いですね\n主人公だけですが・・・・(br)\n従来のプレスターンバトルの進化なので(br)\n(br)\n以上です(/td)
(3) Figuring I might be able to get just the text, I then tried:
response.xpath('//tr//td[#valign="top"]//text()')[0].get()
But that only provides each line at a time, with the \n at the front. As with (1), a line by line rendering makes it difficult to attribute reviews to reviewers and the appropriate section in their review.
From these (2) seems the closest to what I want, and I was hoping I could get some direction in how to grab each section for each review without the markup. I was thinking that since these sections come in sets of 3, if these could be put in a list that would make pulling them easier in the future (i.e. all "good" reviews follow 0, 0+3; all "bad" ones 1, 1+3 ... etc.)...but first I need to actually get the elements.
I've thought about, and tried, iterating over each line with an "if" conditional (something like:)
i = 0
if i <= len(response.xpath('//tr//td[#valign="top"]//text()').getall()):
yield {response.xpath('//tr//td[#valign="top"]')[i].get()}
i + 1
to pull these out, but I'm a bit lost on how to implement something like this. Not sure where it should go. I've briefly looked at Item Loader, but as I'm new to this, I'm still trying to figure it out.
Here's the block where the review code is.
def parse(self, response):
for table in response.xpath('body'):
yield {
#code for other elements in review
'date': response.xpath('//td//div[#align="left"]//text()').getall(),
'name': response.xpath('//td//div[#align="right"]//text()').getall(),
#this includes the above elements, and is regualr enough I can systematically extract what I want
'categories': response.xpath('//tr//td[#class="koumoku"]//text()').getall(),
'scores': response.xpath('//tr//td[#class="tokuten_k"]//text()').getall(),
'play_time': response.xpath('//td[#align="right"]//span[#id="setumei"]//text()').getall(),
#reviews code here
}

Pretty simple task using a part of text as anchor (I used string to get text content for a whole td):
for review_node in response.xpath('//table[#width="645"]'):
good = review_node.xpath('string(.//td[b[starts-with(., "良")]]/following-sibling::td[1])').get()
bad= review_node.xpath('string(.//td[b[starts-with(., "悪")]]/following-sibling::td[1])').get()
...............

Exporting data from Yelp API to csv file

I'm a beginner programmer trying to work with the Yelp API and I've been able to pull the information I need but I can't figure out how to only export a single part of the address into my csv file. This is the code I am working with:
**Convert the JSON string to a dictionary*
business_data = response.json()
c = csv.writer(open('testing.csv', 'a'), lineterminator ='\n')
for biz in business_data['businesses']:
c.writerow([biz['name'], biz['location'], biz['phone'], biz['url']])
In the last line of code, in the for loop, I want to be able to target a specific element of the 'location' like this:
#c.writerow([biz['name'], biz['address1'], biz['city'], biz['state'], biz['zip_code'], biz['phone'], biz['url']])
on the Yelp website it shows that I can target these specifics but I just can't seem to figure out how with c.writerow()
Yelp shows that I can target these like this:
businesses[x].location.address1
businesses[x].location.address2
businesses[x].location.city

From the response, biz['location'] is a python dictionary meaning it consists of key, value pairs.
You can validate this by printing type(biz['location']). To answer your question, all you need is to call the dict key & write the value into the file.
c.writerow([biz['name'], biz['location']['address1'], biz['location']['city'], biz['location']['state'], biz['location']['zip_code'] biz['phone'], biz['url']])

Save string to use multiple times in code to avoid fetching the same information multiple times

I am working on a code where I fetch information from an online database through the api they provide. The information I receive is a large block of data arranged in square- and curly brackets. As of now, I fetch the whole block of data every time I want a specific parameter. It becomes a problem as I am limited in how much I can send queries through their api. I would therefore like to fetch the large block of data one time and save it as a variable in the code that I then reference to when I want specific parts of it. I will then be able to get all the specific parts of the data with only one request.
As of now the code I have looks like this:
from monkeylearn import MonkeyLearn
ml = MonkeyLearn('my personal API-key')
model_id = 'the id of the model'
text1 = 'This is the text that is analyzed by monkeylearn'
data = ['first text', {'text': text1, 'external_id': 'ANY_ID'}, '']
response = ml.extractors.extract(model_id, data).body
company_name_tag = response[1]['extractions'][0]['tag_name']
company_name = response[1]['extractions'][0]['extracted_text']
response contains all the information I get from the request so I would like to only have to fetch it once. As of now, if I were to print(company_name_tag) and print(company_name) it would fetch the data through the api two times. This leads to me reaching my limit on the api a lot faster than necessary.
I appreciate all help with this issue!

Scraping data from a http & javaScript site

I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap

I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.

variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python parsing soup protocol response header and value - python

Related

how to get nested data with pandas and request

Using scrapy to extract and structure table data

Exporting data from Yelp API to csv file

Save string to use multiple times in code to avoid fetching the same information multiple times

Scraping data from a http & javaScript site

Categories

Resources