Scrape a webpage using scrapy into tab-delimited format - python

I would like to scrape and parse the data on these two pages: here and here into a tab-delimited format using scrapy. I did these commands:
scrapy shell
fetch("https://www.drugbank.ca/drugs/DB04899")
print response.text
My two question:
1. for example, for this page, when I type:
response.css(".sequence::text").extract()
[u'>DB04899: Natriuretic peptides B\nSPKMVQGSGCFGRKMDRISSSSGLGCKVLRRH']
But then when I type:
>>> response.css(".synonyms::text").extract()
[]
>>> response.css(".Synonyms::text").extract()
[]
But you can see that there are synonyms listed on the webpage and so the output should not be empty. Can someone demonstrate what I'm doing wrong? (I also tried other tags such as synonym, Synonym) etc.
When I type: response.css(".targets::text").extract(), the output is [u'Targets (3)']. I'm wondering how I can actually parse the data within this list, but I guess this is related to not using the right tags and question 1 above.
This is a vague question/advanced for me at the minute, is it possible to just scrape the whole page in one go, instead of having to know each individual tag? So my output would be a dictionary called 'identification' with Name, accession number, type etc as keys. Then a dictionary called pharmacology with indication, structured indication etc as keys, then another dictionary called interactions, and another called pharmaeconomics etc, one dictionary per page section?
Thanks

There are really no elements with synonyms or Synonyms class attribute value on the page.
You can get to the synonyms by "going to the right" of the dt element with the "Synonyms" text using following-sibling:
In [2]: response.xpath("//dt[. = 'Synonyms']/following-sibling::dd/ul/li/text()").extract()
Out[2]:
['BNP',
'Brain natriuretic peptide 32',
'Natriuretic peptides B',
'Nesiritide recombinant']

Related

Using scrapy to extract and structure table data

I'm new to python and scrapy and thought I'd try out a simple review site to scrape. While most of the site structure is straight forward, I'm having trouble extracting the content of the reviews. This portion is visually laid out in sets of 3 (the text to the right of 良 (good), 悪 (bad), 感 (impressions) fields), but I'm having trouble pulling this content and associating it with a reviewer or section of review due to the use of generic divs, , /n and other formatting.
Any help would be appreciated.
Here's the site and code I've tried for the grabbing them, with some results.
http://www.psmk2.net/ps2/soft_06/rpg/p3_log1.html
(1):
response.xpath('//tr//td[#valign="top"]//text()').getall()
This returns the entire set of reviews, but it contains newline markup and, more of a problem, it renders each line as a separate entry. Due to this, I can't figure out where the good, bad, and impression portions end, nor can I easily parse each separate review as entry length varies.
['\n弱点をついた時のメリット、つかれたときのデメリットがはっきりしてて良い', '\nコミュをあげるのが楽しい',
'\n仲間が多くて誰を連れてくか迷う', '\n難易度はやさしめなので遊びやすい', '\nタルタロスしかダンジョンが無くて飽きる。'........and so forth
(2) As an alternative, I tried:
response.xpath('//tr//td[#valign="top"]')[0].get()
Which actually comes close to what I'd like, save for the markup. Here it seems that it returns the entire field of a review section. Every third element should be the "good" points of each separate review (I've replaced the <> with () to show the raw return).
(td valign="top")\n精一杯考えました(br)\n(br)\n戦闘が面白いですね\n主人公だけですが・・・・(br)\n従来のプレスターンバトルの進化なので(br)\n(br)\n以上です(/td)
(3) Figuring I might be able to get just the text, I then tried:
response.xpath('//tr//td[#valign="top"]//text()')[0].get()
But that only provides each line at a time, with the \n at the front. As with (1), a line by line rendering makes it difficult to attribute reviews to reviewers and the appropriate section in their review.
From these (2) seems the closest to what I want, and I was hoping I could get some direction in how to grab each section for each review without the markup. I was thinking that since these sections come in sets of 3, if these could be put in a list that would make pulling them easier in the future (i.e. all "good" reviews follow 0, 0+3; all "bad" ones 1, 1+3 ... etc.)...but first I need to actually get the elements.
I've thought about, and tried, iterating over each line with an "if" conditional (something like:)
i = 0
if i <= len(response.xpath('//tr//td[#valign="top"]//text()').getall()):
yield {response.xpath('//tr//td[#valign="top"]')[i].get()}
i + 1
to pull these out, but I'm a bit lost on how to implement something like this. Not sure where it should go. I've briefly looked at Item Loader, but as I'm new to this, I'm still trying to figure it out.
Here's the block where the review code is.
def parse(self, response):
for table in response.xpath('body'):
yield {
#code for other elements in review
'date': response.xpath('//td//div[#align="left"]//text()').getall(),
'name': response.xpath('//td//div[#align="right"]//text()').getall(),
#this includes the above elements, and is regualr enough I can systematically extract what I want
'categories': response.xpath('//tr//td[#class="koumoku"]//text()').getall(),
'scores': response.xpath('//tr//td[#class="tokuten_k"]//text()').getall(),
'play_time': response.xpath('//td[#align="right"]//span[#id="setumei"]//text()').getall(),
#reviews code here
}
Pretty simple task using a part of text as anchor (I used string to get text content for a whole td):
for review_node in response.xpath('//table[#width="645"]'):
good = review_node.xpath('string(.//td[b[starts-with(., "良")]]/following-sibling::td[1])').get()
bad= review_node.xpath('string(.//td[b[starts-with(., "悪")]]/following-sibling::td[1])').get()
...............

Scraping data from a http & javaScript site

I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap
I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.
variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.

Clean API results to get the headlines of news articles?

I have been having trouble finding a way to pull out specific text info from the Guardian API for my dissertation. I have managed to get all my text onto Python but how do you then clean it to get say, just the headlines of the news articles?
This is a snippet of the API result that I want to pull out info from:
{
"response": {
"status":"ok",
"userTier":"developer",
"total":1869990,
"startIndex":1,
"pageSize":10,
"currentPage":1,
"pages":186999,
"orderBy":"newest",
"results":[
{
"id":"sport/live/2016/jul/09/tour-de-france-2016-stage-eight-live",
"type":"liveblog",
"sectionId":"sport",
"sectionName":"Sport",
"webPublicationDate":"2016-07-09T13:21:36Z",
"webTitle":"Tour de France 2016: stage eight – live!",
"webUrl":"https://www.theguardian.com/sport/live/2016/jul/09/tour-de-france-2016-stage-eight-live",
"apiUrl":"https://content.guardianapis.com/sport/live/2016/jul/09/tour-de-france-2016-stage-eight-live",
"isHosted":false
},
{
"id":"sport/live/2016/jul/09/serena-williams-v-angelique-kerber-wimbledon-womens-final-live",
"type":"liveblog",
"sectionId":"sport",
"sectionName":"Sport",
"webPublicationDate":"2016-07-09T13:21:02Z",
"webTitle":"Serena Williams v Angelique Kerber: Wimbledon women's final –
...
Hoping the OP adds the used code to the question.
One solution in python is, that whatever you get (from the methods offered by the requests module?) will be either already deeply nested structures you can well index into or you can easily map it to these structures (via json.loads(the_string_you_displayed).
Sample:
d = json.loads(the_string_you_displayed)
head_line = d['response']['results'][0]['webTitle']
Would give the value into headline that is stored in the first dict found in the results "array" (index 0) of the response entries value. (The question was updated so now, the full path is visible)
in case I read the sample snippet given correctly, and it has been cut during copy and paste here, as the given sample is (as is) invalid JSON.
If the text does not represent a valid JSON text, it will depend on sifting through text via substring or pattern matching and may well be very brittle ...
Update: So assuming the full response structure is stored inside a variable named data:
result_seq = data['response']['results'] # yields a list here
headlines = [result['webTitle'] for result in result_seq]
The last line works like so: This is a list comprehension compactly creating a list from all entries result in the result_seq by picking the value of the key webTitle in each dict.
An explicit for loop like solution picking them all would be:
result_seq = data['response']['results']
headlines = []
for result in result_seq:
headlines.append(result['webTitle'])
This does not check for errors like result dicts without a key webTitle etc. but Python will raise a matching exception, and one can decide, if one likes to wrap the processing inside a try: except block or hope for the best ...

Use BeautifulSoup to Iterate over XML to pull specific tags and store in variable

I'm fairly new to programming and have been trying to find a solution for this but all I can find are bits and pieces with no real luck putting it all together.
I'm trying to use BeautifulSoup4 in python to scrape some xml and store the text value in between specific tags in variables. The data is from a med student training program and right now everything needed has to be found manually. So I'm trying to increase efficiency a bit with a scraping program.
Let's say for example that I was looking at this type of test data to experiment with:
<AllergyList>
<Allergy>
<Deleted>n</Deleted>
<Status>
<Active/>
</Status>
<ExternalID/>
<Patient>
<ExternalID/>
<FirstName>Testcase</FirstName>
<LastName>casetest</LastName>
</Patient>
<Allergen>
<Name>Flagyl (metronidazole)</Name>
<Drug>
<NDCID>00025182151,00025182131,00025182150</NDCID>
</Drug>
</Allergen>
<Reaction>difficulty breathing</Reaction>
<OnsetDate>02/02/2013</OnsetDate>
</Allergy>
<Allergy>
<Deleted>n</Deleted>
<Status>
<Active/>
</Status>
<ExternalID/>
<Patient>
<ExternalID/>
<FirstName>Testcase</FirstName>
<LastName>casetest</LastName>
</Patient>
<Allergen>
<Name>Bactrim (sulfamethoxazole-trimethoprim)</Name>
<Drug>
<NDCID>13310014501,49999023220</NDCID>
</Drug>
</Allergen>
<Reaction>swelling</Reaction>
<OnsetDate>05/03/2002</OnsetDate>
</Allergy>
<Number>2</Number>
</AllergyList>
I've been trying to pull the <Name> tag from in between multiple <Allergen> tags as well as the respective data from in between the <Onsetdate> and <Reaction> tags while storing the results of the pull into respective variables.
So for example I would want to pull Flagyl (metronidazole), difficulty breathing, 02/02/2013, then Bactrim (sulfamethoxazole-trimethoprim), swelling, 05/03/2002, and so on while placing them in separate variables that I can use later.
Pulling the first set from the <Allergen> tag is easy but I'm having trouble figuring out how to iterate over the xml and storing the pulled data into variables. I've been trying to use a for loop while storing the data into an array or list but the way I've been writing it I always pull the same data over and over again depending on the number of iterations I calculate from the len() function and have since failed to store any of it into an array.
I've been racking my brain about this for a while now and I think I may just not be that smart so any help or even pointing me in the right direction would be immensely appreciated.
It seems a simple task because there isn't many nesting tags:
from bs4 import BeautifulSoup
import sys
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'xml')
allergies = []
for allergy in soup.find_all('Allergy'):
d = {
'name': allergy.Allergen.Name.string,
'reaction': allergy.Reaction.string,
'on_set_date': allergy.OnsetDate.string,
}
allergies.append(d)
## Use 'allergies' array of dictionaries as you want.
## Example:
print(allergies[1]['reaction'])
Run it with the xml file as argument:
python3 script.py xmlfile
And this test yields:
swelling

Can a formfield be selected w/mechanize based on the type of the field (eg. TextControl, TextareaControl)?

I'm trying to parse an html form using mechanize. The form itself has an arbitrary number of hidden fields and the field names and id's are randomly generated so I have no obvious way to directly select them. Clearly using a name or id is out, and due to the random number of hidden fields I cannot select them based on the sequence number since this always changes too.
However there are always two TextControl fields right after each other, and then below that is a TextareaControl. These are the 3 fields I need access too, basically I need to parse their names and all is well. I've been looking through the mechanize documentation for the past couple hours and haven't come up with anything that seems to be able to do this, however simple it should seem to be (to me anyway).
I have come up with an alternate solution that involves making a list of the form controls, iterating through it to find the controls that contain the string 'Text' returning a new list of those, and then finally stripping out the name using a regular expression. While this works it seems unnecessary and I'm wondering if there's a more elegant solution. Thanks guys.
edit: Here's what I'm currently doing to extract that info if anyone's curious. I think I'm probably just going to stick with this. It seems unnecessary but it gets the job done and it's nothing intensive so I'm not worried about efficiency or anything.
def formtextFieldParse(browser):
'''Expects a mechanize.Browser object with a form already selected. Parses
through the fields returning a tuple of the name of those fields. There
SHOULD only be 3 fields. 2 text followed by 1 textarea corresponding to
Posting Title, Specific Location, and Posting Description'''
import re
pattern = '\(.*\)'
fields = str(browser).split('\n')
textfields = []
for field in fields:
if 'Text' in field: textfields.append(field)
titleFieldName = re.findall(pattern, textfields[0])[0][1:-2]
locationFieldName = re.findall(pattern, textfields[1])[0][1:-2]
descriptionFieldName = re.findall(pattern, textfields[2])[0][1:-2]
I don't think mechanize has the exact functionality you require; could you use mechanize to get the HTML page, then parse the latter for example with BeautifulSoup?

Categories

Resources