I'm trying to get the town and state for a given zip code using the following site:
http://www.zip-info.com/cgi-local/zipsrch.exe?zip=10023&Go=Go
Using the following code I get all the tr tags:
import sys
import os
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.zip-info.com/cgi-local/zipsrch.exe?zip=10023&Go=Go")
data = r.text
soup = BeautifulSoup(data)
print soup.find_all('tr')
how do I find a particular tr tag? in exmaples like this: How to find tag with particular text with Beautiful Soup? you already know the text you are looking for. what do I use if I don't know the text ahead of time?
EDIT
i've now added the following and get nowhere:
for tag in soup.find_all(re.compile("^td align=")):
print (tag.name)
After I took a look at the HTML code of the website you provided, I will say the best way to locate will be "text based location" instead of class, id based ..etc.
First you can easily identify the header row based on the text using the key word "Mail", and then you can easily get the row that contains the content you want.
Here is my code:
import urllib2, re, bs4
soup = bs4.BeautifulSoup(urllib2.urlopen("http://www.zip-info.com/cgi-local/zipsrch.exe?zip=10023&Go=Go"))
# find the header, then find the next tr, which contains your data
tr = soup.find(text=re.compile("Mailing")).find_next("tr")
name, code, zip = [ td.text.strip() for td in tr.find_all("td")]
print name
print code
print zip
After you print them out they look like this:
New York
NY
10023
I would navigate until that point in the html source with a mix of find() and find_all() calls, because I cannot diferenciate from other <td> elements based in poistion, attributes or something else:
import sys
import os
from bs4 import BeautifulSoup
import requests
l = list()
r = requests.get("http://www.zip-info.com/cgi-local/zipsrch.exe?zip=10023&Go=Go")
data = r.text
soup = BeautifulSoup(data)
for table in soup.find('table'):
center = table.find_all('center')[3]
for tr in center.find_all('tr')[-1]:
l.append(tr.string)
print(l[0:-1])
Run it like:
python script.py
That yields:
[u'New York', u'NY']
Related
There is a website where I need to obtain the owners of this item from an online-game item and from research, I need to do some 'web scraping' to get this data. But, the information is in a Javascript document/code, not an easily parseable HTML document like bs4 shows I can easily extract information from. So, I need to get a variable in this javascript document (contains a list of owners of the item I'm looking at) and make it into a usable list/json/string I can implement in my program. Is there a way I can do this? if so, how can I?
I've attached an image of the variable I need when viewing the page source of the site I'm on.
My current code:
from bs4 import BeautifulSoup
html = requests.get('https://www.rolimons.com/item/1029025').content #the item webpage
soup = BeautifulSoup(html, "lxml")
datas = soup.find_all("script")
print(data) #prints the sections of the website content that have ja
IMAGE LINK
To scrape javascript variable, can't use only BeautifulSoup. Regular expression (re) is required.
Use ast.literal_eval to convert string representation of dict to a dict.
from bs4 import BeautifulSoup
import requests
import re
import ast
html = requests.get('https://www.rolimons.com/item/1029025').content #the item webpage
soup = BeautifulSoup(html, "lxml")
ownership_data = re.search(r'ownership_data\s+=\s+.*;', soup.text).group(0)
ownership_data_dict = ast.literal_eval(ownership_data.split('=')[1].strip().replace(';', ''))
print(ownership_data_dict)
Output:
> {'id': 1029025, 'num_points': 1616, 'timestamps': [1491004800,
> 1491091200, 1491177600, 1491264000, 1491350400, 1491436800,
> 1491523200, 1491609600, 1491696000, 1491782400, 1491868800,
> 1491955200, 1492041600, 1492128000, 1492214400, 1492300800,
> 1492387200, 1492473600, 1492560000, 1492646400, 1492732800,
> 1492819200, ...}
import requests
import json
import re
r = requests.get('...')
m = re.search(r'var history_data\s+=\s+(.*)', r.text)
print(json.loads(m.group(1)))
I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]
It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)
Try simply:
soup.select_one('div.field-redshift > div.value>b').text
If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])
How would the value for the "tier1Category" be extracted from the source of this page?
https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product
soup.find('script')
returns only a subset of the source, and the following returns another source within that code.
json.loads(soup.find("script", type="application/ld+json").text)
Bitto and I have similar approaches to this, however I prefer to not rely on knowing which script contains the matching pattern, nor the structure of the script.
import requests
from collections import abc
from bs4 import BeautifulSoup as bs
def nested_dict_iter(nested):
for key, value in nested.items():
if isinstance(value, abc.Mapping):
yield from nested_dict_iter(value)
else:
yield key, value
r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour allergy-tablets/ID=prod6205762-product')
soup = bs(r.content, 'lxml')
for script in soup.find_all('script'):
if 'tier1Category' in script.text:
j = json.loads(script.text[str(script.text).index('{'):str(script.text).rindex(';')])
for k,v in list(nested_dict_iter(j)):
if k == 'tier1Category':
print(v)
Here are the steps I used to get the output
use find_all and get the 10th script tag. This script tag contains the tier1Category value.
Get the script text from the first occurrence of { and till last occurrence of ; . This will give us a proper json text.
Load the text using json.loads
Understand the structure of the json to find how to get the tier1Category value.
Code:
import json
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product')
soup = BeautifulSoup(r.text, 'html.parser')
script_text=soup.find_all('script')[9].text
start=str(script_text).index('{')
end=str(script_text).rindex(';')
proper_json_text=script_text[start:end]
our_json=json.loads(proper_json_text)
print(our_json['product']['results']['productInfo']['tier1Category'])
Output:
Medicines & Treatments
I think you can use an id. I assume tier 1 is after shop in the navigation tree. Otherwise, I don't see that value in that script tag. I see it in an ordinary script (without the script[type="application/ld+json"] ) tag but there are a lot of regex matches for tier 1
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product')
soup = bs(r.content, 'lxml')
data = soup.select_one("#bdCrumbDesktopUrls_0").text
print(data)
I'm trying to programmatically extract text from this webpage which describes a genome assembly in the public archive:
http://www.ebi.ac.uk/ena/data/view/ERS019623
I have thousands of assemblies that I want to track down and extract the study accession, which is the code on the far left of the table beginning with "PRJ". The URL for each of these assemblies is of the same format as the one above, i.e. "http://www.ebi.ac.uk/ena/data/view/ERS******". I have the ERS code for each of my assemblies so I can construct the URL for each one.
I've tried a few different methods, firstly if you add "&display=XML" to the end of the URL it prints the XML (or at least I'm presuming that it's printing the XML for the entire page, because the problem is that the study accession "PRJ******" is no where to be seen here). I had utilised this to extract another code that I needed from the same webpage, the run accession which is always of the format "ERR******" using the below code:
import urllib2
from bs4 import BeautifulSoup
import re
import csv
with open('/Users/bj5/Desktop/web_scrape_test.csv','rb') as f:
reader = csv.reader(f) #opens csv containig list of ERS numbers
for row in reader:
sample = row[0] #reads index 0 (1st row)
ERSpage = "http://www.ebi.ac.uk/ena/data/view/" + sample + "&display=xml" #creates URL using ERS number from 1st row
page = urllib2.urlopen(ERSpage) #opens url and assigns it to variable page
soup = BeautifulSoup(page, "html.parser") #parses the html/xml from page and assigns it to variable called soup
page_text = soup.text #returns text from variable soup, i.e. no tags
ERS = re.search('ERS......', page_text, flags=0).group(0) #returns first ERS followed by six wildcards
ERR = re.search('ERR......', page_text, flags=0).group(0) #retursn first ERR followed by six wildcards
print ERS + ',' + ERR + ',' + "http://www.ebi.ac.uk/ena/data/view/" + sample #prints ERS,ERR,URL
This worked very well, but as the study accession is not in the XML I can't use it to access this.
I also attempted to use BeautifulSoup again to download the HTML by doing this:
from bs4 import BeautifulSoup
from urllib2 import urlopen
BASE_URL = "http://www.ebi.ac.uk/ena/data/view/ERS019623"
def get_category_links(section_url):
html = urlopen(section_url).read()
soup = BeautifulSoup(html, "lxml")
print soup
get_category_links(BASE_URL)
But again I can't see the study accession in the output from this either...
I have also attempted to use a different python module, lxml, to parse the XML and HTML but haven't had any luck there either.
When I right click and inspect element on the page I can find the study accession by doing ctrl+F -> PRJ.
So my question is this: what is the code that I'm looking at in inspect element, XML or HTML (or something else)? Why does it look different to the code that prints in my console when I try and use BeautifulSoup to parse HTML? And finally how can I scrape the study accessions (PRJ******) from these webpages?
(I've only been coding for a couple of months and I'm entirely self-taught so apologies for the slightly confused nature of this question but I hope I've got across what it is that I'm trying to do. Any suggestions or advice would be much appreciated.)
from bs4 import BeautifulSoup
import requests
import re
r = requests.get('http://www.ebi.ac.uk/ena/data/view/ERS019623&display=xml')
soup = BeautifulSoup(r.text, 'lxml')
ERS = soup.find('primary_id').text
ERR = soup.find('id', text=re.compile(r'^ERR')).text
url = 'http://www.ebi.ac.uk/ena/data/view/{}'.format(ERS)
print(ERS, ERR, url)
out:
ERS019623 ERR048142 http://www.ebi.ac.uk/ena/data/view/ERS019623
bs4 can parse xml file, just treat it like html, they are all the same, so their is no need to use regex to extract info.
i find a TEXT download link:
http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=ERS019623&result=read_run&fields=study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,tax_id,scientific_name,instrument_model,library_layout,fastq_ftp,fastq_galaxy,submitted_ftp,submitted_galaxy,sra_ftp,sra_galaxy,cram_index_ftp,cram_index_galaxy&download=txt
this link's fileds can be changed to get the data you want, like this:
http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=ERS019623&result=read_run&fields=study_accession&download=txt
by doing so, you can get all you data in a text file
In you sample soup is a BeautifulSoup object: a representation of the parsed document.
If you want to print the entire HTML of the document, you can call print(soup.prettify()) or if you want the text within it print(soup.get_text()).
The soup object has other possibilities to access parts of the document you are interested in: to navigate the parsed tree, to search in it ...
I have the following html pattern I want to scrap using BeautifulSoup. The html pattern is:
TITLE
I want to grab TITLE and the information that is displayed in the link. That is, if you clicked the link there is a a description of the TITLE. I want that description.
I started with just trying to grab the title with the following code:
import urllib
from bs4 import BeautifulSoup
import re
webpage = urrlib.urlopen("http://urlofinterest")
title = re.compile('<a>(.*)</a>')
findTitle = re.findall(title,webpage)
print findTile
My output is:
% python beta2.py
[]
So this is obviously not even finding the title. I even tried <a href>(.*)</a> and that didn't work. Based on my reading of the documentation and I thought BeautifulSoup will grab whatever text is between the symbols I give it. In this case, , so what am I doing wrong?
How come you're importing beautifulsoup and then not using it at all?
webpage = urrlib.urlopen("http://urlofinterest")
You'll want to read the data from this, so that:
webpage = urrlib.urlopen("http://urlofinterest").read()
Something like (should get you to a point to go further):
>>> blah = 'TITLE'
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(blah) # change to webpage later
>>> for tag in soup('a', href=True):
print tag['href'], tag.string
link TITLE