I'm trying to develop an automated script to download the following data file to a utility server and then ETL related processing. Looking for pythonic suggestions. Not familiar with the current best options for this type of process between urllib, urllib2, beautiful soup, requests, mechanize, selenium, etc.
The Website
"Full Replacement Monthly NPI File"
The Monthly Data File
The file name (and subsequent url) changes monthly.
Here is my current approach thus far:
from bs4 import BeautifulSoup
import urllib
import urllib2
soup = BeautifulSoup(urllib2.urlopen('http://nppes.viva-it.com/NPI_Files.html').read())
download_links = []
for link in soup.findAll(href=True):
urls = link.get('href', '/')
download_links.append(urls)
target_url = download_links[2]
urllib.urlretrieve(target_url , "NPI.zip")
I am not anticipating the content on this clunky govt. site to change, so I though just selecting the 3rd element of the scraped url list would be good enough. Of course, if my entire approach is wrongheaded, I welcome correction (data analytics is the personal forte). Also, if I am using outdated libraries, unpythonic practices, or low performance options, I definitely welcome the newer and better!
In general requests is the easiest way to get webpages.
If the name of the data files follows the pattern NPPES_Data_Dissemination_<Month>_<year>.zip, which seems logical, you can request that directly;
import requests
url = "http://nppes.viva-it.com/NPPES_Data_Dissemination_{}_{}.zip"
r = requests.get(url.format("March", 2015))
The data is then in r.text.
If the data-file name is less certain, you can get the webpage and use a regular expression to search for links to zip files;
In [1]: import requests
In [2]: r = requests.get('http://nppes.viva-it.com/NPI_Files.html')
In [3]: import re
In [4]: re.findall('http.*NPPES.*\.zip', r.text)
Out[4]:
['http://nppes.viva-it.com/NPPES_Data_Dissemination_March_2015.zip',
'http://nppes.viva-it.com/NPPES_Deactivated_NPI_Report_031015.zip',
'http://nppes.viva-it.com/NPPES_Data_Dissemination_030915_031515_Weekly.zip',
'http://nppes.viva-it.com/NPPES_Data_Dissemination_031615_032215_Weekly.zip',
'http://nppes.viva-it.com/NPPES_Data_Dissemination_032315_032915_Weekly.zip',
'http://nppes.viva-it.com/NPPES_Data_Dissemination_033015_040515_Weekly.zip',
'http://nppes.viva-it.com/NPPES_Data_Dissemination_100614_101214_Weekly.zip']
The regular expression in In[4] basically says to find strings that start with "http", contain "NPPES" and end with ".zip".
This isn't speficic enough. Let's change the regular expression as shown below;
In [5]: re.findall('http.*NPPES_Data_Dissemination.*\.zip', r.text)
Out[5]:
['http://nppes.viva-it.com/NPPES_Data_Dissemination_March_2015.zip',
'http://nppes.viva-it.com/NPPES_Data_Dissemination_030915_031515_Weekly.zip',
'http://nppes.viva-it.com/NPPES_Data_Dissemination_031615_032215_Weekly.zip',
'http://nppes.viva-it.com/NPPES_Data_Dissemination_032315_032915_Weekly.zip',
'http://nppes.viva-it.com/NPPES_Data_Dissemination_033015_040515_Weekly.zip',
'http://nppes.viva-it.com/NPPES_Data_Dissemination_100614_101214_Weekly.zip']
This gives us the URLs of the file we want but also the weekly files.
In [6]: fileURLS = re.findall('http.*NPPES_Data_Dissemination.*\.zip', r.text)
Let's filter out the weekly files:
In [7]: [f for f in fileURLS if 'Weekly' not in f]
Out[7]: ['http://nppes.viva-it.com/NPPES_Data_Dissemination_March_2015.zip']
This is the URL you seek. But this whole scheme does depend on how regular the names are. You can add flags to the regular expression searches to discard the case of the letters, that would make it accept more.
Related
There is a URL with a .bin attachment in my HTML file.My goal is to extract the full link with my Python script. I am running this script across many HTML files and the location of the .bin URL may change.If I was able to get the index of the beginning of the URL and the end, I could extract it that way.
I tried doing a word search through the HTML files but there are a few .bin URLS, I only want the first one. Any ideas would be appreciated. Or any other methods.
import urllib.request, urllib.error, urllib.parse
html_link = "www.mywebsitelink.com"
response = urllib.request.urlopen(html_link)
webContent = response.read()
I suggest you look at using Regex.
In your example, you will probably be looking for something like:
^http://.+\.bin$
You can test this out and explore what each part of the Regex expression means using this helpful tool: regex101
Your code would probably look something like this:
import re
bin_url = re.search("^http://.+\.bin$", webContent)
What I am trying to do is the following. There is this web page: http://xml.buienradar.nl .
From that, I want to extract a value every n minutes, preferably with Python. Let's say the windspeed at the Gilze-Rijen station. That is located on this page at:
<buienradarnl>.<weergegevens>.<actueel_weer>.<weerstations>.<weerstation id="6350">.<windsnelheidMS>4.80</windsnelheidMS>
Now, I can find loads of questions with answers that use Python to read a local XML file. But, I would rather not need to wget or curl this page every couple of minutes.
Obviously, I'm not very familiar with this.
There must be a very easy way to do this. The answer either escapes me or is drowned in all the answers that solve problems with a local file.
I would use urllib2 and BeautifulSoup.
from urllib2 import Request, urlopen
from bs4 import BeautifulSoup
req = Request("http://xml.buienradar.nl/")
response = urlopen(req)
output = response.read()
soup = BeautifulSoup(output)
print soup.prettify()
Then you can traverse the output like you were suggesting:
soup.buienradarnl.weergegevens (etc)
I've been reading up on parsing xml with python all day, but looking at the site i need to extract data on, i'm not sure if i'm barking up the wrong tree. Basically i want to get the 13-digit barcodes from a supermarket website (found in the name of the images). For example:
http://www.tesco.com/groceries/SpecialOffers/SpecialOfferDetail/Default.aspx?promoId=A31033985
has 11 items and 11 images, the barcode for the first item is 0000003235676. However when i look at the page source (i assume this is the best way to extract all of the barcodes in one go with python, urllib and beautifulsoup) all of the barcodes are on one line (line 12) however the data doesn't seem to be structured as i would expect in terms of elements and attributes.
new TESCO.sites.UI.entities.Product({name:"Lb Mens Mattifying Dust 7G",xsiType:"QuantityOnlyProduct",productId:"275303365",baseProductId:"72617958",quantity:1,isPermanentlyUnavailable:true,imageURL:"http://img.tesco.com/Groceries/pi/805/5021320051805/IDShot_90x90.jpg",maxQuantity:99,maxGroupQuantity:0,bulkBuyLimitGroupId:"",increment:1,price:2.5,abbr:"g",unitPrice:3.58,catchWeight:"0",shelfName:"Mens Styling",superdepartment:"Health & Beauty",superdepartmentID:"TO_1448953606"});
new TESCO.sites.UI.entities.Product({name:"Lb Mens Thickening Shampoo 250Ml",xsiType:"QuantityOnlyProduct",productId:"275301223",baseProductId:"72617751",quantity:1,isPermanentlyUnavailable:true,imageURL:"http://img.tesco.com/Groceries/pi/225/5021320051225/IDShot_90x90.jpg",maxQuantity:99,maxGroupQuantity:0,bulkBuyLimitGroupId:"",increment:1,price:2.5,abbr:"ml",unitPrice:1,catchWeight:"0",shelfName:"Mens Shampoo ",superdepartment:"Health & Beauty",superdepartmentID:"TO_1448953606"});
new TESCO.sites.UI.entities.Product({name:"Lb Mens Sculpting Puty 75Ml",xsiType:"QuantityOnlyProduct",productId:"275301557",baseProductId:"72617906",quantity:1,isPermanentlyUnavailable:true,imageURL:"http://img.tesco.com/Groceries/pi/287/5021320051287/IDShot_90x90.jpg",maxQuantity:99,maxGroupQuantity:0,bulkBuyLimitGroupId:"",increment:1,price:2.5,abbr:"ml",unitPrice:3.34,catchWeight:"0",shelfName:"Pastes, Putty, Gums, Pomades",superdepartment:"Health & Beauty",superdepartmentID:"TO_1448953606"});
Maybe something like BeautifulSoup is overkill? I understand the DOM tree is not the same thing as the raw source, but why are they so different - when i go to inspect element in firefox the data seems structured as i would expect.
Apologies if this comes across as totally stupid, thanks in advance.
Unfortunately, the barcode is not given in the HTML as structured data; it only appears embedded as part of a URL. So we'll need to isolate the URL and then pick off the barcode with string manipulation:
import urllib2
import bs4 as bs
import re
import urlparse
url = 'http://www.tesco.com/groceries/SpecialOffers/SpecialOfferDetail/Default.aspx?promoId=A31033985'
response = urllib2.urlopen(url)
content = response.read()
# with open('/tmp/test.html', 'w') as f:
# f.write(content)
# Useful for debugging off-line:
# with open('/tmp/test.html', 'r') as f:
# content = f.read()
soup = bs.BeautifulSoup(content)
barcodes = set()
for tag in soup.find_all('img', {'src': re.compile(r'/pi/')}):
href = tag['src']
scheme, netloc, path, query, fragment = urlparse.urlsplit(href)
barcodes.add(path.split('\\')[1])
print(barcodes)
yields
set(['0000003222737', '0000010039670', '0000010036297', '0000010008393', '0000003050453', '0000010062951', '0000003239438', '0000010078402', '0000010016312', '0000003235676', '0000003203132'])
As your site uses javascript to format its content, You might find useful switching from urllib to a tool like Selenium. That way you can crawl pages as they render for a real user with a web browser. This github project seems to solve your task.
Other option will be filtering out json data from page javascript scripts and getting data directly from there.
I want to download some files from this site: http://www.emuparadise.me/soundtracks/highquality/index.php
But I only want to get certain ones.
Is there a way to write a python script to do this? I have intermediate knowledge of python
I'm just looking for a bit of guidance, please point me towards a wiki or library to accomplish this
thanks,
Shrub
Here's a link to my code
I looked at the page. The links seem to redirect to another page, where the file is hosted, clicking which downloads the file.
I would use mechanize to follow the required links to the right page, and then use BeautifulSoup or lxml to parse the resultant page to get the filename.
Then it's a simple matter of opening the file using urlopen and writing its contents out into a local file like so:
f = open(localFilePath, 'w')
f.write(urlopen(remoteFilePath).read())
f.close()
Hope that helps
Make a url request for the page. Once you have the source, filter out and get urls.
The files you want to download are urls that contain a specific extension. It is with this that you can do a regular expression search for all urls that match your criteria.
After filtration, then do a url request for each matched url's data and write it to memory.
Sample code:
#!/usr/bin/python
import re
import sys
import urllib
#Your sample url
sampleUrl = "http://stackoverflow.com"
urlAddInfo = urllib.urlopen(sampleUrl)
data = urlAddInfo.read()
#Sample extensions we'll be looking for: pngs and pdfs
TARGET_EXTENSIONS = "(png|pdf)"
targetCompile = re.compile(TARGET_EXTENSIONS, re.UNICODE|re.MULTILINE)
#Let's get all the urls: match criteria{no spaces or " in a url}
urls = re.findall('(https?://[^\s"]+)', data, re.UNICODE|re.MULTILINE)
#We want these folks
extensionMatches = filter(lambda url: url and targetCompile.search(url), urls)
#The rest of the unmatched urls for which the scrapping can also be repeated.
nonExtMatches = filter(lambda url: url and not targetCompile.search(url), urls)
def fileDl(targetUrl):
#Function to handle downloading of files.
#Arg: url => a String
#Output: Boolean to signify if file has been written to memory
#Validation of the url assumed, for the sake of keeping the illustration short
urlAddInfo = urllib.urlopen(targetUrl)
data = urlAddInfo.read()
fileNameSearch = re.search("([^\/\s]+)$", targetUrl) #Text right before the last slash '/'
if not fileNameSearch:
sys.stderr.write("Could not extract a filename from url '%s'\n"%(targetUrl))
return False
fileName = fileNameSearch.groups(1)[0]
with open(fileName, "wb") as f:
f.write(data)
sys.stderr.write("Wrote %s to memory\n"%(fileName))
return True
#Let's now download the matched files
dlResults = map(lambda fUrl: fileDl(fUrl), extensionMatches)
successfulDls = filter(lambda s: s, dlResults)
sys.stderr.write("Downloaded %d files from %s\n"%(len(successfulDls), sampleUrl))
#You can organize the above code into a function to repeat the process for each of the
#other urls and in that way you can make a crawler.
The above code is written mainly for Python2.X. However, I wrote a crawler that works on any version starting from 2.X
Why yes! 5 years later and, not only is this possible, but you've now got a lot of ways to do it.
I'm going to avoid code-examples here, because mainly want to help break your problem into segments and give you some options for exploration:
Segment 1: GET!
If you must stick to the stdlib, for either python2 or python3, urllib[n]* is what you're going to want to use to pull-down something from the internet.
So again, if you don't want dependencies on other packages:
urllib or urllib2 or maybe another urllib[n] I'm forgetting about.
If you don't have to restrict your imports to the Standard Library:
you're in luck!!!!! You've got:
requests with docs here. requests is the golden standard for gettin' stuff off the web with python. I suggest you use it.
uplink with docs here. It's relatively new & for more programmatic client interfaces.
aiohttp via asyncio with docs here. asyncio got included in python >= 3.5 only, and it's also extra confusing. That said, it if you're willing to put in the time it can be ridiculously efficient for exactly this use-case.
...I'd also be remiss not to mention one of my favorite tools for crawling:
fake_useragent repo here. Docs like seriously not necessary.
Segment 2: Parse!
So again, if you must stick to the stdlib and not install anything with pip, you get to use the extra-extra fun and secure (<==extreme-sarcasm) xml builtin module. Specifically, you get to use the:
xml.etree.ElementTree() with docs here.
It's worth noting that the ElementTree object is what the pip-downloadable lxml package is based on, and made make easier to use. If you want to recreate the wheel and write a bunch of your own complicated logic, using the default xml module is your option.
If you don't have to restrict your imports to the Standard Library:
lxml with docs here. As i said before, lxml is a wrapper around xml.etree that makes it human-usable & implements all those parsing tools you'd need to make yourself. However, as you can see by visiting the docs, it's not easy to use by itself. This brings us to...
BeautifulSoup aka bs4 with docs here. BeautifulSoup makes everything easier. It's my recommendation for this.
Segment 3: GET GET GET!
This section is nearly exactly the same as "Segment 1," except you have a bunch of links not one.
The only thing that changes between this section and "Segment 1" is my recommendation for what to use: aiohttp here will download way faster when dealing with several URLs because it's allows you to download them in parallel.**
* - (where n was decided-on from python-version to ptyhon-version in a somewhat frustratingly arbitrary manner. Look up which urllib[n] has .urlopen() as a top-level function. You can read more about this naming-convention clusterf**k here, here, and here.)
** - (This isn't totally true. It's more sort-of functionally-true at human timescales.)
I would use a combination of wget for downloading - http://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/#more-1885 and BeautifulSoup http://www.crummy.com/software/BeautifulSoup/bs4/doc/ for parsing the downloaded file
I'm working on a school project currently which aim goal is to analyze scam mails with the Natural Language Toolkit package. Basically what I'm willing to do is to compare scams from different years and try to find a trend - how does their structure changed with time.
I found a scam-database: http://www.419scam.org/emails/
I would like to download the content of the links with python, but I am stuck.
My code so far:
from BeautifulSoup import BeautifulSoup
import urllib2, re
html = urllib2.urlopen('http://www.419scam.org/emails/').read()
soup = BeautifulSoup(html)
links = soup.findAll('a')
links2 = soup.findAll(href=re.compile("index"))
print links2
So I can fetch the links but I don't know yet how can I download the content. Any ideas? Thanks a lot!
You've got a good start, but right now you're simply retrieving the index page and loading it into the BeautifulSoup parser. Now that you have href's from the links, you essentially need to open all of those links, and load their contents into data structures that you can then use for your analysis.
This essentially amounts to a very simple web-crawler. If you can use other people's code, you may find something that fits by googling "python Web crawler." I've looked at a few of those, and they are straightforward enough, but may be overkill for this task. Most web-crawlers use recursion to traverse the full tree of a given site. It looks like something much simpler could suffice for your case.
Given my unfamiliarity with BeautifulSoup, this basic structure will hopefully get you on the right path, or give you for a sense for how the web crawling is done:
from BeautifulSoup import BeautifulSoup
import urllib2, re
emailContents = []
def analyze_emails():
# this function and any sub-routines would analyze the emails after they are loaded into a data structure, e.g. emailContents
def parse_email_page(link):
print "opening " + link
# open, soup, and parse the page.
#Looks like the email itself is in a "blockquote" tag so that may be the starting place.
#From there you'll need to create arrays and/or dictionaries of the emails' contents to do your analysis on, e.g. emailContents
def parse_list_page(link):
print "opening " + link
html = urllib2.urlopen(link).read()
soup = BeatifulSoup(html)
email_page_links = # add your own code here to filter the list page soup to get all the relevant links to actual email pages
for link in email_page_links:
parseEmailPage(link['href'])
def main():
html = urllib2.urlopen('http://www.419scam.org/emails/').read()
soup = BeautifulSoup(html)
links = soup.findAll(href=re.compile("20")) # I use '20' to filter links since all the relevant links seem to have 20XX year in them. Seemed to work
for link in links:
parse_list_page(link['href'])
analyze_emails()
if __name__ == "__main__":
main()