So I wrote a crawler for my friend that will go through a large list of web pages that are search results, pull all the links off the page, check if they're in the output file and add if they're not there. It took a lot of debugging but it works great! Unfortunately, the little bugger is really picky about which anchored tags it deems important enough to add.
Here's the code:
#!C:\Python27\Python.exe
from bs4 import BeautifulSoup
from urlparse import urljoin #urljoin is a class that's included in urlparse
import urllib2
import requests #not necessary but keeping here in case additions to code in future
urls_filename = "myurls.txt" #this is the input text file,list of urls or objects to scan
output_filename = "output.txt" #this is the output file that you will export to Excel
keyword = "skin" #optional keyword, not used for this script. Ignore
with open(urls_filename, "r") as f:
url_list = f.read() #This command opens the input text file and reads the information inside it
with open(output_filename, "w") as f:
for url in url_list.split("\n"): #This command splits the text file into separate lines so it's easier to scan
hdr = {'User-Agent': 'Mozilla/5.0'} #This (attempts) to tell the webpage that the program is a Firefox browser
try:
response = urllib2.urlopen(url) #tells program to open the url from the text file
except:
print "Could not access", url
continue
page = response.read() #this assigns a variable to the open page. like algebra, X=page opened
soup = BeautifulSoup(page) #we are feeding the variable to BeautifulSoup so it can analyze it
urls_all = soup('a') #beautiful soup is analyzing all the 'anchored' links in the page
for link in urls_all:
if('href' in dict(link.attrs)):
url = urljoin(url, link['href']) #this combines the relative link e.g. "/support/contactus.html" and adds to domain
if url.find("'")!=-1: continue #explicit statement that the value is not void. if it's NOT void, continue
url=url.split('#')[0]
if (url[0:4] == 'http' and url not in output_filename): #this checks if the item is a webpage and if it's already in the list
f.write(url + "\n") #if it's not in the list, it writes it to the output_filename
It works great except for the following link:
https://research.bidmc.harvard.edu/TVO/tvotech.asp
This link has a number of like "tvotech.asp?Submit=List&ID=796" and it's straight up ignoring them. The only anchor that goes into my output file is the main page itself. It's bizarre because looking at the source code, their anchors are pretty standard, like-
They have 'a' and 'href', I see no reason bs4 would just pass it and only include the main link. Please help. I've tried removing http from line 30 or changing it to https and that just removes all the results, not even the main page comes into the output.
that's cause one of the links there has a mailto in its href, it is then set to the url parameter and break the rest of the links as well cause the don't pass the url[0:4] == 'http' condition, it looks like this:
mailto:research#bidmc.harvard.edu?subject=Question about TVO Available Technology Abstracts
you should either filter it out or not use the same argument url in the loop, note the change to url1:
for link in urls_all:
if('href' in dict(link.attrs)):
url1 = urljoin(url, link['href']) #this combines the relative link e.g. "/support/contactus.html" and adds to domain
if url1.find("'")!=-1: continue #explicit statement that the value is not void. if it's NOT void, continue
url1=url1.split('#')[0]
if (url1[0:4] == 'http' and url1 not in output_filename): #this checks if the item is a webpage and if it's already in the list
f.write(url1 + "\n") #if it's not in the list, it writes it to the output_filename
Related
I have around 900 pages and each page contains 10 buttons (each button has pdf). I want to download all the pdf's - the program should browse to all the pages and download the pdfs one by one.
Code only searching for .pdf but my href does not have .pdf page_no (1 to 900).
https://bidplus.gem.gov.in/bidlists?bidlists&page_no=3
This is the website and below is the link:
BID NO: GEM/2021/B/1804626
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://bidplus.gem.gov.in/bidlists"
#If there is no such folder, the script will create one automatically
folder_location = r'C:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
You only need the href as associated with the links you call buttons. Then prefix with the appropriate protocol + domain.
The links can be matched with the following selector:
.bid_no > a
That is anchor (a) tags with direct parent element having class bid_no.
This should pick up 10 links per page. As you will need a file name for each download I suggest having a global dict, which you store the links as values and link text as keys in. I would replace the "\" in the link descriptions with "_". You simply add to this during your loop over the desired number of pages.
An example of some of the dictionary entries:
As there are over 800 pages I have chosen to add in an additional termination page count variable called end_number. I don't want to loop to all pages so this allows me an early exit. You can remove this param if so desired.
Next, you need to determine the actual number of pages. For this you can use the following css selector to get the Last pagination link and then extract its data-ci-pagination-page value and convert to integer. This can then be the num_pages (number of pages) to terminate your loop at:
.pagination li:last-of-type > a
That looks for an a tag which is a direct child of the last li element, where those li elements have a shared parent with class pagination i.e. the anchor tag in the last li, which is the last page link in the pagination element.
Once you have all your desired links and file suffixes (the description text for the links) in your dictionary, loop the key, value pairs and issue requests for the content. Write that content out to disk.
TODO:
I would suggest you look at ways of optimizing the final issuing of requests and writing out to disk. For example, you could first issue all requests asynchronously and store in a dictionary to optimize what would be an I/0-bound process. Then loop that writing to disk perhaps with a multi-processing approach to optimize for a more CPU-bound process.
I would additionally consider if some sort of wait should be introduced between requests. Or if requests should be batches. You could theoretically currently have something like (836 * 10) + 836 requests.
import requests
from bs4 import BeautifulSoup as bs
end_number = 3
current_page = 1
pdf_links = {}
path = '<your path>'
with requests.Session() as s:
while True:
r = s.get(f'https://bidplus.gem.gov.in/bidlists?bidlists&page_no={current_page}')
soup = bs(r.content, 'lxml')
for i in soup.select('.bid_no > a'):
pdf_links[i.text.strip().replace('/', '_')] = 'https://bidplus.gem.gov.in' + i['href']
#print(pdf_links)
if current_page == 1:
num_pages = int(soup.select_one('.pagination li:last-of-type > a')['data-ci-pagination-page'])
print(num_pages)
if current_page == num_pages or current_page > end_number:
break
current_page+=1
for k,v in pdf_links.items():
with open(f'{path}/{k}.pdf', 'wb') as f:
r = s.get(v)
f.write(r.content)
Your site doesnt work for 90% people. But you provide examples of html. So i hope this ll help you:
url = 'https://bidplus.gem.gov.in/bidlists'
response = requests.get(url)
soup = BeautifulSoup(response.text, features='lxml')
for bid_no in soup.find_all('p', class_='bid_no pull-left'):
for pdf in bid_no.find_all('a'):
with open('pdf_name_here.pdf', 'wb') as f:
#if you have full link
href = pdf.get('href')
#if you have link exept full path, like /showbidDocument/2993132
#href = url + pdf.get('href')
response = requests.get(href)
f.write(response.content)
I would love some assistance or help around an issue i'm currently having.
I'm working on a little python scanner as a project.
The libraries im current importing are:
requests
BeautifulSoup
re
tld
The exact issue is regarding 'scope' of the scanner.
I'd like to pass a URL to the code and have the scanner grab all the anchor tags from the page, but only the ones relevant to the base URL, ignoring out of scope links and also subdomains.
Here is my current code, i'm by no means a programmer, so please excuse sloppy inefficient code.
import requests
from bs4 import BeautifulSoup
import re
from tld import get_tld, get_fld
#This Grabs the URL
print("Please type in a URL:")
URL = input()
#This strips out everthing leaving only the TLD (Future scope function)
def strip_domain(URL):
global domain_name
domain_name = get_fld(URL)
strip_domain(URL)
#This makes the request, and cleans up the source code
def connection(URL):
r = requests.get(URL)
status = r.status_code
sourcecode = r.text
soup = BeautifulSoup(sourcecode,features="html.parser")
cleanupcode = soup.prettify()
#This Strips the Anchor tags and adds them to the links array
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
#This writes our clean anchor tags to a file
with open('source.txt', 'w') as f:
for item in links:
f.write("%s\n" % item)
connection(URL)
The exact code issue is around the "for link in soup.find" section.
I have been trying to parse the array for anchor tags the only contain the base domain, which is the global var "domain_name" so that it only writes the relevant links to the source txt file.
google.com accepted
google.com/file accepted
maps.google.com not written
If someone could assist me or point me in the right direction i'd appreciate it.
I was also thinking it would be possible to write every link to the source.txt file and then alter it after removing the 'out of scope' links, but really thought it more beneficial to do it without having to create additional code.
Additionally, i'm not the strongest with regex, but here is someone that my help.
This is some regex code to catch all variations of http, www, https
(^http:\/\/+|www.|https:\/\/)
To this I was going to append
.*{}'.format(domain_name)
I provide two different situationes. Because i donot agree that href value is xxx.com. Actually you will gain three or four or more kinds of href value, such as /file, folder/file, etc. So you have to transform relative path to absolute path, otherwise, you can not gather all of urls.
Regex: (\/{2}([w]+.)?)([a-z.]+)(?=\/?)
(\/{2}([w]+.)?) Matching non-main parts start from //
([a-z.]+)(?=\/?) Match all specified character until we got /, we ought not to use .*(over-match)
My Code
import re
_input = "http://www.google.com/blabla"
all_part = re.findall(r"(\/{2}([w]+.)?)([a-z.]+)(?=\/?)",_input)[0]
_partA = all_part[2] # google.com
_partB = "".join(all_part[1:]) # www.google.com
print(_partA,_partB)
site = [
"google.com",
"google.com/file",
"maps.google.com"
]
href = [
"https://www.google.com",
"https://www.google.com/file",
"http://maps.google.com"
]
for ele in site:
if re.findall("^{}/?".format(_partA),ele):
print(ele)
for ele in href:
if re.findall("{}/?".format(_partB),ele):
print(ele)
I am working on a web scraping project which involves scraping URLs from a website based on a search term, storing them in a CSV file(under a single column) and finally scraping the information from these links and storing them in a text file.
I am currently stuck with 2 issues.
Only the first few links are scraped. I'm unable to extract links
from other pages(Website contains load more button). I don't know
how to use the XHR object in the code.
The second half of the code reads only the last link(stored in the
csv file), scrapes the respective information and stores it in a
text file. It does not go through all the links from the beginning.
I am unable to figure out where I have gone wrong in terms of file
handling and f.seek(0).
from pprint import pprint
import requests
import lxml
import csv
import urllib2
from bs4 import BeautifulSoup
def get_url_for_search_key(search_key):
base_url = 'http://www.marketing-interactive.com/'
response = requests.get(base_url + '?s=' + search_key)
soup = BeautifulSoup(response.content, "lxml")
return [url['href'] for url in soup.findAll('a', {'rel': 'bookmark'})]
results = soup.findAll('a', {'rel': 'bookmark'})
for r in results:
if r.attrs.get('rel') and r.attrs['rel'][0] == 'bookmark':
newlinks.append(r["href"])
pprint(get_url_for_search_key('digital advertising'))
with open('ctp_output.csv', 'w+') as f:
f.write('\n'.join(get_url_for_search_key('digital advertising')))
f.seek(0)
Reading CSV file, scraping respective content and storing in .txt file
with open('ctp_output.csv', 'rb') as f1:
f1.seek(0)
reader = csv.reader(f1)
for line in reader:
url = line[0]
soup = BeautifulSoup(urllib2.urlopen(url))
with open('ctp_output.txt', 'a+') as f2:
for tag in soup.find_all('p'):
f2.write(tag.text.encode('utf-8') + '\n')
Regarding your second problem, your mode is off. You'll need to convert w+ to a+. In addition, your indentation is off.
with open('ctp_output.csv', 'rb') as f1:
f1.seek(0)
reader = csv.reader(f1)
for line in reader:
url = line[0]
soup = BeautifulSoup(urllib2.urlopen(url))
with open('ctp_output.txt', 'a+') as f2:
for tag in soup.find_all('p'):
f2.write(tag.text.encode('utf-8') + '\n')
The + suffix will create the file if it doesn't exist. However, w+ will erase all contents before writing at each iteration. a+ on the other hand will append to a file if it exists, or create it if it does not.
For your first problem, there's no option but to switch to something that can automate clicking browser buttons and whatnot. You'd have to look at selenium. The alternative is to manually search for that button, extract the url from the href or text, and then make a second request. I leave that to you.
If there are more pages with results observe what changes in the URL when you manually click to go to the next page of results.
I can guarantee 100% that a small piece of the URL will have eighter a subpage number or some other variable encoded in it that strictly relates to the subpage.
Once you have figured out what is the combination you just fit that into a for loop where you put a .format() into the URL that you want to scrape and keep navigating this way through all the subpages of the results.
As to what is the last subpage number - you have to inspect the html code of the site you are scraping and find the variable responsible for it and extract its value. See if there is "class": "Page" or equivalent in their code - it may contain that number that you will need for your for loop.
Unfortunately there is no magic navigate through subresults option....
But this gets pretty close :).
Good luck.
I'm trying to find a specific CSS media query (#media only screen) in CSS files of websites by using a crawler in python 2.7.
Right now I can crawl websites/URLs (from a CSV file) to find specific keywords in their HTML source code using the following code:
import urllib2
keyword = ['keyword to find']
with open('listofURLs.csv') as f:
for line in f:
strdomain = line.strip()
if strdomain:
req = urllib2.Request(strdomain.strip())
response = urllib2.urlopen(req)
html_content = response.read()
for searchstring in keyword:
if searchstring.lower() in str(html_content).lower():
print (strdomain, keyword, 'found')
f.close()
However, I now would like to crawl websites/ULRs (from the CSV file) to find the #media only screen query in their CSS files/source code. How should my code look like?
So, you have to:
1° read the csv file and put each url in a Python list;
2° loop this list, go to the pages and extract the list of css links. You need an HTML parser, for example BeautifulSoup;
3° browse the list of links and extract the item you need. There are CSS parsers like tinycss or cssutils, but I've never used them. A regex can maybe do the trick, even if this is probably not recommended.
4° write the results
Since you know how to read the csv (PS : no need to close the file with f.close() when you use the with open method), here is a minimal suggestion for operations 2 and 3. Feel free to adapt it to your needs and to improve it. I used Python 3, but I think it works on Python 2.7.
import re
import requests
from bs4 import BeautifulSoup
url_list = ["https://76crimes.com/2014/06/25/zambia-to-west-dont-watch-when-we-jail-lgbt-people/"]
for url in url_list:
try:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
css_links = [link["href"] for link in soup.findAll("link") if "stylesheet" in link.get("rel", [])]
print(css_links)
except Exception as e:
print(e, url)
pass
css_links = ["https://cdn.sstatic.net/Sites/stackoverflow/all.css?v=d9243128ba1c"]
#your regular expression
pattern = re.compile(r'#media only screen.+?\}')
for url in css_links:
try:
response = requests.get(url).text
media_only = pattern.findall(response)
print(media_only)
except Exception as e:
print(e, url)
pass
My code successfully scrapes the tr align=center tags from [ http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY ] and writes the td elements to a text file.
However, there are multiple pages available at the site above in which I would like to be able to scrape.
For example, with the url above, when I click the link to "page 2" the overall url does NOT change. I looked at the page source and saw a javascript code to advance to the next page.
How can my code be changed to scrape data from all the available listed pages?
My code that works for page 1 only:
import bs4
import requests
response = requests.get('http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY')
soup = bs4.BeautifulSoup(response.text)
soup.prettify()
acct = open("/Users/it/Desktop/accounting.txt", "w")
for tr in soup.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')
The trick here is to check the requests that are coming in and out of the page-change action when you click on the link to view the other pages. The way to check this is to use Chrome's inspection tool (via pressing F12) or installing the Firebug extension in Firefox. I will be using Chrome's inspection tool in this answer. See below for my settings.
Now, what we want to see is either a GET request to another page or a POST request that changes the page. While the tool is open, click on a page number. For a really brief moment, there will only be one request that will appear, and it's a POST method. All the other elements will quickly follow and fill the page. See below for what we're looking for.
Click on the above POST method. It should bring up a sub-window of sorts that has tabs. Click on the Headers tab. This page lists the request headers, pretty much the identification stuff that the other side (the site, for example) needs from you to be able to connect (someone else can explain this muuuch better than I do).
Whenever the URL has variables like page numbers, location markers, or categories, more often that not, the site uses query-strings. Long story made short, it's similar to an SQL query (actually, it is an SQL query, sometimes) that allows the site to pull the information you need. If this is the case, you can check the request headers for query string parameters. Scroll down a bit and you should find it.
As you can see, the query string parameters match the variables in our URL. A little bit below, you can see Form Data with pageNum: 2 beneath it. This is the key.
POST requests are more commonly known as form requests because these are the kind of requests made when you submit forms, log in to websites, etc. Basically, pretty much anything where you have to submit information. What most people don't see is that POST requests have a URL that they follow. A good example of this is when you log-in to a website and, very briefly, see your address bar morph into some sort of gibberish URL before settling on /index.html or somesuch.
What the above paragraph basically means is that you can (but not always) append the form data to your URL and it will carry out the POST request for you on execution. To know the exact string you have to append, click on view source.
Test if it works by adding it to the URL.
Et voila, it works. Now, the real challenge: getting the last page automatically and scraping all of the pages. Your code is pretty much there. The only things remaining to be done are getting the number of pages, constructing a list of URLs to scrape, and iterating over them.
Modified code is below:
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
base_url = 'http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY'
r = rq.get(base_url)
soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try: # Make sure there are more than one page, otherwise, set to 1.
num_pages = int(page_count_links[-1].get_text())
except IndexError:
num_pages = 1
# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]
# Open the text file. Use with to save self from grief.
with open("results.txt","wb") as acct:
for url_ in url_list:
print "Processing {}...".format(url_)
r_new = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')
We use regular expressions to get the proper links. Then using list comprehension, we built a list of URL strings. Finally, we iterate over them.
Results:
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=1...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=2...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=3...
[Finished in 6.8s]
Hope that helps.
EDIT:
Out of sheer boredom, I think I just created a scraper for the entire class directory. Also, I update both the above and below codes to not error out when there is only a single page available.
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
spring_2015 = "http://my.gwu.edu/mod/pws/subjects.cfm?campId=1&termId=201501"
r = rq.get(spring_2015)
soup = bsoup(r.text)
classes_url_list = [c["href"] for c in soup.find_all("a", href=re.compile(r".*courses.cfm\?campId=1&termId=201501&subjId=.*"))]
print classes_url_list
with open("results.txt","wb") as acct:
for class_url in classes_url_list:
base_url = "http://my.gwu.edu/mod/pws/{}".format(class_url)
r = rq.get(base_url)
soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try:
num_pages = int(page_count_links[-1].get_text())
except IndexError:
num_pages = 1
# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]
# Open the text file. Use with to save self from grief.
for url_ in url_list:
print "Processing {}...".format(url_)
r_new = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')