Web scraping every forum post (Python, Beautifulsoup) - python

Hello once again fellow stack'ers. Short description.. I am web scraping some data from an automotive forum using Python and saving all data into CSV files. With some help from other stackoverflow members managed to get as far as mining through all pages for certain topic, gathering the dates, title and link for each post.
I also have a seperate script I am now sturggling with implementing (For every link found, python creates a new soup for it, scrapes through all the posts and then goes back to previous link).
Would really appreciate any other tips or advice on how to make this better as it's my first time working with python, I think it might be my nested loop logic that's messed up, but checking through multiple times seems right to me.
Heres the code snippet :
link += (div.get('href'))
savedData += "\n" + title + ", " + link
tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link)
while tempNumber < 3:
for tempRow in tempSoup.find_all(id=re.compile("^td_post_")):
for tempNext in tempSoup.find_all(title=re.compile("^Next Page -")):
tempNextPage = ""
tempNextPage += (tempNext.get('href'))
post = ""
post += tempRow.get_text(strip=True)
postData += post + "\n"
tempNumber += 1
tempNewUrl = "http://www.automotiveforums.com/vbulletin/" + tempNextPage
tempSoup = make_soup(tempNewUrl)
print(tempNewUrl)
tempNumber = 1
number += 1
print(number)
newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage
soup = make_soup(newUrl)
My main issue with it so far is that tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link)
Does not seem to create a new soup after it has done scraping all the posts for forum thread.
This is the output I'm getting :
http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=2
http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=3
1
So it does seem to find the correct links for new pages and scrape them, however for next itteration it prints the new dates AND the same exact pages. There's also a reaaly weird 10-12 seconds delays after the last link is printed and only then it hops down to print number 1 and then bash out all the new dates..
But after going for the next forum threads link, it scrapes same exact data every time.
Sorry if it looks really messy, it is sort of a side project, and my first attempt at doing some useful things, so I am very new at this, any advice or tips would be much appreciated. I'm not asking you to solve the code for me, even some pointers for my possibly wrong logic would be greatly appreciated!

So after spending a little bit more time, I have managed to ALMOST crack it. It's now at the point where python finds every thread and it's link on the forum, then goes onto each link, reads all pages and continues on with next link.
This is the fixed code for it if anyone will make any use of it.
link += (div.get('href'))
savedData += "\n" + title + ", " + link
soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + link)
while tempNumber < 4:
for postScrape in soup3.find_all(id=re.compile("^td_post_")):
post = ""
post += postScrape.get_text(strip=True)
postData += post + "\n"
print(post)
for tempNext in soup3.find_all(title=re.compile("^Next Page -")):
tempNextPage = ""
tempNextPage += (tempNext.get('href'))
print(tempNextPage)
soup3 = ""
soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + tempNextPage)
tempNumber += 1
tempNumber = 1
number += 1
print(number)
newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage
soup = make_soup(newUrl)
All I've had to do was to seperate the 2 for loops that were nested within each other, into own loops. Still not a perfect solution, but hey, it ALMOST works.
The non working bit: First 2 threads of provided link have multiple pages of posts. The following 10+ more threads Do not. I cannot figure out a way to check the for tempNext in soup3.find_all(title=re.compile("^Next Page -")):
value outside of loop to see if it's empty or not. Because if it does not find a next page element / href, it just uses the last one. But if I reset the value after each run, it no longer mines each page =l A solution that just created another one problem :D.

Many thanks dear Norbis for sharing your ideas and insights and concepts
since you offer only a snippet i just try to provide an approach that shows how to login to a phpBB - using payload:
import requests
forum = "the forum name"
headers = {'User-Agent': 'Mozilla/5.0'}
payload = {'username': 'username', 'password': 'password', 'redirect':'index.php', 'sid':'', 'login':'Login'}
session = requests.Session()
r = session.post(forum + "ucp.php?mode=login", headers=headers, data=payload)
print(r.text)
but wait: we can - instead of manipulating the website using requests,
also make use a browser automation such as mechanize offers this.
This way we don't have to manage the own session and only have a few lines of code to craft each request.
a interesting example is on GitHub https://github.com/winny-/sirsi/blob/317928f23847f4fe85e2428598fbe44c4dae2352/sirsi/sirsi.py#L74-L211

Related

Beautifulsoup unable to get data from mds-data-table from morningstar

I'm trying to get the dividend information from morningstar.
The following code works for scraping info from finviz but the dividend information is not the same as my broker platform.
symbol = 'bxs'
morningstar_url = 'https://www.morningstar.com/stocks/xnys/' + symbol + '/dividends'
http = urllib3.PoolManager()
response = http.request('GET', morningstar_url)
soup = BeautifulSoup(response.data, 'lxml')
html = list(soup.children)[1]
[type(item) for item in list(soup.children)]
def display_elements(L, show = 0):
test = list(L.children)
if(show):
for i in range(len(test)):
print(i)
print(test[i])
print()
return(test)
test = display_elements(html,1)
I have no issue printing out the elements but cannot find the element that houses the information such as "Total Yield %" of 2.8%. How do I get inside the mds-data-table to extract the information?
Great question! I've actually worked on this specifically, but years ago. Morningstar will only load the tables after running a script to prevent this exact type of scraping behavior. If you view source generally, immediately on load, you won't be able to see any HTML.
What your going to want to do is find the JavaScript code that is loading the elements, and hook up bs4 to use that. You'll have to poke around the files, but somewhere deep in those js files, you'll find a dynamic URL. It'll be hidden, but it'll be in there somewhere. I'll go look at some of my old code and see if i can find something that helps.
So here's an edited sample of what used to work for me:
from urllib.request import urlopen
exchange = 'NYSE'
ticker = 'V'
if exchange == 'NYSE':
exchange_code = "XNYS"
elif exchange in ["NasdaqNM", "NASDAQ"]:
exchange_code = "XNAS"
else:
logging.info("Unknown Exchange Code for {}".format(stock.symbol))
return
time_now = int(time.time())
time_delay = int(time.time()+150)
morningstar_raw = urlopen(f'http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t={exchange_code}:{ticker}&region=usa&culture=en-US&cur=USD&reportType=is&period=12&dataType=A&order=asc&columnYear=5&rounding=3&view=raw&r=354589&callback=jsonp{time_now}&_={time_delay}')
print(morningstar_raw)
Granted this solution is from a file lasted edited sometime in 2018, and they may have changed up their scripting, but you can find this and much more on my github project wxStocks

Sorting audiobooks on Audible.com by release date when using Python requests library

I am trying to reproduce the result of "Scraping and Exploring the Entire English Audible Catalog" by Toby Manders to add results for the books released after this article was published. The idea is to take Manders' dataset and add equivalent fields for all the new audiobooks in the past year or so, and to do that with as few http requests to Audible as possible. I'm using a different Python library than Manders, and Audible has also changed a bit since that piece was published.
The approach used by Manders of getting paged results of each category views is working so far, but my http request is not sorting the result by release date. Here is my code:
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.audible.com/search?pf_rd_p=7fe4387b-4762-42a8-8d9a-a63254c74bb2&pf_rd_r=C7ENYKDADHMCH4KY12D4&ref=a_search_l1_feature_five_browse-bin_6&feature_six_browse-bin=9178177011&pageSize=50'
r = requests.get(base_url)
html = BeautifulSoup(r.text)
# get category list, and links
cat_tuples = []
for cat in html.find('div', {'class':'categories'}).find_all('li', {'class':'bc-list-item'}):
a = cat.find('a')
mytuple = (a.text, 'https://audible.com' + a['href']+'&sort=pubdate-desc-rank')
cat_tuples.append(mytuple)
# each tuple has a format like this ... ('Arts & Entertainment',
# 'https://audible.com/search?feature_six_browse-bin=9178177011&node=2226646011&pageSize=50&pf_rd_p=7fe4387b-4762-42a8-8d9a-a63254c74bb2&pf_rd_r=C7ENYKDADHMCH4KY12D4&ref=a_search_l1_feature_five_browse-bin_6&sort=pubdate-desc-rank')
#request first page of first category
r_page = requests.get(cat_tuples[0][1])
html_page = BeautifulSoup(r.text)
# results should start with '2Pac in the Studio' but instead it's 'Can't Hurt Me: Master Your Mind and Defy the Odds'
Adding sort=pubdate-desc-rank to the request URL appears to work in Chrome, but not with Python. I have tried changing the User Agent in my code as well, but that didn't work.
Note: I would describe Audible.com as generally unfriendly to scraping, but I don't see a clear prohibition against it. My interest in purely informational, and I do not seek to profit from gathering these results.
I took a fresh look at my code this morning and discovered that the solution to this one is a silly coding error on my part. I'm leaving it up in case anyone else has a similar issue. These lines of code:
#request first page of first category
r_page = requests.get(cat_tuples[0][1])
html_page = BeautifulSoup(r.text)
Should be as follows:
#request first page of first category
r_page = requests.get(cat_tuples[0][1])
html_page = BeautifulSoup(r_page.text)

Python-How would i go about getting block of text from an html document

https://www.cpms.osd.mil/Content/AF%20Schedules/survey-sch/111/111R-03Apr2003.html
This is the page I am trying to parse. It's from a government site which in my experience are not known for keeping up their certificates, so you are are going to be warned about it not being safe by your browser. All I want is this part,http://imgur.com/a/BL14W.
edit: Sorry, for the lack of information. I started asking this question then I got called away at work. It's no excuse but when I came back it was time to go home so, I just kinda hit submit.
I have already tried doing it more "manually" but apparently not all of the documents came out exactly the same. Here is what I tried:
def table_parser(page):
file = open(page)
table = []
num = 0
for line in file:
if 'Grade' in line:
num += 1
if num > 0:
num += 1
if 3 <= num < 21:
line = line.rstrip()
if line != '':
split_line = line.split(' ')
split_line = [x for x in split_line if x != '']
strip_line = split_line[:16]
table.append(strip_line)
WG = []
WL = []
WS = []
for l in table:
WG.append((l[1:6]))
WL.append(l[6:11])
WS.append(l[11:16])
file.close()
# Return 3 lists for the 3 charts I want
return WG, WL, WS
This is what I used that got the about half of the 65k files I started with mostly right. I passed the returned lists into csv writers to store them till I can get them all cleaned up. I know there is probably a better way but I came up with this before I could wrap my head around BeautifulSoup. I don't necessarily want the code to do this, just pointers on where to start. I tried to find documentation on BeautifulSoup but I couldn't figure out where to start for what I need.
Your question is a little vague so I'll try my best to help you.
1. Install Beautiful Soup 4
To get a block of text from a webpage,you will need to use the external library BeautifulSoup4 (BS4). Once downloaded and installed to your computer, first import BS4 using the following from bs4 import BeautifulSoupand import urllib.request. Then simply setup BS4 using soup = BeautifulSoup("", "html.parser").
2. Download Webpage
Downloading a webpage is simple, just use site_download = urllib.request.urlopen(url). In your case, simply replace "url" with the url you provided here. Then we need to read what we've downloaded using site_read = site_download.read().decode('utf-8') followed by soup = BeautifulSoup(site_read, "html.parser").
3. Get Block of Text
You can get text in many different ways, so I'll show you a few examples.
To get the first instance of < P > tag (paragraph) text:
text = soup.find("p")
text = getText()
To get all instances of the < P > tag:
text = soup.findAll("p")
text = getText()
To get text from a specific class:
text = soup.find(attrs={"class": "class_name_here"})
text = getText()
4. Further Info
More information on how to get different types of tags and other things you can do with BS4 can be found HERE.

Checking ALL links within links from a source HTML, Python

My code is to search a Link passed in the command prompt, get the HTML code for the webpage at the Link, search the HTML code for links on the webpage, and then repeat these steps for the links found. I hope that is clear.
It should print out any links that cause errors.
Some more needed info:
The max visits it can do is 100.
If a website has an error, a None value is returned.
Python3 is what I am using
eg:
s = readwebpage(url)... # This line of code gets the HTML code for the link(url) passed in its argument.... if the link has an error, s = None.
The HTML code for that website has links that end in p2.html, p3.html, p4.html, and p5.html on its webpage. My code reads all of these, but it does not visit these links individually to search for more links. If it did this, it should search through these links and find a link that ends in p10.html, and then it should report that the link ending with p10.html has errors. Obviously it doesn't do that at the moment, and it's giving me a hard time.
My code..
url = args.url[0]
url_list = [url]
checkedURLs = []
AmountVisited = 0
while (url_list and AmountVisited<maxhits):
url = url_list.pop()
s = readwebpage(url)
print("testing url: http",url) #Print the url being tested, this code is here only for testing..
AmountVisited = AmountVisited + 1
if s == None:
print("* bad reference to http", url)
else:
urls_list = re.findall(r'href="http([\s:]?[^\'" >]+)', s) #Creates a list of all links in HTML code starting with...
while urls_list: #... http or https
insert = urls_list.pop()
while(insert in checkedURLs and urls_list):
insert = urls_list.pop()
url_list.append(insert)
checkedURLs = insert
Please help :)
Here is the code you wanted. However, please, stop using regexes for parsing HTML. BeautifulSoup is the way to go for that.
import re
from urllib import urlopen
def readwebpage(url):
print "testing ",current
return urlopen(url).read()
url = 'http://xrisk.esy.es' #put starting url here
yet_to_visit= [url]
visited_urls = []
AmountVisited = 0
maxhits = 10
while (yet_to_visit and AmountVisited<maxhits):
print yet_to_visit
current = yet_to_visit.pop()
AmountVisited = AmountVisited + 1
html = readwebpage(current)
if html == None:
print "* bad reference to http", current
else:
r = re.compile('(?<=href=").*?(?=")')
links = re.findall(r,html) #Creates a list of all links in HTML code starting with...
for u in links:
if u in visited_urls:
continue
elif u.find('http')!=-1:
yet_to_visit.append(u)
print links
visited_urls.append(current)
Not Python but since you mentioned you aren't tied strictly to regex, I think you might find some use in using wget for this.
wget --spider -o C:\wget.log -e robots=off -w 1 -r -l 10 http://www.stackoverflow.com
Broken down:
--spider: When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there.
-o C:\wget.log: Log all messages to C:\wget.log.
-e robots=off: Ignore robots.txt
-w 1: set a wait time of 1 second
-r: set recursive search on
-l 10: sets the recursive depth to 10, meaning wget will only go as deep as 10 levels in, this may need to change depending on your max requests
http://www.stackoverflow.com: the URL you want to start with
Once complete, you can review the wget.log entries to determine which links had errors by searching for something like HTTP status codes 404, etc.
I suspect your regex is part of your problem. Right now, you have http outside your capture group, and [\s:] matches "some sort of whitespace (ie \s) or :"
I'd change the regex to: urls_list = re.findall(r'href="(.*)"',s). Also known as "match anything in quotes, after href=". If you absolutely need to ensure the http[s]://, use r'href="(https?://.*)"' (s? => one or zero s)
EDIT: And with actually working regex, using a non-greedly glom: href=(?P<q>[\'"])(https?://.*?)(?P=q)'
(Also, uh, while it's not technically necessary in your case because re caches, I think it's good practice to get into the habit of using re.compile.)
I think it's awfully nice that all of your URLs are full URLs. Do you have to deal with relative URLs at all?
`

How do I know when I'm done crawling a domain?

I've written a function in Python that gets all the links on a page.
Then, I run that function for all of the links that first function returned.
My question is, if I were to keep on doing this using CNN as my starting point, how would I know when I had crawled all (or most) of CNN's webpages?
Here's the code for the crawler.
base_url = "http://www.cnn.com"
title = "cnn"
my_file = open(title+".txt","w")
def crawl(site):
seed_url = site
br = Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.open(seed_url)
link_bank = []
for link in br.links():
if link.url[0:4] == "http":
link_bank.append(link.url)
if link.url[0] == "/":
url = link.url
if url.find(".com") == -1:
if url.find(".org") == -1:
link_bank.append(base_url+link.url)
else:
link_bank.append(link.url)
else:
link_bank.append(link.url)
if link.url[0] == "#":
link_bank.append(base_url+link.url)
link_bank = list(set(link_bank))
for link in link_bank:
my_file.write(link+"\n")
return link_bank
my_file.close()
I did not specifically look into your code, but you should look up how to implement a breadth-first-search, and additionally store already visited URLs in a set. If you find a new URL in the currently visited page, append it to the list of URLs to visit, if it wasn't in the set already.
You might need to ignore the query string (everything after the question mark in a URL).
The first thing coming into my mind is to have a set of visited links. Each time you are requesting a link, add a link to a set. Before requesting a link, check if it is not in the set.
Another point is that you are actually reinventing the wheel here, Scrapy web-scraping framework has link extracting mechanism built-in - it's worth using.
Hope that helps.

Categories

Resources