Data Scraping / Regex expression Error (python) - python

I'm trying to scrape data from a table in a website. I can pull the data in, in the form of source code. But in my program, I get the error: TypeError: replace_with() takes exactly 2 arguments (3 given)
import urllib2
import bs4
import re
page_content = ""
for i in range(1,11):
page = urllib2.urlopen("http://b3.caspio.com/dp.asp?appSession=627360156923294&RecordID=&PageID=2&PrevPageID=2&CPIpage="+str(i)+"&CPISortType=&CPIorderBy=")
page_content += page.read()
soup = bs4.BeautifulSoup(page_content)
tables = soup.find_all('tr')
file = open('crime_data.csv', 'w+')
for i in tables:
i = i.replace_with('</td>' , (',')) # this is where I get the error
i = re.sub(r'<.?td[^>]*>','',i)
file.write(i + '\n')
Why is it giving me that error?
Also, in essence, I'm trying take the data from the table and put it into a csv file. Any and all help would be greatly appreciated!

That replace_with function does not do what it appears you want it to. The linked docs state that: PageElement.replace_with()removes a tag or string from the tree, and replaces it with the tag or string of your choice
From your code it looks more like you want to replace the whole end tag </td> with a , in a an effort to get some sort of comma separated data.
Perhaps you should instead just use the get_text method on your <td> elements, and format them from there:
for i in tables:
file.write(i.get_text(',').strip() + '\n')
file.close() ####### <----- VERY IMPORTANT TO CLOSE FILES
Note
I tested your code out and you are not really scraping what you are after. I played around with it and came up with this:
import urllib2
import bs4
def scrape_crimes(html,write_headers):
soup = bs4.BeautifulSoup(html) # make the soup
table = soup.find_all('table',class_=('cbResultSetTable',)) # search for the exact table you want, there are multiple nested tables on the pages you are scraping
if len(table) > 0: # if the table is found
table = table[0] # set the table to the first result
else:
return # no table found, no use scraping
with open('crime_data.csv', 'a') as f: # opens file to append content
trs = table.find_all('tr') # get all the rows in the table
if write_headers: # if we request that headers are written
for th in trs[0].find_all('th'): # write each header followed by a comma
f.write(th.get_text(strip=True).encode('utf-8')+',') # ensure data is writable by calling encode
f.write('\n') # write a newline
for tr in trs: # for each table row in the table
tds = tr.find_all('td') # get all the td elements
if len(tds) > 0: # if there are td elements (not true for header rows
for td in tds: # for each td element
f.write(td.get_text(strip=True).encode('utf-8')+',') # add the data followed by a comma
f.write('\n') # finish the row off with a newline
open('crime_data.csv','w').close() # clear the file before running
for i in range(1,11):
page = urllib2.urlopen("http://b3.caspio.com/dp.asp?appSession=627360156923294&RecordID=&PageID=2&PrevPageID=2&CPIpage="+str(i)+"&CPISortType=&CPIorderBy=")
scrape_crimes(page.read(),i==1) # offset the code to a function, the second argument is only true the first time
# this ensures that you will get headers only at the top of your output file
I removed the use of the re library because in general regex and html do not play nicely together., the short explanation being: HTML is not a regular language.
I also witch from using the coding pattern:
file = open('file_name','w')
# do stuff
file.close()
to this preferred pattern:
with open('file_name','w') as f:
# do stuff
In the first example it is often common to forget to close the file, which you did forget in your provided code. The second pattern will handle the close for you, so no worries there. Also, it is not good practice to name your variables with the same names as native python commands.
I changed your scripts pattern from combining all the pages html to scraping each page one by one because that is not a good idea. You could run into memory issues if you were doing this with large pages. Instead, it is usually better to handle the data in chunks.
The next thing I did was look at the HTML of the page you were scraping. You were pulling all <tr> elements but had you closely inspected the page, you would have seen that the table you are after is actually contained in a <tr>, giving you some big nasty block of text as a "result". Using bs4's optional classs_ argument to denote a specific class to look for in the table element leads to the data you are after.
The next thing I noticed was that the table headers would get pulled for every page, sprinkling your results with this redundant information. You would only want to pull this info the first time, so I added some logic for that.
I switched to using the .get_text method instead of the regex/replace_with combo you had because of the above explanations. The get_text method returns unicode however so I added the call to .encode('utf-8') which would ensure the data would be writable. I also specified the strip=True argument to get rid of any pesky white-space characters on the data. The reasoning behind this: you load the whole bs4 library, why not use it? The good people who write that library spent a lot of time taking care of parsing the text so you don't have to waste time doing it.
Hope this was helpful! Happy scraping!

Related

Python- Regular Expression outputting last occurrence [HTML Scraping]

I'm web scraping from a local archive of techcrunch.com. I'm using regex to sort through and grab every heading for each article, however my output continues to remain as the last occurrence.
def extractNews():
selection = listbox.curselection()
if selection == (0,):
# Read the webpage:
response = urlopen("file:///E:/University/IFB104/InternetArchive/Archives/Sun,%20October%201st,%202017.html")
html = response.read()
match = findall((r'<h2 class="post-title"><a href="(.*?)".*>(.*)</a></h2>'), str(html)) # use [-2] for position after )
if match:
for link, title in match:
variable = "%s" % (title)
print(variable)
and the current output is
Heetch raises $12 million to reboot its ridesharing service
which is the last heading of the entire webpage, as seen in the image below (last occurrence)
The website/image looks like
this and each article block consists of the same code for the heading:
<h2 class="post-title">Heetch raises $12 million to reboot its ridesharing service</h2>
I cannot see why it keeps resulting to this last match. I have ran it through websites such as https://regex101.com/ and it tells me that I only have one match which is not the one being outputted in my program. Any help would be greatly appreciated.
EDIT: If anyone is aware of a way to display each matched result SEPARATELY between different <h1></h1> tags when writing to a .html file, it would mean a lot :) I am not sure if this is right but I think you use [-#] for the position/match being referred too?
The regex is fine, but your problem is in the loop here.
if match:
for link, title in match:
variable = "%s" % (title)
Your variable is overwritten in each iteration. That's why you only see the its value for the last iteration of the loop.
You could do something along these lines:
if match:
variableList = []
for link, title in match:
variable = "%s" % (title)
variableList.append(variable)
print variableList
Also, generally, I would recommend against using regex to parse html (as per the famous answer).
If you haven't already familiarised yourself with BeautifulSoup, you should. Here is a non-regex solution using BeautifulSoup to dig out all h2 post-titles from your html page.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
soup.findAll('h2', {'class':'post-title'})

Python - capture ALL tables from an HTML page

I have emails with embedded HTML tables and I have code that uses BeautifulSoup to extract the tables and the data within them, my problem is that sometimes it only succeeds in capturing one table when there are more.
The code I normally run on these table is:
with open(file_path) as in_f:
msg = email.message_from_file(in_f)
html_msg = msg.get_payload(1)
body = html_msg.get_payload(decode=True)
html = body.decode()
table = bs4.BeautifulSoup(html).find("table")
data = [[cell.text.strip() for cell in row.find_all("td")] for row in table.find_all("tr")]
But for this email, and some others like it, I only successfully extract the first Package. I've tried changing one line to table = bs4.BeautifulSoup(html).find_all("table") but find_all doesn't work there.
I'm a novice when it comes to BeautifulSoup so any help would be appreciated, thanks.
I think I see what you are doing wrong;
if you do
table = bs4.BeautifulSoup(html).find("table")
it returns a Tag (ie one element). If instead you do
tables = bs4.BeautifulSoup(html).find_all("table")
it returns a ResultSet (basically a list of tables). So far so good! The problem comes in the next line, when you try to treat the ResultSet as if it were a single Tag:
... for row in tables.find_all("tr") # Can't do this!
tables is not a single element (which has a .find_all method), it is a list of elements (which doesn't) - hence the AttributeError. Instead, you have to iterate over each table, like so:
tables = bs4.BeautifulSoup(html).find_all("table")
data = []
for table in tables: # <-- extra level of iteration!
for row in table.find_all("tr"):
data.append([cell.text.strip() for cell in row.find_all("td")])
Hope that helps!

python lxml.html: pull preceding text in html docstring

I'm trying to identify a given <table> element based on the text that precedes it in the html document.
My current method is to stringify each html table element and search for its text index within the file text:
filing_text=request.urlopen(url).read()
#some text cleanup here to make lxml's output match the .read() content
ref_text = lxml.html.tostring(filing_text).upper().\
replace(b" ",b"&NBSP;")
tbl_count=0
for tbl in self.filing_tree.iterfind('.//table'):
text_ind=reftext.find(lxml.html.tostring(tbl).\
upper().replace(b" ",b"&NBSP;"))
start_text=lxml.html.tostring(tbl)[0:50]
tbl_count+=1
print ('tbl: %s; position: %s; %s'%(tbl_count,text_ind,start_text))
Given the starting index of the table element, I can then search x characters preceding for text that may identify help to identify the table's content.
Two concerns with this approach:
Since the tag density (i.e., how much of the filing text is markup versus content) varies from url to url, it's hard to standardize my search range in the preceding text. 2500 characters of html may encompass 300 characters of actual content or 2000
Serializing and searching once per table element seems rather inefficient. It adds more overhead to a webscraping workflow than I'd like
Question: Is there a better way to do this? Is there an lxml method that can extract text content prior to a given element? I'm imagining something like itertext() that goes backwards from the element, recursively through the html docstring.
Use beautiful soup. Just a snippit to get you started:
>>> from bs4 import BeautifulSoup
>>> stupid_html = "<html><p> Hello </p><table> </table></html>"
>>> soup = BeautifulSoup(stupid_html )
>>> list_of_tables = soup.find_all("table")
>>> print( list_of_tables[0].previous )
Hello

Unscriptable Int Error for String Slice

I'm writing a webscraper and I have a table full of links to .pdf files that I want to download, save, and later analyze. I was using beautiful soup and I had soup find all the links. They are normally beautiful soup tag objects, but I've turned them into strings. The string is actually a bunch of junk with the link text buried in the middle. I want to cut out that junk and just leave the link. Then I will turn these into a list and have python download them later. (My plan is for python to keep a list of the pdf link names to keep track of what it's downloaded and then it can name the files according to those link names or a portion thereof).
But the .pdfs come in variable name-lengths, e.g.:
I_am_the_first_file.pdf
And_I_am_the_seond_file.pdf
and as they exist in the table, they have a bunch of junk text:
a href = ://blah/blah/blah/I_am_the_first_file.pdf[plus other annotation stuff that gets into my string accidentally]
a href = ://blah/blah/blah/And_I_am_the_seond_file.pdf[plus other annotation stuff that gets into my string accidentally]
So I want to cut ("slice") the front part and the last part off of the string and just leave the string that points to my url (so what follows is the desired output for my program):
://blah/blah/blah/I_am_the_first_file.pdf
://blah/blah/blah/And_I_am_the_seond_file.pdf
As you can see, though, the second file has more characters in the string than the first. So I can't do:
string[9:40]
or whatever because that would work for the first file but not for the second.
So i'm trying to come up with a variable for the end of the string slice, like so:
string[9:x]
wherein x is the location in the string that ends in '.pdf' (and my thought was to use the string.index('.pdf') function to do this.
But is t3h fail because I get an error trying to use a variable to do this
("TypeError: 'int' object is unsubscriptable")
Probably there's an easy answer and a better way to do this other than messing with strings, but you guys are way smartert than me and I figured you'd know straight off.
Here's my full code so far:
import urllib, urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("mywebsite.com")
soup = BeautifulSoup(page)
table_with_my_pdf_links = soup.find('table', id = 'searchResults')
#"search results" is just what the table i was looking for happened to be called.
for pdf_link in table_with_my_pdf_links.findAll('a'):
#this says find all the links and looop over them
pdf_link_string = str(pdf_link)
#turn the links into strings (they are usually soup tag objects, which don't help me much that I know of)
if 'pdf' in pdf_link_string:
#some links in the table are .html and I don't want those, I just want the pdfs.
end_of_link = pdf_link_string.index('.pdf')
#I want to know where the .pdf file extension ends because that's the end of the link, so I'll slice backward from there
just_the_link = end_of_link[9:end_of_link]
#here, the first 9 characters are junk "a href = yadda yadda yadda". So I'm setting a variable that starts just after that junk and goes to the .pdf (I realize that I will actualy have to do .pdf + 3 or something to actually get to the end of string, but this makes it easier for now).
print just_the_link
#I debug by print statement because I'm an amatuer
the line (Second from the bottom) that reads:
just_the_link = end_of_link[9:end_of_link]
returns an error (TypeError: 'int' object is unsubscriptable)
also, the ":" should be hyper text transfer protocol colon, but it won't let me post that b/c newbs can't post more than 2 links so I took them out.
just_the_link = end_of_link[9:end_of_link]
This is your problem, just like the error message says. end_of_link is an integer -- the index of ".pdf" in pdf_link_string, which you calculated in the preceding line. So naturally you can't slice it. You want to slice pdf_link_string.
Sounds like a job for regular expressions:
import urllib, urllib2, re
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("mywebsite.com")
soup = BeautifulSoup(page)
table_with_my_pdf_links = soup.find('table', id = 'searchResults')
#"search results" is just what the table i was looking for happened to be called.
for pdf_link in table_with_my_pdf_links.findAll('a'):
#this says find all the links and looop over them
pdf_link_string = str(pdf_link)
#turn the links into strings (they are usually soup tag objects, which don't help me much that I know of)
if 'pdf' in pdf_link_string:
pdfURLPattern = re.compile("""://(\w+/)+\S+.pdf""")
pdfURLMatch = pdfURLPattern.search(line)
#If there is no match than search() returns None, otherwise the whole group (group(0)) returns the URL of interest.
if pdfURLMatch:
print pdfURLMatch.group(0)

Extract data from a website's list, without superfluous tags

Working code: Google dictionary lookup via python and beautiful soup -> simply execute and enter a word.
I've quite simply extracted the first definition from a specific list item. However to get plain data, I've had to split my data at the line break, and then strip it to remove the superfluous list tag.
My question is, is there a method to extract the data contained within a specific list without doing my above string manipulation - perhaps a function in beautiful soup that I have yet to see?
This is the relevant section of code:
# Retrieve HTML and parse with BeautifulSoup.
doc = userAgentSwitcher().open(queryURL).read()
soup = BeautifulSoup(doc)
# Extract the first list item -> and encode it.
definition = soup('li', limit=2)[0].encode('utf-8')
# Format the return as word:definition removing superfluous data.
print word + " : " + definition.split("<br />")[0].strip("<li>")
I think you are looking for findAll(text=True) this will extract the text from the tags
definitions = soup('ul')[0].findAll(text=True)
Will return a ist of all the text contents broken at the tag boundaries

Categories

Resources