Regular Expressions in Python-Scraping Data from website

Regular Expressions in Python-Scraping Data from website - python

I am new to Python and I trying to pull in xml files from a website and load them into a database. I have been using the Beautiful Soup module in Python but I cannot pull in the specific xml file that I want.
In the website source code it looks as follows:
ReportName.XML
ReportName.XML
<ReportName.XML
The following shows the code I have in Python. This brings back everything with the 'href' tag whereas I want to filter the files on the 'Report I want name dddddddd'. I have tried using regular expressions such as 'href=\s\w+' for example but to no avail as it returns NONE. Any help is appreciated
from bs4 import BeautifulSoup
import urllib
import re
webpage=("http://www.example.com")
response=urllib.urlopen(webpage).read()
soup=BeautifulSoup(response)
for link in soup.find_all('a'):
print(link.get('href')
When I use Python it findall('href') it pulls back the entire string but I want to filter just the xml aspect. I have tried variations of the code such as findall('href\MarketReports') and findall('href\w+') put this returns "None" when I run the code.
Any help is appreciated

I'm not entirely clear exactly what you're looking for, but if I understand correctly, you only want to get ReportName.XML, in which case it would be:
find('a').text
If you're looking for "/MarketRepoerts/ReportName.XML", then it would be:
find('a').attrs['href']

I used the following code and it was able to find the reports as I needed them. The Google presentation was a great help along with jdotjdot input
http://www.youtube.com/watch?v=kWyoYtvJpe4
The code that I used to find my XML was
import re
import urllib
webpage=("http://www.example.com")
response=urllib.urlopen(webpage).read()
print re.findall(r"Report I want\w+[.]XML",response)

Related

Extract certain values from a line with BeautifulSoup

for a school course we are learning advanced python,to get a first idea about web scraping and similar stuff.... I got an exercise to do where I have to extract the values v1, v2 from the following line of an HTML... I tried looking up but couldn't find any really specific things.... If it is unappropriated for SO just delete it....
The HTML part
{"v1":"first","ex":"first_soup","foo":"0","doo":"0","v1":["second"]}
so afterwards when i want to show the values it should look like
print(v1)
first
print(v2)
second
I tried to get the values just by slicing the whole line like this:
v1=htmltext[7,12]
v2=htmltext[60,66]
but in this case I am not using the bs4 module, which is recommended using... I would be very grateful in case someone could teach me...

What you are seeing there is not an HTML file but a JSON. In this case it makes no sense to use BeautifulSoup's HTML parser, you may want to use a standard JSON library to do that, like so:
import json
json_Dict=json.loads(str(soup))
Then you can index it using the headers (or keys)
json_Dict["v1"]
>>>"first"

Why isn't beautifulsoup's find_all not working

I have decided to view a website's source code, and chose a class, which is "expanded". I wanted to print out all of its contents, with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.quora.com/How-can-I-write-a-bot-using-Python")
soup = BeautifulSoup(page.content, 'html.parser')
print soup.find_all(class_='expanded')
but it simply prints out:
[]
Please help me detect what's wrong.
I already saw this thread and tried following what the answer said but it did not help me since this error appears in the terminal:
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

When searching for a class value, you should pass it in like this:
soup.find_all(attrs={"class":"expanded"})
That being said, I don't see anything in the source code of that site with a class called "expanded". The closest thing I could find was class='ui_qtext_expanded'. If that is what you are trying to find, you need to include the whole string.
soup.find_all(attrs={"class":"ui_qtext_expanded"})

Scraping Wikipedia Table providing no results

venturing into the world of python. I've done the codeacademy course and traweled through stack and youtube but hitting an issue I cant solve.
I'm attempting to do a simple print of a table located in wikipedia, failing misreably at writing my own code I decided to use a tutorial example and build off. However this isn't working and I haven't the foggest idea why.
This is the code here with the appropiate link included. My end result is an empty list "[ ]". I'm using PyCharm 2017.2, beautifulsoup 4.6.0, requests 2.18.4 & python 3.6.2. Any advice appreciated. For reference, the tutorial website is here
import requests
from bs4 import BeautifulSoup
WIKI_URL = "https://en.wikipedia.org/wiki/List_of_volcanoes_by_elevation"
req = requests.get(WIKI_URL)
soup = BeautifulSoup(req.content, 'lxml')
table_classes = {"class": ["sortable", "plainrowheaders"]}
wikitables = soup.findAll("table", table_classes)
print(wikitables)

You can accomplish that using regular expressions.
You get site content by requests.get(WIKI_URL).content
See source code of the site to see how Wikipedia presents tables in HTML.
Find a regular expression that can fit whole table (might be something like <table>(?P<table>*+?)</table>). What this does is get anything between <table> and </table> tokens. Good documentation for regex with python. Take a look at re.findall().
Now you are left with table data. You can use regular expressions again to get data for each row, then regex on each row to get columns. re.findall() is the key again.

Using BeautifulSoup to Extract CData

I'm trying to use BeautifulSoup from bs4/Python 3 to extract CData. However, whenever I search for it using the following, it returns an empty result. Can anyone point out what I'm doing wrong?
from bs4 import BeautifulSoup,CData
txt = '''<foobar>We have
<![CDATA[some data here]]>
and more.
</foobar>'''
soup = BeautifulSoup(txt)
for cd in soup.findAll(text=True):
if isinstance(cd, CData):
print('CData contents: %r' % cd)

The problem appears to be that the default parser doesn't parse CDATA properly. If you specify the correct parser, the CDATA shows up:
soup = BeautifulSoup(txt,'html.parser')
For more information on parsers, see the docs
I got onto this by using the diagnose function, which the docs recommend:
If you have questions about Beautiful Soup, or run into problems, send mail to the discussion group. If your problem involves parsing an HTML document, be sure to mention what the diagnose() function says about that document.
Using the diagnose() function gives you output of how the different parsers see your html, which enables you to choose the right parser for your use case.

How to extract information from a web page using python in json or xml format?

I need help in extracting information from a webpage. I give the URL and then I need to extract information like contact number, address, href, name of person etc. I am able to extract the page source completely for a provided URL with known tags. But I need a generic source code to extract this data from any URL. I used regex to extract emails for e.g.
import urllib
import re
#htmlfile=urllib.urlopen("http://www.plainsboronj.com/content/departmental-directory")
urls=["http://www.plainsboronj.com/content/departmental-directory"]
i=0
regex='\b[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b'
pattern=re.compile(regex)
print pattern
while i<len(urls):
htmlfile=urllib.urlopen(urls[i])
htmltext=htmlfile.read()
titles=re.findall(pattern,htmltext)
print titles
i+=1
This gives me empty list. Any help to extract all info as I said above will be highly appreciated.
The idea is to give a URL and the extract all information like name, phone number, email, address etc in json or xml format. Thank you all in advance...!!

To start with you need to fix your regex.
\ needs to be escaped in python strings.
Easy way to fix this is using a raw string r'' instead.
regex=r'\b[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b

Meanwhile I have managed to get it working, after some small modifications (beware that I am working with Python 3.4.2):
import urllib.request
import re
#htmlfile=urllib.urlopen("http://www.plainsboronj.com/content/departmental-directory")
urls=["http://www.plainsboronj.com/content/departmental-directory"]
i=0
regex='[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}'
pattern=re.compile(regex)
print(pattern)
while i<len(urls):
htmlfile=urllib.request.urlopen(urls[i])
htmltext=htmlfile.read()
titles=re.findall(pattern,htmltext.decode())
print(titles)
i+=1
The result is:
['townshipclerk#plainsboronj.com', 'acancro#plainsboronj.com', ...]
Good luck

I think you're on the wrong track here: you have a HTML file, from where you try to extract information. You have started doing this by filtering on '#'-sign for finding e-mail addresses (hence your choice of working with regular expressions). However other things like names, phone numbers, ... are not recognisable using regular expressions, hence another approach might be useful. Under URL "https://docs.python.org/3/library/html.parser.html" there is some explanation on how to parse HTML files. In my opinion this will be a better approach for solving your needs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular Expressions in Python-Scraping Data from website - python

I'm not entirely clear exactly what you're looking for, but if I understand correctly, you only want to get ReportName.XML, in which case it would be: find('a').text If you're looking for "/MarketRepoerts/ReportName.XML", then it would be: find('a').attrs['href']

Related

Extract certain values from a line with BeautifulSoup

Why isn't beautifulsoup's find_all not working

Scraping Wikipedia Table providing no results

Using BeautifulSoup to Extract CData

How to extract information from a web page using python in json or xml format?

Categories

Resources