for a school course we are learning advanced python,to get a first idea about web scraping and similar stuff.... I got an exercise to do where I have to extract the values v1, v2 from the following line of an HTML... I tried looking up but couldn't find any really specific things.... If it is unappropriated for SO just delete it....
The HTML part
{"v1":"first","ex":"first_soup","foo":"0","doo":"0","v1":["second"]}
so afterwards when i want to show the values it should look like
print(v1)
first
print(v2)
second
I tried to get the values just by slicing the whole line like this:
v1=htmltext[7,12]
v2=htmltext[60,66]
but in this case I am not using the bs4 module, which is recommended using... I would be very grateful in case someone could teach me...
What you are seeing there is not an HTML file but a JSON. In this case it makes no sense to use BeautifulSoup's HTML parser, you may want to use a standard JSON library to do that, like so:
import json
json_Dict=json.loads(str(soup))
Then you can index it using the headers (or keys)
json_Dict["v1"]
>>>"first"
Related
So with request and lxml I have been trying to create a small API that given certain parameters would download a timetable from a certain website, this one, the thing is I am a complete newbie at stuff like these and aside from the hours I can't seem to get anything else.
I've been messing around with xpath code but mostly what I get is a simple []. I've been trying to get the first line of classes that correspond to the first line of hours (8.00-8.30) which should probably appear as something like this [,,,Introdução à Gestão,].
page = requests.get('https://fenix.iscte-iul.pt/publico/siteViewer.do?method=roomViewer&roomName=2E04&objectCode=4787574275047425&executionPeriodOID=4787574275047425&selectedDay=1542067200000&contentContextPath_PATH=/estudante/consultar/horario&_request_checksum_=ae083a3cc967c40242304d1f720ad730dcb426cd')
tree = html.fromstring(page.content)
class_block_one = tree.xpath('//table[#class="timetable"]/tbody/tr[1]/td[#class=*]/a/abbr//text()')
print(class_block_one)
To get required text from first (actually second) row, you can try below XPath
'//table[#class="timetable"]//tr[2]/td/a/abbr//text()'
You can get values from all rows:
for row in tree.xpath('//table[#class="timetable"]//tr'):
print(row.xpath('./td/a/abbr//text()'))
I need help in extracting information from a webpage. I give the URL and then I need to extract information like contact number, address, href, name of person etc. I am able to extract the page source completely for a provided URL with known tags. But I need a generic source code to extract this data from any URL. I used regex to extract emails for e.g.
import urllib
import re
#htmlfile=urllib.urlopen("http://www.plainsboronj.com/content/departmental-directory")
urls=["http://www.plainsboronj.com/content/departmental-directory"]
i=0
regex='\b[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b'
pattern=re.compile(regex)
print pattern
while i<len(urls):
htmlfile=urllib.urlopen(urls[i])
htmltext=htmlfile.read()
titles=re.findall(pattern,htmltext)
print titles
i+=1
This gives me empty list. Any help to extract all info as I said above will be highly appreciated.
The idea is to give a URL and the extract all information like name, phone number, email, address etc in json or xml format. Thank you all in advance...!!
To start with you need to fix your regex.
\ needs to be escaped in python strings.
Easy way to fix this is using a raw string r'' instead.
regex=r'\b[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b
Meanwhile I have managed to get it working, after some small modifications (beware that I am working with Python 3.4.2):
import urllib.request
import re
#htmlfile=urllib.urlopen("http://www.plainsboronj.com/content/departmental-directory")
urls=["http://www.plainsboronj.com/content/departmental-directory"]
i=0
regex='[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}'
pattern=re.compile(regex)
print(pattern)
while i<len(urls):
htmlfile=urllib.request.urlopen(urls[i])
htmltext=htmlfile.read()
titles=re.findall(pattern,htmltext.decode())
print(titles)
i+=1
The result is:
['townshipclerk#plainsboronj.com', 'acancro#plainsboronj.com', ...]
Good luck
I think you're on the wrong track here: you have a HTML file, from where you try to extract information. You have started doing this by filtering on '#'-sign for finding e-mail addresses (hence your choice of working with regular expressions). However other things like names, phone numbers, ... are not recognisable using regular expressions, hence another approach might be useful. Under URL "https://docs.python.org/3/library/html.parser.html" there is some explanation on how to parse HTML files. In my opinion this will be a better approach for solving your needs.
I'm trying to use Beautiful Soup to isolate a specific <table> element and put it in a new file. The table has an id, ModelTable, and I can find it using soup.select("#ModelTable") ("soup" being the imported file).
However, I'm having trouble figuring out how to get the element into a new file. Simply writing it to a new file (as in: write(soup.select("#ModelTable") ) doesn't work, as it's not a string object, and converting it with str() results in a string enclosed in brackets.
Ideally I'd like to be able to export the isolated element after running it through .prettify() so that I can get a good HTML file right off the bat. I know I must be missing something obvious... any hints?
You need to iterate over the contents of the returned object. Your question also taught me that BS4's .select uses CSS selectors, which is fantastic.
with open('file_output.html', 'w') as f:
for tag in soup.select("#ModelTable"):
f.write(tag.prettify())
I am trying to parse a website for
blahblahblah
I DONT CARE ABOUT THIS EITHER
blahblahblah
(there are many of these, and I want all of them in some tokenized form). Unfortunately the HTML is very large and a little complicated, so trying to crawl down the tree might take me some time to just sort out the nested elements. Is there an easy way to just retrieve this?
Thanks!
If you just want the href's for a tags, then use:
data = """blahblahblah
I DONT CARE ABOUT THIS EITHER
blahblahblah"""
import lxml.html
tree = lxml.html.fromstring(data)
print tree.xpath('//a/#href')
# ['THIS IS WHAT I WANT']
I am new to Python and I trying to pull in xml files from a website and load them into a database. I have been using the Beautiful Soup module in Python but I cannot pull in the specific xml file that I want.
In the website source code it looks as follows:
ReportName.XML
ReportName.XML
<ReportName.XML
The following shows the code I have in Python. This brings back everything with the 'href' tag whereas I want to filter the files on the 'Report I want name dddddddd'. I have tried using regular expressions such as 'href=\s\w+' for example but to no avail as it returns NONE. Any help is appreciated
from bs4 import BeautifulSoup
import urllib
import re
webpage=("http://www.example.com")
response=urllib.urlopen(webpage).read()
soup=BeautifulSoup(response)
for link in soup.find_all('a'):
print(link.get('href')
When I use Python it findall('href') it pulls back the entire string but I want to filter just the xml aspect. I have tried variations of the code such as findall('href\MarketReports') and findall('href\w+') put this returns "None" when I run the code.
Any help is appreciated
I'm not entirely clear exactly what you're looking for, but if I understand correctly, you only want to get ReportName.XML, in which case it would be:
find('a').text
If you're looking for "/MarketRepoerts/ReportName.XML", then it would be:
find('a').attrs['href']
I used the following code and it was able to find the reports as I needed them. The Google presentation was a great help along with jdotjdot input
http://www.youtube.com/watch?v=kWyoYtvJpe4
The code that I used to find my XML was
import re
import urllib
webpage=("http://www.example.com")
response=urllib.urlopen(webpage).read()
print re.findall(r"Report I want\w+[.]XML",response)