How to see the output after getting the data in web scraping? - python

I typed some code to get some data from a site (scraping) and I got the result I want (results are just numbers).
the question is, how can I show results as an output on my website? I am using python and HTML in Vs Code. Here is the scraping code:
import requests
from bs4 import BeautifulSoup
getpage= requests.get('https://www.worldometers.info/coronavirus/country/austria/')
getpage_soup= BeautifulSoup(getpage.text, 'html.parser')
Get_Total_Deaths_Recoverd_Cases= getpage_soup.findAll('div', {'class':'maincounter-number'})
for para in Get_Total_Deaths_Recoverd_Cases:
print (para.text)
Get_Active_And_Closed= BeautifulSoup(getpage.text, 'html.parser')
All_Numbers2= Get_Active_And_Closed.findAll('div', {'class':'number-table-main'})
for para2 in All_Numbers2:
print (para2.text)
and these are the results that I want to show on the website:
577,007
9,687
535,798
31,522
545,485

I don't know how to describe this solution but it is a solution nonetheless:
import os
lst = [
577.007, 9.687, 535.798, 31.522, 545.485
]
p = ''
for num in lst:
p += f'<p>{num}</p>\n'
html = f"""
<!doctype>
<html>
<head>
<title>Test</title>
</head>
<body>
{p}
</body>
</html>
"""
with open('web.html', 'w') as file:
file.write(html)
os.startfile('web.html')
You don't have to necessarily use that os.startfile but You get the idea I hope, You now have web.html file and You can display it on Your website or whatever and theoretically You can also get a normal html document and copy it there (set as html variable and put the {p} wherever You need/want) or even go to Your html and put somewhere for example {%%} (which is what django uses but anyways),then read the whole file and replace those {%%} with Your value and write that back.
At the end this is a solution but depending on what You use e.g. a framework of sorts for example django or flask, there are easier ways to do that there

Related

Extract string from <script> - BeautifulSoup python

I'm trying to create a python script to extract some informations from a webmail. I wanna follow a redirection.
My code :
br1 = mechanize.Browser()
br1.set_handle_robots(False)
br1.set_cookiejar(cj)
br1.open("LOGIN URL")
br1.select_form(nr=0)
br1.form['username'] = mail_site
br1.form['password'] = pw_site
res1 = br1.submit()
html = res1.read()
print html
Result is not what i expect.
It contains only a redirection script.
I've seen that i have to extract the information from this script to follow this redirection.
So, in my case,i've to extract jsessionid into a script.
The script is :
<script>
function redir(){
window.self.location.replace('/webmail/en_EN/continue.html;jsessionid=1D5QS4DA6C148DC4C14QS4CS5.1FDS5F4DSV1A64DA5DA?MESSAGE=NO_COOKIE&DT=1&URL_VALID=welcome.html');
return true;
}
</script>
If i'm not wrong, i've to build one regex.
I've tried many things but no results.
Anyone have an idea ?
import re
get_jsession = re.search(r'jsessionid=([A-Za-z0-9.]+)',script_)
print(get_jsession.group(1))
>>> '1D5QS4DA6C148DC4C14QS4CS5.1FDS5F4DSV1A64DA5DA'

Pars and extract urls inside an html web content without using BeautifulSoup or urlib libraries

I am new in python and I am so sorry if my question is very basic. In my program, I need to pars an html web page and extract all of the links inside that. Assume my web page content is such as below:
<html><head><title>Fakebook</title><style TYPE="text/css"><!--
#pagelist li { display: inline; padding-right: 10px; }
--></style></head><body><h1>testwebapp</h1><p>Home</p><hr/><h1>Welcome to testwebapp</h1><p>Random URLs!</p><ul><li>Rennie Tach</li><li>Pid Ko</li><li>Ler She</li><li>iti Sar</li><li><a </ul>
<p>Page 1 of 2
<ul id="pagelist"><li>
1
</li><li>2</li><li>next</li><li>last</li></ul></p>
</body></html>
Now, I need to pars this web content and extract all of the links inside that. In another words, I need below content to be extracted from the web page:
/testwebapp/847945358/
/testwebapp/848854776/
/testwebapp/850558104/
/testwebapp/851635068/
/testwebapp/570508160/fri/2/
/testwebapp/570508160/fri/2/
/testwebapp/570508160/fri/2/
I searched so much about parsing web pages using python such as this, this or this, but many of them have used libraries such as urlib or urlib2 or BeautifulSoup and request which I can not use these libraries in my program. Because my application will run on a machine that these libraries have not been installed on that. So I need to parse my web content manually. My idea was that, I save my web page content in a string and then I convert the string((separated by space)) to an array of string and then check each item of my array and if it has /testwebapp/ or fri keyword, save that in an array. But when I am using below command for converting the string contain my web page content to an array, I got this error:
arrayofwords_fromwebpage = (webcontent_saved_in_a_string).split(" ")
and the error is:
TypeError: a bytes-like object is required, not 'str'
IS there any quick and efficient way for parsing and extracting this links inside an html web page without using any library such as urlib, urlib2 or BeautifulSoup?
If all that you need is to found all url use only Python, this function will help you:
def search(html):
HREF = 'a href="'
res = []
s, e = 0, 0
while True:
s = html.find(HREF, e)
if s == -1:
break
e = html.find('">', s)
res.append(html[s+len(HREF):e])
return res
You can use something from the standard library, namely HTMLParser.
I subclass it for your purpose by watching for 'a' tags. When the parser encounters one it looks for the 'href' attribute and, if it's present, it prints its value.
To execute it, I instantiate the subclass, then give its feed method the HTML that you presented in your question.
You can see the results at the end of this answer.
>>> from html.parser import HTMLParser
>>> class SharoozHTMLParser(HTMLParser):
... def handle_starttag(self, tag, attrs):
... if tag == 'a':
... attrs = {k: v for (k, v) in attrs}
... if 'href' in attrs:
... print (attrs['href'])
...
>>> parser = SharoozHTMLParser()
>>> parser.feed(open('temp.htm').read())
/testwebapp/
/testwebapp/847945358/
/testwebapp/848854776/
/testwebapp/850558104/
/testwebapp/851635068/
/testwebapp/570508160/fri/2/
/testwebapp/570508160/fri/2/
/testwebapp/570508160/fri/2/

Problems crawling wordreference

I am trying to crawl wordreference, but I am not succeding.
The first problem I have encountered is, that a big part is loaded via JavaScript, but that shouldn't be much problem because I can see what I need in the source code.
So, for example, I want to extract for a given word, the first two meanings, so in this url: http://www.wordreference.com/es/translation.asp?tranword=crane I need to extract grulla and grĂșa.
This is my code:
import lxml.html as lh
import urllib2
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
doc = lh.parse((urllib2.urlopen(url)))
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print i
The result is that I get an empty list.
I have tried to crawl it with scrapy too, no success. I am not sure what is going on, the only way I have been able to crawl it is using curl, but that is sloopy, I want to do it in an elegant way, with Python.
Thank you very much
It looks like you need a User-Agent header to be sent, see Changing user agent on urllib2.urlopen.
Also, just switching to requests would do the trick (it automatically sends the python-requests/version User Agent by default):
import lxml.html as lh
import requests
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
response = requests.get("http://www.wordreference.com/es/translation.asp?tranword=crane")
doc = lh.fromstring(response.content)
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print(i)
Prints:
grulla
grĂșa
plataforma
...
grulla blanca
grulla trompetera

Getting the title using urllib

I am supposed to write a code that goes into a web site and gets its title so here is the code i have
import urllib.request
def findTitle(url):
urllib.request.Request(url)
#open url
urllib.request.urlopen(url)
urllib.request.urlopen(url).read().decode('utf-8')
#set same variable equal to the end of <title> tag
endTitlePos = url.find("<title>")
#set variable equal to starting position of <title> tag
startTitlePos = url.find("<title>", endTitlePos)
startTitlePos += len("<title>")
#set new variable equal to </title>
TitleContent=url.find("</title>",startTitlePos)
#return slice of output between the two variables
title = url[startTitlePos:endTitlePos]
content_list=[]
content_list.append(title)
return content_list
def main():
url="https://google.com/search"
print(findTitle(url))
main()
we are using google for an example. Now its supposed to just print "google" but currently it prints "['//google.com/searc']" i am just curious what i am missing here, i mean it seems very simple but i dont know why its printing the url rather then the title and how also do i turn it form the list into a string?
There are several alternative to get data from webpages. The best use BeautifulSoup. In your case string split() method works well
import urllib.request
def findTitle(url):
webpage = urllib.request.urlopen(url).read()
title = str(webpage).split('<title>')[1].split('</title>')[0]
return title
>>>print(findTitle('http://www.google.com'))
Google

How do I print a line following a line containing certain text in a saved file in Python?

I have written a Python program to find the carrier of a cell phone given the number. It downloads the source of http://www.whitepages.com/carrier_lookup?carrier=other&number_0=1112223333&response=1 (where 1112223333 is the phone number to lookup) and saves this as carrier.html. In the source, the carrier is in the line after the [div class="carrier_result"] tag. (switch in < and > for [ and ], as stackoverflow thought I was trying to format using the html and would not display it.)
My program currently searches the file and finds the line containing the div tag, but now I need a way to store the next line after that as a string. My current code is: http://pastebin.com/MSDN0vbC
What you really want to be doing is parsing the HTML properly. Use the BeautifulSoup library - it's wonderful at doing so.
Sample code:
import urllib2, BeautifulSoup
opener = urllib2.build_opener()
opener.addheaders[0] = ('User-agent', 'Mozilla/5.1')
response = opener.open('http://www.whitepages.com/carrier_lookup?carrier=other&number_0=1112223333&response=1').read()
bs = BeautifulSoup.BeautifulSoup(response)
print bs.findAll('div', attrs={'class': 'carrier_result'})[0].next.strip()
You should be using a HTML parser such as BeautifulSoup or lxml instead.
to get the next line, you can use
htmlsource = open('carrier.html', 'r')
for line in htmlsource:
if '<div class="carrier_result">' in line:
nextline = htmlsource.next()
print nextline
A "better" way is to split on </div>, then get the things you want, as sometimes the stuff you want can occur all in one line. So using next() if give wrong result.eg
data=open("carrier.html").read().split("</div>")
for item in data:
if '<div class="carrier_result">' in item:
print item.split('<div class="carrier_result">')[-1].strip()
by the way, if its possible, try to use Python's own web module, like urllib, urllib2 instead of calling external wget.

Categories

Resources