How to parse title from sites using python? - python

I am new to python and cannot understand how to implement the following task.
There is a txt file with domains (about 10,000 domains) stored in upper case. Necessary:
- transfer domain addresses to lower case
- add the string 'http: //' at the beginning of the domain so that the domain is then inserted into requests
- make a loop so that the parser collects the title from each domain (site)
- write everything to a file table with two fields | site url | site title |
that's what happened:
import requests
from bs4 import BeautifulSoup as bs
f = open(r'file.txt','r+')
a=[]
for i in f:
a.append(i.lower().replace('\n',''))
a[-1]='http://'+a[-1]
f.close()
title_list=[]
for url in a:
try:
r=requests.get(url)
page=bs(r.content,'html.parser')
title=page.find('title')
title_list.append(url)
title_list.append(title.text.replace('\n',''))
except Exception as e:
print(e)
print(title_list)
I don’t know how to implement server error checking

You can do something similar to this.
import urllib2
from BeautifulSoup import BeautifulSoup
file = open('urllist.txt', 'r')
urlList = file.readlines()
file.close()
titles = []
for url in urlList:
soup = BeautifulSoup(urllib2.urlopen('https://' + url.lower()))
titles.append(soup.title.string)
Note:
'urllist.txt' is the file containing URLs
titles will contain the list of website titles
Hope this helps :)

Related

Script to extract links from page and check the domain

I'm trying to write a script that iterates through a list of web pages, extracts the links from each page and checks each link to see if the are in a given set of domains. I have the script set up to write two files - pages with links in the given domains are written to one file while the rest are written to the other. I'm essentially trying to sort the pages based on the links in the pages. Below is my script but it doesn't look right. I'd appreciate any pointers to achieve this (I'm new at this, can you tell)
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.rose.com', 'https://www.pink.com']
for i in range(len(urls)):
grab = requests.get(urls[i])
soup = BeautifulSoup(grab.text, 'html.parser')
f = open('links_good.txt', 'w')
g = open('links_need_update.txt', 'w')
for link in soup.find_all('a'):
data = link.get('href')
check_url = re.compile(r'(www.x.com)+ | (www.y.com)')
invalid = check_url.search(data)
if invalid == None
g.write(urls[i])
g.write('\n')
else:
f.write(urls[i])
f.write('\n')
There are some very basic problems with your code:
if invalid == None is missing a : at the end, but should also be if invalid is None:
not all <a> elements will have an href, so you need to deal with those, or your script will fail.
the regex has some issues (you probably don't want to repeat that first URL and the parentheses are pointless)
you write the URL to the file every time you find a problem, but you only need to write it to the file if it has a problem at all; or perhaps you wanted a full lists of all the problematic links?
you rewrite the files on every iteration of your for loop, so you only get the final result
Fixing all that (and using a few arbitrary URLs that work):
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
f = open('links_good.txt', 'w')
g = open('links_need_update.txt', 'w')
for i in range(len(urls)):
grab = requests.get(urls[i])
soup = BeautifulSoup(grab.text, 'html.parser')
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
check_url = re.compile('gamespot.com|pcgamer.com')
result = check_url.search(data)
if result is None:
# if there's no result, the link doesn't match what we need, so write it and stop searching
g.write(urls[i])
g.write('\n')
break
else:
f.write(urls[i])
f.write('\n')
However, there's still a lot of issues:
you open file handles, but never close them, use with instead
you loop over a list using an index, that's not needed, loop over urls directly
you compile a regex for efficieny, but do so on every iteration, countering the effect
The same code with those problems fixed:
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
with open('links_good.txt', 'w') as f, open('links_need_update.txt', 'w') as g:
check_url = re.compile('gamespot.com|pcgamer.com')
for url in urls:
grab = requests.get(url)
soup = BeautifulSoup(grab.text, 'html.parser')
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
result = check_url.search(data)
if result is None:
# if there's no result, the link doesn't match what we need, so write it and stop searching
g.write(url)
g.write('\n')
break
else:
f.write(url)
f.write('\n')
Or, if you want to list all the problematic URLs on the sites:
import requests
from bs4 import BeautifulSoup
import re
urls = ['https://www.gamespot.com', 'https://www.pcgamer.com']
with open('links_good.txt', 'w') as f, open('links_need_update.txt', 'w') as g:
check_url = re.compile('gamespot.com|pcgamer.com')
for url in urls:
grab = requests.get(url)
soup = BeautifulSoup(grab.text, 'html.parser')
good = True
for link in soup.find_all('a'):
data = link.get('href')
if data is not None:
result = check_url.search(data)
if result is None:
g.write(f'{url},{data}\n')
good = False
if good:
f.write(url)
f.write('\n')

Web Scraping - Python - need assistance

This is my first time posting so apologies if there is any errors. I currently have a file with a list of URLs, and I am trying to create a python program which will go to the URLs and grab the text from the HTML page and save it in a .txt file. I am currently using beautifulsoup to scrape these sites and many of them are throwing errors which I am unsure how to solve. I am looking for a better way to this: I have posted by code below.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from urllib.request import Request
import datefinder
from dateutil.parser import parse
import json
import re
import random
import time
import scrapy
import requests
import urllib
import os.path
from os import path
#extracts page contents using beautifulSoup
def page_extract(url):
req = Request(url,
headers={'User-Agent': 'Mozilla/5.0'})
webpage = uReq(req, timeout=5).read()
page_soup = soup(webpage, "lxml")
return page_soup
#opens file that contains the links
file1 = open('links.txt', 'r')
lines = file1.readlines()
#for loop that iterates through the list of urls I have
for i in range(0, len(lines)):
fileName = str(i)+".txt"
url = str(lines[i])
print(i)
try:
#if the scraping is successful i would like it to save the text contents in a text file with the text file name
# being the index
soup2 = page_extract(url)
text = soup2.text
f = open("Politifact Files/"+fileName,"x")
f.write(str(text))
f.close()
print(url)
except:
#otherwise save it to another folder which contains all the sites that threw an error
f = open("Politifact Files Not Completed/"+fileName,"x")
f.close()
print("NOT DONE: "+url)
Thanks #Thierry Lathuille and #Dr Pi for your response. I was able to find a solution to this problem by looking into python libraries that are able to webscrape the important text off of a webpage. I came across one called 'Trafilatura' which is able to accomplish this task. The documentation for this library is here at: https://pypi.org/project/trafilatura/.

Extract CSS media queries from websites with python 2.7

I'm trying to find a specific CSS media query (#media only screen) in CSS files of websites by using a crawler in python 2.7.
Right now I can crawl websites/URLs (from a CSV file) to find specific keywords in their HTML source code using the following code:
import urllib2
keyword = ['keyword to find']
with open('listofURLs.csv') as f:
for line in f:
strdomain = line.strip()
if strdomain:
req = urllib2.Request(strdomain.strip())
response = urllib2.urlopen(req)
html_content = response.read()
for searchstring in keyword:
if searchstring.lower() in str(html_content).lower():
print (strdomain, keyword, 'found')
f.close()
However, I now would like to crawl websites/ULRs (from the CSV file) to find the #media only screen query in their CSS files/source code. How should my code look like?
So, you have to:
1° read the csv file and put each url in a Python list;
2° loop this list, go to the pages and extract the list of css links. You need an HTML parser, for example BeautifulSoup;
3° browse the list of links and extract the item you need. There are CSS parsers like tinycss or cssutils, but I've never used them. A regex can maybe do the trick, even if this is probably not recommended.
4° write the results
Since you know how to read the csv (PS : no need to close the file with f.close() when you use the with open method), here is a minimal suggestion for operations 2 and 3. Feel free to adapt it to your needs and to improve it. I used Python 3, but I think it works on Python 2.7.
import re
import requests
from bs4 import BeautifulSoup
url_list = ["https://76crimes.com/2014/06/25/zambia-to-west-dont-watch-when-we-jail-lgbt-people/"]
for url in url_list:
try:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
css_links = [link["href"] for link in soup.findAll("link") if "stylesheet" in link.get("rel", [])]
print(css_links)
except Exception as e:
print(e, url)
pass
css_links = ["https://cdn.sstatic.net/Sites/stackoverflow/all.css?v=d9243128ba1c"]
#your regular expression
pattern = re.compile(r'#media only screen.+?\}')
for url in css_links:
try:
response = requests.get(url).text
media_only = pattern.findall(response)
print(media_only)
except Exception as e:
print(e, url)
pass

Scraping data from multiple places on same HTML with Beautiful Soup (Python)

I have problems with scraping data from certain URL with Beautiful Soup.
I've successfully made part where my code opens text file with list of URL's and goes through them.
First problem that I encounter is when I want to go through two separate places on HTML page.
With code that I wrote so far, it only goes through first "class" and just doesn't want to search and scrap another one that I defined.
Second issue is that I can get data only if I run my script in terminal with:
python mitel.py > mitel.txt
Output that I get is not the one that I want. I am just looking for two strings from it, but I cannot find a way to extract it.
Finally, there's no way I can get my results to write to CSV.
I only get last string of last URL from url-list into my CSV.
Can you assist TOTAL beginner in Python?
Here's my script:
import urllib2
from bs4 import BeautifulSoup
import csv
import os
import itertools
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
with open('urllist.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
site = urllib2.urlopen(url)
soup = BeautifulSoup(site.read(), 'html.parser')
for target in soup.findAll(True, {"class":["tel-number", "tel-result main"]}):
finalt = target.text.strip()
print finalt
with open('output_file.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(finalt)
For some reason, I cannot paste succesfully targeted HTML code, so I'll just put a link here to one of the pages, and if it gets needed, I'll try to somehow paste it, although, its very big and complex.
Targeted URL for scraping
Thank you so much in advance!
Well I managed to get some results with the help of #furas and google.
With this code, I can get all "a" from the page, and then in MS Excel I was able to get rid of everything that wasn't name and phone.
Sorting and rest of the stuff is also done in excel... I guess I am to big of a newbie to accomplish everything in one script.
import urllib2
from bs4 import BeautifulSoup
import csv
import os
import itertools
import requests
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
finalt = []
proxy = urllib2.ProxyHandler({'http': 'http://163.158.216.152:80'})
auth = urllib2.HTTPBasicAuthHandler()
opener = urllib2.build_opener(proxy, auth, urllib2.HTTPHandler)
urllib2.install_opener(opener)
with open('mater.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
site = urllib2.urlopen(url)
soup = BeautifulSoup(site.read(), 'html.parser')
for target in soup.findAll('a'):
finalt.append( target.text.strip() )
print finalt
with open('imena1-50.csv', 'wb') as f:
writer = csv.writer(f)
for i in finalt:
writer.writerow([i])
It also uses proxy.. sort off. Didn't get it to get proxies from .txt list.
Not bad for first python scraping, but far from efficient and the way I imagine it.
maybe your selector is wrong, try this
for target in soup.findAll(True, {"class":["tel-number",'tel-result-main']}):

How to collect a continuous set of webpages using python?

https://example.net/users/x
Here, x is a number that ranges from 1 to 200000. I want to run a loop to get all the URLs and extract contents from every URL using beautiful soup.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
content = urlopen(re.compile(r"https://example.net/users/[0-9]//"))
soup = BeautifulSoup(content)
Is this the right approach? I have to perform two things.
Get a continuous set of URLs
Extract & store retrieved contents from every page/URL.
UPDATE:
I've to get only one particular value from each of the webpages.
soup = BeautifulSoup(content)
divTag = soup.find_all("div", {"class":"classname"})
for tag in divTag:
ulTags = tag.find_all("ul", {"class":"classname"})
for tag in ulTags:
aTags = tag.find_all("a",{"class":"classname"})
for tag in aTags:
name = tag.find('img')['alt']
print(name)
You could try this:
import urllib2
import shutil
urls = []
for i in range(10):
urls.append(str('https://www.example.org/users/' + i))
def getUrl(urls):
for url in urls:
# Only a file_name based on url string
file_name = url.replace('https://', '').replace('.', '_').replace('/', '_')
response = urllib2.urlopen(url)
with open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
getUrl(urls)
If you just need the contents of a web page, you could probably use lxml, from which you could parse the content. Something like:
from lxml import etree
r = requests.get('https://example.net/users/x')
dom = etree.fromstring(r.text)
# parse seomthing
title = dom.xpath('//h1[#class="title"]')[0].text
Additionally, if you are scraping 10s or 100s of thousands of pages, you might want to look into something like grequests where you can do multiple asynchronous HTTP requests.

Categories

Resources