Web scraping python for Arabic text

Web scraping python for Arabic text - python

I am trying to web Scrape the website: "http://norumors.net/?post_type=rumors?post_type=rumors" to get only the heading news and put them in a CSV file using Beautifulsoup and python, This is the code I am using after i look into the HTML source code "view-source:http://norumors.net/?post_type=rumors?post_type=rumors"
import urllib.request,sys,time
from bs4 import BeautifulSoup
import requests
import pandas as pd
pagesToGet= 1
upperframe=[]
for page in range(1,pagesToGet+1):
print('processing page :', page)
url = 'http://norumors.net/?post_type=rumors/?page='+str(page)
print(url)
#an exception might be thrown, so the code should be in a try-except block
try:
#use the browser to get the url. This is suspicious command that might blow up.
page=requests.get(url) # this might throw an exception if something goes wrong.
except Exception as e: # this describes what to do if an exception is thrown
error_type, error_obj, error_info = sys.exc_info() # get the exception information
print ('ERROR FOR LINK:',url) #print the link that cause the problem
print (error_type, 'Line:', error_info.tb_lineno) #print error info and line that threw the exception
continue #ignore this page. Abandon this and go back.
time.sleep(2)
soup=BeautifulSoup(page.text,'html.parser')
frame=[]
links=soup.find_all('li',attrs={'class':'o-listicle__item'})
print(len(links))
filename="NEWS.csv"
f=open(filename,"w", encoding = 'utf-8')
headers="Statement,Link\n"
f.write(headers)
for j in links:
Statement = j.find("div",attrs={'class':'row d-flex'}).text.strip()
# Link = "http://norumors.net/"
Link += j.find("div",attrs={'class':'col-lg-4 col-md-4 col-sm-6 col-xs-6'}).find('a')['href'].strip()
frame.append((Statement,Link))
f.write(Statement.replace(",","^")+","+Link+","+Date.replace(",","^")+","+Source.replace(",","^")+","+Label.replace(",","^")+"\n")
upperframe.extend(frame)
f.close()
data=pd.DataFrame(upperframe, columns=['Statement','Link'])
data.head()
but After I run the code I am getting the pandas data frame and CSV file empty, any suggestion why is that? knowing that i want to get the text between tags.

If I understand correctly, you want to get the text part of the news headlines and the href link to these news. You further want to write them into a CSV file. The problem with your code is for j in links: is not executed because soup.find_all('li',attrs={'class':'o-listicle__item'}) returns an empty list. You should be careful with the names and classes of the tags that you are searching. Below code gets news texts and their links, it also writes them to the CSV file using pd.DataFrame.
import urllib.request,sys,time
from bs4 import BeautifulSoup
import requests
import pandas as pd
pagesToGet = 1
for page in range(1,pagesToGet+1):
print('processing page :', page)
url = 'http://norumors.net/?post_type=rumors/?page=' + str(page)
print(url)
#an exception might be thrown, so the code should be in a try-except block
try:
#use the browser to get the url. This is suspicious command that might blow up.
page = requests.get(url) # this might throw an exception if something goes wrong.
except Exception as e: # this describes what to do if an exception is thrown
error_type, error_obj, error_info = sys.exc_info() # get the exception information
print('ERROR FOR LINK:',url) #print the link that cause the problem
print(error_type, 'Line:', error_info.tb_lineno) #print error info and line that threw the exception
continue #ignore this page. Abandon this and go back.
soup = BeautifulSoup(page.text,'html.parser')
texts = []
links = []
filename = "NEWS.csv"
f = open(filename,"w", encoding = 'utf-8')
Statement = soup.find("div",attrs={'class':'row d-flex'})
divs = Statement.find_all("div",attrs={'class':'col-lg-4 col-md-4 col-sm-6 col-xs-6'})
for div in divs:
txt = div.find("img",attrs={'class':'rumor__thumb'})
texts.append(txt['alt'])
lnk = div.find("a",attrs={'class':'rumor--archive'})
links.append(lnk['href'])
data = pd.DataFrame(list(zip(texts, links)), columns=['Statement', 'Link'])
data.to_csv(f, encoding='utf-8', index=False)
f.close()

Related

Web Scraping - Extract list of text from multiple pages

I want to extract a list of names from multiple pages of a website.
The website has over 200 pages and i want to save all the names to a text file. I have wrote some code but it's giving me index error.
CODE:
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://hamariweb.com/names/muslim/boy/page-'
#for page in range(1, 203):
page = 1
req = requests.get(URL + str(page))
soup = bs(req.text, 'html.parser')
row = soup.find('div', attrs={'class', 'row'})
books = row.find_all('a')
for book in books:
data = book.find_all('b')[0].get_text()
print(data)
OUTPUT:
Aabbaz
Aabid
Aabideen
Aabinus
Aadam
Aadeel
Aadil
Aadroop
Aafandi
Aafaq
Aaki
Aakif
Aalah
Aalam
Aalamgeer
Aalif
Traceback (most recent call last):
File "C:\Users\Mujtaba\Documents\names.py", line 15, in <module>
data = book.find_all('b')[0].get_text()
IndexError: list index out of range
>>>

The reason for getting the error is since it can't find a <b> tag.
Try this code to request each page and save the data to a file:
import requests
from bs4 import BeautifulSoup as bs
MAIN_URL = "https://hamariweb.com/names/muslim/boy/"
URL = "https://hamariweb.com/names/muslim/boy/page-{}"
with open("output.txt", "a", encoding="utf-8") as f:
for page in range(203):
if page == 0:
req = requests.get(MAIN_URL.format(page))
else:
req = requests.get(URL.format(page))
soup = bs(req.text, "html.parser")
print(f"page # {page}, Getting: {req.url}")
book_name = (
tag.get_text(strip=True)
for tag in soup.select(
"tr.bottom-divider:nth-of-type(n+2) td:nth-of-type(1)"
)
)
f.seek(0)
f.write("\n".join(book_name) + "\n")

I suggest to change your parser to html5lib #pip install html5lib. I just think it's better. Second It's better NOT to do a .find() from your soup object DIRECTLY since it might cause some problems where the tags and classes might have duplicates. SO you might be finding data on a html tag where your data isn't even there. So it's better to check everything and inspect element the the tags you want to get and see on what block of code they might be in cause it is easier that way to scrape, also to avoid more errors.
What I did there is I inspected the elements first and FIND the BLOCK of code where you want to get your data and I found that it is on a div and its class is mb-40 content-box that is where all the names you are trying to get are. Luckily the class is UNIQUE and there are no other elements with the same tag and class so we can just directly .find() it.
Then the value of trs are simply the tr tags inside of that block
(Take note also that those <tr> tags are inside of a <table> tag but the good thing is those are the only <tr> tags that exist so there wouldn't be much of a problem like if there would be another <table> tag with the same class value)
which the <tr> tags contains the names you want to get. You may ask why is there [1:] it's because to start at index 1 to NOT include the Header from the table on the website.
Then just loop through those tr tags and get the text. With regards to your error on why is it happening it is simply because of index out of range you are trying to access a .find_all() result list item where it is out of bounds and this might happen if cases that there are no such data that is being found and that also might happen if you DIRECTLY do a .find() function on your soup variable, because there would be times that there are tags and their respective class values are the same BUT! WITH DIFFERENT CONTENT WITHIN IT. So what happens is you're expecting to scrape that particular part of the website but what actually happening is you're scraping a different part, that's why you might not get any data and wonder why it is happening.
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://hamariweb.com/names/muslim/boy/page-'
#for page in range(1, 203):
page = 1
req = requests.get(URL + str(page))
soup = bs(req.content, 'html5lib')
div_container = soup.find('div', class_='mb-40 content-box')
trs = div_container.find_all("tr",class_="bottom-divider")[1:]
for tr in trs:
text = tr.find("td").find("a").text
print(text)

The issue you're having with the IndexError means that in this case the b-tag you found doesn't contains the information that you are looking for.
You can simply wrap that piece of code in a try-except clause.
for book in books:
try:
data = book.find_all('b')[0].get_text()
print(data)
# Add data to the all_titles list
all_titles.append(data)
except IndexError:
pass # There was no element available
This will catch you error and move on. But not break the code.
Below I have also added some extra lines to save your title to a text-file.
Take a look at the inline comments.
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://hamariweb.com/names/muslim/boy/page-'
# Theres is where your titles will be saved. Changes as needed
PATH = '/tmp/title_file.txt'
page = 1
req = requests.get(URL + str(page))
soup = bs(req.text, 'html.parser')
row = soup.find('div', attrs={'class', 'row'})
books = row.find_all('a')
# Here your title will be stored before writing to file
all_titles = []
for book in books:
try:
# Add strip() to cleanup the input
data = book.find_all('b')[0].get_text().strip()
print(data)
# Add data to the all_titles list
all_titles.append(data)
except IndexError:
pass # There was no element available
# Open path to write
with open(PATH, 'w') as f:
# Write all titles on a new line
f.write('\n'.join(all_titles))

BS4 python script crashes immediately when running, but looking at the code it should be fine

I wrote the following code. It's probably not the prettiest, but I tried. When I run it it creates the links.txt file, but the actual script crashes immediately without showing any error on the cmd. I tried researching BS4 and I really think this should work.
This was the initial script that I was trying to get working so I could eventually change it to only scrape the links within the 'card' class, but the fact that it cannot even scrape all the links I want to understand what I did wrong.
import requests
import time
from bs4 import BeautifulSoup
import sys
sys.stdout = open("links.txt", "a")
for x in range(0, 10):
try:
URL = f'https://wesbite.com/downloads/{x}/'
page = requests.get(URL)
time.sleep(5)
soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True):
if a.text:
links_with_text.append(a['href'])
print(links_with_text)
except:
continue
Example of the Card class I was evenmtually trying to scrape:
<div class="card-content">
<div class="center">
<a target="_blank" href="https://website.com/username/">username</a>
I took your suggestions removed the except, and realized that my indents were inconsistent. After fixing that and changing the page.text it seems to work. The code below:
import requests
import time
from bs4 import BeautifulSoup
import sys
sys.stdout = open("links.txt", "a")
for x in range(0, 10):
try:
URL = f'https://wesbite.com/downloads/{x}/'
page = requests.get(URL)
time.sleep(5)
soup = BeautifulSoup(page.text, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True):
if a.text:
links_with_text.append(a['href'])
print(links_with_text)
except Exception as e:
print('something went wrong')

The html variable in BeautifulSoup(html, 'html.parser') is not defined in the code you've posted, my guess is that's raising an exception, which is suppressed by your catch block. Remove the try...catch code and run it, exceptions are helpful information and suppressing them in this way will prevent you from finding the problem.

My script does not search all links, what to do?

I'm building a script to scan a website and capture URLs and test whether it's working or not. The problem is that the script is looking for just the URLs of the website's home page and leaving others aside. How do I capture all pages linked to the site?
Below my code attachment:
import urllib
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
page = urllib.request.urlopen("http://www.google.com/")
soup = BeautifulSoup(page.read(), features='lxml')
links = soup.findAll("a", attrs={'href': re.compile('^(http://)')})
for link in links:
result = (link["href"])
req = Request(result)
try:
response = urlopen(req)
pass
except HTTPError as e:
if e.code != 200:
# Stop, Error!
with open("Document_ERROR.txt", 'a') as archive:
archive.write(result)
archive.write('\n')
archive.write('{} \n'.format(e.reason))
archive.write('{}'.format(e.code))
archive.close()
else:
# Enjoy!
with open("Document_OK.txt", 'a') as archive:
archive.write(result)
archive.write('\n')
archive.close()

The main reason this doesn't work is that you put both the OK and ERROR-writes inside the except-block.
This means that only urls that actually raise an exception will be stored.
In general it would be my advice for you to spray some print-statements into the difference stages of the script - or use an IDE that allows you to step through the code during runtime - line by line. That makes stuff like this so much easier to debug.
PyCharm is free and allows you to do so. Give that a try.
So - I haven't worked with urllib but use requests a lot (python -m pip install requests). A quick refactor using that would look something like below:
import requests
from bs4 import BeautifulSoup
import re
url = "http://www.google.com"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "lxml")
links = soup.find_all("a", attrs={'href': re.compile('^(http://)')})
for link in links:
href = link["href"]
print("Testing for URL {}".format(href))
try:
# since you only want to scan for status code, no need to pull the entire html of the site - use HEAD instead of GET
r = requests.head(href)
status = r.status_code
# 404 etc will not yield an error
error = None
except Exception as e:
# these exception will not have a status_code
status = None
error = e
# store the finding in your files
if status is None or status != 200:
print("URL is broken. Writing to ERROR_Doc")
# do your storing here of href, status and error
else:
print("URL is live. Writing to OK_Doc"
# do your storing here
Hope this makes sense.

Looping url from csv file for scrape html with python

I'm learning to use python to scrape websites (online store)
I'm create a blocke code to scrape a website, where the url to scrape is in the CSV file that I will load to scrape.
However, after running the repetition can only work once in one of the lines does not reach the end of the url in the CSV and does not continue to the next URL.
disc_information = html.find('div', class_='alert alert-info global-promo').text.strip().strip('\n')
AttributeError: 'NoneType' object has no attribute 'text'
how do I get through if an error occurs when html is not found?
the following line of code I use python, please help so that the looping scrape runs to the end of the url list
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import pandas
with open('Url Torch.csv','rt') as f:
data = csv.reader(f, delimiter=',')
for row in data:
URL_GO = row[2]
def variable_Scrape(url):
try:
cookies = dict(cookie="............")
request = requests.get(url, cookies=cookies)
html = BeautifulSoup(request.content, 'html.parser')
title = html.find('div', class_='title').text.strip().strip('\n')
desc = html.find('div', class_='content').text
link = html.find_all('img', class_='lazyload slide-item owl-lazy')
normal_price = html.find('div', class_='amount public').text.strip().strip('\n')
disc_information = html.find('div', class_='alert alert-info global-promo').text.strip().strip('\n')
except AttributeError as e:
print(e)
#ConnectionAbortedError
return False
else:
print(title)
#print(desc)
#print(link)
finally:
print(title)
print(desc)
print(link)
print('Finally.....')
variable_Scrape(URL_GO)

You should call the variable_Scrape(URL_GO) method inside the for-loop and make sure you declare the method before it uses.

Unnamed error using Urllib2 and Beautiful soup

The output of this code block always returns me the "except". No specific error is shown in my terminal. What am i doing wrong ?
Any help is appreciated!
from bs4 import BeautifulSoup
import csv
import urllib2
# get page source and create a BeautifulSoup object based on it
try:
print("Fetching page.")
page = urllib2.open("http://siph0n.net")
soup = BeautifulSoup(page, 'lxml')
#specify tags the parameters are stored in
metaData = soup.find_all("a")
except:
print("Error during fetch.")
exit()

"No specific error is shown in my terminal"
That's because your except block is shadowing it. Either remove the try/except or print the exception in the except block:
try:
.
.
.
except Exception as ex:
print(ex)
Note that catching the general type Exception is generally a bad idea. Your except blocks should always catch the specific exception type as possible.

You can use requests for getting the data.
from bs4 import BeautifulSoup
import requests
import csv
import urllib2
# get page source and create a BeautifulSoup object based on it
try:
print("Fetching page.")
page = requests.get("http://siph0n.net")
soup = BeautifulSoup(page, 'lxml')
#specify tags the parameters are stored in
metaData = soup.find_all("a")
except Exception as ex:
print(ex)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping python for Arabic text - python

Related

Web Scraping - Extract list of text from multiple pages

BS4 python script crashes immediately when running, but looking at the code it should be fine

My script does not search all links, what to do?

Looping url from csv file for scrape html with python

Unnamed error using Urllib2 and Beautiful soup

Categories

Resources