Get strange letters from Arabic alphabet when scrape an Arabic website - python

I would to scrape this site: http://waqfeya.com/book.php?bid=1
but when I do I get characters like these ÇáÞÑÂä ÇáßÑíã .
This how looks my script:
import requests
from bs4 import BeautifulSoup
BASE_URL = "http://waqfeya.com/book.php?bid=1"
source = requests.get(BASE_URL)
soup = BeautifulSoup(source.text, 'lxml')
print(soup)
I tried these things but don't work for me:
source.encoding = 'utf-8'
and this:
source.encoding = 'ISO-8859-1'
also this:
soup = BeautifulSoup(source.text, from_endocing='ISO-8859-1')
​But none worked for me.

Use urlopen instead of request
from bs4 import BeautifulSoup
from urllib import urlopen
BASE_URL = "http://waqfeya.com/book.php?bid=1"
open = urlopen(BASE_URL)
soup = BeautifulSoup(open, 'lxml')
print(soup.encode('utf-8'))

Sometimes Requests may get the encoding wrong. For this site we can get the correct encoding from the Source.
You can assign the encoding like source.encoding='windows-1256' before using source.text in BeautifulSoup.
import requests
BASE_URL = "http://waqfeya.com/book.php?bid=1"
source = requests.get(BASE_URL)
print(source.encoding)
print(source.apparent_encoding)
source.encoding='windows-1256'
print(source.text)
I was able to get all the Arabic characters properly.

Related

how to put web scraped data into a list

this is the code I used to get the data from a website with all the wordle possible words, im trying to put them in a list so I can create a wordle clone but I get a weird output when I do this. please help
import requests
from bs4 import BeautifulSoup
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
word_list = list(soup)
It do not need BeautifulSoup, simply split the text of the response:
import requests
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
requests.get(url).text.split()
Or if you like to do it wit BeautifulSoup anyway:
import requests
from bs4 import BeautifulSoup
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
soup.text.split()
Output:
['women',
'nikau',
'swack',
'feens',
'fyles',
'poled',
'clags',
'starn',...]

How do I clean up Beautiful soup's output

I am trying to scrape a book from a website and while parsing it with Beautiful Soup I noticed that there were some errors. For example this sentence:
"You have more… direct control over your skaa here. How many woul "Oh, a half dozen or so,"
The "more…" and " woul" are both errors that occurred somewhere in the script.
Is there anyway to automatically clean mistakes like this up?
Example code of what I have is below.
import requests
from bs4 import BeautifulSoup
url = 'http://thefreeonlinenovel.com/con/mistborn-the-final-empire_page-1'
res = requests.get(url)
text = res.text
soup = BeautifulSoup(text, 'html.parser')
print(soup.prettify())
trin = soup.tr.get_text()
final = str(trin)
print(final)
You need to escape the convert the html entities as detailed here. To apply in your situation however, and retain the text, you can use stripped_strings:
import requests
from bs4 import BeautifulSoup
import html
url = 'http://thefreeonlinenovel.com/con/mistborn-the-final-empire_page-1'
res = requests.get(url)
text = res.text
soup = BeautifulSoup(text, 'lxml')
for r in soup.select_one('table tr').stripped_strings:
s = html.unescape(r)
print(s)

How can I print any website content? (Using something like my code)

I want to open the website and get its content, store it in a variable and print it
from urllib.request import urlopen
url = any_website
content = urlopen(url).read().decode('utf-8')
print(content)
The expected result is that I get what is written in the page
In python, there are several libraries you may be interested in. An example of printing contents to get you started below:-
from bs4 import BeautifulSoup as soup
import requests
url = "https://en.wikipedia.org/wiki/List_of_multinational_corporations"
page = requests.get(url)
page_html = (page.content)
page_soup = soup(page_html, "html.parser")
print (page_soup)
with urlopen, you may try as below
from bs4 import BeautifulSoup
import urllib
url = "https://en.wikipedia.org/wiki/List_of_multinational_corporations"
r = urllib.urlopen(url).read()
soup = BeautifulSoup(r)
print type(soup)
print (soup.prettify()[0:1000])

How to fix Cyrillic characters while web-scraping with Python

I'm scraping a Cyrillic website with python using BeautifulSoup, but I'm having some trouble, every word is showing like this:
СилÑановÑка Ðавкова во Ðази
I also tried some other Cyrillic websites, but they are working good.
My code is this:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://').text
soup = BeautifulSoup(source, 'lxml')
print(soup.prettify())
How should I fix it?
requests fails to detect it as utf-8.
from bs4 import BeautifulSoup
import requests
source = requests.get('https://time.mk/') # don't convert to text just yet
# print(source.encoding)
# prints out ISO-8859-1
source.encoding = 'utf-8' # override encoding manually
soup = BeautifulSoup(source.text, 'lxml') # this will now decode utf-8 correctly

How to open with urllib, link parsed by BeautifulSoup?

I use python 3, Beautiful Soup 4 and urllib for parsing some html.
I need to parse some pages, get some links from this pages, and than parse pages from that links. I try to do it like that:
import urllib.request
import urllib
from bs4 import BeautifulSoup
with urllib.request.urlopen("https://example.com/mypage?myparam=%D0%BC%D0%B2") as response:
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
total = soup.find(attrs={"class":"item_total"})
link = u"https://example.com" + total.find('a').get('href')
with urllib.request.urlopen(link) as response:
exthtml = BeautifulSoup(html,response.read())
But urllib can't open second link, because it is not encoded, like fist link. It has different languages symbols, and white spaces.
I try to encode url, like:
link = urllib.parse.quote("https://example.com" + total.find('a').get('href'))
But it encode all symbols. How can I get properly url form bs, and get request?
UPD:
exapmle of second link, resulted by
link = u"https://example.com" + total.find('a').get('href')
is
"https://example.com/mypage?p1url=www.example.net%2Fthatpage%2F01234&text=абвгд еёжз ийклмно"
should just be urlencoding your link.
link = "https://example.com" + urllib.parse.quote(total.find('a').get('href'))

Categories

Resources