Python Webscraping: Problems parsing chinese characters with beautiful soup/requests - python

I am scraping a Chinese website and usually there is no problem to parse the chinese characters which i use to find specific urls with the pattern function within bs4.
However, for this particular chinese website the soup cannot be parsed properly.
Below is the code i use to set up the soup:
start = f'http://www.shuichan.cc/news_list.asp?action=&c_id=93&s_id=210&page={1}'
r = requests.get(start)
soup = bs(r.content, "html.parser")
An example of the printed soup is the following:
Current soup
Note: I had to add a picture as Stack though it was spam :)
The above should have looked like the following:
Proper soup
I wonder if i have to specify some kind of encoding within the request or perhaps something within the soup but as for now i have not found anything that would work.
Thanks in advance!

I don't know Chinese. Does this give the desired results?
import requests
from bs4 import BeautifulSoup as bs
start = f'http://www.shuichan.cc/news_list.asp?action=&c_id=93&s_id=210&page={1}'
r = requests.get(start)
soup = bs(r.content.decode('GBK', 'ignore'), "html.parser")
print(soup)

Related

How do I log data from a live website using beautiful soup

Hello I am trying to use beautiful soup and requests to log the data coming from an anemometer which updates live every second. The link to this website here:
http://88.97.23.70:81/
The piece of data I want to scrape is highlighted in purple in the image :
from inspection of the html in my browser.
I have written the code bellow in to try to print out the data however when I run the code it prints: None. I think this means that the soup object doesnt infact contain the whole html page? Upon printing soup.prettify() I cannot find the same id=js-2-text I find when inspecting the html in my browser. If anyone has any ideas why this might be or how to fix it I would be most grateful.
from bs4 import BeautifulSoup
import requests
wind_url='http://88.97.23.70:81/'
r = requests.get(wind_url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
print(soup.find(id='js-2-text'))
All the best,
Brendan
The data is loaded from external URL, so beautifulsoup doesn't need it. You can try to use API URL the page is connecting to:
import requests
from bs4 import BeautifulSoup
api_url = "http://88.97.23.70:81/cgi-bin/CGI_GetMeasurement.cgi"
data = {"input_id": "1"}
soup = BeautifulSoup(requests.post(api_url, data=data).content, "html.parser")
_, direction, metres_per_second, *_ = soup.csv.text.split(",")
knots = float(metres_per_second) * 1.9438445
print(direction, metres_per_second, knots)
Prints:
210 006.58 12.79049681

Struggling with BeautifulSoup and tags

I hate to trouble anyone with this, but I've been on this issue for days.
Basically, I want to scrape the Psychological Torture Methods from this web page: https://en.m.wikipedia.org/wiki/List_of_methods_of_torture
This is the exact information I would like to acquire:
Ego-Fragmentation
Learned Helplessness
Chinese water torture
Welcome parade (torture)
And below is my code:
from bs4 import BeautifulSoup
import requests
URL = 'https://en.m.wikipedia.org/wiki/List_of_methods_of_torture'
page = requests.get(URL)
html_soup = BeautifulSoup(page.content, 'html.parser')
type(html_soup)
print (html_soup.find("div", class_="mw-parser-output").find_all(text=True, recursive=False) )
I'm sure there is an easy fix to this that I can't see. One you look at the sites html, you'll probably find the answer.
Best wishes, truly.
Have a Beautiful day!
HomeMadeMusic.
Try this . Your expected output is in under section
from bs4 import BeautifulSoup
import requests
URL = 'https://en.m.wikipedia.org/wiki/List_of_methods_of_torture'
page = requests.get(URL)
html_soup = BeautifulSoup(page.content, 'html.parser')
print(html_soup.prettify())
print ([x.text for x in html_soup.find("section", class_="mf-section-1").find_all('a')])
When in doubt, brute force it and pretend like you'll come back to it later
from bs4 import BeautifulSoup
import requests
URL = 'https://en.m.wikipedia.org/wiki/List_of_methods_of_torture'
page = requests.get(URL)
html_soup = BeautifulSoup(page.content, 'html.parser')
sections = html_soup.find_all("section")
torture_methods = sections[1].find_all("li")
torture_method_names = list(map(lambda x: x.text, torture_methods))
print(torture_method_names)
Prints:
['Ego-Fragmentation', 'Learned Helplessness', 'Chinese water torture', 'Welcome parade (torture)']
There were a couple of issues here:
First the recursive=False parameter means that you will only get the text that is directly inside the node you selected. You won't get the text from its subnodes.
As there is no text directly inside this div element, the method returns an empty list.
Second problem: the div you selected doesn't only contain the "Psychological Torture Methods" section but also the other sections of the page as well as the disclaimer displayed at the beginning of the article.
To get the information you need, you should only get the content of the section node whose class is mf-section-1
Solution
I just tweaked your code to print the information you needed. I had to use the lstrip method to remove an unnecessary line break.
from bs4 import BeautifulSoup
import requests
URL = 'https://en.m.wikipedia.org/wiki/List_of_methods_of_torture'
page = requests.get(URL)
html_soup = BeautifulSoup(page.content, 'html.parser')
type(html_soup)
print (''.join(html_soup.find("section", class_="mf-section-1").findAll(text=True)).lstrip("\n"))
Output
Ego-Fragmentation
Learned Helplessness
Chinese water torture
Welcome parade (torture)

Matching a specific piece of text in a title using Beuatiful Soup

Basically, I want to find all links that contain certain key terms. In my case, the titles of these links that I want come in this form: abc... (common text), dce... (common text), ... I want to take all of the links containing "(common text)" and put them in the list. I got the code working and I understand how to find all links. However, I converted the links to strings to find the "(common text)". I know that this isn't good practice and I am not sure how to use Beautiful Soup to find this common element without converting to a string. The issue here is that the titles I am searching for are not all the same. Here's what I have so far:
from bs4 import BeautifulSoup
import requests
import webbrowser
url = 'website.com'
http = requests.get(url)
soup = BeautifulSoup(http.content, "lxml")
links = soup.find_all('a', limit=4000)
links_length = len(links)
string_links = []
targetlist = []
for a in range(links_length):
string_links.append(str(links[a]))
if '(common text)' in string_links[a]:
targetlist.append(string_links[a])
NOTE: I am looking for the simplest method using Beautiful Soup to accomplish this. Any help will be appreciated.
Without the actual website and actual output you want, it's very difficult to say what you want but this is a "cleaner" solution using list comprehension.
from bs4 import BeautifulSoup
import requests
import webbrowser
url = 'website.com'
http = requests.get(url)
soup = BeautifulSoup(http.content, "lxml")
links = soup.find_all('a', limit=4000)
targetlist = [str(link) for link in links if "(common text)" in str(link)]

Finding name and codes of all airports

I am trying to scrape data to get the text I need. I want to find the line that says aberdeen and all lines after it which contain the airport info. Here is a pic of the html hierarchy:
I am trying to locate the text elements inside the class "i1" with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.airportcodes.org/')
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('div',attrs={"class":"i1"})
print(table.text)
But I am not getting the values I expect at all. Here is a link to the data if curious. I am new to scraping obviously.
The problem is your BeautifulSoup parser:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.airportcodes.org/')
soup = BeautifulSoup(page.text, 'lxml')
table = soup.find('div',attrs={"class":"i1"})
print(table.text)
If what you want is the text elements, you can use:
soup.get_text()
Note: this will give you all the text elements.
why are people suggesting selenium? this doesnt dynamically load the data ... requests + re is all you need, you dont even need beautiful soup
data = requests.get('http://www.airportcodes.org/').content
cities_and_codes =re.findall("([A-Za-z, ]+)\(([A-Z]{3})\)",data)
just look for any alphanumeric characters (including also comma and space)
followed by exactly 3 uppercase letters in parenthesis

Scraping site returns different href for a link

In python, I'm using the requests module and BS4 to search the web with duckduckgo.com. I went to http://duckduckgo.com/html/?q='hello' manually and got the first results title as <a class="result__a" href="http://example.com"> using the Developer Tools. Now I used the following code to get the href with Python:
html = requests.get('http://duckduckgo.com/html/?q=hello').content
soup = BeautifulSoup4(html, 'html.parser')
result = soup.find('a', class_='result__a')['href']
However, the href looks like gibberish and is completely different from the one i saw manually. ny idea why this is happening?
There are multiple DOM elements with the classname 'result__a'. So, don't expect the first link you see be the first you get.
The 'gibberish' you mentioned is an encoded URL. You'll need to decode and parse it to get the parameters(params) of the URL.
For example:
"/l/?kh=-1&uddg=https%3A%2F%2Fwww.example.com"
The above href contains two params, namely kh and uddg.
uddg is the actual link you need I suppose.
Below code will get all the URL of that particular class, unquoted.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs, unquote
html = requests.get('http://duckduckgo.com/html/?q=hello').content
soup = BeautifulSoup(html, 'html.parser')
for anchor in soup.find_all('a', attrs={'class':'result__a'}):
link = anchor.get('href')
url_obj = urlparse(link)
parsed_url = parse_qs(url_obj.query).get('uddg', '')
if parsed_url:
print(unquote(parsed_url[0]))

Categories

Resources