Beautifulsoup: Removing German Umlauts - python

i'm new to all of this, so i need a little bit of help. For a uni project i am trying to extract ingedrients from a website and in general the code works how it should, but i just don't know how
to get "Bärlauch" instead of "B%C3%A4rlauch" in the end.
I used beautifulsoup with the following code:
URL = [...]
links = []
for url in range(0,10):
req = requests.get(URL[url])
soup = bs(req.content, 'html.parser')
for link in soup.findAll('a'):
links.append(str(link.get('href')))
I don't get why it doesn't work as it should, eventhough the encoding already is utf-8.
Maybe someone knows better.
Thanks!

URLs are URL-encoded. The response of a request ist a response not a req(uest).
URLS = [...]
links = []
for url in URLS:
response = requests.get(url)
soup = bs(response.content, 'html.parser')
for link in soup.find_all('a'):
links.append(urllib.parse.unquote(link.get('href')))

Related

how to tell if a page has valid content using python without opening the page

I am working on this python script to check Facebook pages. it is looking something like this
url = 'https://www.facebook.com/photo/?fbid=3063889{}'.format(i)
urlHandler = urllib.request.urlopen(url)
html = urlHandler.read()
print(html)
is there any way to tell if a page has content or not without opening the link?
like between this and this
and this is using BeautifulSoup
html_text = requests.get(vgm_url).text
soup = BeautifulSoup(html_text, 'html.parser')
for link in soup.find_all('img'):
print(link.get('src'))

How to read link from beautifulsoup output python

I am trying to pass a link I extracted from beautifulsoup.
import requests
r = requests.get('https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links[1])
This is the link I am wanting.
Output: https://ed-public-download.app.cloud.gov/downloads/CollegeScorecard_Raw_Data_07202021.zip
Now I am trying to pass this link through so I can download the contents.
# make a folder if it doesn't already exist
if not os.path.exists(folder_name):
os.makedirs(folder_name)
# pass the url
url = r'link from beautifulsoup result needs to go here'
response = requests.get(url, stream = True)
# extract contents
with zipfile.ZipFile(io.BytesIO(response.content)) as zf:
for elem in zf.namelist():
zf.extract(elem, '../data')
My overall goal is trying to take the link that I webscraped and place it in the url variable because the link is always changing on this website. I want to make it dynamic so I don't have to manually search for this link and change it when its changing and instead it changes dynamically. I hope this makes sense and appreciate any help I can get.
If I manually enter my code as the following I know it works
url = r'https://ed-public-download.app.cloud.gov/downloads/CollegeScorecard_Raw_Data_07202021.zip'
If I can get my code to pass that exactly I know it'll work I'm just stuck with how to accomplish this.
I think you can do it with the find_all() method in Beautiful Soup
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content)
for a in soup.find_all('a'):
url = a.get('href')

Soup works on one IMBD page but not on another. How to solve?

url1 = "https://www.imdb.com/user/ur34087578/watchlist"
url = "https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv"
results1 = requests.get(url1, headers=headers)
results = requests.get(url, headers=headers)
soup1 = BeautifulSoup(results1.text, "html.parser")
soup = BeautifulSoup(results.text, "html.parser")
movie_div1 = soup1.find_all('div', class_='lister-item-content')
movie_div = soup.find_all('div', class_='lister-item mode-advanced')
#using unique tag for each movie in the respective link
print(movie_div1)
#empty list
print(movie_div)
#gives perfect list
Why is movie_div1 giving an empty list? I am not able to identify any difference in the URL structures to indicate the code should be different. All leads appreciated.
Unfortunately the div you want is processed by a javascript code so you can't get by scraping the raw html request.
You can get the movies you want by the request json your browser gets, which you won't need to scrape the code with beautifulsoup, making your script much faster.
2nd option is using Selenium.
Good luck.
As #SakuraFreak mentioned, you could parse the JSON received. However, this JSON response is embedded within the HTML itself which is later converted to HTML by browser JS (this is what you see as <div class="lister-item-content">...</div>.
For example, this is how you would extract the JSON content from the HTML to display movie/show names from the watchlist:
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.imdb.com/user/ur34087578/watchlist"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
details = str(soup.find('span', class_='ab_widget'))
json_initial = "IMDbReactInitialState.push("
json_leftover = ");\n"
json_start = details.find(json_initial) + len(json_initial)
details = details[json_start:]
json_end = details.find(json_leftover)
json_data = json.loads(details[:json_end])
imdb_titles = json_data["titles"]
for item in imdb_titles.values():
print(item["primary"]["title"])

How can I get the text from this specific div class?

I want to extract the text here
a lot of text
I used
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
mestuff = soup.find("div", {"class":"bbcode bbcode--profile-page"})
but it never fails to return with "None" in the terminal.
How can I go about this?
Link is "https://osu.ppy.sh/users/1521445"
(This is a repost since the old question was super old. I don't know if I should've made another question or not but aa)
Data is dynamically loaded from script tag so, as in other answer, you can grab from that tag. You can target the tag by its id then you need to pull out the relevant json, then the html from that json, then parse html which would have been loaded dynamically on page (at this point you can use your original class selector)
import requests, json, pprint
from bs4 import BeautifulSoup as bs
r = requests.get('https://osu.ppy.sh/users/1521445')
soup = bs(r.content, 'lxml')
all_data = json.loads(soup.select_one('#json-user').text)
soup = bs(all_data['page']['html'], 'lxml')
pprint.pprint(soup.select_one('.bbcode--profile-page').get_text('\n'))
You could try this:
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
x = soup.findAll("script",{"id":re.compile(r"json-user")})
result = re.findall('raw\":(.+)},\"previous_usernames', x[0].text.strip())
print(result)
Im not sure why the div with class='bbcode bbcode--profile-page' is string inside script tag with class='json-user', that's why you can't get it's value by div with class='bbcode bbcode--profile-page'
Hope this could help

Extracting specific elements from list python 2.7

I am working on this bot that extracts the urls from a specific page. I have extracted all the links and put them in a list now I can't seem to get realist urls(lead to other sites starting with http or https) out from the list and append them to another list or delete the ones that don't start with http. Thanks in advance
import urllib2
import requests
from bs4 import BeautifulSoup
def main():
#get all the links from bing about cancer
site = "http://www.bing.com/search?q=cancer&qs=n&form=QBLH&pq=cancer&sc=8-4&sp=-1&sk=&cvid=E56491F36028416EB41694212B7C33F2"
urls =[]
true_links = []
r = requests.get(site)
html_content = r.content
soup = BeautifulSoup(html_content, 'html.parser')
links = soup.find_all("a")
for link in links:
link = link.get("href")
urls.append(str(link))
#urls.append(link.get("href"))
#print map(str, urls)
#REMOVE GARBAGE LINKS
print len(urls)
print urls
main()
You can use urlparse.urljoin:
link = urlparse.urljoin(site, link.get("href"))
This will create absolute URLs out of relative ones.
You should also be using html_content = r.text instead of html_content = r.content. r.text takes care of using the proper encoding.

Categories

Resources