weird text indentation when web scraping with beautifullsoup4 in python - python

Im trying to web scrape github
This is the code:
import requests as req
from bs4 import BeautifulSoup
urls = [
"https://github.com/moom825/Discord-RAT",
"https://github.com/freyacodes/Lavalink",
"https://github.com/KagChi/lavalink-railways",
"https://github.com/KagChi/lavalink-repl",
"https://github.com/Devoxin/Lavalink.py",
"https://github.com/karyeet/heroku-lavalink"]
r = req.get(urls[0])
soup = BeautifulSoup(r.content,"lxml")
title = str(soup.find("p",attrs={"class":"f4 mt-3"}).text)
print(title)
When i run the program i don't get any kind of errors but the indentation is very weird
Please anyone help me with this problem
Im using replit

Github has a really good API
You can use .strip() after .text then it will remove whitespace.
import requests as req
from bs4 import BeautifulSoup
urls = [
"https://github.com/moom825/Discord-RAT",
"https://github.com/freyacodes/Lavalink",
"https://github.com/KagChi/lavalink-railways",
"https://github.com/KagChi/lavalink-repl",
"https://github.com/Devoxin/Lavalink.py",
"https://github.com/karyeet/heroku-lavalink"]
r = req.get(urls[0])
soup = BeautifulSoup(r.content,"lxml")
title = str(soup.find("p",attrs={"class":"f4 mt-3"}).text.strip())
print(title)

Related

Extracting json when web scraping

I was following a python guide on web scraping and there's one line of code that won't work for me. I'd appreciate it if anybody could help me figure out what the issue is, thanks.
from bs4 import BeautifulSoup
import json
import re
import requests
url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
page = requests.get(url)
soup = BeautifulSoup(page.content, "lxml")
script = soup.find('script', text=re.compile('root\.App\.main'))
json_text = re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1)
Error Message:
json_text = re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1)
AttributeError: 'NoneType' object has no attribute 'string'
Link to the guide I was looking at: https://www.mattbutton.com/how-to-scrape-stock-upgrades-and-downgrades-from-yahoo-finance/
Main issue in my opinion is that you should add an user-agent to your request, so that you get expected HTML:
headers = {'user-agent':'Mozilla/5.0'}
page = requests.get(url, headers=headers)
Note: Almost and first at all - Take a deeper look into your soup, to check if expected information is available.
Example
import re
import json
from bs4 import BeautifulSoup
url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
headers = {'user-agent':'Mozilla/5.0'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content)
script = soup.find('script', text=re.compile('root\.App\.main'))
json_text = json.loads(re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1))
json_text

How do I log data from a live website using beautiful soup

Hello I am trying to use beautiful soup and requests to log the data coming from an anemometer which updates live every second. The link to this website here:
http://88.97.23.70:81/
The piece of data I want to scrape is highlighted in purple in the image :
from inspection of the html in my browser.
I have written the code bellow in to try to print out the data however when I run the code it prints: None. I think this means that the soup object doesnt infact contain the whole html page? Upon printing soup.prettify() I cannot find the same id=js-2-text I find when inspecting the html in my browser. If anyone has any ideas why this might be or how to fix it I would be most grateful.
from bs4 import BeautifulSoup
import requests
wind_url='http://88.97.23.70:81/'
r = requests.get(wind_url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
print(soup.find(id='js-2-text'))
All the best,
Brendan
The data is loaded from external URL, so beautifulsoup doesn't need it. You can try to use API URL the page is connecting to:
import requests
from bs4 import BeautifulSoup
api_url = "http://88.97.23.70:81/cgi-bin/CGI_GetMeasurement.cgi"
data = {"input_id": "1"}
soup = BeautifulSoup(requests.post(api_url, data=data).content, "html.parser")
_, direction, metres_per_second, *_ = soup.csv.text.split(",")
knots = float(metres_per_second) * 1.9438445
print(direction, metres_per_second, knots)
Prints:
210 006.58 12.79049681

Python Couldn't parse HTML from URL

I have tried this below code
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://myip.ms/'
page = 1
req = requests.get(URL + str(page))
soup = bs (req.text, 'html.parser')
print (soup)
this code working for some websites but not working for most of websites like myip.ms
Works for me. But what essentially are you trying to achieve here? Your code appends "1" to the end of the URL and then visits it. If the page with those URL parameters doesn't exist on the server - it will give you errors. For this case: https://myip.ms/1 exists, but no surprise that any other page could give you errors

Decode a web page using request and BeautifulSoup package

I am trying a practice question of python. The question is "Use the BeautifulSoup and requests Python packages to print out a list of all the article titles on the New York Times homepage."
Below is my solution but it doesn't give any output. I am using Jupyter Notebook and when I run the below code it does nothing. My kernel is also working properly which means I have a problem with my code.
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
base_url= 'https://www.nytimes.com/'
r=requests.get(base_url)
soup=BeautifulSoup(urlopen(base_url))
get_titles=soup.find_all(class_="css-1vctqli esl82me2" )
print()
for title in get_titles:
print(title.text)
Where did you get that class tag ? This is not the right one.
You need to replace css-1vctqli esl82me2 by css-1j836f9 esl82me3
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
base_url = 'https://www.nytimes.com/'
r = requests.get(base_url)
soup = BeautifulSoup(urlopen(base_url))
get_titles = soup.find_all(class_ = "css-1j836f9 esl82me3")
print()
for title in get_titles:
print(title.text)
And the output :

BeautifulSoup does not work for some web sites

I have this sript:
import urrlib2
from bs4 import BeautifulSoup
url = "http://www.shoptop.ru/"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
divs = soup.findAll('a')
print divs
For this web site, it prints empty list? What can be problem? I am running on Ubuntu 12.04
Actually there are quite couple of bugs in BeautifulSoup which might raise some unknown errors. I had a similar issue when working on apache using lxml parser
So, just try to use other couple of parsers mentioned in the documentation
soup = BeautifulSoup(page, "html.parser")
This should work!
It looks like you have a few mistakes in your code urrlib2 should be urllib2, I've fixed the code for you and this works using BeautifulSoup 3
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.shoptop.ru/"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
divs = soup.findAll('a')
print divs

Categories

Resources