Retrieving the text output of a html website using bs4 - python

I am currently trying to extract the text of what match name I have scraped.
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.betexplorer.com/odds-movements/soccer/'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
times = soup.select('span.table-main__time') #good
matches = soup.find_all("td",class_ ="table-main__tt")
I have located the tag/class and it seems the value i want to retrieve is behind the href in the a tag. the output I wish to achieve here is 'Can Tho - Long An'
this is a dynamic webpage so its likely that the same output for match wont be possible but I am looking for pointers on how I can extract just the text and not the whole html.

Related

Can't scrape <h3> tag from page

Seems like i can scrape any tag and class, except h3 on this page. It keeps returning None or an empty list. I'm trying to get this h3 tag:
...on the following webpage:
https://www.empireonline.com/movies/features/best-movies-2/
And this is the code I use:
from bs4 import BeautifulSoup
import requests
from pprint import pprint
from bs4 import BeautifulSoup
URL = "https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
web_html = response.text
soup = BeautifulSoup(web_html, "html.parser")
movies = soup.findAll(name = "h3" , class_ = "jsx-4245974604")
movies_text=[]
for item in movies:
result = item.getText()
movies_text.append(result)
print(movies_text)
Can you please help with the solution for this problem?
As other people mentioned this is dynamic content, which needs to be generated first when opening/running the webpage. Therefore you can't find the class "jsx-4245974604" with BS4.
If you print out your "soup" variable you actually can see that you won't find it. But if simply you want to get the names of the movies you can just use another part of the html in this case.
The movie name is in the alt tag of the picture (and actually also in many other parts of the html).
import requests
from pprint import pprint
from bs4 import BeautifulSoup
URL = "https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
web_html = response.text
soup = BeautifulSoup(web_html, "html.parser")
movies = soup.findAll("img", class_="jsx-952983560")
movies_text=[]
for item in movies:
result = item.get('alt')
movies_text.append(result)
print(movies_text)
If you run into this issue in the future, remember to just print out the initial html you can get with soup and just check by eye if the information you need can be found.

Is there a way I can extract a list from a javascript document?

There is a website where I need to obtain the owners of this item from an online-game item and from research, I need to do some 'web scraping' to get this data. But, the information is in a Javascript document/code, not an easily parseable HTML document like bs4 shows I can easily extract information from. So, I need to get a variable in this javascript document (contains a list of owners of the item I'm looking at) and make it into a usable list/json/string I can implement in my program. Is there a way I can do this? if so, how can I?
I've attached an image of the variable I need when viewing the page source of the site I'm on.
My current code:
from bs4 import BeautifulSoup
html = requests.get('https://www.rolimons.com/item/1029025').content #the item webpage
soup = BeautifulSoup(html, "lxml")
datas = soup.find_all("script")
print(data) #prints the sections of the website content that have ja
IMAGE LINK
To scrape javascript variable, can't use only BeautifulSoup. Regular expression (re) is required.
Use ast.literal_eval to convert string representation of dict to a dict.
from bs4 import BeautifulSoup
import requests
import re
import ast
html = requests.get('https://www.rolimons.com/item/1029025').content #the item webpage
soup = BeautifulSoup(html, "lxml")
ownership_data = re.search(r'ownership_data\s+=\s+.*;', soup.text).group(0)
ownership_data_dict = ast.literal_eval(ownership_data.split('=')[1].strip().replace(';', ''))
print(ownership_data_dict)
Output:
> {'id': 1029025, 'num_points': 1616, 'timestamps': [1491004800,
> 1491091200, 1491177600, 1491264000, 1491350400, 1491436800,
> 1491523200, 1491609600, 1491696000, 1491782400, 1491868800,
> 1491955200, 1492041600, 1492128000, 1492214400, 1492300800,
> 1492387200, 1492473600, 1492560000, 1492646400, 1492732800,
> 1492819200, ...}
import requests
import json
import re
r = requests.get('...')
m = re.search(r'var history_data\s+=\s+(.*)', r.text)
print(json.loads(m.group(1)))

How to get CData from html using beautiful soup

I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]
It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)
Try simply:
soup.select_one('div.field-redshift > div.value>b').text
If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])

Finding name and codes of all airports

I am trying to scrape data to get the text I need. I want to find the line that says aberdeen and all lines after it which contain the airport info. Here is a pic of the html hierarchy:
I am trying to locate the text elements inside the class "i1" with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.airportcodes.org/')
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('div',attrs={"class":"i1"})
print(table.text)
But I am not getting the values I expect at all. Here is a link to the data if curious. I am new to scraping obviously.
The problem is your BeautifulSoup parser:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.airportcodes.org/')
soup = BeautifulSoup(page.text, 'lxml')
table = soup.find('div',attrs={"class":"i1"})
print(table.text)
If what you want is the text elements, you can use:
soup.get_text()
Note: this will give you all the text elements.
why are people suggesting selenium? this doesnt dynamically load the data ... requests + re is all you need, you dont even need beautiful soup
data = requests.get('http://www.airportcodes.org/').content
cities_and_codes =re.findall("([A-Za-z, ]+)\(([A-Z]{3})\)",data)
just look for any alphanumeric characters (including also comma and space)
followed by exactly 3 uppercase letters in parenthesis

Extract element from HTML with Python's BeautifulSoup library

I'm looking to extract data from Instagram and record the time of the post without using auth.
The below code gives me the HTML of the pages from the IG post, but I'm not able to extract the time element from the HTML.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import json
url_path = 'https://www.instagram.com/<username>'
session = HTMLSession()
r = session.get(url_path)
soup = BeautifulSoup(r.content,features='lxml')
print(soup)
I would like to extract data from the time element near the bottom of this screenshot
to extract time you can use html tag and its class :
time = soup.findAll("time", {"class": "_1o9PC Nzb55"}).text
I'm guessing that the picture you've shared is a browser inspector screenshot. Although inspecting the code is a good basic guideline on web scraping you should check what BeautifullSoup is getting. If you check the print of soup you will see that the data you are looking for its a json inside of a script tag. So your code and any other solution that targets the time tag aren't working on BS4. You might try with selenium maybe.
Anyway here goes the BeautifullSoup pseudo-solution using the instagram from your screenshot:
from bs4 import BeautifulSoup
import json
import re
import requests
import time
url_path = "https://www.instagram.com/srirachi9/"
response = requests.get(url_path)
soup = BeautifulSoup(response.content)
pattern = re.compile(r"window\._sharedData\ = (.*);", re.MULTILINE)
script = soup.find("script", text=lambda x: x and "window._sharedData" in x).text
data = json.loads(re.search(pattern, script).group(1))
times = len(data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'])
for x in range(times):
time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'][x]['node']['taken_at_timestamp']))
The times variable its the amount of timestamps the json contains. It may look like hell but its just a matter of patiently following the json structure and indexing accordingly.

Categories

Resources