Using Python to Scrape Nested Divs and Spans in Twitter? - python

I'm trying to scrape the likes and retweets from the results of a Twitter search.
After running the Python below, I get an empty list, []. I'm not using the Twitter API because it doesn't look at the tweets by hashtag this far back.
The code I'm using is:
from bs4 import BeautifulSoup
import requests
url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
all_likes = soup.find_all('span', class_='ProfileTweet-actionCountForPresentation')
print(all_likes)
I can successfully save the html to file using this code. It is missing large amounts of information when I search the text, such as the class names I am looking for...
So (part of) the problem is apparently in accurately accessing the source code.
filename = 'newfile2.txt'
with open(filename, 'w') as handle:
handle.writelines(str(data))
This screenshot shows the span that I'm trying to scrape.
I've looked at this question, and others like it, but I'm not quite getting there.
How can I use BeautifulSoup to get deeply nested div values?

It seems that your GET request returns valid HTML but with no tweet elements in the #timeline element. However, adding a user agent to the request headers seems to remedy this.
from bs4 import BeautifulSoup
import requests
url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text
soup = BeautifulSoup(data, "lxml")
all_likes = soup.find_all('span', class_='ProfileTweet-actionCountForPresentation')
print(all_likes)

Related

Why do I always get empty list thing to select a tag or css property by using Python?

I just started studying Python, requests and BeautifulSoup.
I'm using VSCode and Python version is 3.10.8
I want to get HTML code using a 'taw' tag in google. but I can't get it. the result keeps getting an empty list.
import requests
from bs4 import BeautifulSoup
url = 'https://www.google.com/search?client=safari&rls=en&q=프로그래밍+공부&ie=UTF-8&oe=UTF-8'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
find = soup.select('#taw')
print(find)
and here's HTML code that I tried to get 'taw' tag
sorry for using image instead of codes.
Taw tag contains Google's ad site and I want to scrap this tag. I tried other CSS properties and tags, but the empty list keeps showing up as a result. I tried soup.find, but I got 'None'.
For various possible reasons, you don't always get the exact same html via python's requests.get as what you see in your browser. Sometimes it's because of blockers or JavaScript loading, but for this specific page and element, it's just that google will format the response a bot differently based on the source of the request. Try adding some headers
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.google.com/search?client=safari&rls=en&q=프로그래밍+공부&ie=UTF-8&oe=UTF-8'
response = requests.get(url, headers=headers)
reqErr = response.raise_for_status() # just a good habit to check
if reqErr: print(f'!"{reqErr}" - while getting ', url)
soup = BeautifulSoup(response.content, 'html.parser')
find = soup.select('#taw')
if not find: ## save html to check [IN AN EDITOR, NOT a browser] if expected elements are missing
hfn = 'x.html'
with open(hfn, 'wb') as f: f.write(response.content)
print(f'saved html to "{hfn}"')
print(find)
The reqErr and if not find.... parts are just to help understand why in case you don't get the expected results. They're helpful for debugging in general for requests+bs4 scraping attempts.
The printed output I got with the code above was:
[<div id="taw"><div data-ved="2ahUKEwjvjrj04ev7AhV3LDQIHaXeDCkQL3oECAcQAg" id="oFNiHe"></div><div id="tvcap"></div></div>]

I can't access the text in the span using BeautifulSoup

Hi Everyone receive error msg when executing this code :
from bs4 import BeautifulSoup
import requests
import html.parser
from requests_html import HTMLSession
session = HTMLSession()
response = session.get("https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht")
soup = BeautifulSoup(response.content, 'html.parser')
tables = soup.find_all("tr")
for table in tables:
movie_name = table.find("span", class_ = "secondaryInfo")
print(movie_name)
output:
movie_name = table.find("span", class_ = "secondaryInfo").text
AttributeError: 'NoneType' object has no attribute 'text'
You selected for the first row which is the header and doesn't have that class as it doesn't list the prices. An alternative way is to simply exclude the header with a css selector of nth-child(n+2). You also only need requests.
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht")
soup = BeautifulSoup(response.content, 'html.parser')
for row in soup.select('tr:nth-child(n+2)'):
movie_name = row.find("span", class_ = "secondaryInfo")
print(movie_name.text)
Just use the SelectorGadget Chrome extension to grab CSS selector by clicking on the desired element in your browser without inventing anything superfluous. However, it's not working perfectly if the HTML structure is terrible.
You're looking for this:
for result in soup.select(".titleColumn a"):
movie_name = result.text
Also, there's no need in using HTMLSession IF you don't want to persist certain parameters across requests to the same host (website).
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests
# user-agent is used to act as a real user visit
# this could reduce the chance (a little bit) of being blocked by a website
# and prevent from IP limit block or permanent ban
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get("https://www.imdb.com/chart/boxoffice/?ref_=nv_ch_cht", headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
for result in soup.select(".titleColumn a"):
movie_name = result.text
print(movie_name)
# output
'''
Eternals
Dune: Part One
No Time to Die
Venom: Let There Be Carnage
Ron's Gone Wrong
The French Dispatch
Halloween Kills
Spencer
Antlers
Last Night in Soho
'''
P.S. There's a dedicated web scraping blog of mine. If you need to parse search engines, have a try using SerpApi.
Disclaimer, I work for SerpApi.

Trying to access hidden <div> tags when web scraping in python

so I'm trying to extract some data from a website by webscraping using python but some of the div tags are not expanding to show the data that I want.
This is my code.
import requests
from bs4 import BeautifulSoup as soup
uq_url = "https://my.uq.edu.au/programs-courses/requirements/program/2451/2021"
headers = {
'User-Agent': "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
web_r = requests.get(uq_url, headers=headers)
web_soup = soup(web_r.text, 'html.parser')
print(web_soup.prettify())
This is what the code will scrape but it won't extract any of the data in the div with id="app". It's supposed to have a lot of data in there like the second picture. Any help would be appreciated.
All the that content is present within a script tag, as shown in your image. You can regex out the appropriate javascript object then handle the unquoted keys with json, in order to convert to hjson. Then extract whatever you want:
import requests, re, hjson
from bs4 import BeautifulSoup as bs #there is some data as embedded html you may wish to parse later from json
r = requests.get('https://my.uq.edu.au/programs-courses/requirements/program/2451/2021', headers = {'User-Agent':'Mozilla/5.0'})
data = hjson.loads(re.search(r'window\.AppData = ([\s\S]+?);\n' , r.text).group(1))
# hjson.dumpsJSON(data['programRequirements'])
core_courses = data['programRequirements']['payload']['components'][1]['payload']['body'][0]['body']
for course in core_courses:
if 'curriculumReference' in course:
print(course['curriculumReference'])

Python 3 BeautifulSoup Scraping Content After "Read More" Text

I've recently started looking into purchasing some land, and I'm writing a little app to help me organize details in Jira/Confluence to help me keep track of who I've talked to and what I talked to them about in regards to each parcel of land individually.
So, I wrote this little scraper for landwatch(dot)com:
[url is just a listing on the website]
from bs4 import BeautifulSoup
import requests
def get_property_data(url):
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
response = requests.get(url, headers=headers) # Maybe request Url with read more already gone
soup = BeautifulSoup(response.text, 'html5lib')
title = soup.find_all(class_='b442a')[0].text
details = soup.find_all('p', class_='d19de')
price = soup.find_all('div', class_='_260f0')[0].text
deets = []
for i in range(len(details)):
if details[i].text != '':
deets.append(details[i].text)
detail = ''
for i in deets:
detail += '<p>' + i + '</p>'
return [title, detail, price]
Everything works great except that the class d19de has a ton of values hidden behind the Read More button.
While Googling away at this, I discovered How to Scrape reviews with read more from Webpages using BeautifulSoup, however I either don't understand what they're doing well enough to implement it, or this just doesn't work anymore:
import requests ; from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("http://www.mouthshut.com/product-reviews/Lakeside-Chalet-Mumbai-reviews-925017044").text, "html.parser")
for title in soup.select("a[id^=ctl00_ctl00_ContentPlaceHolderFooter_ContentPlaceHolderBody_rptreviews_]"):
items = title.get('href')
if items:
broth = BeautifulSoup(requests.get(items).text, "html.parser")
for item in broth.select("div.user-review p.lnhgt"):
print(item.text)
Any thoughts on how to bypass that Read More button? I'm really hoping to do this in BeautifulSoup, and not selenium.
Here's an example URL for testing: https://www.landwatch.com/huerfano-county-colorado-recreational-property-for-sale/pid/410454403
That data is present within a script tag. Here is an example of extracting that content, parsing with json, and outputting land description info as a list:
from bs4 import BeautifulSoup
import requests, json
url = 'https://www.landwatch.com/huerfano-county-colorado-recreational-property-for-sale/pid/410454403'
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
response = requests.get(url, headers=headers) # Maybe request Url with read more already gone
soup = BeautifulSoup(response.text, 'html5lib')
all_data = json.loads(soup.select_one('[type="application/ld+json"]').string)
details = all_data['description'].split('\r\r')
You may wish to examine what else is in that script tag:
from pprint import pprint
pprint(all_data)

Why is the data retrieved showing as blank instead of outputting the correct numbers?

I can't seem to see what is missing. Why is the response not printing the ASINs?
import requests
import re
urls = [
'https://www.amazon.com/s?k=xbox+game&ref=nb_sb_noss_2',
'https://www.amazon.com/s?k=ps4+game&ref=nb_sb_noss_2'
]
for url in urls:
content = requests.get(url).content
decoded_content = content.decode()
asins = set(re.findall(r'/[^/]+/dp/([^"]+)', decoded_content))
print(asins)
traceback
set()
set()
[Finished in 0.735s]
Regular expressions should not be used to parse HTML. Every StackOverflow answer to questions like this do not recommend regex for HTML. It is difficult to write a regular expression complex enough to get the data-asin value from each <div>. The BeautifulSoup library will make this task easier. But if you must use regex, this code will return everything inside of the body tags:
re.findall(r'<body.*?>(.+?)</body>', decoded_content, flags=re.DOTALL)
Also, print decoded_content and read the HTML. You might not be receiving the same website that you see in the web browser. Using your code I just get an error message from Amazon or a small test to see if I am a robot. If you do not have real headers attached to your request, big websites like Amazon will not return the page you want. They try to prevent people from scraping their site.
Here is some code that works using the BeautifulSoup library. You need to install the library first pip3 install bs4.
from bs4 import BeautifulSoup
import requests
def getAsins(url):
headers = requests.utils.default_headers()
headers.update({'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36','Accept-Language': 'en-US, en;q=0.5'})
decoded_content = requests.get(url, headers=headers).content.decode()
soup = BeautifulSoup(decoded_content, 'html.parser')
asins = {}
for asin in soup.find_all('div'):
if asin.get('data-asin'):
asins[asin.get('data-uuid')] = asin.get('data-asin')
return asins
'''
result = getAsins('https://www.amazon.com/s?k=xbox+game&ref=nb_sb_noss_2')
print(result)
{None: 'B07RBN5C9C', '8652921a-81ee-4e15-b12d-5129c3d35195': 'B07P15JL3T', 'cb25b4bf-efc3-4bc6-ae7f-84f69dcf131b': 'B0886YWLC9', 'bc730e28-2818-472d-bc03-6e9fb97dcaad': 'B089F8R7SQ', '339c4ca0-1d24-4920-be60-54ef6890d542': 'B08GQW447N', '4532f725-f416-4372-8aa0-8751b2b090cc': 'B08DD5559K', 'a0e17b74-7457-4df7-85c9-5eefbfe4025b': 'B08BXHCQKR', '52ef86ef-58ac-492d-ad25-46e7bed0b8b9': 'B087XR383W', '3e79c338-525c-42a4-80da-4f2014ed6cf7': 'B07H5VVV1H', '45007b26-6d8c-4120-9ecc-0116bb5f703f': 'B07DJW4WZC', 'dc061247-2f4c-4f6b-a499-9e2c2e50324b': 'B07YLGXLYQ', '18ff6ba3-37b9-44f8-8f87-23445252ccbd': 'B01FST8A90', '6d9f29a1-9264-40b6-b34e-d4bfa9cb9b37': 'B088MZ4R82', '74569fd0-7938-4375-aade-5191cb84cd47': 'B07SXMV28K', 'd35cb3a0-daea-4c37-89c5-db53837365d4': 'B07DFJJ3FN', 'fc0b73cc-83dd-44d9-b920-d08f07be76eb': 'B07KYC1VL7', 'eaeb69d1-a2f9-4ea4-ac97-1d9a955d706b': 'B076PRWVFG', '0aafbb75-1bac-492c-848e-a046b2de9978': 'B07Q47W1B4', '9e373245-9e8b-4564-a32f-42baa7b51d64': 'B07C4SGGZ2', '4af7587a-98bf-41e0-bde6-2a2fad512d95': 'B07SJ2T3CW', '8635a92e-22a7-4474-a27d-3db75c75e500': 'B08D44W56B', '49d752ce-5d68-4323-be9b-3cbb34c8b562': 'B086JQGB7W', '6398531f-6864-4c7b-9879-84ee9de57d80': 'B07XD3TK36'}
'''
If you are reading html from a file then:
from bs4 import BeautifulSoup
import requests
def getAsins(location_to_file):
file = open(location_to_file)
soup = BeautifulSoup(file, 'html.parser')
asins = {}
for asin in soup.find_all('div'):
if asin.get('data-asin'):
asins[asin.get('data-uuid')] = asin.get('data-asin')
return asins

Categories

Resources