How to get twitter profile name using python BeautifulSoup module? - python

I'm trying to get twitter profile name using profile url with beautifulsoup in python,
but whatever html tags I use, I'm not able to get the name. What html tags can I use to
get the profile name from twitter user page ?
url = 'https://twitter.com/twitterID'
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
# Find the display name
name_element = soup.find('span', {'class':'css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0'})
if name_element != None:
display_name = name_element.text
else:
display_name = "error"

html = requests.get(url).text
Twitter profile links cannot be scraped simply through requests like this since the contents of the profile pages are loaded with JavaScript [via the API], as you might notice if you previewed the source HTML on you browser's network logs or checked the fetched HTML.
name_element = soup.find('span', {'class':'css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0'})
display_name = name_element.text
Even after fetching the right HTML, calling .find like that will result in display_name containing 'To view keyboard shortcuts, press question mark' or 'Don’t miss what’s happening' because there are 67 span tags with that class. Calling .find_all(....)[6] might work but it's definitely not a reliable approach. You should instead consider using .select with CSS selectors to target the name.
name_element = soup.select_one('div[data-testid="UserName"] span>span')
The .find equivalent would be
# name_element = soup.find('div', {'data-testid': 'UserName'}).span.span ## too many weak points
name_element = soup.find(lambda t: t.name == t.parent.name == 'span' and t.find_parent('div', {'data-testid': 'UserName'}))
but I find .select much more convenient.
Selenium Example
Using two functions I often use for scraping - linkToSoup_selenium (which takes a URL and returns a BeautifulSoup object after using selenium and bs4 to load and parse the HTML), and selectForList (which extracts details from bs4 Tags based on selectors [like in the selectors dictionary below])
Setup:
# imports ## PASTE FROM https://pastebin.com/kEC9gPC8
# def linkToSoup_selenium... ## PASTE FROM https://pastebin.com/kEC9gPC8
# def selectForList... ## PASTE FROM https://pastebin.com/ZnZ7xM6u
## JUST FOR REDUCING WHITESPACE - not important for extracting information ##
def miniStr(o): return ' '.join(w for w in str(o).split() if w)
profileUrls = ['https://twitter.com/twitterID', 'https://twitter.com/jokowi', 'https://twitter.com/sep_colin']
# ptSel = 'article[data-testid="tweet"]:has(div[data-testid="socialContext"])'
# ptuaSel = 'div[data-testid="User-Names"]>div>div>div>a'
selectors = {
'og_url': ('meta[property="og\:url"][content]', 'content'),
'name_span': 'div[data-testid="UserName"] span>span',
'name_div': 'div[data-testid="UserName"]',
# 'handle': 'div[data-testid="UserName"]>div>div>div+div',
'description': 'div[data-testid="UserDescription"]',
# 'location': 'span[data-testid="UserLocation"]>span',
# 'url_href': ('a[data-testid="UserUrl"][href]', 'href'),
# 'url_text': 'a[data-testid="UserUrl"]>span',
# 'birthday': 'span[data-testid="UserBirthdate"]',
# 'joined': 'span[data-testid="UserJoinDate"]>span',
# 'following': 'div[data-testid="UserName"]~div>div>a[href$="\/following"]',
# 'followers': 'div[data-testid="UserName"]~div>div>a[href$="\/followers"]',
# 'pinnedTweet_uname': f'{ptSel} div[data-testid="User-Names"] span>span',
# 'pinnedTweet_handl': f'{ptSel} {ptuaSel}:not([aria-label])',
# 'pinnedTweet_pDate': (f'{ptSel} {ptuaSel}[aria-label]', 'aria-label'),
# 'pinnedTweet_text': f'{ptSel} div[data-testid="tweetText"]',
}
def scrapeTwitterProfile(profileUrl, selRef=selectors):
soup = linkToSoup_selenium(profileUrl, ecx=[
'div[data-testid="UserDescription"]' # wait for user description to load
# 'article[data-testid="tweet"]' # wait for tweets to load
], tmout=3, by_method='css', returnErr=True)
if not isinstance(soup, str): return selectForList(soup, selRef)
return {'Error': f'failed to scrape {profileUrl} - {soup}'}
Setting returnErr=True returns the error message (a string instead of the BeautifulSoup object) if anything goes wrong. ecx should be set based on which part/s you want to load (it's a list so it can have multiple selectors). tmout doesn't have to be passed (default is 25sec), but if it is, it should be adjusted for the other arguments and your own device and browser speeds - on my browser, tmo=0.01 is enough to load user details, but loading the first tweets takes at least tmo=2.
I wrote scrapeTwitterProfile mostly so that I could get tuDets [below] in one line. The for-loop after that is just for printing the results.
tuDets = [scrapeTwitterProfile(url) for url in profileUrls]
for url, d in zip(profileUrls, tuDets):
print('\nFrom', url)
for k, v in d.items(): print(f'\t{k}: {miniStr(v)}')
snscrape Example
snscrape has a module for Twitter that can be used to access Twitter data without having registered up for the API yourself. The example below prints a similar output to the previous example, but is faster.
# import snscrape.modules.twitter as sns_twitter
# def miniStr(o): return ' '.join(w for w in str(o).split() if w)
# profileIDs = [url.split('twitter.com/', 1)[-1].split('/')[0] for url in profileUrls]
profileIDs = ['twitterID', 'jokowi', 'sep_colin']
keysList = ['username', 'id', 'displayname', 'description', 'url']
for pid in profileIDs:
tusRes, defVal = sns_twitter.TwitterUserScraper(pid).entity, 'no such attribute'
print('\nfor ID', pid)
for k in keysList: print('\t', k, ':', miniStr(getattr(tusRes, k, defVal)))
You can get most of the attributes in .entity with .__dict__ or print them all all with something like
print('\n'.join(f'{a}: {miniStr(v)}' for a, v in [
(n, getattr(tusRes, n)) for n in dir(tusRes)
] if not (a[:1] == '_' or callable(v))))
See this example from this tutorial if you are interested in scraping tweets as well.

Related

Isolating a link with beautifulsoup

I have to scrape through the text of a website: link. I created a set using beautifulsoup of all the links on the page and then eventually I want to iterate through the set.
import requests
from bs4 import BeautifulSoup
url = 'https://crmhelpcenter.gitbook.io/wahi-digital/getting-started/readme'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a')
check = []
for link in links:
link = 'https://crmhelpcenter.gitbook.io' + link.get('href')
check.append(link)
print(check)
With this method it is not adding the sub-links of some of the links in the sidebar. I could loop through each page and add the links accordingly but then I have to go through each link again and check if it is included in a set which makes the time expensive. Is there any way I can instead just isolate the "next" link that is on each page and go through that recursively till I reach the end?
Is there any way I can instead just isolate the "next" link that is on each page and go through that recursively till I reach the end?
If you mean buttons like
OR
then you can look for a tags with data-rnwi-handle="BaseCard" and [because "Previous" button has the same attribute] containing "Next" as as the first [stripped] string (see aNxt below). You don't need to utilize recursion necessarily - since each page has just one "Next" [at most], a while loop should suffice:
# from urllib.parse import urljoin # [ if you use it ]
rootUrl = 'https://crmhelpcenter.gitbook.io'
nxtUrl = f'{rootUrl}/wahi-digital/getting-started/readme'
nextUrls = [nxtUrl]
# allUrls = [nxtUrl] # [ if you want to collect ]
while nxtUrl:
resp = requests.get(nxtUrl)
print([len(nextUrls)], resp.status_code, resp.reason, 'from', resp.url)
soup = BeautifulSoup(resp.content, 'html.parser')
### EXTRACT ANY PAGE DATA YOU WANT TO COLLECT ###
# pgUrl = {urljoin(nxtUrl, a["href"]) for a in soup.select('a[href]')}
# allUrls += [l for l in pgUrl if l not in allUrls]
aNxt = [a for a in soup.find_all(
'a', {'href': True, 'data-rnwi-handle': 'BaseCard'}
) if list(a.stripped_strings)[:1]==['Next']]
# nxtUrl = urljoin(nxtUrl, aNxt[0]["href"]) if aNxt else None
nxtUrl = f'{rootUrl}{aNxt[0]["href"]}' if aNxt else None
nextUrls.append(nxtUrl) # the last item will [most likely] be None
# if nxtUrl is None: nextUrls = nextUrls[:-1] # remove last item if None
On colab this took about 3min to run and collect 344[+1 for None] items in nextUrls and 2879 in allUrls; omitting or keeping allUrls does not seem to make any significant difference in this duration, since the most of the delay is due to the request (and some due to parsing).
You can also try to scrape all ~3k links with this queue-based crawler. [It took about 15min in my colab notebook.] The results of that, as well as nextUrls and allUrls have been uploaded to this spreadsheet.

Comparing results with Beautiful Soup in Python

I've got the following code that filters a particular search on an auction site.
I can display the titles of each value & also the len of all returned values:
from bs4 import BeautifulSoup
import requests
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
soup = BeautifulSoup(url.text, "html.parser")
listings = soup.findAll("div", attrs={"class":"tm-marketplace-search-card__title"})
print(len(listings))
for listing in listings:
print(listing.text)
This prints out the following:
#print(len(listings))
3
#for listing in listings:
# print(listing.text)
PRS. Ten Top Custom 24, faded Denim, Piezo.
PRS SE CUSTOM 22
PRS Tremonti SE *With Seymour Duncan Pickups*
I know what I want to do next, but don't know how to code it. Basically I want to only display new results. I was thinking storing the len of the listings (3 at the moment) as a variable & then comparing that with another GET request (2nd variable) that maybe runs first thing in the morning. Alternatively compare both text values instead of the len. If it doesn't match, then it shows the new listings. Is there a better or different way to do this? Any help appreciated thank you
With length-comparison, there is the issue of some results being removed between checks, so it might look like there are no new results even if there are; and text-comparison does not account for results with similar titles.
I can suggest 3 other methods. (The 3rd uses my preferred approach.)
Closing time
A comment suggested using the closing time, which can be found in the tag before the title; you can define a function to get the days until closing
from datetime import date
import dateutil.parser
def get_days_til_closing(lSoup):
cTxt = lSoup.previous_sibling.find('div', {'tmid':'closingtime'}).text
cTime = dateutil.parser.parse(cTxt.replace('Closes:', '').strip())
return (cTime.date() - date.today()).days
and then filter by the returned value
min_dtc = 3 # or as preferred
# your current code upto listings = soup.findAll....
new_listings = [l for l in listings if get_days_til_closing(l) > min_dtc]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings: print(listing.text)
However, I don't know if sellers are allowed to set their own closing times or if they're set at a fixed offset; also, I don't see the closing time text when inspecting with the browser dev tools [even though I could extract it with the code above], and that makes me a bit unsure of whether it's always available.
JSON list of Listing IDs
Each result is in a "card" with a link to the relevant listing, and that link contains a number that I'm calling the "listing ID". You can save that in a list as a JSON file and keep checking against it every new scrape
from bs4 import BeautifulSoup
import requests
import json
lFilename = 'listing_ids.json' # or as preferred
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
try:
prev_listings = json.load(open(lFilename, 'r'))
except Exception as e:
prev_listings = []
print(len(prev_listings), 'saved listings found')
soup = BeautifulSoup(url.text, "html.parser")
listings = soup.select("div.o-card > a[href*='/listing/']")
new_listings = [
l for l in listings if
l.get('href').split('/listing/')[1].split('?')[0]
not in prev_listings
]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings:
print(listing.select_one('div.tm-marketplace-search-card__title').text)
with open(lFilename, 'w') as f:
json.dump(prev_listings + [
l.get('href').split('/listing/')[1].split('?')[0]
for l in new_listings
], f)
This should be fairly reliable as long as they don't tend to recycle the listing ids, this should be fairly reliable. (Even then, every once in a while, after checking the new listings for that day, you can just delete the JSON file and re-run the program once; it will also keep the file from getting too big...)
CSV Logging [including Listing IDs]
Instead of just saving the IDs, you can save pretty much all the details from each result
from bs4 import BeautifulSoup
import requests
from datetime import date
import pandas
lFilename = 'listings.csv' # or as preferred
max_days = 60 # or as preferred
date_today = date.today()
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
try:
prev_listings = pandas.read_csv(lFilename).to_dict(orient='records')
prevIds = [str(l['listing_id']) for l in prev_listings]
except Exception as e:
prev_listings, prevIds = [], []
print(len(prev_listings), 'saved listings found')
def get_listing_details(lSoup, prevList, lDate=date_today):
selectorsRef = {
'title': 'div.tm-marketplace-search-card__title',
'location_time': 'div.tm-marketplace-search-card__location-and-time',
'footer': 'div.tm-marketplace-search-card__footer',
}
lId = lSoup.get('href').split('/listing/')[1].split('?')[0]
lDets = {'listing_id': lId}
for k, sel in selectorsRef.items():
s = lSoup.select_one(sel)
lDets[k] = None if s is None else s.text
lDets['listing_link'] = 'https://www.trademe.co.nz/a/' + lSoup.get('href')
lDets['new_listing'] = lId not in prevList
lDets['last_scraped'] = lDate.isoformat()
return lDets
soup = BeautifulSoup(url.text, "html.parser")
listings = [
get_listing_details(s, prevIds) for s in
soup.select("div.o-card > a[href*='/listing/']")
]
todaysIds = [l['listing_id'] for l in listings]
new_listings = [l for l in listings if l['new_listing']]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings: print(listing['title'])
prev_listings = [
p for p in prev_listings if str(p['listing_id']) not in todaysIds
and (date_today - date.fromisoformat(p['last_scraped'])).days < max_days
]
pandas.DataFrame(prev_listings + listings).to_csv(lFilename, index=False)
You'll end up with a spreadsheet of scraping history/log that you can check anytime, and depending on what you set max_days to, the oldest data will be automatically cleared.
Fixed it with the following:
allGuitars = ["",]
latestGuitar = soup.select("#-title")[0].text.strip()
if latestGuitar in allGuitars[0]:
print("No change. The latest listing is still: " + allGuitars[0])
elif not latestGuitar in allGuitars[0]:
print("New listing detected! - " + latestGuitar)
allGuitars.clear()
allGuitars.insert(0, latestGuitar)

Why is Python requests returning a different text value to what I get when I navigate to the webpage by hand?

I am trying to build a simple 'stock-checker' for a T-shirt I want to buy. Here is the link: https://yesfriends.co/products/mens-t-shirt-black?variant=40840532689069
As you can see, I am present with 'Coming Soon' text, whereas usually if an item is in stock, it will show 'Add To Cart'.
I thought the simplest way would be to use requests and beautifulsoup to isolate this <button> tag, and read the value of text. If it eventually says 'Add To Cart', then I will write the code to email/message myself it's back in stock.
However, here's the code I have so far, and you'll see that the response says the text contains 'Add To Cart', which is not what the website actually shows?
import requests
import bs4
URL = 'https://yesfriends.co/products/mens-t-shirt-black?variant=40840532689069'
def check_stock(url):
page = requests.get(url)
soup = bs4.BeautifulSoup(page.content, "html.parser")
buttons = soup.find_all('button', {'name': 'add'})
return buttons
if __name__ == '__main__':
buttons = check_stock(URL)
print(buttons[0].text)
All data available in <script> tag in JSON. So we need to get this, and extract the information we need. Let's use a simple slice by indexes to get clean JSON
import requests
import json
url = 'https://yesfriends.co/products/mens-t-shirt-black'
response = requests.get(url)
index_start = response.text.index('product:', 0) + len('product:')
index_finish = response.text.index(', }', index_start)
json_obj = json.loads(response.text[index_start:index_finish])
for variant in json_obj['variants']:
available = 'IN STOCK' if variant['available'] else 'OUT OF STOCK'
print(variant['id'], variant['option1'], available)
OUTPUT:
40840532623533 XXS OUT OF STOCK
40840532656301 XS OUT OF STOCK
40840532689069 S OUT OF STOCK
40840532721837 M OUT OF STOCK
40840532754605 L OUT OF STOCK
40840532787373 XL OUT OF STOCK
40840532820141 XXL OUT OF STOCK
40840532852909 3XL IN STOCK
40840532885677 4XL OUT OF STOCK

Webscraping Issue w/ BeautifulSoup

I am new to Python web scraping, and I am scraping productreview.com for review. The following code pulls all the data I need for a single review:
#Scrape TrustPilot for User Reviews (Rating, Comments)
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import json
import requests
import datetime as dt
final_list=[]
url = 'https://www.productreview.com.au/listings/world-nomads'
r = requests.get(url)
soup = bs(r.text, 'lxml')
for div in soup.find('div', class_ = 'loadingOverlay_24D'):
try:
name = soup.find('h4', class_ = 'my-0_27D align-items-baseline_kxl flex-row_3gP d-inline-flex_1j8 text-muted_2v5')
name = name.find('span').text
location = soup.find('h4').find('small').text
policy = soup.find('div', class_ ='px-4_1Cw pt-4_9Zz pb-2_1Ex card-body_2iI').find('span').text
title = soup.find('h3').find('span').text
content = soup.find('p', class_ = 'mb-0_2CX').text
rating = soup.find('div', class_ = 'mb-4_2RH align-items-center_3Oi flex-wrap_ATH d-flex_oSG')
rating = rating.find('div')['title']
final_list.append([name, location, policy, rating, title, content])
except AttributeError:
pass
reviews = pd.DataFrame(final_list, columns = ['Name', 'Location', 'Policy', 'Rating', 'Title', 'Content'])
print(reviews)
But when I edit
for div in soup.find('div', class_ = 'loadingOverlay_24D'):
to
for div in soup.findAll('div', class_ = 'loadingOverlay_24D'):
I don't get all reviews, I just get the same entry looped over and over.
Any help would be much appreciated.
Thanks!
Issue 1: Repeated data inside the loop
You loop has the following form:
for div in soup.find('div' , ...):
name = soup.find('h4', ... )
policy = soup.find('div', ... )
...
Notice that you are calling find inside the loop for the soup object. This means that each time you try to find the value for name, it will search the whole document from the beginning and return the first match, in every iteration.
This is why you are getting the same data over and over.
To fix this, you need to call find inside the current review div that you are currently at. That is:
for div in soup.find('div' , ...):
name = div.find('h4', ... )
policy = div.find('div', ... )
...
Issue 2: Missing data and error handling
In your code, any errors inside the loop are ignored. However, there are many errors that are actually happening while parsing and extracting the values. For example:
location = div.find('h4').find('small').text
Not all reviews have location information. Hence, the code will extract h4, then try to find small, but won't find any, returning None. Then you are calling .text on that None object, causing an exception. Hence, this review will not be added to the result data frame.
To fix this, you need to add more error checking. For example:
locationDiv = div.find('h4').find('small')
if locationDiv:
location = locationDiv.text
else:
location = ''
Issue 3: Identifying and extracting data
The page you're trying to parse has broken HTML, and uses CSS classes that seem random or at least inconsistent. You need to find the correct and unique identifiers for the data that you are extracting such that they strictly match all the entries.
For example, you are extracting the review-container div using CSS class loadingOverlay_24D. This is incorrect. This CSS class seems to be for a "loading" placeholder div or something similar. Actual reviews are enclosed in div blocks that look like this:
<div itemscope="" itemType="http://schema.org/Review" itemProp="review">
....
</div>
Notice that the uniquely identifying property is the itemProp attribute. You can extract those div blocks using:
soup.find('div', {'itemprop': 'review'}):
Similarly, you have to find the correct identifying properties of the other data you want to extract to ensure you get all your data fully and correctly.
One more thing, when a tag has more than one CSS class, usually only one of them is the identifying property you want to use. For example, for names, you have this:
name = soup.find('h4', class_ = 'my-0_27D align-items-baseline_kxl flex-row_3gP d-inline-flex_1j8 text-muted_2v5')
but in reality, you don't need all these classes. The first class, in this case, is sufficient to identify the name h4 blocks
name = soup.find('h4', class_ = 'my-0_27D')
Example:
Here's an example to extract the author names from review page:
for div in soup.find_all('div', {'itemprop': 'review'}):
name = div.find('h4', class_ = 'my-0_27D')
if (name):
name = name.find('span').text
else:
name = '-'
print(name)
Output:
Aidan
Bruno M.
Ba. I.
Luca Evangelista
Upset
Julian L.
Alison Peck
...
The page servs broken html code and html.parser is better at dealing with it.
Change soup = bs(r.text, 'lxml') to soup = bs(r.text, 'html.parser')

Beautifulsoup can't find text

I'm trying to write a scraper in python using urllib and beautiful soup. I have a csv of URLs for news stories, and for ~80% of the pages the scraper works, but when there is a picture at the top of the story the script no longer pulls the time or the body text. I am mostly confused because soup.find and soup.find_all don't seem to produce different results. I have tried a variety of different tags that should capture the text as well as 'lxml' and 'html.parser.'
Here is the code:
testcount = 0
titles1 = []
bodies1 = []
times1 = []
data = pd.read_csv('URLsALLjun27.csv', header=None)
for url in data[0]:
try:
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
titlemess = soup.find(id="title").get_text() #getting the title
titlestring = str(titlemess) #make it a string
title = titlestring.replace("\n", "").replace("\r","")
titles1.append(title)
bodymess = soup.find(class_="article").get_text() #get the body with markup
bodystring = str(bodymess) #make body a string
body = bodystring.replace("\n", "").replace("\u3000","") #scrub markup
bodies1.append(body) #add to list for export
timemess = soup.find('span',{"class":"time"}).get_text()
timestring = str(timemess)
time = timestring.replace("\n", "").replace("\r","").replace("年", "-").replace("月","-").replace("日", "")
times1.append(time)
testcount = testcount +1 #counter
print(testcount)
except Exception as e:
print(testcount, e)
And here are some of the results I get (those marked 'nonetype' are the ones where the title was successfully pulled but body/time is empty)
1 http://news.xinhuanet.com/politics/2016-06/27/c_1119122255.htm
2 http://news.xinhuanet.com/politics/2016-05/22/c_129004569.htm 'NoneType' object has no attribute 'get_text'
Any help would be much appreciated! Thanks.
EDIT: I don't have '10 reputation points' so I can't post more links to test but will comment with them if you need more examples of pages.
The issue is that there is no class="article" on the website with the picture in it and same with the "class":"time". Consequently, it seems that you'll have to detect whether there's a picture on the website or not and then if there is a picture, search for the date and text as follows:
For the date, try:
timemess = soup.find(id="pubtime").get_text()
For the body text, it seems that the article is rather just the caption for the picture. Consequently, you could try the following:
bodymess = soup.find('img').findNext().get_text()
In brief, the soup.find('img') finds the image and findNext() goes to the next block which, coincidentally, contains the text.
Thus, in your code, I would do something as follows:
try:
bodymess = soup.find(class_="article").get_text()
except AttributeError:
bodymess = soup.find('img').findNext().get_text()
try:
timemess = soup.find('span',{"class":"time"}).get_text()
except AttributeError:
timemess = soup.find(id="pubtime").get_text()
As a general flow in web scraping, I usually go to the website itself using a browser and find the elements in the website backend in the browser first.

Categories

Resources