Download a search result from Twitter using webscraping

Download a search result from Twitter using webscraping - python

I am new to Python and newer to webscraping, so my question might be very basic.
I am trying to use webscraping to download some results from Twitter searches. I have already understood how the urls of searches work, so I am directly accessing the urls of the searches.
I expect most of my searches to provide no results, and I would like to extract that information in those cases. I’m going to use an example of a search which returns no results. The url would be:
https://twitter.com/search?q=%22John%20Doe%22%20stackexchange%20trial&src=typed_query
That would return something like:
I am trying to extract the text ‘No results for ""John Doe" stackexchange trial"’. But there is something in my code which is not working.
The html code from that part is:
The code I am trying is the following:
import os
os.getcwd()
os.chdir('my.dir')
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import pandas as pd
import os
import urllib.request
import requests
import re
urlpage = "https://twitter.com/search?q=%22John%20Doe%22%20stackexchange%20trial&src=typed_query"
page = urllib.request.urlopen(urlpage)
soup = BeautifulSoup(page, 'html.parser')
results = soup.find("main", class_="css-1dbjc4n r-16y2uox r-1wbh5a2")
text_element = results.find("span", class_="css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0")
text = text_element.text
print(text)
I believe the problem is when I define "results", it is not finding what I want.
I got to that version of the code by analogy from a code which does actually work, which is the following:
urlpage="https://stackexchange.com/"
page = urllib.request.urlopen(urlpage)
soup = BeautifulSoup(page, 'html.parser')
results = soup.find(id='content')
print(results.prettify())
title = results.find('h3', class_='title')
print(title.text)
Thank you very much in advance for all your help!
Edit: so apparently BeautifulSoup doesn’t work for this (I’m not sure why, I think it has to do with the way Twitter loads its elements). I had to use Selenium.
Here is a code that does the job:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import time
url = r'https://twitter.com/search?q=%22John%20Doe%22%20stackexchange%20trial&src=typed_query'
driver = webdriver.Chrome()
driver.get(url)
time.sleep(10)
element=driver.find_element_by_xpath("//span[contains(text(),'No results')]")
print(element.text)

Related

BeautifulSoup returns nothing even though the element exists

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.flaconi.de/haare/maria-nila/head-and-hair-
heal/maria-nila-head-and-hair-heal-haarshampoo.html#sku=80021856-
100')
soup = BeautifulSoup(driver.page_source,'html.parser')
soup.find('div', class_ = 'average-rating')
it returns nothing. I am sure there is a content from website

That value is stored in a script tag. You can regex it out from response.text though I would escape the html entities first to make regex more readable
import requests, re, html
r = requests.get('https://www.flaconi.de/haare/maria-nila/head-and-hair-heal/maria-nila-head-and-hair-heal-haarshampoo.html#sku=80021856-100')
avg_rating = round(float(re.search(r'"ratingValue":(.*?),', html.unescape(r.text)).group(1)), 1)
print(avg_rating)

There is no need to use BS here. Selenium can find the element no problem.
import pandas as pd
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.flaconi.de/haare/maria-nila/head-and-hair-heal/maria-nila-head-and-hair-heal-haarshampoo.html#sku=80021856-100')
driver.find_element_by_class_name('average-rating').text
Output
4.8

BeautifulSoup not extracting all pages

I'm trying to practice some web scraping for a school project, but can't figure out why my script isn't pulling all the listings for a particular region? Would appreciate any help! I've been trying to figure out for hours!
(For simplicity, I'm just sharing one small sub-section of a page i'm trying to scrape. i'm hoping once I can figure out what's wrong here, I can apply it to other regions)
(You might need to create an account to login to see prices, before scraping)
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://condos.ca')
def get_page(page):
url= f'https://condos.ca/toronto/condos-for-sale size_range=300%2C999999999&property_type=Condo%20Apt%2CComm%20Element%20Condo%2CLeasehold%20Condo&mode=Sold&end_date_unix=exact%2C2011&sublocality_id=22&page={page}'
driver.get(url)
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
return soup
prices=[]
location=[]
for page in range(5):
soup = get_page(page)
for tag in soup.find_all('div',class_ = 'styles___AskingPrice-sc-54qk44-4 styles___ClosePrice-sc-54qk44-5 dHPUdq hwkkXU'):
prices.append(tag.get_text())
for tag in soup.find_all('address',class_ = 'styles___Address-sc-54qk44-13 gTwVlm'):
location.append(tag.get_text())
For some reason, i'm only getting an output with 48 records, when it should be around 146.
Thanks!

Want to get the text from the li tag using selenium

I want the text from the li tag that is the specification of the product but when i am searching using driver.find_element_by_css_selector it gives the error as path cannot find .So not able to get the text .
enter code here
import urllib.request
from bs4 import BeautifulSoup
import csv
import os
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.keys import Keys
import pandas as pd
import time
chrome_path =r'C:/Users/91940/AppData/Local/Programs/Python/Python39/Scripts/chromedriver.exe'
driver = webdriver.Chrome(executable_path=chrome_path)
driver.implicitly_wait(10)
driver.get("https://www.lazada.sg/products/samsung-galaxy-watch3-bt-45mm-titanium-
i1156462257-s4537770883.html?search=1&freeshipping=1")
speci = driver.find_element_by_css_selector('data-spm-anchor-
id="a2o42.pdp_revamp.product_detail.i17.5fa031ceGZk42Z"')
how to get the text from the li tag .When I run the above code it gives the error "No such element unable to locate the element".

There are anti-scraping measures. If those do not affect you then you can use css classes to target the li elements to loop over, and the title/values for each specification:
specs = [(i.find_element_by_css_selector('.key-title').text, i.find_element_by_css_selector('.key-value').text) for i in driver.find_elements_by_css_selector('.key-li')]
You can also regex that required info from a script tag and just use requests and json to parse out the spec (there is other info including reviews included in data)
import re, json
r = requests.get("https://www.lazada.sg/products/samsung-galaxy-watch3-bt-45mm-titanium-i1156462257-s4537770883.html?search=1&freeshipping=1",
headers = {'User-Agent':'Mozilla/5.0'})
html = r.text
#html = driver.page_source
data = json.loads(re.search(r'var __moduleData__ = (.*);', html).group(1))
print(data['data']['root']['fields']['specifications'])

Scraping using Inspect element

I am trying to get some information from Instagram by scraping it. I have tried this code on twitter and it was working fine but it shows no result on Instagram both of the code are available here.
Twitter code:
from bs4 import BeautifulSoup
from urllib2 import urlopen
theurl = "https://twitter.com/realmadrid"
thepage = urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
print(soup.find('div',{"class":"ProfileHeaderCard"}))
Result: Perfectly given.
Instagram Code:
from bs4 import BeautifulSoup
from urllib2 import urlopen
theurl = "https://www.instagram.com/barackobama/"
thepage = urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
print(soup.find('div',{"class":"_bugdy"}))
Result: None

If you look at the source, you will see the content is dynamically loaded so there is no div._bugdy in what is returned by your request, depending on what it is you want you may be able to pull it from the script json:
import requests
import re
import json
r = requests.get("https://www.instagram.com/barackobama/")
soup = BeautifulSoup(r.content)
js = soup.find("script",text=re.compile("window._sharedData")).text
_json = json.loads((js[js.find("{"):js.rfind("}")+1]))
from pprint import pprint as pp
pp(_json)
That gives you everything you see in the <script type="text/javascript">window._sharedData = ..... in the source returned.
If you want to ge the followers then you will need to use something like selenium, the site is pretty much all dynamically loaded content, to get the followers you need to click the link which is only visible if you are logged in, this will get you closer to what you want:
from selenium import webdriver
import time
login = "https://www.instagram.com"
dr = webdriver.Chrome()
dr.get(login)
dr.find_element_by_xpath("//a[#class='_k6cv7']").click()
dr.find_element_by_xpath("//input[#name='username']").send_keys(youruname")
dr.find_element_by_xpath("//input[#name='password']").send_keys("yourpass")
dr.find_element_by_css_selector("button._aj7mu._taytv._ki5uo._o0442").click()
time.sleep(5)
dr.get("https://www.instagram.com/barackobama")
dr.find_element_by_css_selector('a[href="/barackobama/followers/"]').click()
time.sleep(3)
for li in dr.find_element_by_css_selector("div._n3cp9._qjr85").find_elements_by_xpath("//ul/li"):
print(li.text)
That pulls some text from the li tags that appear in the popup after you click the link, you can pull whatever you want from the unordered list:

First of all there seems to be a typo in the address on line 3.
from bs4 import BeautifulSoup
from urllib2 import urlopen
theurl = "https://www.instagram.com/barackobama/"
thepage = urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
print(soup.find('div',{"class":"_bugdy"}))
Secondly, since you are working with dynamically loaded content, Python might not be able to see all the content you see when browsing the page in your browser.
In order to solve that there are different webdrivers, such as Selenium webdriver (http://www.seleniumhq.org/projects/webdriver/) and PhantomJS (http://phantomjs.org/) which emulate the browser and can wait for Javascript to generate/display data before looking it up.

Selenium: how to get the entire html as a string?

I am using Selenium with python. See the following code:
from selenium.webdriver.common.keys import Keys
import selenium.webdriver
driver = selenium.webdriver.Firefox()
driver.get("http://finance.yahoo.com/q?s=APP")
Now, I want to do one simple thing: get the html of that web as a string from the driver. Then, I can use BeautifulSoup to parse it. Does anyone know this?
Actually, I don't how to access information from this driver, e.g., to get the stock price of apple in this case.
I am totally new to Selenium. A good tutorial for it is highly appreciated.
Thank you!

Have a look at following code.
from selenium.webdriver.common.keys import Keys
import selenium.webdriver
driver = selenium.webdriver.Firefox()
driver.get("http://finance.yahoo.com/q?s=APP")
page_html = driver.page_source
In page_html you will have html of opened page.

You're looking for page_source.
To continue along with your example:
soup = BeautifulSoup(driver.page_source)
As another commenter noted however, you could use a library like requests to the same effect:
r = requests.get('http://finance.yahoo.com/q?s=APP')
soup = BeautifulSoup(r.content)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Download a search result from Twitter using webscraping - python

Related

BeautifulSoup returns nothing even though the element exists

BeautifulSoup not extracting all pages

Want to get the text from the li tag using selenium

Scraping using Inspect element

Selenium: how to get the entire html as a string?

Categories

Resources