Crawling google search url list with python

Crawling google search url list with python - python

I'd like to scrape google search result url with python.
Here's my code
import requests
from bs4 import BeautifulSoup
def search(keyword):
html = requests.get('https://www.google.co.kr/search?q={}&num=100&sourceid=chrome&ie=UTF-8'.format(keyword)).text
soup = BeautifulSoup(html, 'html.parser')
result = []
for i in soup.find_all('h3', {'class':'r'}):
result.append(i.find('a', href = True) ['href'][7:])
return result
search('computer')
Then I can get result. First url of the list is wikipedia.com which is,
'https://en.wikipedia.org/wiki/Computer&sa=U&ved=0ahUKEwixyfu7q5HdAhWR3lQKHUfoDcsQFggTMAA&usg=AOvVaw2nvT-2sO4iJenW_fkyCS3i',
'?q=computer&num=100&ie=UTF-8&prmd=ivnsbp&tbm=isch&tbo=u&source=univ&sa=X&ved=0ahUKEwixyfu7q5HdAhWR3lQKHUfoDcsQsAQIHg'
I want to get clean url, which is 'https://en.wikipedia.org/wiki/Computer' including all the other search result in this case.
How can I modify my codes?
Edited: As you see the image below, I want to get the real url (marked yellow), not the messy and long url above.

How about appending
.split('&')[0]
to your code, in a way such that it becomes:
import requests
from bs4 import BeautifulSoup
def search(keyword):
html = requests.get('https://www.google.co.kr/search?q={}&num=100&sourceid=chrome&ie=UTF-8'.format(keyword)).text
soup = BeautifulSoup(html, 'html.parser')
result = []
for i in soup.find_all('h3', {'class':'r'}):
result.append(i.find('a', href = True) ['href'][7:].split('&')[0])
return result
search('computer')
[EDIT]
Taking https://en.wikipedia.org/wiki/Computer as an example:
Through chrome developer tools the url looks clean.
Since it belongs to <h3 class="r">, your code should work fine and return the clean url.
Instead, if you replace
result.append(i.find('a', href = True) ['href'][7:])
with
print i
then in my terminal it returns the following for the above link:
/url?q=https://en.wikipedia.org/wiki/Computer&sa=U&ved=0ahUKEwinqcqdypHdAhVhKH0KHVWIBEUQFggfMAU&usg=AOvVaw1pduIWw_TSCJUxtP9W_kHJ
you can see that /url?q= has been prepended, and &sa=U&ved=0ahUKEwinqcqdypHdAhVhKH0KHVWIBEUQFggfMAU&usg=AOvVaw1pduIWw_TSCJUxtP9W_kHJ
has been appended.
By looking at other links as well, I observed that the prepended part always looks like /url?q=, and the appended part always begins with a &.
Therefore it's my belief that my original answer should work:
result.append(i.find('a', href = True) ['href'][7:].split('&')[0])
[7:] removes the prepended string, and split('&')[0] the appended string.

I found solution.
This modification in the search function works.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
html = requests.get('https://www.google.co.kr/search?q={}&num=100&sourceid=chrome&ie=UTF-8'.format(keyword), headers = headers).text

Related

My web scraping/searching project is not giving me the correct results

I'm trying to make a web scraping / searching project with beautifulsoup, but when i run this code all i'm getting is None. Im sure i've checked the open-source html code on google with respect to the part of the web-page I want to scrape. Here's the code-
from bs4 import BeautifulSoup
import requests
topic_search = input('Enter your search queries: ')
search = topic_search.replace('', '+')
url = 'https://www.google.com/search?q=' + topic_search + "&start" + str(0)
header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=header).text
soup = BeautifulSoup(response, features='lxml')
result = soup.find('div', id='result-stats')
print(result)
But the only result getting printed is "None". I don't know what to do. Please help. Also, even if I use soup.find_all apart from soup.find, i still get "[]" as the result. What should I do?

What should I do?
Firstly check if it does correctly get page source, add
print(response)
immediately after
response = requests.get(url, headers=header).text
run that, does it print intended source of page? If not fix that, if that works well then
soup.find('div', id='result-stats')
giving None means that there is not element fullfilling these criteria. You need to alter arguments to describe existing element. Notice that id if used should be unique, so if you want to access element with given id you should be able to do that providing id alone.

Soup does not find specific class from div

I have tried everything. The response is perfect and I do get what I am supposed to be getting, I just don't understand why I receive an empty array when I'm searching for a div with a specific class (that definitely exists) on the web page. I have tried looking everywhere, but nothing seems to work.
Here's my code:
import requests
import lxml
from bs4 import BeautifulSoup
baseurl = 'https://www.atea.dk/eshop/products/?filters=S_Apple%20MacBook%20Pro%2016%20GB'
response = requests.get(baseurl)
soup = BeautifulSoup(response.content, 'lxml')
productlist = soup.find_all("div", class_="nsv-product ns_b_a")
print(productlist)
I am essentially trying to build a script that emails me when the items from an e-shop are marked as available (pa lager) instead of unavailable (ikke pa lager).

You might need to use Selenium on this one.
The div is, AFAIK, rendered by JS.
BeautifulSoup does not capture JS-rendered content.
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.common.keys import Keys
options = webdriver.FirefoxOptions()
options.headless = True
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install(),options=options)
driver.get('https://www.atea.dk/eshop/products/?filters=S_Apple%20MacBook%20Pro%2016%20GB')
k = driver.find_elements_by_xpath("//div[#class='nsv-product ns_b_a']")
Your code below that snippet should contains everything you need, e.g. processing, saving into your database, etc..
Note: That snippet is a bit flawed, e.g. you want to use Chrome, but it only provides an example, so tweak it to your own needs.

You need to inspect the page source code (for windows : Ctrl + U) and search for this section window.netset.model.productListDataModel
it's enclosed in <script> tag.
What you need to do is to parse that enclosed json string
<script>window.netset.model.productListDataModel = {....}
which will provide your desired product listing (8 per page).
Here is the code
import re, json
import requests
import lxml
from bs4 import BeautifulSoup
baseurl = 'https://www.atea.dk/eshop/products/?filters=S_Apple%20MacBook%20Pro%2016%20GB'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
response = requests.get(baseurl, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
# print response
print(response)
# regex to find the script enclosed json string
product_list_raw_str = re.findall(r'productListDataModel\s+=\s+(\{.*?\});\n', response.text)[0].strip()
# parse json string
products_json = json.loads(product_list_raw_str)
# find the product list
product_list = products_json['response']['productlistrows']['productrows']
# check product list count, 8 per page
len(product_list)
# iterate the product list
for product in product_list:
print(product)
It will output -
<Response [200]>
{'buyable': True, 'ispackage': False, 'artcolumndata': 'MK1E3DK/A', 'quantityDisabled': False, 'rowid': '5418526',.........
..... 'showAddToMyList': False, 'showAddToCompareList': True, 'showbuybutton': True}

Web Scraping using XPath - Not finding element after copying text xpath

Trying to get a specific portion of text from this web page... trying to use code I found from a similar post:
# Import required modules
from lxml import html
import requests
# Request the page
page = requests.get('https://www.baseball-reference.com/players/k/kershcl01.shtml')
# Parsing the page
tree = html.fromstring(page.content)
# Get element using XPath
share = tree.xpath(
'//div[#id="leaderboard_cyyoung"]/table/tbody/tr[11]/td/a')
print(share)
Output is just empty brackets []

You are getting empty results because the div element you are trying to query is commented out in the requested page's source. Note that when you use the requests.get method, you get the page's HTML source code, not the rendered HTML code generated by the browser from your interaction with the page and that you can inspect with the browser's developer tools.
So I would say: check again if this is really the element you see rendered on the page and see if you can identify what kind of interaction makes it rendered. Then you can use a tool to mock this interaction so that you can get the rendered HTML code within your Python environment. I would suggest helium for doing so. If this is not the right element, you can simply update the specified XPath to get the right source-code available element and successfully scrape it.

As stated, this is rendered/dynamic part of the site. It is there in the comments, so you'll need to pull out the comments of the html, then parse. The other issue with it is in the comments, there is no <tbody> tag, so it wont find anything, you'd need to remove that. I'm not sure what you want to pull out though (is it the link, is it the text?). I alerted your code to show you how to use it with lxml, but hoestly not a fan. I'd prefer to just use BeautifulSoup. BeautifulSoup however doesn't intigrate with xpath, so used css selector instead.
Your code altered:
import requests
from lxml import html
from bs4 import BeautifulSoup, Comment
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
url = "https://www.baseball-reference.com/players/k/kershcl01.shtml"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'leaderboard_cyyoung' in str(each):
htmlStr = str(each)
# Parsing the page
tree = html.fromstring(htmlStr)
# Get element using XPath
share = tree.xpath('//div[#id="leaderboard_cyyoung"]/table/tr[11]/td/a')
print(share)
How I would do it:
import requests
from bs4 import BeautifulSoup, Comment
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
url = "https://www.baseball-reference.com/players/k/kershcl01.shtml"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'leaderboard_cyyoung' in str(each):
soup = BeautifulSoup(str(each), 'html.parser')
share = soup.select('div#leaderboard_cyyoung > table > tr:nth-child(12) > td > a')
print(share)
break
Output:
[4.58 Career Shares]

Python parser outputs None

In general, I try to get at least some tags from this site, and always gives none. I have no idea how to fix this.
There is a button Tickets, after you press it from the side there is an additional panel, so I want to parse it, I can not understand how. As I understand it, this tab is not loaded immediately after clicking, what to do next I do not understand. P.S. just started to learn it.
# coding: utf-8-sig
import urllib.request
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
def get_html(url):
request = urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(request)
return response.read()
def parse(html):
soup = BeautifulSoup(html,"html.parser")
table = soup.find('body', class_='panel-open')
print(table)
def main():
parse(get_html('http://toto-info.co/'))
if __name__ == '__main__':
main()

That would be because the body element of the web page http://toto-info.co/ does not contain the class attribute "panel-open".
You can see what the body element contains by changing the line in your code:
table = soup.find('body', class_='panel-open')
to
table = soup.find('body')
This will now print the body element and all the elements it contains.
As you will see the body element contains very little except script if you want to get the script to render you will have to use other technologies I suggest you do a Google search for starters e.g. Web-scraping JavaScript page with Python.
An example that does select something by class, if you are interested is:
table = soup.find('div', class_='standalone')
But that selects from this page:
<div class="standalone" data-app="" id="app"></div>
but that is about all of the markup on this page that is displayed without JavaScript.

Using Python to Scrape Nested Divs and Spans in Twitter?

I'm trying to scrape the likes and retweets from the results of a Twitter search.
After running the Python below, I get an empty list, []. I'm not using the Twitter API because it doesn't look at the tweets by hashtag this far back.
The code I'm using is:
from bs4 import BeautifulSoup
import requests
url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
all_likes = soup.find_all('span', class_='ProfileTweet-actionCountForPresentation')
print(all_likes)
I can successfully save the html to file using this code. It is missing large amounts of information when I search the text, such as the class names I am looking for...
So (part of) the problem is apparently in accurately accessing the source code.
filename = 'newfile2.txt'
with open(filename, 'w') as handle:
handle.writelines(str(data))
This screenshot shows the span that I'm trying to scrape.
I've looked at this question, and others like it, but I'm not quite getting there.
How can I use BeautifulSoup to get deeply nested div values?

It seems that your GET request returns valid HTML but with no tweet elements in the #timeline element. However, adding a user agent to the request headers seems to remedy this.
from bs4 import BeautifulSoup
import requests
url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text
soup = BeautifulSoup(data, "lxml")
all_likes = soup.find_all('span', class_='ProfileTweet-actionCountForPresentation')
print(all_likes)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Crawling google search url list with python - python

Related

My web scraping/searching project is not giving me the correct results

Soup does not find specific class from div

Web Scraping using XPath - Not finding element after copying text xpath

Python parser outputs None

Using Python to Scrape Nested Divs and Spans in Twitter?

Categories

Resources