How to search on Google with Selenium in Python? - python

I'm really new to web scraping. Is there anyone that could tell me how to search on google.com with Selenium in Python?

In order to search something on google, try this:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.google.com/')
textbox = driver.find_element_by_xpath('//input[#name=\"q\"]')
textbox.send_keys('who invented python\n')

Selenium probably isn't the best. other libraries/tools would work better. BeautifulSoup is the first one that comes to mind

Don't use Selenium, there are other tools more suitable for that. Be aware that Google does not look upon favorably on scraping their search results:
Google does not take legal action against scraping, likely for
self-protective reasons. However, Google is using a range of defensive
methods that makes scraping their results a challenging task.
Source: https://en.wikipedia.org/wiki/Search_engine_scraping

Related

Collecting CSV/EXML file from a website that uses Javascript

As a beginner I've been heavily warned to avoid resource heavy browsers for web scraping such as Selenium.
Then I looked at this site: Intcomex Webstore
My idea was to make an alert program to tell me the price and if the item was low in quantity.
I can't for the life of me figure out how one would even attempt to get any of this information, whether through the CSV/EXML files or directly.
I'd possibly use requests however it only returns the javascript function as a link: href="javascript:PriceListExportCSV('/en-XUS/Products/Csv','query‌​');
In Developer Tools after I've clicked the CSV link I see a GET request to http://store.intcomex.com/en-XUS/Products/Csv
However if I use requests I get status_code = 404.
Any help to point me in the right direction is greatly appreciated.
After taking the advice of many helpful commenters, I've come to the conclusion that I indeed need to use a browser such as Selenium.
While it may not be the ideal solution, it appears to be only viable one at the moment.
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://store.intcomex.com/en-XUS/Products/ByCategory/cpt.allone?r=True')
browser.execute_script("javascript:PriceListExportCSV('/en-XUS/Products/Csv','query');")
I'll have to figure it out from here...

Scraping a dynamic website via Python Scripts: how to get the values?

I am trying to scrape information from a website. So far, I've been able to access the webpage, log in with a username and password, and then print that landing page's page source into a separate .html/.txt file as needed.
Here's where the problems arise: on that "landing page," there's a table that I want to scrape the data from. If I were to manually right-click on any integer on that table, and select "inspect," I'd find the integer with no problem. However, when looking at the page source as a whole, I don't see the integers- just variable/parameter names. This leads me to believe it is a dynamic website.
How can I scrape the data?
I've been to hell and back trying to scrape this website, and so far, here's how the available technology has worked for me:
Firefox, IE, and Opera do not render the table. My guess is that this is a problem on the website's end. Only Chrome seems to work if I log in manually.
Selenium's Chromium package has been failing on me repeatedly (on my Windows 7 laptop) and I have even posted a question about the matter here. For now I'll assume it's just a lost cause, but I'm willing to graciously accept anyone's benevolent help.
Spynner's description looked promising, but that setup has frustrated me for quite some time- and the lack of a clear introduction only compounds its cumbersome nature to a novice like myself.
I prefer to code in Python, as it is the language I am most comfortable with. I have a pending company request to have the company install Visual Studio on my computer (to try doing this in C#), but I'm not holding my breath...
If my code can be of any use, so far, here's how I'm using mechanize:
# Headless Browsing Using PhantomJS and Selenium
#
# PhantomJS is installed in current directory
#
from selenium import webdriver
import time
browser = webdriver.PhantomJS()
browser.set_window_size(1120, 550) # need a fake browser size to fetch elements
def login_entry(username, password):
login_email = browser.find_element_by_id('UserName')
login_email.send_keys(username)
login_password = browser.find_element_by_id('Password')
login_password.send_keys(password)
submit_elem = browser.find_element_by_xpath("//button[contains(text(), 'Log in')]")
submit_elem.click()
browser.get("https://www.example.com")
login_entry('usr_name', 'pwd')
time.sleep(10)
test_output = open('phantomjs_test_source_output.html', 'w')
test_output.write(repr(browser.page_source))
test_output.close()
browser.quit()
p.s.- if anyone thinks I should be tagging javascript to this question, let me know. I personally don't know javascript but I'm sensing that it might be part of the problem/solution.
Try something like this. Sometimes with dynamic pages you need to wait for the data to load.
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(my_driver, my_time).until(EC.presence_of_all_elements_located(my_expected_element))
http://selenium-python.readthedocs.io/waits.html
https://seleniumhq.github.io/selenium/docs/api/py/webdriver_support/selenium.webdriver.support.expected_conditions.html

Using AutoIT with Selenium

Thank you for answering my previous question but as one is solved another is found apparently.
Interacting with the flash game itself is now the problem. I have tried researching how to do it in Selenium but it can't be done. I've seen FlashSelenium, Sikuli, and AutoIT.
I can only find documentation for FlashSelenium in Java, It's easier for me to use AutoIT rather than Sikuli as I'd have to learn to use Jpython to create the kind of script I want to, which I am not straying away from learning just trying to finish this as fast as possible. As for AutoIT, the only problem with it is that I don't undertsand how to use it with seleium
from selenium import webdriver
import autoit
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://na58.evony.com/s.html?loginid=747970653D74727947616D65&adv=index")
driver.maximize_window()
assert "Evony - Free forever " in driver.title
So far I have this and It's doing what is suppose to do which is create a new account using that "driver.get" but when I reach to the page, it is all flash and I can not interact with anything in the webpage so I have to use AutoIT but I don't know how to get it to "pick-up" from where selenium left off. I want it to interact with a button on the webpage and from viewing a previous post on stackoverflow I can use a (x,y) to specify the location but unfortunately that post didn't explain beyond that. Any and all information would be great, thanks.
Yes, you can use any number of scraping libraries (scrapy and beautiful soup are both easy to use and very powerful). Personally though, I like Selenium and its python bindings because they're the most flexible. Your final script would look something like this:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://xx.yy.zz")
# Click the "New comer, click here to play!" button
elem = driver. find_element_by_partial_link_text("Click here to play")
elem.send_keys(Keys.RETURN)
Can you post what the source of the page looks like (maybe using a Pastebin)?
Edit: updated to show you how to click the "Click here to play" button.

How to get CSS Background Colour with Python?

Basically exactly as the question says, I'm trying to get the background colour out from a website.
At the moment I'm using BeautifulSoup to get the HTML, but it's proving a difficult way of the getting the CSS. Any help would be great!
This is not something you can reliably solve with BeautifulSoup. You need a real browser.
The simplest option would be to use selenium browser automation tool:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('url')
element = driver.find_element_by_id('myid')
print(element.value_of_css_property('background-color'))
value_of_css_property() documentation.

How to scrape a web-site filling out forms and 'clicking' on links with R?

I would like to web-scrape the html source code of java-script pages that I can´t access without selecting one option in a drop-down list and, after, 'clicking' on links. Spite of not been in java, a simple example can be this:
Web-scrape the main wikipedia pages in all languages available in the drop-down list in the bottom of this url: http://www.wikipedia.org/
To do so, I need to select one language, English for example, and then 'click' in the 'Main Page' link in the left of the new url (http://en.wikipedia.org/wiki/Special:Search?search=&go=Go).
After this step, I would scrape the html source code of the wikipedia main page in English.
Is there any way to do this using R? I have already tried RCurl and XML packages, but it does not work well with the javascript page.
If it is not possible with R, could anyone tell me how to do this with python?
It's possible to do this using python with the selenium package. There are some useful examples here. I found it helpful to install Firebug so that I could identify elements on the page. There is also a Selenium Firefox plugin with an interactive window that can help too.
import sys
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://website.aspx")
elem = driver.find_element_by_id("ctl00_ctl00")
elem.send_keys( '15' )
elem.send_keys( Keys.RETURN )
Take a look at the RCurl and XML packages for posting form information to the website and then processing the data afterwards. RCurl is pretty cool, but you might have an issue with the HTML parsing because if it isn't standards compliant, the XML package may not want to play nice.
If you are interested in learning Python however, Celenius' example above coupled with beautifulSoup would be what you need.

Categories

Resources