I'm trying to learn python, so I decided to write a script that could translate something using google translate. Till now I wrote this:
import sys
from BeautifulSoup import BeautifulSoup
import urllib2
import urllib
data = {'sl':'en','tl':'it','text':'word'}
request = urllib2.Request('http://www.translate.google.com', urllib.urlencode(data))
request.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11')
opener = urllib2.build_opener()
feeddata = opener.open(request).read()
#print feeddata
soup = BeautifulSoup(feeddata)
print soup.find('span', id="result_box")
print request.get_method()
And now I'm stuck. I can't see any bugs in it, but it still doesn't work (by that I mean that the script will run, but it wont translate the word).
Does anyone know how to fix it?
(Sorry for my poor English)
I made this script if you want to check it:
https://github.com/mouuff/Google-Translate-API
: )
Google translate is meant to be used with a GET request and not a POST request. However, urrllib2 will automatically submit a POST if you add any data to your request.
The solution is to construct the url with a querystring so you will be submitting a GET.
You'll need to alter the request = urllib2.Request('http://www.translate.google.com', urllib.urlencode(data)) line of your code.
Here goes:
querystring = urllib.urlencode(data)
request = urllib2.Request('http://www.translate.google.com' + '?' + querystring )
And you will get the following output:
<span id="result_box" class="short_text">
<span title="word" onmouseover="this.style.backgroundColor='#ebeff9'" onmouseout="this.style.backgroundColor='#fff'">
parola
</span>
</span>
By the way, you're kinda breaking Google's terms of service; look into them if you're doing more than hacking a little script for training.
Using requests
I strongly advise you to stay away from urllib if possible, and use the excellent requests library, which will allow you to efficiently use HTTP with Python.
Yes their documentation is not so easy to uncover.
Here's what you do:
In the Google Cloud Platform Console:
1.1 Go to the Projects page and select or create a new project
1.2 Enable billing for your project
1.3 Enable the Cloud Translation API
1.4 Create a new API key in your project, make sure to restrict usage by IP or other means available there.
In the machine where you want to run the client:
pip install --upgrade google-api-python-client
Then you can write this to send translation requests and receive responses:
Here's the code:
import json
from apiclient.discovery import build
query='this is a test to translate english to spanish'
target_language = 'es'
service = build('translate','v2',developerKey='INSERT_YOUR_APP_API_KEY_HERE')
collection = service.translations()
request = collection.list(q=query, target=target_language)
response = request.execute()
response_json = json.dumps(response)
ascii_translation = ((response['translations'][0])['translatedText']).encode('utf-8').decode('ascii', 'ignore')
utf_translation = ((response['translations'][0])['translatedText']).encode('utf-8')
print response
print ascii_translation
print utf_translation
Related
I'm trying to automate log-in into Costco.com to check some member only prices.
I used dev tool and the Network tab to identify the request that handles the Logon, from which I inferred the POST URL and the parameters.
Code looks like:
import requests
s = requests.session()
payload = {'logonId': 'email#email.com',
'logonPassword': 'mypassword'
}
#get this data from Google-ing "my user agent"
user_agent = {"User-Agent" : "myusergent"}
url = 'https://www.costco.com/Logon'
response = s.post(url, headers=user_agent,data=payload)
print(response.status_code)
When I run this, it just runs and runs and never returns anything. Waited 5 minutes and still running.
What am I going worng?
maybe you should try to make a get requests to get some cookies before make the post requests, if the post requests doesnt work, maybe you should add a timeout so the script stop and you know that it doesnt work.
r = requests.get(w, verify=False, timeout=10)
This one is tough. Usually, in order to set the proper cookies, a get request to the url is first required. We can go directly to https://www.costco.com/LogonForm so long as we change the user agent from the default python requests one. This is accomplished as follows:
import requests
agent = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/85.0.4183.102 Safari/537.36"
)
with requests.Session() as s:
headers = {'user-agent': agent}
s.headers.update(headers)
logon = s.get('https://www.costco.com/LogonForm')
# Saved the cookies in variable, explanation below
cks = s.cookies
Logon get request is successful, ie status code 200! Taking a look at cks:
print(sorted([c.name for c in cks]))
['C_LOC',
'CriteoSessionUserId',
'JSESSIONID',
'WC_ACTIVEPOINTER',
'WC_AUTHENTICATION_-1002',
'WC_GENERIC_ACTIVITYDATA',
'WC_PERSISTENT',
'WC_SESSION_ESTABLISHED',
'WC_USERACTIVITY_-1002',
'_abck',
'ak_bmsc',
'akaas_AS01',
'bm_sz',
'client-zip-short']
Then using the inspect network in google chrome and clicking login yields the following form data for the post in order to login. (place this below cks)
data = {'logonId': username,
'logonPassword': password,
'reLogonURL': 'LogonForm',
'isPharmacy': 'false',
'fromCheckout': '',
'authToken': '-1002,5M9R2fZEDWOZ1d8MBwy40LOFIV0=',
'URL':'Lw=='}
login = s.post('https://www.costco.com/Logon', data=data, allow_redirects=True)
However, simply trying this makes the request just sit there and infinitely redirect.
Using burp suite, I stepped into the post and and found the post request when done via browser. This post has many more cookies than obtained in the initial get request.
Quite a few more in fact
# cookies is equal to the curl from burp, then converted curl to python req
sorted(cookies.keys())
['$JSESSIONID',
'AKA_A2',
'AMCVS_97B21CFE5329614E0A490D45%40AdobeOrg',
'AMCV_97B21CFE5329614E0A490D45%40AdobeOrg',
'C_LOC',
'CriteoSessionUserId',
'OptanonConsent',
'RT',
'WAREHOUSEDELIVERY_WHS',
'WC_ACTIVEPOINTER',
'WC_AUTHENTICATION_-1002',
'WC_GENERIC_ACTIVITYDATA',
'WC_PERSISTENT',
'WC_SESSION_ESTABLISHED',
'WC_USERACTIVITY_-1002',
'WRIgnore',
'WRUIDCD20200731',
'__CT_Data',
'_abck',
'_cs_c',
'_cs_cvars',
'_cs_id',
'_cs_s',
'_fbp',
'ajs_anonymous_id_2',
'ak_bmsc',
'akaas_AS01',
'at_check',
'bm_sz',
'client-zip-short',
'invCheckPostalCode',
'invCheckStateCode',
'mbox',
'rememberedLogonId',
's_cc',
's_sq',
'sto__count',
'sto__session']
Most of these look to be static, however because there are so many its hard to tell which is which and what each is supposed to be. It's here where I myself get stuck, and I am actually really curious how this would be accomplished. In some of the cookie data I can also see some sort of ibm commerce information, so I am linking Prevent Encryption (Krypto) Of Url Paramaters in IBM Commerce Server 6 as its the only other relevant SO answer question pertaining somewhat remotely to this.
Essentially though the steps would be to determine the proper cookies to pass for this post (and then the proper cookies and info for the redirect!). I believe some of these are being set by some js or something since they are not in the get response from the site. Sorry I can't be more help here.
If you absolutely need to login, try using selenium as it simulates a browser. Otherwise, if you just want to check if an item is in stock, this guide uses requests and doesn't need to be logged in https://aryaboudaie.com/python/technical/educational/2020/07/05/using-python-to-buy-a-gift.html
I've been learning a lot of python lately to work on some projects at work.
Currently I need to do some web scraping with google search results. I found several sites that demonstrated how to use ajax google api to search, however after attempting to use it, it appears to no longer be supported. Any suggestions?
I've been searching for quite a while to find a way but can't seem to find any solutions that currently work.
You can always directly scrape Google results. To do this, you can use the URL https://google.com/search?q=<Query> this will return the top 10 search results.
Then you can use lxml for example to parse the page. Depending on what you use, you can either query the resulting node tree via a CSS-Selector (.r a) or using a XPath-Selector (//h3[#class="r"]/a)
In some cases the resulting URL will redirect to Google. Usually it contains a query-parameter qwhich will contain the actual request URL.
Example code using lxml and requests:
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)
for result in page.cssselect(".r a"):
url = result.get("href")
if url.startswith("/url?"):
url = parse_qs(urlparse(url).query)['q']
print(url[0])
A note on google banning your IP: In my experience, google only bans
if you start spamming google with search requests. It will respond
with a 503 if Google thinks you are bot.
Here is another service that can be used for scraping SERPs (https://zenserp.com) It does not require a client and is cheaper.
Here is a python code sample:
import requests
headers = {
'apikey': '',
}
params = (
('q', 'Pied Piper'),
('location', 'United States'),
('search_engine', 'google.com'),
('language', 'English'),
)
response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)
You have 2 options. Building it yourself or using a SERP API.
A SERP API will return the Google search results as a formatted JSON response.
I would recommend a SERP API as it is easier to use, and you don't have to worry about getting blocked by Google.
1. SERP API
I have good experience with the scraperbox serp api.
You can use the following code to call the API. Make sure to replace YOUR_API_TOKEN with your scraperbox API token.
import urllib.parse
import urllib.request
import ssl
import json
ssl._create_default_https_context = ssl._create_unverified_context
# Urlencode the query string
q = urllib.parse.quote_plus("Where can I get the best coffee")
# Create the query URL.
query = "https://api.scraperbox.com/google"
query += "?token=%s" % "YOUR_API_TOKEN"
query += "&q=%s" % q
query += "&proxy_location=gb"
# Call the API.
request = urllib.request.Request(query)
raw_response = urllib.request.urlopen(request).read()
raw_json = raw_response.decode("utf-8")
response = json.loads(raw_json)
# Print the first result title
print(response["organic_results"][0]["title"])
2. Build your own Python scraper
I recently wrote an in-depth blog post on how to scrape search results with Python.
Here is a quick summary.
First you should get the HTML contents of the Google search result page.
import urllib.request
url = 'https://google.com/search?q=Where+can+I+get+the+best+coffee'
# Perform the request
request = urllib.request.Request(url)
# Set a normal User Agent header, otherwise Google will block the request.
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36')
raw_response = urllib.request.urlopen(request).read()
# Read the repsonse as a utf-8 string
html = raw_response.decode("utf-8")
Then you can use BeautifulSoup to extract the search results.
For example, the following code will get all titles.
from bs4 import BeautifulSoup
# The code to get the html contents here.
soup = BeautifulSoup(html, 'html.parser')
# Find all the search result divs
divs = soup.select("#search div.g")
for div in divs:
# Search for a h3 tag
results = div.select("h3")
# Check if we have found a result
if (len(results) >= 1):
# Print the title
h3 = results[0]
print(h3.get_text())
You can extend this code to also extract the search result urls and descriptions.
You can also use a third party service like Serp API - I wrote and run this tool - that is a paid Google search engine results API. It solves the issues of being blocked, and you don't have to rent proxies and do the result parsing yourself.
It's easy to integrate with Python:
from lib.google_search_results import GoogleSearchResults
params = {
"q" : "Coffee",
"location" : "Austin, Texas, United States",
"hl" : "en",
"gl" : "us",
"google_domain" : "google.com",
"api_key" : "demo",
}
query = GoogleSearchResults(params)
dictionary_results = query.get_dictionary()
GitHub: https://github.com/serpapi/google-search-results-python
Current answers will work but google will ban your for scrapping.
My current solution uses the requests_ip_rotator
import requests
from requests_ip_rotator import ApiGateway
import os
keywords = ['test']
def parse(keyword, session):
url = f"https://www.google.com/search?q={keyword}"
response = session.get(url)
print(response)
if __name__ == '__main__':
AWS_ACCESS_KEY_ID = ''
AWS_SECRET_ACCESS_KEY = ''
gateway = ApiGateway("https://www.google.com", access_key_id=AWS_ACCESS_KEY_ID,
access_key_secret=AWS_SECRET_ACCESS_KEY)
gateway.start()
session = requests.Session()
session.mount("https://www.google.com", gateway)
for keyword in keywords:
parse(keyword, session)
gateway.shutdown()
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY you can create in AWS console.
This solution allow you to parse 1 million requests (amazon free limit)
I am trying to log into Facebook by sending post request, and get HTML source code from my profile page.
I have tried many ways, but my script always returns me the same login page.
Hopefully someone can give me some hints/suggestions.
import http.cookiejar
import urllib.parse
import urllib.error
from urllib.request import urlopen
post_data = {
'email':'xxx',
'pass':'xxx',
'legacy_return':'1',
'trynum':'1',
'timezone':'240',
'lgndim':'xxx',
'lgnrnd':'xxx',
'lgnjs': 'xxx'
}
try:
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
login_data = urllib.parse.urlencode(post_data)
encode_data = login_data.encode('UTF-8')
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64)')]
opener.open('https://www.facebook.com/login.php?', encode_data)
resp = opener.open('https://www.facebook.com/login.php?')
print (resp.read().decode('utf-8'))
print (resp.geturl())
except urllib.error.HTTPError as err:
print(err.code)
A quick look at the Facebook login shows that Facebook POSTs more variables than you have in your code. I know that FB has been trying to crack down extensively on scraping and my guess is that they are using on-page javascript and other techniques to prevent you from doing what you want to do.
I tried using the plugin "Tamper Data" for Firefox to intercept a POST call to /login, I copied every single variable (including ones you don't use like 'lsd' and 'qsstamp'), but simulating the request in Python still doesn't work.
In the end, the simplest answer is to use the Facebook APIs. Graph API docs are found here
Using urlopen also for url queries seems obvious. What I tried is:
import urllib2
query='http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627'
f = urllib2.urlopen(query)
s = f.read()
f.close()
However, for this specific url query it fails with HTTP error 403 forbidden
When entering this query in my browser, it works.
Also when using http://www.httpquery.com/ to submit the query, it works.
Do you have suggestions how to use Python right to grab the correct response?
Looks like it requires cookies... (which you can do with urllib2), but an easier way if you're doing this, is to use requests
import requests
session = requests.session()
r = session.get('http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627')
This is generally a much easier and less-stressful method of retrieving URLs in Python.
requests will automatically store and re-use cookies for you. Creating a session is slightly overkill here, but is useful for when you need to submit data to login pages etc..., or re-use cookies across a site... etc...
using urllib2 is something like
import urllib2, cookielib
cookies = cookielib.CookieJar()
opener = urllib2.build_opener( urllib2.HTTPCookieProcessor(cookies) )
data = opener.open('url').read()
It appears that the urllib2 default user agent is banned by the host. You can simply supply your own user agent string:
import urllib2
url = 'http://www.onvista.de/aktien/snapshot.html?ID_OSI=86627'
request = urllib2.Request(url, headers={"User-Agent" : "MyUserAgent"})
contents = urllib2.urlopen(request).read()
print contents
Python noob here. I'm trying to extract a link, specifically the link to 'all reviews' on an Amazon product page. I get an unexpected result.
import urllib2
req = urllib2.Request('http://www.amazon.com/Ole-Henriksen-Truth-Collagen- Booster/dp/B000A0ADT8/ref=sr_1_1?s=hpc&ie=UTF8&qid=1342922857&sr=1-1&keywords=truth')
response = urllib2.urlopen(req)
page = response.read()
start = page.find("all reviews")
link_start = page.find("href=", start) + 6
link_end = page.find('"', link_start)
print page[link_start:link_end]
The program should output:
http://www.amazon.com/Ole-Henriksen-Truth-Collagen-Booster/product- reviews/B000A0ADT8/ref=dp_top_cm_cr_acr_txt?ie=UTF8&showViewpoints=1
Instead, it outputs:
http://www.amazon.com/Ole-Henriksen-Truth-Collagen-Booster/product-reviews/B000A0ADT8
I get the same result you do, but that appears to be simply because the page Amazon serves to your Python script is different from what it serves to my browser. I wrote the downloaded page to disk and loaded it in a text editor, and sure enough, the link ends with ADT8" without all the /ref=dp_top stuff.
In order to help convince Amazon to serve you the same page as a browser, your script is probably going to have to act a lot more like a browser (by accepting and sending cookies, for example). The mechanize module can help with this.
Ah, okay. If you do the usual trick of faking a user agent, for example:
req = urllib2.Request('http://www.amazon.com/Ole-Henriksen-Truth-Collagen-Booster/dp/B000A0ADT8/ref=sr_1_1?s=hpc&ie=UTF8&qid=1342922857&sr=1-1&keywords=truth')
ua = 'Mozilla/5.0 (X11; Linux x86_64; rv:2.0.1) Gecko/20110506 Firefox/4.0.1'
req.add_header('User-Agent', ua)
response = urllib2.urlopen(req)
then you should get something like
localhost-2:coding $ python plink.py
http://www.amazon.com/Ole-Henriksen-Truth-Collagen-Booster/product-reviews/B000A0ADT8/ref=dp_top_cm_cr_acr_txt/190-6179299-9485047?ie=UTF8&showViewpoints=1
which might be closer to what you want.
[Disclaimer: be sure to verify that Amazon's TOS rules permit whatever you're going to do before you do it..]