scraping data from a web page with python requests

scraping data from a web page with python requests - python

I am trying to scrape domain search page (where you can enter the keyword, and get some random results) and
I found this api url in network tab https://api.leandomainsearch.com/search?query=computer&count=all (for keyword: computer), but I am getting this error
{'error': True, 'message': 'Invalid API Credentials'}
here is the code
import requests
r = requests.get("https://api.leandomainsearch.com/search?query=cmputer&count=all")
print(r.json())

The site needs that you set Authorization and Referer HTTP headers.
For example:
import re
import json
import requests
kw = 'computer'
url = 'https://leandomainsearch.com/search/'
api_url = 'https://api.leandomainsearch.com/search'
api_key = re.search(r'"apiKey":"(.*?)"', requests.get(url, params={'q': kw}).text)[1]
headers = {'Authorization': 'Key ' + api_key, 'Referer': 'https://leandomainsearch.com/search/?q={}'.format(kw)}
data = requests.get(api_url, params={'query': kw, 'count': 'all'}, headers=headers).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for d in data['domains']:
print(d['name'])
print()
print('Total:', data['_meta']['total_records'])
Prints:
...
blackopscomputer.com
allegiancecomputer.com
northpolecomputer.com
monumentalcomputer.com
fissioncomputer.com
hedgehogcomputer.com
blackwellcomputer.com
reflectionscomputer.com
towerscomputer.com
offgridcomputer.com
redefinecomputer.com
quantumleapcomputer.com
Total: 1727

Related

Trying to retrieve data from the Anbima API

I'm trying to automate a process in which i have to download some brazilian fund quotes from Anbima (Brazil regulator). I have been able to work around the first steps to retrieve the access token but i don't know how to use the token in order to make requests. Here is the tutorial website https://developers.anbima.com.br/en/como-acessar-nossas-apis/.
I have tried a lot of thing but all i get from the request is 'Could not find a required APP in the request, identified by HEADER client_id.'
If someone could share some light. Thank you in advance.
import requests
import base64
import json
requests.get("https://api.anbima.com.br/feed/fundos/v1/fundos")
ClientID = '2Xy1ey11****'
ClientSecret = 'faStF1Hc****'
codeString = ClientID + ":" + ClientSecret
codeStringBytes = codeString.encode('ascii')
base64CodeBytes = base64.b64encode(codeStringBytes)
base64CodeString = base64CodeBytes.decode('ascii')
url = "https://api.anbima.com.br/oauth/access-token"
headers = {
'content-type': 'application/json'
,'authorization': f'Basic {base64CodeString}'
}
body = {
"grant_type": "client_credentials"
}
r = requests.post(url=url, data=json.dumps(body), headers=headers, allow_redirects=True)
jsonDict = r.json()
##################
urlFundos = "https://api-sandbox.anbima.com.br/feed/precos-indices/v1/titulos-publicos/mercado-secundario-TPF"
token = jsonDict['access_token']
headers2 = {
'content-type': 'application/json'
,'authorization': f'Bearer {token}'
}
r2 = requests.get(url=urlFundos, headers=headers2)
r2.status_code
r2.text

I was having the same problem, but today I could advance. I believe you need to adjust some parameters in the header.
Follows the piece of code I developed.
from bs4 import BeautifulSoup
import requests
PRODUCTION_URL = 'https://api.anbima.com.br'
SANDBOX_URL = 'https://api-sandbox.anbima.com.br'
API_URL = '/feed/fundos/v1/fundos/'
CODIGO_FUNDO = '594733'
PRODUCTION = False
if PRODUCTION:
URL = PRODUCTION_URL
else:
URL = SANDBOX_URL
URL = URL + API_URL + CODIGO_FUNDO
HEADER = {'access_token': 'your token',
'client_id' : 'your client ID'}
html = requests.get(URL, headers=HEADER).content
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
The sandbox API will return a dummy JSON. To access the production API you will need to request access (I'm trying to do this just now).

url = 'https://api.anbima.com.br/oauth/access-token'
http = 'https://api-sandbox.anbima.com.br/feed/precos-indices/v1/titulos-publicos/pu-intradiario'
client_id = "oLRa*******"
client_secret = "6U2nefG*****"
client_credentials = "oLRa*******:6U2nefG*****"
client_credentials = client_credentials.encode('ascii')
senhabytes = base64.b64encode(client_credentials)
senha = base64.b64decode(senhabytes)
print(senhabytes, senha)
body = {
"grant_type": "client_credentials"
}
headers = {
'content-type': 'application/json',
'Authorization': 'Basic b0xSYTJFSUlOMWR*********************'
}
request = requests.post(url, headers=headers, json=body, allow_redirects=True)
informacoes = request.json()
token = informacoes['access_token']
headers2 = {
"content-type": "application/json",
"client_id": f"{client_id}",
"access_token": f"{token}"
}
titulos = requests.get(http, headers=headers2)
titulos = fundos.json()
I used your code as a model, then I've made some changes. I've printed the encode client_id:client_secret and then I've copied and pasted in the headers.
I've changed the data for json.

Pagination SendinBlue Api Call

I'm been trying to get data from the SendinBlue API. The problem is the API have a limit of 100 registers per call and my Python loop is not working properly. This is what I have so far, the call works fine.
import requests
import pandas as pd
from pandas import json_normalize
import json
results = []
pagination = 0
url = "https://api.sendinblue.com/v3/smtp/statistics/events"
querystring = {"limit":"100","offset":pagination,"days":"15"}
headers = {
"Accept": "application/json",
"api-key": "XXXXXXX"
}
#respuesta de la API
response = requests.request("GET", url, headers=headers, params=querystring)
#convertir json a diccionario
data = json.loads(response.text)
#convertir diccionario a DataFrame
base = pd.json_normalize(data,record_path='events')
The data structure is like this:
{'events': [
{'email': 'chusperu#gmail.com',
'date': '2020-10-18T17:18:58.000-05:00',
'subject': 'Diego, ¡Gracias por registrarte! 😉',
'messageId': '<202010181429.12179607081#smtp-relay.mailin.fr>',
'event': 'opened',
'tag': '',
'from': 'ventas01#grupodymperu.com',
{'email': 'cynthiaapurimac#gmail.com',
'date': '2020-10-18T17:52:56.000-05:00',
'subject': 'Alvarado, ¡Gracias por registrarte! 😉',
'messageId': '<202010182252.53640747487#smtp-relay.mailin.fr>',
'event': 'requests',
'tag': '',
'from': 'ventas01#grupodymperu.com'},
....
The loop I have tried is this, but it only paginated the first 200 registers. What I'm doing wrong?
for i in data['events']:
results.append(i)
while response.status_code == 200:
pagination += 100
querystring ['offset'] = pagination
response = requests.request("GET", url, headers=headers, params=querystring)
data = json.loads(response.text)
for i in data['events']:
results.append(i)
else:
break
print(results)

Finally get it.
import requests
import pandas as pd
from pandas import json_normalize
import json
# Excel = "C:/Users/User/PycharmProjects/Primero/DataSendin.xlsx"
pagination = 0
url = "https://api.sendinblue.com/v3/smtp/statistics/events"
querystring = {"limit":"100","offset":f"{pagination}","days":"3"}
headers = {
"Accept": "application/json",
"api-key": "Your API key"
}
response = requests.request("GET", url, headers=headers, params=querystring)
#respuesta de la API
try:
#convertir json a diccionario
results = []
data = json.loads(response.text)
results.append(data)
if not data:
print("no hay data")
else:
while response.status_code == 200:
pagination += 100
querystring ['offset'] = pagination
response = requests.request("GET", url, headers=headers, params=querystring)
data = json.loads(response.text)
results.append(data)
if not data:
break
except ValueError:
"no data"
#convertir diccionario a DataFrame
final = list(filter(None, results))
base = pd.json_normalize(final,record_path='events')
base

Looping through an array of API values for API GET request in Python

I have an array of ice cream flavors I want to iterate over for an API GET request. How do I loop through an array such as [vanilla, chocolate, strawberry] using the standard API request below?
import requests
url = "https://fakeurl.com/values/icecreamflavor/chocolate?"
payload = {}
headers = {
'Authorization': 'Bearer (STRING)',
'(STRING)': '(STRING)'
}
response = requests.request("GET", url, headers=headers, data = payload)
my_list = (response.text.encode('utf8'))

You could probably try string formatting on your url. You could loop through your array of ice-cream flavors, change the url in each loop and perform API GET request on the changed url.
import requests
iceCreamFlavors = ["vanilla", "chocolate", "strawberry"]
url = "https://fakeurl.com/values/icecreamflavor/{flavor}?"
payload = {}
headers = {
'Authorization': 'Bearer (STRING)',
'(STRING)': '(STRING)'
}
my_list = []
for flavor in iceCreamFlavors:
response = requests.request("GET", url.format(flavor=flavor), headers=headers, data = payload)
my_list.append(response.text.encode('utf8'))

Differences of pythons urllib , urllib2 and requests libary

I have 2 scripts which submits a post request with Ajax parameters.
One script uses the requests libary (script that works).
The other one uses urllib & urllib2.
At the moment I have no idea why the urllib script does not work.
Can anybody help?
Script using Request:
import requests
s = requests.Session()
url = "https://www.shirtinator.de/?cT=search/motives&sq=junggesellenabschied"
data1 = {
'xajax': 'searchBrowse',
'xajaxr': '1455134430801',
'xajaxargs[]': ['1', 'true', 'true', 'motives', '100'],
}
r = s.post(url, data=data1, headers={'X-Requested-With': 'XMLHttpRequest'}, verify=False)
#r = requests.post(url, data=data1, headers={'X-Requested-With': 'XMLHttpRequest'}, verify=False)
#r = requests.post(url, verify=False)
result = r.text
print result
print result.count("motiveImageBox")
Script using urllib:
import urllib2
import urllib
#
url = "https://www.shirtinator.de/?cT=search/motives&sq=junggesellenabschied"
data = ({
'xajax': 'searchBrowse',
'xajaxr': '1455134430801',
'xajaxargs[]': ['1', 'true', 'true', 'motives', '100'],
})
encode_data = urllib.urlencode(data)
print encode_data
req = urllib2.Request(url,encode_data)
response = urllib2.urlopen(req)
d = response.read()
print d
print d.count("motiveImageBox")

Python: how to send html-code in a POST request

I try to send htm-code from python script to Joomla site.
description = "<h1>Header</h1><p>text</p>"
values = {'description' : description.encode(self.encoding),
'id = ' : 5,
}
data = urlencode(values)
binData = data.encode(self.encoding)
headers = { 'User-Agent' : self.userAgent,
'X-Requested-With' : 'XMLHttpRequest'}
req = urllib2.Request(self.addr, binData, headers)
response = urllib2.urlopen(req)
rawreply = response.read()
At Joomla-server I got the same string but without html:
$desc = JRequest::getVar('description', '', 'POST');
What's wrong?

You should use requests
pip install requests
then
import requests
description = "<h1>Header</h1><p>text</p>"
values = dict(description='description',id=5)
response = requests.post(self.addr,data=values)
if response.ok:
print response.json()
or if joomla didnt return json
print response.content

JRequest::getVar or JRequest::getString filter HTML code. But it can be turned off:
$desc = JRequest::getVar('description', '', 'POST', 'string', JREQUEST_ALLOWHTML);

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scraping data from a web page with python requests - python

Related

Trying to retrieve data from the Anbima API

Pagination SendinBlue Api Call

Looping through an array of API values for API GET request in Python

Differences of pythons urllib , urllib2 and requests libary

Python: how to send html-code in a POST request

Categories

Resources