I'm trying to get data from etoro. This link works in my browser https://www.etoro.com/sapi/userstats/CopySim/Username/viveredidividend/OneYearAgo but it's forbidden via request.get() even if I add user agent, headers and even cookies.
import requests
url = "https://www.etoro.com/sapi/userstats/CopySim/Username/viveredidividend/OneYearAgo"
headers = {
'Host': 'www.etoro.com',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0',
'Accept': '*/*',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Referer': 'https://www.etoro.com/people/viveredidividend/chart',
'Cookie': 'XXX',
'TE': 'Trailers'
}
requests.get(url, headers=headers)
>>> <Response [403]>
How to solve it without selenium?
This error gives when you doesn't authenticate the python code in browser. When you login with website it is authenticate and its remember it, thats why you can use and works fine in browser by site.
In order to solve this problem you first need to authenticate the browser in your python code.
To authenticate,
import requests
response = requests.get(url, auth=(username, password))
The error 403 tells that the request you are making is getting blocked. Actually, the website is protected by cloudflare which is preventing the website to get scraped. You can check it by executing print(response.text) in your code and you'll see Access denied | www.etoro.com used Cloudflare to restrict access in the returned cloudflare HTML inside title tag.
Under the hood, when you sent the requests it goes through the cloudflare server and verify whether it's coming from the real browser or not. If the request pass the verification then only it forward the request to website server which returns the valid response. Otherwise, the cloudflare block the request.
It's difficult to bypass cloudflare. Nevertheless, you can try your luck with the code given below.
Code
import urllib.request
url = 'https://www.etoro.com/sapi/userstats/CopySim/Username/viveredidividend/OneYearAgo'
headers = {
'authority': 'www.etoro.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'accept': 'application/json, text/plain, */*',
'accounttype': 'Real',
'applicationidentifier': 'ReToro',
'sec-ch-ua-mobile': '?0',
'applicationversion': '331.0.2',
'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.etoro.com/discover/markets/cryptocurrencies',
'accept-language': 'en-US,en;q=0.9',
'cookie': '__cfruid=e7f40231e2946a1a645f6fa0eb19af969527087e-1624781498; _gcl_au=1.1.279416294.1624782732; _gid=GA1.2.518227313.1624782732; _scid=64860a19-28e4-4e83-9f65-252b26c70796; _fbp=fb.1.1624782732733.795190273; __adal_ca=so%3Ddirect%26me%3Dnone%26ca%3Ddirect%26co%3D%28not%2520set%29%26ke%3D%28not%2520set%29; __adal_cw=1624782733150; _sctr=1|1624732200000; _gaexp=GAX1.2.eSuc0QBTRhKbpaD4vT_-oA.18880.x331; _hjTLDTest=1; _hjid=bb69919f-e61b-4a94-a03b-db7b1f4ec4e4; hp_preferences=%7B%22locale%22%3A%22en-gb%22%7D; funnelFromId=38; eToroLocale=en-gb; G_ENABLED_IDPS=google; marketing_visitor_regulation_id=10; marketing_visitor_country=96; __cflb=0KaS4BfEHptJdJv5nwPFxhdSsqV6GxaSK8BuVNBmVkuj6hYxsLDisSwNTSmCwpbFxkL3LDuPyToV1fUsaeNLoSNtWLVGmBErMgEeYAyzW4uVUEoJHMzTirQMGVAqNKRnL; __cf_bm=6ef9d6f250ee71d99f439672839b52ac168f7c89-1624785170-1800-ASu4E7yXfb+ci0NsW8VuCgeJiCE72Jm9uD7KkGJdy1XyNwmPvvg388mcSP+hTCYUJvtdLyY2Vl/ekoQMAkXDATn0gyFR0LbMLl0b7sCd1Fz/Uwb3TlvfpswY1pv2NvCdqJBy5sYzSznxEsZkLznM+IGjMbvSzQffBIg6k3LDbNGPjWwv7jWq/EbDd++xriLziA==; _uetsid=2ba841e0d72211eb9b5cc3bdcf56041f; _uetvid=2babee20d72211eb97efddb582c3c625; _ga=GA1.2.1277719802.1624782732; _gat_UA-2056847-65=1; __adal_ses=*; __adal_id=47f4f887-c22b-4ce0-8298-37d6a0630bdd.1624782733.2.1624785174.1624782818.770dd6b7-1517-45c9-9554-fc8d210f1d7a; _gat=1; TS01047baf=01d53e5818a8d6dc983e2c3d0e6ada224b4742910600ba921ea33920c60ab80b88c8c57ec50101b4aeeb020479ccfac6c3c567431f; outbrain_cid_fetch=true; _ga_B0NS054E7V=GS1.1.1624785164.2.1.1624785189.35; TMIS2=9a74f8b353780f2fbe59d8dc1d9cd901437be0b823f8ee60d0ab36264e2503993c5e999eaf455068baf761d067e3a4cf92d9327aaa1db627113c6c3ae3b39cd5e8ea5ce755fb8858d673749c5c919fe250d6297ac50c5b7f738927b62732627c5171a8d3a86cdc883c43ce0e24df35f8fe9b6f60a5c9148f0a762e765c11d99d; mp_dbbd7bd9566da85f012f7ca5d8c6c944_mixpanel=%7B%22distinct_id%22%3A%20%2217a4c99388faa1-0317c936b045a4-34647600-13c680-17a4c993890d70%22%2C%22%24device_id%22%3A%20%2217a4c99388faa1-0317c936b045a4-34647600-13c680-17a4c993890d70%22%2C%22%24initial_referrer%22%3A%20%22%24direct%22%2C%22%24initial_referring_domain%22%3A%20%22%24direct%22%7D',
}
request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request).read()
print(response.decode('utf-8'))
Related
I want to access a site with python requests but I get 403 error despite i copy the header of browser and used it. Here is my code,Is there anybody that can solve this problem?
import requests
Url = 'https://bama.ir/'
session = requests.Session()
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
'Accept-Language': 'en-US,en;q=0.5',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Connection': 'keep-alive',
}
session.headers = headers
r = session.get(Url, headers=headers)
Seems like python-requests's getting detected.
You might try using the answer provided here
Im trying to log on into this website (https://phishtank.org/login.php) to use it with python, but the site use cookies. I tried this:
import urllib
cookies = {
'PHPSESSID': '3hdp8jeu933e8t4hvh240i8rp840p06j',
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'es-ES,es;q=0.8,en-US;q=0.5,en;q=0.3',
'Referer': 'https://phishtank.org/login_required.php',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Cache-Control': 'max-age=0',
'TE': 'trailers',
}
response = requests.get('https://phishtank.org/add_web_phish.php', headers=headers, cookies=cookies)
print(response.text)
It works, but after a few minutes the cookie just expires. What can I do to avoid this limitation? Maybe somethin that request new cookies for me and use it.
Use request sessions instead. It persists cookies for you
import requests
session = requests.Session()
session.headers.update(headers)
session.cookies.update(cookies)
session.get(<url>)
I am still a beginner at web scraping, I am trying to extract data from an API but the problem is that it has a Bearer token and this token changed after 5 to 6 hours so I have to go to the web page again and copy the token again so is there any way to extract the data without any more opening to the web page and copy the token again
I found this info as well on the network request, as someone told me that I could use the refresh_token to access but I don't know how to do that
Cache-Control: no-cache,
Connection: keep-alive,
Content-Length: 177,
Content-Type: application/json;charset=UTF-8,
Cookie: dhh_token=; refresh_token=; _hurrier_session=81556f54bf555a952d1a7f780766b028,
dnt: 1
import pandas as pd
from time import sleep
def make_request():
headers = {
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'sec-ch-ua': '^\\^',
'Accept': 'application/json',
'Authorization': 'Bearer eyJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJMdXRiZlZRUVZhWlpmNTNJbGxhaXFDY3BCVTNyaGtqZiIsInN1YiI6MzEzMTcwLCJleHAiOjE2MjQzMjU2NDcsInJvbCI6ImRpc3BhdGNoZXIiLCJyb2xlcyI6WyJodXJyaWVyLmRpc3BhdGNoZXIiLCJjb2QuY29kX21hbmFnZXIiXSwibmFtIjoiRXNsYW0gWmVmdGF3eSIsImVtYSI6ImV6ZWZ0YXd5QHRhbGFiYXQuY29tIiwidXNlcm5hbWUiOiJlemVmdGF3eUB0YWxhYmF0LmNvbSIsImNvdW50cmllcyI6WyJrdyIsImJoIiwicWEiLCJhZSIsImVnIiwib20iLCJqbyIsInEyIiwiazMiXX0.XYykBij-jaiIS_2tdqKFIfYGfw0uS0rKmcOTSHor8Nk',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
'Content-Type': 'application/json;charset=UTF-8',
'Origin': 'url',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'url',
'Accept-Language': 'en-US,en;q=0.9,ar-EG;q=0.8,ar;q=0.7',
'dnt': '1',
}
data = {
'status': 'picked'
}
response = requests.post('url/api', headers=headers, json=data)
print(response.text)
return json.loads(response.text)
def extract_data(row):
data_row = {
'order_id': row['order']['code'],
'deedline': row['order']['deadline'].split('.')[0],
'picked_at': row['picked_at'].split('.')[0],
'picked_by': row['picked_by'],
'processed_at': row['processed_at'],
'type': row['type']
}
return data_row
def periodique_extract(delay):
extract_count = 0
while True:
extract_count += 1
data = make_request()
if extract_count == 1 :
df = pd.DataFrame([extract_data(row) for row in data['data']])
df.to_csv(r"C:\Users\di\Desktop\New folder\a.csv", mode='a')
else:
df = pd.DataFrame([extract_data(row) for row in data['data']])
df.to_csv(r"C:\Users\di\Desktop\New folder\a.csv", mode='a',header=False)
print('exracting data {} times'.format(extract_count))
sleep(delay)
periodique_extract(60)
#note: as the website is track live operation so I extract data every 1 min ```
Sometimes these tokens require JavaScript execution to be set and automatically added to API requests. That means you need to open the page in something that actually runs the javascript, in order to get the token. I.e. actually opening the page in a browser.
One solution could be to use something like Selenium or Puppeteer to open the page whenever the token expires to get a new token, that you then feed to your script. But this depends on the specifics on the page, without a link the correct solution is difficult to say. But if the method of you opening the page in your browser, copying the token, then running your script works, then this is very likely to also work.
I am using Locust to try and perform a load test. I can't get a good (200) response from the API we are using. It continually gives me a:
{
"message": "Invalid API key",
"status": 400
}
However, using the same information I am using for Locust in Postman generates a proper response. The post is a cross site post so it's not going to the host defined for locust. I have replaced any sensitive info with Redacted. So what am I doing wrong? Any help appreciated.
Code Example:
targetURL = 'https://Redacted, name=https:Redacted'
searchBody1 = {"params": "facets=%5B%22Property%20Type%22%2C%22amenities.Property%20Amenities%22%2C%22amenities.Suitability%22%2C%22amenities.Area%20Activities%22%2C%22Bedrooms%22%2C%22Total%20Beds%22%2C%22Bathrooms%22%5D&hitsPerPage=0"}
searchHeader1 = {'Host': 'Redacted',
'Connection': 'keep-alive',
'Content-Length': '223',
'accept': 'application/json',
'Origin': 'Redacted',
'User-Agent': 'Mozilla/5.0 (Linux; Android 8.0.0; SM-G930U) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.93 Mobile Safari/537.36',
'content-type': 'application/x-www-form-urlencoded',
'Sec-Fetch-Site': 'cross-site',
'Sec-Fetch-Mode': 'cors',
'Referer': 'Redacted',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9'}
response = self.client.post(url=targetURL, json=searchBody1, headers=searchHeader1, catch_response=True)
You're setting the url parameter to literally be 'https://Redacted, name=https:Redacted'. Python will not expand that to mean two different parameters, as you seem to have assumed.
You should specify name as an independent parameter in the call to post(), like this:
self.client.post(url='https://Redacted', name='your_short_name', ...)
Hello I am trying to login to https://www.neighborwho.com using Requests for Python but with my code the website response keeps telling me that it cannot find any user with my username when in fact I can login using normal browser manually. I know I could use a headless browser or maybe lxml or mechanicalsoup etc but I am learning python and requests right now so want to see if it can be done in requests
Here is my code:
import requests
url = 'https://www.neighborwho.com/api/v5/session'
payload = {'user[email]': 'my_username',
'user[password]': 'my_password'}
headers = {'referer': 'https://www.neighborwho.com/app/login',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
'origin': 'https://www.neighborwho.com',
'accept': 'application/json, text/javascript, */*; q=0.01',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'content-length': '451',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin'
}
s = requests.Session()
resp = s.post(url, data=payload, headers=headers)
print(resp.status_code)
print(resp.text)
Here is the output I am getting:
401
{"session":{"errors":"We do not see an account that matches that
email/password combination. For security reasons we may occasionally reset
passwords. If you have an account that matches the email address
\"my_username\" and need to reset your password, please use the link
below."},"meta":{"status":401}}