I am trying to get the HTML page back from sending a POST request:
import httplib
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
headers = {
'Host': 'digitalvita.pitt.edu',
'Connection': 'keep-alive',
'Content-Length': '325',
'Origin': 'https://digitalvita.pitt.edu',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1',
'Content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Accept': 'text/javascript, text/html, application/xml, text/xml, */*',
'Referer': 'https://digitalvita.pitt.edu/index.php',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Cookie': 'PHPSESSID=lvetilatpgs9okgrntk1nvn595'
}
data = {
'action': 'search',
'xdata': '<search id="1"><context type="all" /><results><ordering>familyName</ordering><pagesize>100000</pagesize><page>1</page></results><terms><name>d</name><school>All</school></terms></search>',
'request': 'search'
}
data = urllib.urlencode(data)
print data
req = urllib2.Request('https://digitalvita.pitt.edu/dispatcher.php', data, headers)
response = urllib2.urlopen(req)
the_page = response.read()
soup=BeautifulSoup(the_page)
print soup
Can anyone tell me how to make it work?
Do not specify a Content-Length header, urllib2 calculates it for you. As it is, your header specifies the wrong length:
>>> data = urllib.urlencode(data)
>>> len(data)
319
Without that header the rest of the posted code works fine for me.
Related
I'm using selenium with python, and I'm trying to scrape this page. https://www.vexforum.com/u?period=all. I want to be able to get the data for all 40,000 or so users on this forum, but it only loads 50 initially. You can keep scrolling on the page to load all of the forum's members. Is there any way to request the entire page initially, with all 40k members? Thanks for any help you can provide!
You should use requests (if the robots.txt allow that):
import requests
count = 2
while True:
try:
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Cookie': '_ga=GA1.2.439277064.1611329580; _gat=1; _gid=GA1.2.1557861689.1611329580',
'Referer': 'https://www.vexforum.com/u?period=all',
'Host': 'www.vexforum.com',
'Accept-Language': 'it-it',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.1 Safari/605.1.15',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'X-CSRF-Token': 'undefined',
'Discourse-Present': 'true',
'X-Requested-With': 'XMLHttpRequest',
}
params = {
'order': 'likes_received',
'page': str(count),
'period': 'all'
}
r = requests.get('https://www.vexforum.com/directory_items?order=likes_received&page=2&period=all', headers=headers, params=params)
print(r.json())
print('\n\n\n')
print('___________________________________________________')
print('\n\n\n')
count +=1
except:
pass
You now have only to parse the json response grab the information you want.
In the following code, I am trying to do POST method to microsoft online account, and I am starting with a page that requires to post an email. This is my try till now
import requests
from bs4 import BeautifulSoup
url = 'https://moe-register.emis.gov.eg/account/login?ReturnUrl=%2Fhome%2FRegistrationForm'
headers ={
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,ar;q=0.8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded',
'Cookie':'__RequestVerificationToken=vdS3aPPg5qQ2bH9ADTppeKIVJfclPsMI6dqB6_Ru11-2XJPpLfs7jBlejK3n0PZuYl-CwuM2hmeCsXzjZ4bVfj2HGLs2KOfBUphZHwO9cOQ1; .AspNet.MOEEXAMREGFORM=ekeG7UWLA6OSbT8ZoOBYpC_qYMrBQMi3YOwrPGsZZ_3XCuCsU1BP4uc5QGGE2gMnFgmiDIbkIk_8h9WtTi-P89V7ME6t_mBls6T3uR2jlllCh0Ob-a-a56NaVNIArqBLovUnLGMWioPYazJ9DVHKZY7nR_SvKVKg2kPkn6KffkpzzHOUQAatzQ2FcStZBYNEGcfHF6F9ZkP3VdKKJJM-3hWC8y62kJ-YWD0sKAgAulbKlqcgL1ml6kFoctt2u66eIWNm3ENnMbryh8565aIk3N3UrSd5lBoO-3Qh8jdqPCCq38w3cURRzCd1Z1rhqYb3V2qYs1ULRT1_SyRXFQLrJs5Y9fsMNkuZVeDp_CKfyzM',
'Host': 'moe-register.emis.gov.eg',
'Origin': 'https://moe-register.emis.gov.eg',
'Referer': 'https://moe-register.emis.gov.eg/account/login?ReturnUrl=%2Fhome%2FRegistrationForm',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
with requests.session() as s:
# r = s.post(url)
#soup = BeautifulSoup(r.content, 'lxml')
data = {'EmailAddress': '476731809#matrouh1.moe.edu.eg'}
r_post = s.post(url, data=data, headers=headers, verify=False)
soup = BeautifulSoup(r_post.content, 'lxml')
print(soup)
What I got is the same page that requires the post of the email again. I expected to get the page that requires sign-in password..
This is the starting page
and this is an example of the email that needed to be posted 476731809#matrouh1.moe.edu.eg
** I have tried to use such a code but I got the page sign in again (although the credentials are correct)
Can you please try this code
import requests
from bs4 import BeautifulSoup
url = 'https://login.microsoftonline.com/common/login'
s = requests.Session()
res = s.get('https://login.microsoftonline.com')
cookies = dict(res.cookies)
res = s.post(url,
auth=('476731809#matrouh1.moe.edu.eg', 'Std#050202'),
verify=False,
cookies=cookies)
soup = BeautifulSoup(res.text, 'html.parser')
print(soup)
I checked out the page and following seems to be working:
import requests
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'Origin': 'https://moe-register.emis.gov.eg',
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Referer': 'https://moe-register.emis.gov.eg/account/login',
'Accept-Language': 'en-US,en;q=0.9,gl;q=0.8,fil;q=0.7,hi;q=0.6',
}
data = {
'EmailAddress': '476731809#matrouh1.moe.edu.eg'
}
response = requests.post('https://moe-register.emis.gov.eg/account/authenticate', headers=headers, data=data, verify=False)
Your POST endpoint seems to be wrong, since you need to re-direct from /login to /authenticate to proceed with the request (I am on a mac so my user-agent may be different than yours/required, you can change that from the headers variable).
import requests as req
html = req.get(url)
texto = html.text
print(texto)
I cant get all HTML with Python Requests, only gets a litle part from the html file.
You need to add headers to your request to obtain a response like in your browser. Try:
import requests as req
headers = {
'Host': 'servidor.aternos.me',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Cookie': 'axcaccess=e75a91ffeda624d3a1e24c1d9fb31734',
'Upgrade-Insecure-Requests': '1'
}
url = 'https://servidor.aternos.me/'
html = req.get(url, headers=headers, timeout=10.)
print(html.status_code)
texto = html.text
print(texto)
I am using Requests to parse some data on a server. However, I keep getting a 503 response. The request headers have cookies in them, but my method does not seem to be handling them properly.
I am also a bit confused as to what I should be doing with cookies and when full stop. The website is http://epgservices.sky.com/nevermiss/and my code is below.
Headers and params look correct when viewed in Google Dev Tools, other than the cookies are missing when I use Requests. Any ideas?
import json
import requests
from urllib3.util import Retry
from requests.adapters import HTTPAdapter
from requests import Session, exceptions
import re
import traceback
from cookielib import LWPCookieJar
class sky_ondemand:
session = requests.Session()
jar = session.cookies
url = 'http://epgservices.sky.com'
movie_path = ''.join(movie_path)
headers = {
'Host': 'epgservices.sky.com',
'Connection': 'keep-alive',
'Accept': 'application/json, text/javascript, */*',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
'Referer': 'http://epgservices.sky.com/never-miss/index.htm',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8'
}
params = {
'queryType': 'movieQuery',
'query': '',
'exactMatch': 'false',
'genre': '',
'startsWith': 'all',
'sortBy': 'requested',
'pageNum': '1',
'pageSize': '10',
'src': 'movieLetterButton'
}
r = session.get(url, params=params, headers=headers, cookies=jar)
data = r.content
print(data)
Sorted this if anyone is interested....was nothing to do with the cookies...the url should have been 'http://epgservices.sky.com/tvlistings-proxy/NeverMissProxy/neverMissMovieSearchRequest.json?'
I'm trying to login into a website using a post request like this:
import requests
cookies = {
'_SID': 'c1i73k2mg3sj0ugi5ql16c3sp7',
'isCookieAllowed': 'true',
}
headers = {
'Host': 'service.premiumsim.de',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Referer': 'https://service.premiumsim.de/',
'Content-Type': 'application/x-www-form-urlencoded',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
}
data = [
('_SID', 'c1i73k2mg3sj0ugi5ql16c3sp7'),
('UserLoginType[alias]', 'username'),
('UserLoginType[password]', 'password'),
('UserLoginType[logindata]', ''),
('UserLoginType[_token]', '1af70f3d0e5b9e6c39e1475b6d84e9d125d076de'),
]
requests.post('https://service.premiumsim.de/public/login_check', headers=headers, cookies=cookies, data=data)
The problem is 'UserLoginType[_token] in params. The above code is working, but I don't have that token and have no clue how to generate it. So when I do my request, I do it without the _token and the request fails.
Google did not found any helpful information about UserLoginType.
Does anyone know how to generate it(e.g. with another request first) to be able to login?
Edit:
Thanks to the suggestion of t.m.adam I used bs4 to get the token:
import requests
from bs4 import BeautifulSoup as bs
tokenRequest = requests.get('https://service.premiumsim.de')
html_bytes = tokenRequest.text
soup = bs(html_bytes, 'lxml')
token = soup.find('input', {'id':'UserLoginType__token'})['value']