Python 301 POST

Python 301 POST - python

So basically I'm trying to make a request to this website - https://panel.talonro.com/login/ which is supposed to be 301 redirect.
I send data as I should but in the end there is no Location header in my request and status code is 200 instead of 301.
I can't figure out what I am doing wrong. Please help
def do_request():
req = requests.get('https://panel.talonro.com/login/').text
soup = BeautifulSoup(req, 'html.parser')
csrf = soup.find('input', {'name':'csrfKey'}).get('value')
ref = soup.find('input', {'name':'ref'}).get('value')
post_data = {
'auth':'mylogin',
'password':'mypassword',
'login__standard_submitted':'1',
'csrfKey':csrf,
'ref':ref,
'submit':'Go'
}
post = requests.post(url = 'https://forum.talonro.com/login/', data = post_data, headers = {'referer':'https://panel.talonro.com/login/'})

Right now push_data is in do_request(), so you cannot access it outside of that function.
Instead, try this where you return that info and then pass it in:
import requests
from bs4 import BeautifulSoup
def do_request():
req = requests.get('https://panel.talonro.com/login/').text
soup = BeautifulSoup(req, 'html.parser')
csrf = soup.find('input', {'name':'csrfKey'}).get('value')
ref = soup.find('input', {'name':'ref'}).get('value')
post_data = {
'auth':'mylogin',
'password':'mypassword',
'login__standard_submitted':'1',
'csrfKey':csrf,
'ref':ref,
'submit':'Go'
}
return post_data
post = requests.post(url = 'https://forum.talonro.com/login/', data = do_request(), headers = {'referer':'https://panel.talonro.com/login/'})

Related

Python - Scrapping Woocommerce does not bring text from price

i am working in a price update control between the web from my work and the Tango database (our management/administration system).
Because of that, i have to scrap prices from our web site iwth Python. But
i am having troubles while scraping woocommerce price text. I tried to scrape with requests html and with BeautifulSoup libraries but both brings (direct from source) the "bdi" price text as $0.00:
For example: https://hierroscasanova.com.ar/producto/cano-estructural-redondo/?attribute_pa_medida-1=3&attribute_pa_espesor=2-85&attribute_pa_unidad=kg
Script de requests_html:
from requests_html import HTMLSession
import csv
import time
link = 'https://hierroscasanova.com.ar/producto/cano-estructural-redondo/?attribute_pa_medida-1=3&attribute_pa_espesor=2-85&attribute_pa_unidad=kg'
s = HTMLSession()
r = s.get(link)
#print(r.text)
title = r.html.find('h1', first=True).full_text
price = r.html.find('span.woocommerce-Price-amount.amount bdi')[0].full_text
print(price)
price = r.html.find('span.woocommerce-Price-amount.amount bdi')[1].full_text
print(price)
Result:
$0.00
$0.00
Script de BeautifulSoup:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://hierroscasanova.com.ar/producto/cano-estructural-redondo/?attribute_pa_medida-1=3&attribute_pa_espesor=2-85&attribute_pa_unidad=kg")
soup = BeautifulSoup(page.text, "html.parser")
print(soup)
Result:
<span class="woocommerce-Price-amount amount"><bdi><span class="woocommerce-Price-currencySymbol">$</span>0.00</bdi>
PS: i noticed that when the full web site is download it brings all the data and prices (not $0.00), so i do not know why are the libraries failling.
<div class="woocommerce-variation-price"><span class="price"><span class="woocommerce-Price-amount amount"><bdi><span class="woocommerce-Price-currencySymbol">$</span>325.54</bdi></span> <small class="woocommerce-price-suffix">( IVA incluido )</small></span></div>
Thanks you very much!

You can do it with Selenium. But i show you how to do it with json and bs4.
First we need product id:
def get_id(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, features='lxml')
data_product_id = soup.find('form', class_='variations_form').get('data-product_id')
return data_product_id
Then with this ID, we can get price:
def get_price(product_id, payload):
url = "https://hierroscasanova.com.ar/?wc-ajax=get_variation"
payload = f"{payload}&product_id={product_id}"
headers = {
'accept': '*/*',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'
}
response = requests.request("POST", url, headers=headers, data=payload)
json_data = json.loads(response.text)
return json_data['display_price']
Now remains to prepare the parameters for the link, and we can check:
attribute_pa_medida = '1=3'
attribute_pa_espesor = '2-85'
attribute_pa_unidad = 'kg'
attributes = f'attribute_pa_medida-{attribute_pa_medida}&attribute_pa_espesor={attribute_pa_espesor}&attribute_pa_unidad={attribute_pa_unidad}'
url = f'https://hierroscasanova.com.ar/producto/cano-estructural-redondo/?{attributes}'
print(get_price(get_id(url), attributes))
UPD full code:
import requests
import json
from bs4 import BeautifulSoup
def get_id(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, features='lxml')
data_product_id = soup.find('form', class_='variations_form').get('data-product_id')
return data_product_id
def get_price(product_id, payload):
url = "https://hierroscasanova.com.ar/?wc-ajax=get_variation"
payload = f"{payload}&product_id={product_id}"
headers = {
'accept': '*/*',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'
}
response = requests.request("POST", url, headers=headers, data=payload)
json_data = json.loads(response.text)
return json_data['display_price']
attribute_pa_medida = '1=3'
attribute_pa_espesor = '2-85'
attribute_pa_unidad = 'kg'
attributes = f'attribute_pa_medida-{attribute_pa_medida}&attribute_pa_espesor={attribute_pa_espesor}&attribute_pa_unidad={attribute_pa_unidad}'
url = f'https://hierroscasanova.com.ar/producto/cano-estructural-redondo/?{attributes}'
print(get_price(get_id(url), attributes))

how to extract useful info from output with regular expresson or beautifulsoup

This is a snippet of the code that I have been working on for a few days. It tries to get a list of sub-domains for a domain using dnsdumpster.com. However, I prints a lot of data that I don't even need.
with requests.Session() as s:
url = 'https://dnsdumpster.com'
response = s.get(url, headers=headers, proxies=proxies)
response.encoding = 'utf-8' # Optional: requests infers this internally
soup1 = BeautifulSoup(response.text, 'html.parser')
input = soup1.find_all('input')
csrfmiddlewaretoken_raw = str(input[0])
csrfmiddlewaretoken = csrfmiddlewaretoken_raw[55:119]
data = {
'csrfmiddlewaretoken' : csrfmiddlewaretoken,
'targetip' : domain
}
send_data = s.post(url, data=data, proxies=proxies, headers=headers)
print(send_data.status_code)
soup2 = BeautifulSoup(send_data.text, 'html.parser')
td = soup2.find_all('td', {"class": "col-md-4"})
for i in range(len(td)):
item = str(td[i])
subdomain = item[0:100]
print(subdomain)
And this is the output.
<td class="col-md-4">ns.example.co.eu.<br/>
<a class="external nounderline" data-target="#myModal" d
<td class="col-md-4">0 ex-am-ple.mail.protection.outlook.com.<br/>
<a class="external nounderlin
<td class="col-md-4">blog.example.co.eu<br/>
<a class="external nounderline" data-target="#myModal"
<td class="col-md-4">dari.kardan.edu.af<br/>
I want the sub-domain names without the HTML tags and irrelevant data?
And as you can see, the sub-domain names are not uniform.
Can anyone help me with a regular expression or is there any way I can get the information that I want using BeautifulSoup?

First i try to catch the subdomain then in the last few line i clean the subdomain.
with requests.Session() as s:
url = 'https://dnsdumpster.com'
response = s.get(url, headers=headers, proxies=proxies)
response.encoding = 'utf-8' # Optional: requests infers this internally
soup1 = BeautifulSoup(response.text, 'html.parser')
input = soup1.find_all('input')
csrfmiddlewaretoken_raw = str(input[0])
csrfmiddlewaretoken = csrfmiddlewaretoken_raw[55:119]
data = {
'csrfmiddlewaretoken' : csrfmiddlewaretoken,
'targetip' : domain
}
send_data = s.post(url, data=data, proxies=proxies, headers=headers)
print(send_data.status_code)
soup2 = BeautifulSoup(send_data.text, 'html.parser')
td = soup2.find_all('td', {"class": "col-md-4"})
mydomain = []
for i in range(len(td)):
subdomain = td[i].text.strip()
mydomain.append(subdomain)
# filter_mydomain
filtered_subdomain = []
for sub in mydomain:
if sub.endswith('.'):
res = sub[:len(sub)-1]
filtered_subdomain.append(res)
else:
filtered_subdomain.append(sub)
print(filtered_subdomain)

How do I log into a website using python?

I'm trying to log into a website called Roblox with python. Right now what I have isn't working. When I run the code the output is:
[]
(True, 200)
above I am expecting the first line to say "Hello, awdfsrgsrgrsgsrgsg22!"
code:
import requests
from lxml import html
payload = {
"username": "awdfsrgsrgrsgsrgsg22",
"password": "newyork2000",
"csrfmiddlewaretoken": "WYa7Qp4lu9N6"
}
session_requests = requests.session()
login_url = "https://www.roblox.com/Login"
result = session_requests.get(login_url)
tree = html.fromstring(result.text)
result = session_requests.post(
login_url,
data = payload,
headers = dict(referer=login_url)
)
url = 'https://www.roblox.com/home'
result = session_requests.get(
url,
headers = dict(referer = url)
)
tree = html.fromstring(result.content)
welcomeMessage = tree.xpath('//*[#id="HomeContainer"]/div[1]/div/h1/a')
print(welcomeMessage) #expecting "Hello, awdfsrgsrgrsgsrgsg22!"
print(result.ok, result.status_code)

requests failes to keep logged in session

I am trying to scrape some emails from mdpi.com, emails available only to logged in users. But it fails when I am trying to do so. I am getting
when logged out:
Code itself:
import requests
from bs4 import BeautifulSoup
import traceback
login_data = {'form[email]': 'xxxxxxx#gmail.com', 'form[password]': 'xxxxxxxxx', 'remember': 1,}
base_url = 'http://www.mdpi.com'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0'}
session = requests.Session()
session.headers = headers
# log_in
s = session.post('https://susy.mdpi.com/user/login', data=login_data)
print(s.text)
print(session.cookies)
def make_soup(url):
try:
r = session.get(url)
soup = BeautifulSoup(r.content, 'lxml')
return soup
except:
traceback.print_exc()
return None
example_link = 'http://www.mdpi.com/search?journal=medsci&year_from=1996&year_to=2017&page_count=200&sort=relevance&view=default'
def article_finder(soup):
one_page_articles_divs = soup.find_all('div', class_='article-content')
for article_div in one_page_articles_divs:
a_link = article_div.find('a', class_='title-link')
link = base_url + a_link.get('href')
print(link)
article_soup = make_soup(link)
grab_author_info(article_soup)
def grab_author_info(article_soup):
# title of the article
article_title = article_soup.find('h1', class_="title").text
print(article_title)
# affiliation
affiliations_div = article_soup.find('div', class_='art-affiliations')
affiliation_dict = {}
aff_indexes = affiliations_div.find_all('div', class_='affiliation-item')
aff_values = affiliations_div.find_all('div', class_='affiliation-name')
for i, index in enumerate(aff_indexes): # 0, 1
affiliation_dict[int(index.text)] = aff_values[i].text
# authors names
authors_div = article_soup.find('div', class_='art-authors')
authors_spans = authors_div.find_all('span', class_='inlineblock')
for span in authors_spans:
name_and_email = span.find_all('a') # name and email
name = name_and_email[0].text
# email
email = name_and_email[1].get('href')[7:]
# affiliation_index
affiliation_index = span.find('sup').text
indexes = set()
if len(affiliation_index) > 2:
for i in affiliation_index.strip():
try:
ind = int(i)
indexes.add(ind)
except ValueError:
pass
print(name)
for index in indexes:
print('affiliation =>', affiliation_dict[index])
print('email: {}'.format(email))
if __name__ == '__main__':
article_finder(make_soup(example_link))
What should I do in order to get what I want?

Ah that is easy, you haven't managed to log in correctly. If you look at the response from your initial call you will see that you are returned the login page HTML instead of the my profile page. The reason for this is that you are not submitted the hidden token on the form.
The solution request the login page, and then use either lxml or BeautifulSoup to parse the hidden input 'form[_token]'. Get that value and then add it to your login_data payload.
Then submit your login request and you'll be in.

Python (Post) submit a form

I am teaching myself with submitting a form on the web
but somehow post is not working.
The url is https://courselist.wm.edu/courselist/
and the code so far is:
from bs4 import BeautifulSoup
import requests
import urllib
import re
url = 'http://courselist.wm.edu/courselist'
with requests.Session() as session:
response = session.get(url)
soup = BeautifulSoup(response.content)
data = {
'term_code' : '201530',
'term_subj' : 'AFST',
'attr' : '0',
'levl' : '0',
'status' : '0'
}
r = session.post(url, data=data)
#response = session.post(url, data=data)
print r.content
#soup = BeautifulSoup(response.content)
#for row in soup.select('table'):
# print [td.text for td in row.find_all('td')]

You cannot submit a form with Beautifulsoup. For this you should use Mechanize. See here an example of how to use it for form submitting.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python 301 POST - python

Related

Python - Scrapping Woocommerce does not bring text from price

how to extract useful info from output with regular expresson or beautifulsoup

How do I log into a website using python?

requests failes to keep logged in session

Python (Post) submit a form

Categories

Resources