I've tried two completely different methods. But still I can't get the data that is only present after loggin in.
I've tried doing one using requests but the xpath returns a null
import requests
from lxml import html
USERNAME = "xxx"
PASSWORD = "xxx"
LOGIN_URL = "http://www.reginaandrew.com/customer/account/loginPost/referer/aHR0cDovL3d3dy5yZWdpbmFhbmRyZXcuY29tLz9fX19TSUQ9VQ,,/"
URL = "http://www.reginaandrew.com/gold-leaf-glass-top-table"
def main():
FormKeyTxt = ""
session_requests = requests.session()
# Get login csrf token
result = session_requests.get(LOGIN_URL)
tree = html.fromstring(result.text)
# Create payload
formKey = str((tree.xpath("//*[ # id = 'login-form'] / input / # value")))
FormKeyTxt = "".join(formKey)
#print(FormKeyTxt.replace("['","").replace("']",""))
payload = {
"login[username]": USERNAME,
"login[password]": PASSWORD,
"form_key": FormKeyTxt,
"persistent_remember_me": "checked"
}
# Perform login
result = session_requests.post(LOGIN_URL, data=payload)
# Scrape url
result = session_requests.get(URL, data=payload)
tree = html.fromstring(result.content)
bucket_names = tree.xpath("//span[contains(#class, 'in-stock')]/text()")
print(bucket_names)
print(result)
print(result.status_code)
if __name__ == '__main__':
main()
ive tried another one using Mechanical soup but still it returns a null
import argparse
import mechanicalsoup
import urllib.request
from bs4 import BeautifulSoup
parser = argparse.ArgumentParser(description='Login to GitHub.')
parser.add_argument("username")
parser.add_argument("password")
args = parser.parse_args()
browser = mechanicalsoup.Browser()
login_page = browser.get("http://www.reginaandrew.com/gold-leaf-glass-top-table")
login_form = login_page.soup.select("#login-form")[0]
login_form.input({"login[username]": args.username, "login[password]": args.password})
page2 = browser.submit(login_form,login_page.url )
messages = page2.soup.find(class_='in-stock1')
if messages:
print(messages.text)
print(page2.soup.title.text)
I understand the top solution better so id like to do it using that but is there anything I'm missing? (I'm sure I'm missing a lot)
This should do it
import requests
import re
url = "http://www.reginaandrew.com/"
r = requests.session()
rs = r.get(url)
cut = re.search(r'<form.+?id="login-form".+?<\/form>', rs.text, re.S|re.I).group()
action = re.search(r'action="(.+?)"', cut).group(1)
form_key = re.search(r'name="form_key".+?value="(.+?)"', cut).group(1)
payload = {
"login[username]": "fugees",
"login[password]": "nugees",
"form_key": form_key,
"persistent_remember_me": "on"
}
rs = r.post(action, data=payload, headers={'Referer':url})
Related
I'm trying to log into a website called Roblox with python. Right now what I have isn't working. When I run the code the output is:
[]
(True, 200)
above I am expecting the first line to say "Hello, awdfsrgsrgrsgsrgsg22!"
code:
import requests
from lxml import html
payload = {
"username": "awdfsrgsrgrsgsrgsg22",
"password": "newyork2000",
"csrfmiddlewaretoken": "WYa7Qp4lu9N6"
}
session_requests = requests.session()
login_url = "https://www.roblox.com/Login"
result = session_requests.get(login_url)
tree = html.fromstring(result.text)
result = session_requests.post(
login_url,
data = payload,
headers = dict(referer=login_url)
)
url = 'https://www.roblox.com/home'
result = session_requests.get(
url,
headers = dict(referer = url)
)
tree = html.fromstring(result.content)
welcomeMessage = tree.xpath('//*[#id="HomeContainer"]/div[1]/div/h1/a')
print(welcomeMessage) #expecting "Hello, awdfsrgsrgrsgsrgsg22!"
print(result.ok, result.status_code)
I've written a script in python to scrape the result populated upon filling in two inputboxes zipcode and distance with 66109,10000. When I try the inputs manually, the site does display results but when I try the same using the script I get nothing. The script throws no error either. What might be the issues here?
Website link
I've tried with:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sart.org/clinic-pages/find-a-clinic/'
payload = {
'zip': '66109',
'strdistance': '10000',
'SelectedState': 'Select State or Region'
}
def get_clinics(link):
session = requests.Session()
response = session.post(link,data=payload,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(response.text,"lxml")
item = soup.select_one(".clinics__search-meta").text
print(item)
if __name__ == '__main__':
get_clinics(url)
I'm only after this line Within 10000 miles of 66109 there are 383 clinics. generated when the search is made.
I changed the url and the requests method to GET and worked for me
def get_clinics(link):
session = requests.Session()
response = session.get(link, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(response.text,"lxml")
item = soup.select_one(".clinics__search-meta").text
print(item)
url = 'https://www.sart.org/clinic-pages/find-a-clinic?zip=66109&strdistance=10000&SelectedState=Select+State+or+Region'
get_clinics(url)
Include cookies is one of the main concern here. If you do it in the right way, you can get a valid response following the way you started. Here is the working code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sart.org/clinic-pages/find-a-clinic/'
payload = {
'zip': '66109',
'strdistance': '10000',
'SelectedState': 'Select State or Region'
}
def get_clinics(link):
with requests.Session() as s:
res = s.get(link)
req = s.post(link,data=payload,cookies=res.cookies.get_dict())
soup = BeautifulSoup(req.text,"lxml")
item = soup.select_one(".clinics__search-meta").get_text(strip=True)
print(item)
if __name__ == '__main__':
get_clinics(url)
I want to get the URLs of the most recent posts of an Instagram user (not me, and I don't have an IG account so I can't use the API). The URLs should be in the style of https://www.instagram.com/p/BpnlsmWgqon/
I've tried making a request with response = requests.get(profile_url) and then parsing the HTML with soup = BeautifulSoup(html, 'html.parser').
After these and some other functions I get a big JSON file with data of the most recent pics (but not their URLs).
How can I get the URLs and extract just that?
Edit: This is what I've coded now. It's a mess, I've trying many approaches but none has worked.
#from subprocess import call
#from instagram.client import InstagramAPI
import requests
import json
from bs4 import BeautifulSoup
#from InstagramAPI.InstagramAPI import InstagramAPI
from instagram.client import InstagramAPI
from config import login, password
userid = "6194091573"
#url = "https://www.instagram.com/mercadona.novedades/?__a=1"
#pic_url =
#call('instalooter user mercadona.novedades ./pics -n 2')
#r = requests.get("https://www.instagram.com/mercadona.novedades")
#print(r.text)
def request_pic_url(profile_url):
response = requests.get(profile_url)
return response.text
def extract_json(html):
soup = BeautifulSoup(html, 'html.parser')
body = soup.find('body')
script_tag = body.find('script')
raw_string = script_tag.text.strip().replace('window._sharedData =', '').replace(';', '')
return json.loads(raw_string)
def get_recent_pics(profile_url):
results = []
response = request_pic_url(profile_url)
json_data = extract_json(response)
metrics = json_data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']["edges"]
for node in metrics:
node = node.get('node')
if node and isinstance(node, dict):
results.append(node)
return results
def api_thing():
api = InstagramAPI(login, password)
recent_media, next_ = api.user_recent_media(userid, 2)
for media in recent_media:
print(media.caption.text)
def main():
userid = "6194091573"
api_thing()
if __name__ == "__main__":
main()
def get_large_pic(url):
return url + "/media/?size=l"
def get_media_id(url):
req = requests.get('https://api.instagram.com/oembed/?url={}'.format(url))
media_id = req.json()['media_id']
return media_id
i suggest you to use the following library:https://github.com/LevPasha/Instagram-API-python
api = InstagramAPI("username", "password")
api.login()
def get_lastposts(us_id):
api.getUserFeed(us_id)
if 'items' in api.LastJson:
info = api.LastJson['items']
posts=[]
for media in info:
if (media['caption']!=None):
#print(media['caption']['media_id'])
posts.append(media['caption']['media_id'])
return posts
get_lastposts('user_id')
I am trying to scrape some emails from mdpi.com, emails available only to logged in users. But it fails when I am trying to do so. I am getting
when logged out:
Code itself:
import requests
from bs4 import BeautifulSoup
import traceback
login_data = {'form[email]': 'xxxxxxx#gmail.com', 'form[password]': 'xxxxxxxxx', 'remember': 1,}
base_url = 'http://www.mdpi.com'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0'}
session = requests.Session()
session.headers = headers
# log_in
s = session.post('https://susy.mdpi.com/user/login', data=login_data)
print(s.text)
print(session.cookies)
def make_soup(url):
try:
r = session.get(url)
soup = BeautifulSoup(r.content, 'lxml')
return soup
except:
traceback.print_exc()
return None
example_link = 'http://www.mdpi.com/search?journal=medsci&year_from=1996&year_to=2017&page_count=200&sort=relevance&view=default'
def article_finder(soup):
one_page_articles_divs = soup.find_all('div', class_='article-content')
for article_div in one_page_articles_divs:
a_link = article_div.find('a', class_='title-link')
link = base_url + a_link.get('href')
print(link)
article_soup = make_soup(link)
grab_author_info(article_soup)
def grab_author_info(article_soup):
# title of the article
article_title = article_soup.find('h1', class_="title").text
print(article_title)
# affiliation
affiliations_div = article_soup.find('div', class_='art-affiliations')
affiliation_dict = {}
aff_indexes = affiliations_div.find_all('div', class_='affiliation-item')
aff_values = affiliations_div.find_all('div', class_='affiliation-name')
for i, index in enumerate(aff_indexes): # 0, 1
affiliation_dict[int(index.text)] = aff_values[i].text
# authors names
authors_div = article_soup.find('div', class_='art-authors')
authors_spans = authors_div.find_all('span', class_='inlineblock')
for span in authors_spans:
name_and_email = span.find_all('a') # name and email
name = name_and_email[0].text
# email
email = name_and_email[1].get('href')[7:]
# affiliation_index
affiliation_index = span.find('sup').text
indexes = set()
if len(affiliation_index) > 2:
for i in affiliation_index.strip():
try:
ind = int(i)
indexes.add(ind)
except ValueError:
pass
print(name)
for index in indexes:
print('affiliation =>', affiliation_dict[index])
print('email: {}'.format(email))
if __name__ == '__main__':
article_finder(make_soup(example_link))
What should I do in order to get what I want?
Ah that is easy, you haven't managed to log in correctly. If you look at the response from your initial call you will see that you are returned the login page HTML instead of the my profile page. The reason for this is that you are not submitted the hidden token on the form.
The solution request the login page, and then use either lxml or BeautifulSoup to parse the hidden input 'form[_token]'. Get that value and then add it to your login_data payload.
Then submit your login request and you'll be in.
I have written code for calling the AlchemyLanguage API of Bluemix in Python. I need the keywords and entities, but it is only showing the first keyword and first entity for the text file. Where am I going wrong?
import requests
import urllib
import urllib2
def call_alchemy_api(text, API_KEY):
payload = {'outputMode':'json','extract':'entities,keywords','sentiment':'1','maxRetrieve':'1', 'url':'https://www.ibm.com/us-en/'}
payload['apikey'] = API_KEY
encoded_text = urllib.quote_plus(text)
payload['text'] = text
data = urllib.urlencode(payload)
url = 'https://gateway-a.watsonplatform.net/calls/text/TextGetCombinedData'
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
return response
if __name__ == "__main__":
api_key = 'xxxxxxxxxxxxxxxxxxxxxmyapi'
f = open('in0.txt','r')
text = f.read()
print text
response = call_alchemy_api(text, api_key)
print response.read()
Change the maxRetrieve keyword's value.
Example:
payload = {'outputMode':'json','extract':'entities,keywords','sentiment':'1','maxRetrieve':'3', 'url':'https://www.ibm.com/us-en/'}
API Link:
http://www.ibm.com/watson/developercloud/alchemy-language/api/v1/