How do I read the text within in a <pre> in python? - python

I'm trying to make a script that detects whether or not an Instagram username is taken. I found that using the url
https://www.instagram.com/{username}/?__a=1 will fill with info about the account if the name exists, but if the name doesn't exist, the page will just have {} inside of a pre and nothing else.
I'm using Requests and BeautifulSoup to scrape the page. Here is a script I wrote to test this out:
import requests
from bs4 import BeautifulSoup
username = input("Enter the username you would like to check:")
account_url=('https://www.instagram.com/' + username + '/?__a=1')
r = requests.get(account_url)
print(r.text)
The displaying the text works, but even when I put a username that doesn't exist or a random jumble of letters, it always returns a bunch of html that I don't see in inspect element on the actual url. How do I make it just returns the text inside of the pre? I just want to detect if the site shows nothing so I can determine whether or not it's a taken username.
Also, when you load the instagram ?__a=1 url with a non-existing username, inspect element will say there was an error 404, but testing the status of the requests variable in python always comes back with 200, which is success. I'm pretty inexperienced with python because I haven't used it in a very long time so some help would be greatly appreciated.

If you want a list of accounts which are not taken you could use this
import requests
not_taken = []
user_names = ["randomuser1", "randomuser2", "randomuser3", "etc..."]
for name in user_names:
response = requests.get(f"https://www.instagram.com/{name}/?__a=1")
if response.status_code == 404:
not_taken.append(name)
Now you can use not_taken as you want , for example :
print(not_taken)

Related

Get redirected ULR with Python to get access code

This question has been asked several times, and didn't find any answer that works for me. I am using request library to get the redirect url, however my code returns the original url. If I click on the link it takes few second before I get the redirect url and then manually extract the code, but I need to get this information by python.
Here is my code. I have tried response.history but it returns empty list.
import requests
response = requests.get("https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/authorize?client_id={client_id}&response_type=code&redirect_uri=http%3A%2F%2Flocalhost%2Fmyapp%2F&response_mode=query&scope=user.read%20chat.read&state=12345")
print(response)
print('-------------------')
print(response.url)
I am trying to get the code by following this Microsoft documention "https://learn.microsoft.com/en-us/graph/auth-v2-user".
Here are the links that I found in stack over flow and didn't solve my issue.
To get redirected URL with requests , How to get redirect url code with Python? ( this is probably very close to my situation), how to get redirect url using python requests and this one Python Requests library redirect new url
I didn't have any luck to get redirected url back by using requests as mentioned in previous posts. But I was able to work around this using webbrowser library and then get the browser history using sqlite 3 and was able to get the result that I was looking for.
I had to go through postman and add postman url into my app registration for using Graph API, but if you simply want to get redirected url you can follow the same code and you should get redirected url.
let me know if there are better solutions.
here is my code:
import webbrowser
import sqlite3
import pandas as pd
import shutil
webbrowser.open("https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/authorize?client_id={client_id}&response_type=code&redirect_uri=http%3A%2F%2Flocalhost%2Fmyapp%2F&response_mode=query&scope=user.read%20chat.read&state=12345")
#source file is where the history of your webbroser is saved, I was using chrome, but it should be the same process if you are using different browser
source_file = 'C:\\Users\\{your_user_id}\\AppData\\Local\\Google\\Chrome\\User Data\\Default\\History'
# could not directly connect to history file as it was locked and had to make a copy of it in different location
destination_file = 'C:\\Users\\{user}\\Downloads\\History'
time.sleep(30) # there is some delay to update the history file, so 30 sec wait give it enough time to make sure your last url get logged
shutil.copy(source_file,destination_file) # copying the file.
con = sqlite3.connect('C:\\Users\\{user}\\Downloads\\History')#connecting to browser history
cursor = con.execute("SELECT * FROM urls")
names = [description[0] for description in cursor.description]
urls = cursor.fetchall()
con.close()
df_history = pd.DataFrame(urls,columns=names)
last_url = df_history.loc[len(df_history)-1,'url']
print(last_url)
>>https://oauth.pstmn.io/v1/browser-callback?code={code}&state=12345&session_state={session_state}#

Using Mechanicalsoup to navigate multiple pages / forms

I've had success using mechanicalsoup with single pages / single forms, but am having difficulty with a multistep problem. The pages I am attempting to navigate start here: https://webapps2.ncua.gov/CustomQuery/CUSelect.aspx
I get through the first page/form, but then I am not sure how to deal with the second page/form. The third page includes the result that I wish to scrape.
import requests
import urllib.parse
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://webapps2.ncua.gov/CustomQuery/CUSelect.aspx")
form=browser.select_form()
browser["operand0"] = "State"
browser["operator0"] = "Not Equal"
browser["value0"] = "XX"
response = browser.submit_selected()
form2 = browser.get_current_form()
submit = browser.get_current_page().find('input', id='BtnAllAcct')
form2.choose_submit(submit)
browser.submit_selected()
submit = browser.get_current_page().find('input', id='Btndata1')
form2.choose_submit(submit)
browser.submit_selected()
Any ideas? This is my second attempt after first trying to interact with the API, but two separate forms is stumping me on that as well.
I was solving a similar issue, following the advice in MechanicalSoup follow a link without inside a button I switched to selenium

Python webscraping page which requires login

I am trying to automate a web data gathering process using Python. In my case, I need to pull the information from https://app.ixml.com.br/documentos/nfe page. However, before you go to this page, you need to log in at https://app.ixml.com/login. The code below should theoretically log into the site:
import re
from robobrowser import RoboBrowser
username = 'email'
password = 'password'
br = RoboBrowser()
br.open('https://app.ixml.com.br/login')
form = br.get_form()
form['email'] = username
form['senha'] = password
br.submit_form(form)
src = str(br.parsed())
However, by printing the src variable, I get the source code from the https://app.ixml.com.br/login page, ie before logging in. If I add the following lines at the end of the previous code
br.open('https://app.ixml.com.br/documentos/nfe')
src2 = str(br.parsed())
The src2 variable contains the source code of the page https://app.ixml.com.br/. I tried some variations, such as creating a new br object, but got the same result. How can I access the information at https://app.ixml.com.br/documentos/nfe?
If it is ok to have a webpage opening you can try to solve this using selenium. This package makes it possible to create a program that reacts just like a user would.
The following code would have you login:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://app.ixml.com.br/login")
browser.find_element_by_id("email").send_keys("abc#mail")
browser.find_element_by_id("senha").send_keys("abc")
browser.find_element_by_css_selector("button").click()

Python beautiful soup web scraper doesn't return tag contents

I'am trying to scrape matches and their respective odds from local bookie site but every site i try my web scraper doesn't return anything rather just prints "Process finished with exit code 0" but doesn't return anything.
Can someone help me crack open the containers and get out the contents.
i have tried all the above sites for almost a month but with no success. The problem seems to be with the exact div, class or probably span element layout.
https://www.betlion.co.ug/
https://www.betpawa.ug/
https://www.premierbet.ug/
for example i tried link 2 in the code as shown
import requests
from bs4 import BeautifulSoup
url = "https://www.betpawa.ug/"
response = requests.get (url, timeout=5)
content = BeautifulSoup (response.content, "html.parser")
for match in content.findAll("div",attrs={"class":"events-container prematch", "id":"Bp-Event-591531"}):
print (match.text.strip())
i expect the program to return a list of matches, odds and all the other components of the container. however the program runs and just prints " "Process finished with exit code 0" nothing else
it looks like the base site gets loaded in two phases
Load some HTML structure for the page,
Use JavaScript to fill in the contents
You can prove this to yourself by right clicking on the page, do "view page source" and then searching for "events-container" (it is not there).
So you'll need something more powerful than requests + bs4. I have heard of folks using Selenium to do this, but I'm not familiar with it.
You should consider using urllib3 instead of requests.
from urllib.request import Request, urlopen.
- build your req:
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
- retrieve document:
res = urlopen(req)
- parse it using bs4:
html = BeautifulSoup (res, 'html.parser')
Like Chris Curvey described, the problem is that requests can't execute the JavaScript of the page. If you print your content variable you can see that the page would display a message like: "JavaScript Required! To provide you with the best possible product, our website requires JavaScript to function..." With Selenium you control an full browser in form of an WebDriver (for eample ChromeDriver binary for the Google Chrome Browser):
from bs4 import BeautifulSoup
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('headless')
driver = webdriver.Chrome(chrome_options = chrome_options)
url = "https://www.betpawa.ug/"
driver.get(url)
page = driver.page_source
content = BeautifulSoup(page, 'html.parser')
for match in content.findAll("div",attrs={"class":"events-container"}):
print (match.text.strip())
Update:
In Line 13 the command print (match.text.strip()) simply extract only the text elements for each match-div's wich has the class-attribute "events-container".
If you want to extract more specific content you can access each match over the match variable.
You need to know:
which of the avabile information you want
and how to indentify this information inside the match-div's
structure.
in which data-type you need this information
To make it easy run the program, open the developer tools of chrome with key F12, on the left top corner you see now the icon for "select an element ...",
if you click on the icon and click in the browser on the desired element you see in the area under the icon the equivalent source.
Analyse it carefully to get the info's you need, for example:
The Title of the Football match is the first h3-Tag in the match-div
and is an string
The Odd's shown are span-tag's with the class event-odds and an
number (float/double)
Search the function you need in Google or in the reference to the package you use (BeautifulSoup4).
Let's try to get it quick and dirty by using the BeautifulSoup functions on the match variable to don't get the elements of the full site (have replaced the whitespace with tabs):
# (1) lets try to find the h3-tag
title_tags = match.findAll("h3") # use on match variable
if len(title_tags) > 0: # at least one found?
title = title_tags[0].getText() # get the text of the first one
print("Title: ", title) # show it
else:
print("no h3-tags found")
exit()
# (2) lets try to get some odds as numbers in the order in which they are displayed
odds_tags = match.findAll("span", attrs={"class":"event-odds"})
if len(odds_tags) > 2: # at least three found?
odds = [] # create an list
for tag in odds_tags: # loop over the odds_tags we found
odd = tag.getText() # get the text
print("Odd: ", odd)
# good but it is an string, you can't compare it with an number in
# python and expect an good result.
# You have to clean it and convert it:
clean_odd = odd.strip() # remove empty spaces
odd = float(clean_odd) # convert it to float
print("Odd as Number:", odd)
else:
print("something wen't wrong with the odds")
exit()
input("Press enter to try it on the next match!")

request.get doesn't work in python scraper

Hi I am trying to make this basic scraper work, where it should go to a website fill "City" and "area" ,search for restaurants and return the html page.
This is the code i'm using
payload = OrderedDict([('cityId','NewYork'),('area','Centralpark')])
req = requests.get("http://www.somewebsite.com",params=payload)
f = req.content
soup = BeautifulSoup((f))
And Here is how the Source HTML looks like
When I'm checking the resulting soup variable it doesn't have the search results , instead it contains the data from the first page only,which has the form for entering city and area value (i.e. www.somewebsite.com, what i want is results of www.somewebsite.com?cityId=NewYork&area=centralPark).So Is there anything that i have to pass with that params to explicitly press the search button or is there any other way to make it work.
You need first check whether you can visit the URL by web browser and get the correct result.

Categories

Resources