using generalized text to search inside html source code

using generalized text to search inside html source code - python

I'm working on a small project where I have small files that will be fed to my program from a folder.
These files will be used to check websites for a user
so, for example, I want to look up a user on Reddit, Twitch and more sites. I could use this tool if I had the configs for the sites I wanted to check the username on.
def userCheck(link, user, isSuccess):
link = link.replace("<USERNAME>", user)
isSuccess = isSuccess.replace("<USERNAME>", user)
print(isSuccess)
print(link)
html = requests.get(link)
print(html)
page_source = html.text
count = page_source.count(isSuccess)
print(count)
if count > 0:
return True
else:
return False
outFile = open("outputs.txt","w+")
outFile.write(page_source)
outFile.close()
print(userCheck("https://reddit.com/u/<USERNAME>", "1r0nk3y", "<h1 class=\"_eYtD2XCVieq6emjKBH3m\"><USERNAME></h1>"))
in theory, my code should look through the HTML for the isSuccess variable (replaced by the username) and successfully return true in this certain case. Yes, I know I could work with response codes but some websites will give a good code for a bad user.
here's my output:
<h1 class=_eYtD2XCVieq6emjKBH3m>1r0nk3y</h1>
https://reddit.com/u/1r0nk3y
<Response [502]>
0
False
I believe the main dilemma I am facing here is that I don't want to use something like bs4 for this. it needs to be much simpler. but if a website needs a much more advanced keyword or phrase to validate it was a successful lookup, I will need to use things with quotes in them
any help would be appreciated, thank you!

Related

web scraping failure of log in using python requests module

I am trying to develop a script with python to web scraping some information on a specific website for learning purposes.
I went over a lot of different tutorials and posts, trying to gather some insights from them, they are very useful but still didn't help me to find a way to log in the website and do searches with different keywords.
I tried to use different APIs, such as requests and urllib, maybe I didn't find the right way to solve it.
The steps lists as follow:
login information set up
Send login information to the website and get response for future use
keywords setup
import header
set up cookiejar
from login response, do the search
After I tried, it will work randomly, and
here is the code:
import getpass
# marvin
# date:2018/2/7
# login stage preparation
def login_values():
login="https://www.****.com/login"
username = input("Please insert your username: ")
password = getpass.getpass("Please type in your password: ")
host="www.****.com"
#store login screts
data = {
"username": username,
"password": password,
}
return login,host,data
The following is for getting the HTML file from a website
import requests
import random
import http.cookiejar
import socket
# Set up web scraping function to output the html text file
def webscrape(login_url,host_url,login_data,target_url):
#static values preparation
##import header
user_agents = [
***
]
agent = random.choice(user_agents)
headers={'User-agent':agent,
'Accept':'*/*',
'Accept-Language':'en-US,en;q=0.9;zh-cmn-Hans',
'Host':host_url,
'charset':'utf-8',
}
##set up cookie jar
cj = http.cookiejar.CookieJar()
#
# get the html file
socket.setdefaulttimeout(20)
s=requests.Session()
req=s.post(login_url, data=login_data)
res = s.get(target_url, cookies=cj,headers=headers)
html=res.text
return html
Here is the code to get each links from html:
from bs4 import BeautifulSoup
#set up html parsing function for parsing all the list links
def getlist(keyword,loginurl,hosturl,valuesurl,html_lists):
page=1
pagenum=10# set up maximum page num
links=[]
soup=BeautifulSoup(html_lists,"lxml")
try:
for li in soup.find("div",class_="search_pager human_pager in-block").ul.find_all('li'):
target_part=soup.find_all("div",class_="search_result_single search-2017 pb25 pt25 pl30 pr30 ")
[links.append(link.find('a')['href']) for link in target_part]
page+=1
if page<=pagenum:
try:
nexturl=soup.find('div',class_='search_pager human_pager in-block').ul.find('li',class_='pagination-next ng-scope ').a['href'] #next page
except AttributeError:
print("{}'s links are all stored!".format(keyword))
return links
else:
chs_html=webscrape(loginurl,hosturl,valuesurl,nexturl)
soup=BeautifulSoup(chs_html,"lxml")
except AttributeError:
target_part=soup.find_all("div",class_="search_result_single search-2017 pb25 pt25 pl30 pr30 ")
[links.append(link.find('a')['href']) for link in target_part]
print("There is only one page")
return links
The test code is:
keyword="****"
myurl="https://www.****.com/search/os2?key={}".format(keyword)
chs_html=webscrape(login,host,values,myurl)
chs_links=getlist(keyword,login,host,values,chs_html)
targethtml=webscrape(login,host,values,chs_links[1])
There are total 22 links and one page containing 19 links, so it is supposed to have more than one page, if the result "There is only one page" shown up, it indicates a failure.
Problems:
The login_values function is to secure my login information by combining all functions to a final function, but apparently, the username and password are still really easy to show just by print() command.
This the main problem!! Like I mentioned before, this method works randomly. By the way, what I mean not working, it is that the HTML file is only the login page instead of the searching result. I want to get a better control to make it work most of the time. I checked user-agents by print agent every time to see if they are relevant, and it is not! I cleared cookies with suspicious to full storage memory, and it is not.
There are sometimes I facing max trial error or OS error, I guess it is the error from the server I was trying to reach, is there a way I can set up a wait timer for me to prevent these errors from happening?

Checking ALL links within links from a source HTML, Python

My code is to search a Link passed in the command prompt, get the HTML code for the webpage at the Link, search the HTML code for links on the webpage, and then repeat these steps for the links found. I hope that is clear.
It should print out any links that cause errors.
Some more needed info:
The max visits it can do is 100.
If a website has an error, a None value is returned.
Python3 is what I am using
eg:
s = readwebpage(url)... # This line of code gets the HTML code for the link(url) passed in its argument.... if the link has an error, s = None.
The HTML code for that website has links that end in p2.html, p3.html, p4.html, and p5.html on its webpage. My code reads all of these, but it does not visit these links individually to search for more links. If it did this, it should search through these links and find a link that ends in p10.html, and then it should report that the link ending with p10.html has errors. Obviously it doesn't do that at the moment, and it's giving me a hard time.
My code..
url = args.url[0]
url_list = [url]
checkedURLs = []
AmountVisited = 0
while (url_list and AmountVisited<maxhits):
url = url_list.pop()
s = readwebpage(url)
print("testing url: http",url) #Print the url being tested, this code is here only for testing..
AmountVisited = AmountVisited + 1
if s == None:
print("* bad reference to http", url)
else:
urls_list = re.findall(r'href="http([\s:]?[^\'" >]+)', s) #Creates a list of all links in HTML code starting with...
while urls_list: #... http or https
insert = urls_list.pop()
while(insert in checkedURLs and urls_list):
insert = urls_list.pop()
url_list.append(insert)
checkedURLs = insert
Please help :)

Here is the code you wanted. However, please, stop using regexes for parsing HTML. BeautifulSoup is the way to go for that.
import re
from urllib import urlopen
def readwebpage(url):
print "testing ",current
return urlopen(url).read()
url = 'http://xrisk.esy.es' #put starting url here
yet_to_visit= [url]
visited_urls = []
AmountVisited = 0
maxhits = 10
while (yet_to_visit and AmountVisited<maxhits):
print yet_to_visit
current = yet_to_visit.pop()
AmountVisited = AmountVisited + 1
html = readwebpage(current)
if html == None:
print "* bad reference to http", current
else:
r = re.compile('(?<=href=").*?(?=")')
links = re.findall(r,html) #Creates a list of all links in HTML code starting with...
for u in links:
if u in visited_urls:
continue
elif u.find('http')!=-1:
yet_to_visit.append(u)
print links
visited_urls.append(current)

Not Python but since you mentioned you aren't tied strictly to regex, I think you might find some use in using wget for this.
wget --spider -o C:\wget.log -e robots=off -w 1 -r -l 10 http://www.stackoverflow.com
Broken down:
--spider: When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there.
-o C:\wget.log: Log all messages to C:\wget.log.
-e robots=off: Ignore robots.txt
-w 1: set a wait time of 1 second
-r: set recursive search on
-l 10: sets the recursive depth to 10, meaning wget will only go as deep as 10 levels in, this may need to change depending on your max requests
http://www.stackoverflow.com: the URL you want to start with
Once complete, you can review the wget.log entries to determine which links had errors by searching for something like HTTP status codes 404, etc.

I suspect your regex is part of your problem. Right now, you have http outside your capture group, and [\s:] matches "some sort of whitespace (ie \s) or :"
I'd change the regex to: urls_list = re.findall(r'href="(.*)"',s). Also known as "match anything in quotes, after href=". If you absolutely need to ensure the http[s]://, use r'href="(https?://.*)"' (s? => one or zero s)
EDIT: And with actually working regex, using a non-greedly glom: href=(?P<q>[\'"])(https?://.*?)(?P=q)'
(Also, uh, while it's not technically necessary in your case because re caches, I think it's good practice to get into the habit of using re.compile.)
I think it's awfully nice that all of your URLs are full URLs. Do you have to deal with relative URLs at all?
`

Scraping urbandictionary with Python

I'm currently working on an arcbot and I'm trying to make a command "!urbandictionary", it should scrape the meaning of a term, the first one which is provided by urbandictionary, if there's another solution, e.g. another dictionary site with a better api that's also good. Here's my code:
if Command.lower() == '!urban':
dictionary = Argument[1] #this is the term which the user provides, e.g. "scrape"
dictionaryscrape = urllib2.urlopen('http://www.urbandictionary.com/define.php?term='+dictionary).read() #plain html of the site
scraped = getBetweenHTML(dictionaryscrape, '<div class="meaning">','</div>') #Here's my problem, i'm not sure if it scrapes the first meaning or not..
messages.main(scraped,xSock,BotID) #Sends the meaning of the provided word (Argument[0])
How do I correctly scrape a meaning of a word in urbandictionary?

Just get the text from the meaning class:
import requests
from bs4 import BeautifulSoup
word = "scrape"
r = requests.get("http://www.urbandictionary.com/define.php?term={}".format(word))
soup = BeautifulSoup(r.content)
print(soup.find("div",attrs={"class":"meaning"}).text)
Gassing and breaking your car repeatedly really fast so that the front and rear bumpers "scrape" the pavement; while going hyphy

There is an unofficial api here apparently
`http://api.urbandictionary.com/v0/define?term={word}`
From https://github.com/zdict/zdict/wiki/Urban-dictionary-API-documentation

How can I get the current URL or the URL clicked on and save it as a string in python?

How can I get the current URL and save it as a string in python?
I have some code that uses encodedURL = urllib.quote_plus to change the URL in a for loop going through a list. I cannot save encodedURL as a new variable because it's in a for loop and will always return the last item in the list.
My end goal is that I want to get the URL of a hyperlink that the user clicks on, so I can display certain content on that specific URL.
Apologies if I have left out important information. There is too much code and too many modules to post it all here. If you need anything else please let me know.
EDIT: To add more description:
I have a page which has a list of user comments about a website. The website is hyperlinked to that actual website, and there is a "list all comments about this website" link. My goal is that when the user clicks on list all comments about this website, it will open another page showing every comment that is about that website. The problem is I cannot get the website they are referring to when clicking 'all comments about this website'
Don't know if it helps but this is what I am using:
z=[ ]
for x in S:
y = list(x)
z.append(y)
for coms in z:
url = urllib.quote_plus(coms[2])
coms[2] = "'Commented on:' <a href='%s'> %s</a> (<a href = 'conversation?page=%s'> all </a>) " % (coms[2],coms[2], url)
coms[3] += "<br><br>"
deCodedURL = urllib.unquote_plus(url)
text2 = interface.list_comments_page(db, **THIS IS THE PROBLEM**)
page_comments = {
'comments_page':'<p>%s</p>' % text2,
}
if environ['PATH_INFO'] == '/conversation':
headers = [('content-type' , 'text/html')]
start_response("200 OK", headers)
return templating.generate_page(page_comments)

So your problem is you need to parse the URL for the query string, and urllib has some helpers for that:
>>> i
'conversation?page=http://www.google.com/'
>>> urllib.splitvalue(urllib.splitquery(i)[1])
('page', 'http://www.google.com/')

How to parse web elements into notepad using Python?

can anyone help me with "extracting" stuff from site using Python? Here is the info :
I have folder name with set of numbers (they are ID of item) and i have to use that ID for entering page and then "scrap" info from page to my notepad... It's like this : http://www.somesite.com/pic.mhtml?id=[ID]... I need to exctract picture link (picture link always have ID.jpg at the end of the file)from it and write it in notepad and then replace that txt name with name of the picture... Picture is always in title tags... Thanks in advance...

What you need is a data scraper - http://www.crummy.com/software/BeautifulSoup/ will help you pull data off of websites. You can then load that data into a variable, write it to a file, or do anything you normally do with data.

You could try parsing the html source for images.
Try something similar:
class Parser(object):
__rx = r'(url|src)="(http://www\.page\.com/path/?ID=\d*\.(jpeg|jpg|gif|png)'
def __crawl(self, url):
images = []
code = urllib.urlopen(url).read()
for line in code.split('\n'):
imagesearch = re.search(self.__rx, line)
if imagesearch:
image = '%s.%s' % (imagesearch.group(2), imagesearch.group(4))
images.append(image)
return images
it's untestet, you may want to check the regex

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

using generalized text to search inside html source code - python

Related

web scraping failure of log in using python requests module

Checking ALL links within links from a source HTML, Python

Scraping urbandictionary with Python

How can I get the current URL or the URL clicked on and save it as a string in python?

How to parse web elements into notepad using Python?

Categories

Resources