re.findall printing the full text on a line of tekst - python

I got the following code:
import urllib
import re
html = urllib.urlopen("http://jshawl.com/python-playground/").read()
lines = [html]
for line in lines:
if re.findall("jesseshawl", line):
print line
My output when I run this code, is that it wil return the full website. How can I only display the row where it did found "jesseshawl". It should return something like:
jesseshawl#gmail.com
And is there a way to not return all html tags when I run this?
My output:
<html>
<head></head>
<body>
<h1>Some images to download:</h1>
<img src='python.gif'/><br />
<img src='terminal.png' />
<hr />
<h1>Email addresses to extract:</h1>
jesseshawl#gmail.com<br />
sudojesse#gmail.com<br />
<hr />
<h1>Login Form:</h1>
Login here:<br />
User: user<br />
Pass: pass
<form method="POST" action="login.php">
User: <input type="text" name="username" /><br />
Pass: <input type="password" name="password" /><br />
<input type="submit" />
</form>
<h1>Memorable Quotes</h1>
<ul>
<li></li>
</ul>
</body>
</html>

You are reading the whole page .S0 it prints all the thing .You have to read it line by line.There is no need for findall you can use in operator
Code:
import urllib
import re
html = urllib.urlopen("http://jshawl.com/python-playground/").readlines()
for line in html :
if "jesseshawl" in line:
print line
Output:
jesseshawl#gmail.com<br />
And if you don't want tags you could remove them using sub
Code2:
import urllib
import re
html = urllib.urlopen("http://jshawl.com/python-playground/").readlines()
for line in html :
if "jesseshawl" in line:
print re.sub("<[^>]*?>","",line)
Output2:
jesseshawl#gmail.com

Related

Bucket POST must contain a field named 'AWSAccessKeyId' with SigV4

I am trying to do a simple create_presigned_post using python and boto3.
import boto3
from botocore.config import Config
def s3_upload_creds():
REGION = 'eu-west-2'
s3 = boto3.client('s3',
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key, region_name=REGION, config=Config(signature_version='s3v4'))
return s3.generate_presigned_post(
Bucket = bucket,
Key = key,
ExpiresIn=3600
)
upload_fields = s3_upload_creds()
url = upload_fields['url']
upload_fields = upload_fields['fields']
"""<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body>
<form action="{url}" method="post" enctype="multipart/form-data">
<input type="text" id="x-amz-algorithm" name="x-amz-algorithm" value="AWS4-HMAC-SHA256" /><br />
<input type="text" id="x-amz-credential" name="x-amz-credential" value="{creds}" /><br />
<input type="text" id="x-amz-date" name="x-amz-date" value="{date}" /><br />
<input type="text" id="x-amz-policy" name="policy" value="{policy}" /><br />
<input type="text" id="signature" name="signature" value="{signature}" />
<input type="text" id="key" name="key" value="{key}" />
File:
<input type="file" name="file" /> <br />
<input type="submit" name="submit" value="Upload to Amazon S3" />
</form>
</html>""".format(
url=url,
creds=upload_fields['x-amz-credential'],
date=upload_fields['x-amz-date'],
policy=upload_fields['policy'],
signature=upload_fields['x-amz-signature'],
key=upload_fields['key']
)
However I receive this error.
<?xml version="1.0" encoding="UTF-8"?>
<Error>
<Code>InvalidArgument</Code>
<Message>Bucket POST must contain a field named 'AWSAccessKeyId'. If it is specified, please check the order of the fields.</Message>
<ArgumentName>AWSAccessKeyId</ArgumentName>
<ArgumentValue></ArgumentValue>
<RequestId>DFEDBZFEES4PFQV3</RequestId>
<HostId>k7wj2Ehd/DjtpVk+OtG0qGFRECtTYkQv64hEwLRFkqKR4Qfhj0nbOHKS5DNqWo/TTGR3BbC6k=</HostId>
</Error>
Browsing solutions online I found that if the file is specified before the other fields, it will not work. I have verified that it is not the case on my browser and Postman.
I tried to add the field but it gives me this error.
<?xml version="1.0" encoding="UTF-8"?>
<Error>
<Code>InvalidRequest</Code>
<Message>The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.</Message>
<RequestId>AYGPAQPQFDWFE3Q6N</RequestId>
<HostId>aYTEn/SyElUsyFPRFEDWtZ8SYKuQWB7bvaxvmq0tvXkPFDWlYAixsO6ue+tEbVstaWGRF4KY=</HostId>
</Error>
I do not understand why aws needs AWSAccessKeyId since I can see it at the start of x-amz-credential:
AK-----------------R/date/region/s3/aws4_requests
I Have looked
at theses:
Link
Link
Thanks Anon Coward for your help.
The issue come from the html form I used. One of the fields where rename with the name html attribute wrongly, causing and error which message is out of context. The sample above should change name='signature' to name='x-amz-signature'.
Note that the field names returned from create_presigned_post are:
key
x-amz-algorithm
x-amz-credential
x-amz-date
policy
x-amz-signature

Scrapy - How to get XPATH node as text

I want to get a node as string from a XHTML node.
for example
<html>
<body>
<div>
<input type="text" value="myText"></input>
<input type="text" value="myText" disabled="true"></input>
</div>
</body>
</html>
I want to get two input's XHTML text as string in an array, like this?
["<input type="text" value="myText"></input>"
"<input type="text" value="myText" disabled="true"></input>"]
Is this possible with scrapys XPATH selector?
Not completely what you're asking for as it only shows the opening tag and for some reason it removes ="true" from the last input, but it might be enough depending on what you're using it for.
>>> response.xpath('//div/input').extract()
[u'<input type="text" value="myText">', u'<input type="text" value="myText" disabled>']

Python BeautifulSoup returning wrong list of inputs from find_all()

I have Python 2.7.3 and bs.version is 4.4.1
For some reason this code
from bs4 import BeautifulSoup # parsing
html = """
<html>
<head id="Head1"><title>Title</title></head>
<body>
<form id="form" action="login.php" method="post">
<input type="text" name="fname">
<input type="text" name="email" >
<input type="button" name="Submit" value="submit">
</form>
</body>
</html>
"""
html_proc = BeautifulSoup(html, 'html.parser')
for form in html_proc.find_all('form'):
for input in form.find_all('input'):
print "input:" + str(input)
returns a wrong list of inputs:
input:<input name="fname" type="text">
<input name="email" type="text">
<input name="Submit" type="button" value="submit">
</input></input></input>
input:<input name="email" type="text">
<input name="Submit" type="button" value="submit">
</input></input>
input:<input name="Submit" type="button" value="submit">
</input>
It's supposed to return
input: <input name="fname" type="text">
input: <input type="text" name="email">
input: <input type="button" name="Submit" value="submit">
What happened?
To me, this looks like an artifact of the html parser. Using 'lxml' for the parser instead of 'html.parser' seems to make it work. The downside of this is that you (or your users) then need to install lxml -- The upside is that lxml is a better/faster parser ;-).
As for why 'html.parser' doesn't seem to work correctly in this case, I think it has something to do with the fact that input tags are self-closing. If you explicitly close your inputs, it works:
<input type="text" name="fname" ></input>
<input type="text" name="email" ></input>
<input type="button" name="Submit" value="submit" ></input>
I would be curious to see if we could modify the source code to handle this case ... Doing a little experiment to monkey-patch bs4 indicates that we can do this:
from bs4 import BeautifulSoup
from bs4.builder import _htmlparser
# Monkey-patch the Beautiful soup HTML parser to close input tags automatically.
BeautifulSoupHTMLParser = _htmlparser.BeautifulSoupHTMLParser
class FixedParser(BeautifulSoupHTMLParser):
def handle_starttag(self, name, attrs):
# Old-style class... No super :-(
result = BeautifulSoupHTMLParser.handle_starttag(self, name, attrs)
if name.lower() == 'input':
self.handle_endtag(name)
return result
_htmlparser.BeautifulSoupHTMLParser = FixedParser
html = """
<html>
<head id="Head1"><title>Title</title></head>
<body>
<form id="form" action="login.php" method="post">
<input type="text" name="fname" >
<input type="text" name="email" >
<input type="button" name="Submit" value="submit" >
</form>
</body>
</html>
"""
html_proc = BeautifulSoup(html, 'html.parser')
for form in html_proc.find_all('form'):
for input in form.find_all('input'):
print "input:" + str(input)
Obviously, this isn't a true fix (I wouldn't submit this as a patch to the BS4 folks), but it does demonstrate the problem. Since there is no end-tag, the handle_endtag method is never getting called. If we call it ourselves, things tend to work out (as long as the html doesn't also have a closing input tag ...).
I'm not really sure whose responsibility this bug should be, but I suppose that you could start by submitting it to bs4 -- They might then forward you on to report a bug on the python tracker, I'm not sure...
Don't use nested loop for this, and use lxml , Change your code to this:
inp = []
html_proc = BeautifulSoup(html, 'lxml')
for form in html_proc.find_all('form'):
inp.extend(form.find_all('input'))
for item in inp:
print "input:" + str(item)

The Python code will not style

I can't get the css file in the folder to style. I have the file in the correct folder. Is it something to do with my indenting? I have no idea was is going on. The python code compiles and wors but the look isn't there and the page print plain.
import webapp2
class MainHandler(webapp2.RequestHandler):
def get(self):
#web page sections
form_head='''
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<link href="css/style.css" rel="stylesheet" type="text/css" />
<link href='http://fonts.googleapis.com/css?family=Roboto' rel='stylesheet' type='text/css'>
<title>Gamers R' Us Subscribing</title>
</head>
<body>'''
form_body='''
<div class="maincontainer">
<h1>Welcome to Gamers R' Us!</h1>
<div id="bgimg">
<p>Gamers R' Us is a blog that talks and reviews video games daily. This form services as a way for users to receive emails based apon your preferred gaming system and preferred genre of game.</p>
<div id="formbox">
<h2>Subscribe Today!</h2>
<form method="GET">
<label>Full Name: </label><input type="text" name="name" placeholder=" John Doe"/><br>
<label>Email: </label><input type="text" name="email" placeholder=" me#domain.com"/><br>
<input type="checkbox" name="subscribe" value="yes" checked>Subscribe for gaming updates and more!<br>
<label>Select the gaming system you prefer:</label><br>
<select name="system" class="selectbox">
<option value="ps4">Playstation 4</option>
<option value="xbone">Xbox One</option>
<option value="wiiu">Wii U</option>
<option value="pc">PC Gaming</option>
</select><br>
<input type="radio" name="genre" value="FPS">First Person Shooter.<br>
<input type="radio" name="genre" value="MOBA">Multiplayer Online Battle Arena.<br>
<input type="radio" name="genre" value="RPG">Role-Playing Game.<br>
<input type="radio" name="genre" value="RTS">Real Time Strategy.<br>
<input type="radio" name="genre" value="Other">Other Genre.<br>
<input type="submit" class="subbtn" value="Done" />
</form>
</div>
</div>
</div>'''
form_foot='''
</body>
</html>'''
#if GET is requested it should display on next screen.
#else should load page.
if self.request.GET:
name= self.request.GET['name']
email= self.request.GET['email']
system= self.request.GET['system']
subscribe= self.request.GET['subscribe']
genre= self.request.GET['genre']
#displays form information submitted by user.
self.response.write(form_head + "<div class='maincontainer'>" +
'<h1>Thanks for Subbing!</h1>' +
'<div id="infobox">' +
'<h2></h2>' +
"Name: "+name+"<br />" +
"Email: "+email+"<br />" +
"Preferred System: "+system+"<br /> " +
"Preferred Genre: "+genre+
'</div>' +
'</div>' +
form_foot)
#Will display error. ** PLACE HOLDER **
else:
self.response.write(form_head + form_body + form_foot)
# Do not touch this.
app = webapp2.WSGIApplication([('/', MainHandler)], debug=True)
I Wager that if you use Firebug or another browser-based web app debugger, you will find that it failed to load the css/style.css resource. And then you will look at your code and realize that, even though you put style.css in a folder called css, you have configured only one URI for the server...for the / URL.
When you provide a route for /css/style.css and a corresponding handler, then the stylesheet will start to be loaded, and the HTML will be styled.

How do I recover the address server of user, put it in variable and execute it in command line

scriptInfo.py
import os, sys, platform, webbrowser
def main()
template = open('scriptHmtl.phtml').read()
scriptHtml.phtml
<html>
<head>
</head>
<body>
<h2><center> welcome </center></h2>
<br/><br/><br/>
...
variables
..
<form name="sendData" method="get" action="http://localhost:8000/cgi/scriptGet.py">
Name: <input type="text" name="n"><br/><br/>
First Name: <input type="text" name="fn"/><br/><br/>
Mail: <input type="text" name="ma"/><br/><br/>
Address: <input type="text" name="add"/> <br/><br/>
<input type="submit" value="OK"/>
</form>
Instead of action="http://localhost:8000/cgi/scriptGet.py", there must be a variable which contain the code to recover the server address, but I don't want how to do it.
With HTML forms you can just ignore the server and go straight to the script.
For example like
<form name="sendData" method="get" action="cgi/scriptGet.py">

Categories

Resources