How convert text from shell to html? - python

Probably it really easy and stupid question, but I'm new in Python, so deal with it
What I need to do - execute command in shell and output it as telegra.ph page
Problem - telegra.ph API ignores \n things and all outpot text in one line
Used Python Telegraph API wrapper - https://github.com/python273/telegraph
I understand it needs to convert my text to html-like and remove <p> tags, I've tried some scripts, but my program gave me error:
telegraph.exceptions.NotAllowedTag: span tag is not allowed
So, I've removed all span tags and got same result as if I've put without converting
Then I tried to use replace("\n", "<p>") but stucked in closing tags...
Code:
import subprocess
from telegraph import Telegraph
telegraph = Telegraph()
telegraph.create_account(short_name='1111')
tmp = subprocess.run("arp", capture_output=True, text=True, shell=True).stdout
print( '\n\n\n'+tmp+'\n\n\n\n') ### debug line
response = telegraph.create_page(
'Random',
html_content= '<p>' + tmp + '</p>'
)
print('https://telegra.ph/{}'.format(response['path']))

The closest html equivalent to \n is the "hard break" <br/> tag.
It does not require closing, because it contains nothing, and directly signifies line break.
Assuming it is supported by telegra.ph, you could simply:
tmp.replace('\n', '<br/>');

Add this line to convert all intermediate newlines to individual <p>-sections:
tmp = "</p><p>".join(tmp.split("\n"))
tmp.split("\n") splits the string into an array of lines.
"</p><p>".join(...) glues everything together again, closing the previous <p>-section and starting a new one.
This way, the example works for me and line breaks are correctly displayed on the page.
EDIT: As the other answer suggests, of course you can also use tags. It depends on what you want to achieve!

It is not clear to me, why does the telegraph module replace newlines with spaces. In this case it seems reasonable to disable this functionality.
import subprocess
import re
import telegraph
from telegraph import Telegraph
telegraph.utils.RE_WHITESPACE = re.compile(r'([ ]{10})', re.UNICODE)
telegraph = Telegraph()
telegraph.create_account(short_name='1111')
tmp = subprocess.run("/usr/sbin/arp",
capture_output=True,
text=True,
shell=True).stdout
response = telegraph.create_page(
'Random',
html_content = '<pre>' + tmp + '</pre>'
)
print('https://telegra.ph/{}'.format(response['path']))
Would output
that comes close to actual formatted arp output.

Related

Extract Email from Bulk Text - Error

I would like to extract all the email addresses included in an HTML code. I wrote this very simple code (I'm a super basic python writer, I'm just trying to learn):
#coding=utf-8
import urllib
import re
html = urllib.urlopen('http://giacomobonvini.com').read()
r = re.compile(r'(\b[\w.]+#+[\w.]+.+[\w.]\b)')
results = r.findall(html)
emails = ""
for x in results:
emails += str(x) + "\n"
print emails
The problem is that, even if the code works, the email are printed in this way:
"giacomo.bonvini#gmail.com < / span"
"giacomo.bonvini#gmail.com < br"
I would line not to have "< / span" and " < br".
Do you have any idea?
Thanks
Giacomo
r'(\b[\w.]+#+[\w.]+.+[\w.]\b)'
The problem is likely the .+ combination, which matches anything. Maybe you meant to match a single dot instead? If so, use for example [.]

python lxml not showing all content

I am trying to scrape a specific section of a web page, and eventually calculate word frequency. But I am finding it difficult to get the entire text. As far as I understand from looking at the HTML code, my script omits the part of that section that are in a break line but without <br> tag.
My code:
import urllib
from lxml import html as LH
import lxml
import requests
scripturl="http://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-sopranos&episode=s06e21"
scripthtml=urllib.urlopen(scripturl).read()
scripthtml=requests.get(scripturl)
tree = LH.fromstring(scripthtml.content)
script=tree.xpath('//div[#class="scrolling-script-container"]/text()')
print script
print type(script)
This is the output:
["\n\n\n\n \t\t\t ( radio clicks, \r music plays ) \r \r Disc jockey: \r
New York's classic rock \r q104.", '3.', '
\r \r Good morning.', " \r I'm jim kerr.",
' \r \r Coming up \r
When I iterate the result only the phrases that follow the /r and are followed by a comma or double comma.
for res in script:
print res
The output is:
q104.
3.
Good morning.
I'm jim kerr.
I am not confined to lxml, but because I am rather new, I am less familiar with other methods.
An lxml element has both a text and tail method. You are searching for text, but if there is am HTML element embedded in the element (br, for example), your search for text will only go as deep as the first text the parser gets from the element's text() method.
try:
script = tree.xpath('//div[#class="scrolling-script-container"]')
print join(" ", (script[0].text(), script[0].tail()))
This was bothering me, I wrote out a solution:
import requests
import lxml
from lxml import etree
from io import StringIO
parser = etree.HTMLParser()
base_url = "http://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-sopranos&episode=s06e21"
resp = requests.get(base_url)
root = etree.parse(StringIO(resp.text), parser)
script = root.xpath('//div[#class="scrolling-script-container"]')
text_list = []
for elem in script:
print(elem.attrib)
if hasattr(elem, 'text'):
text_list.append(elem.text)
if hasattr(elem, 'tail'):
text_list.append(elem.tail)
for elem in text_list:
# only gets the first block of text before
# it encounters a br tag
print(elem)
for elem in script:
# prints everything
for sib in elem.iter():
print(sib.attrib)
if hasattr(sib, 'text'):
print(sib.text)
if hasattr(sib, 'tail'):
print(sib.tail)

Remove newline in python with urllib

I am using Python 3.x. While using urllib.request to download the webpage, i am getting a lot of \n in between. I am trying to remove it using the methods given in the other threads of the forum, but i am not able to do so. I have used strip() function and the replace() function...but no luck! I am running this code on eclipse. Here is my code:
import urllib.request
#Downloading entire Web Document
def download_page(a):
opener = urllib.request.FancyURLopener({})
try:
open_url = opener.open(a)
page = str(open_url.read())
return page
except:
return""
raw_html = download_page("http://www.zseries.in")
print("Raw HTML = " + raw_html)
#Remove line breaks
raw_html2 = raw_html.replace('\n', '')
print("Raw HTML2 = " + raw_html2)
I am not able to spot out the reason of getting a lot of \n in the raw_html variable.
Your download_page() function corrupts the html (str() call) that is why you see \n (two characters \ and n) in the output. Don't use .replace() or other similar solution, fix download_page() function instead:
from urllib.request import urlopen
with urlopen("http://www.zseries.in") as response:
html_content = response.read()
At this point html_content contains a bytes object. To get it as text, you need to know its character encoding e.g., to get it from Content-Type http header:
encoding = response.headers.get_content_charset('utf-8')
html_text = html_content.decode(encoding)
See A good way to get the charset/encoding of an HTTP response in Python.
if the server doesn't pass charset in Content-Type header then there are complex rules to figure out the character encoding in html5 document e.g., it may be specified inside html document: <meta charset="utf-8"> (you would need an html parser to get it).
If you read the html correctly then you shouldn't see literal characters \n in the page.
If you look at the source you've downloaded, the \n escape sequences you're trying to replace() are actually escaped themselves: \\n. Try this instead:
import urllib.request
def download_page(a):
opener = urllib.request.FancyURLopener({})
open_url = opener.open(a)
page = str(open_url.read()).replace('\\n', '')
return page
I removed the try/except clause because generic except statements without targeting a specific exception (or class of exceptions) are generally bad. If it fails, you have no idea why.
Seems like they are literal \n characters , so i suggest you to do like this.
raw_html2 = raw_html.replace('\\n', '')

unnecessary exclamation marks(!)'s in HTML code

I am emailing the content of a text file "gerrit.txt" # http://pastie.org/8289257 in outlook using the below code,
however after the email is sent when I look at the source code( #http://pastie.org/8289379) of the email in outlook ,i see unnecessary
exclamation markds(!)'s in the code which is messing up the output, can anyone provide inputs on why is it so and how to avoid this ?
from email.mime.text import MIMEText
from smtplib import SMTP
def email (body,subject):
msg = MIMEText("%s" % body, 'html')
msg['Content-Type'] = "text/html; charset=UTF8"
msg['Subject'] = subject
s = SMTP('localhost',25)
s.sendmail('userid#company.com', ['userid2#company.com'],msg=msg.as_string())
def main ():
# open gerrit.txt and read the content into body
with open('gerrit.txt', 'r') as f:
body = f.read()
subject = "test email"
email(body,subject)
print "Done"
if __name__ == '__main__':
main()
Some info available here: http://bugs.python.org/issue6327
Note that mailservers have a 990-character limit on each line
contained within an email message. If an email message is sent that
contains lines longer than 990-characters, those lines will be
subdivided by additional line ending characters, which can cause
corruption in the email message, particularly for HTML content. To
prevent this from occurring, add your own line-ending characters at
appropriate locations within the email message to ensure that no lines
are longer than 990 characters.
I think you must split your html to some lines. You can use textwrap.wrap method.
adding a '\n' in between my html string , some random 20 characters before "!" was appearing solved my problem
I also faced the same issue, Its because outlook doesn't support line more than 990 characters it starts giving below issues.
Nested tables
Color change of column heading
Adding unwanted ! marks .
Here is solution for the same.
if you are adding for single line you can add
"line[:40]" + \r\n + "line[40:]".
If you are forming a table then you can put the same in loop like
"<td>" + line[j][:40]+"\r\n"+line[j][40:] + "</td>"
In my case the html is being constructed outside of the python script and is passed in as an argument. I added line breaks after each html tag within the python script which resolved my issue:
import re
result_html = re.sub(">", ">\n", html_body)

How do I print a line following a line containing certain text in a saved file in Python?

I have written a Python program to find the carrier of a cell phone given the number. It downloads the source of http://www.whitepages.com/carrier_lookup?carrier=other&number_0=1112223333&response=1 (where 1112223333 is the phone number to lookup) and saves this as carrier.html. In the source, the carrier is in the line after the [div class="carrier_result"] tag. (switch in < and > for [ and ], as stackoverflow thought I was trying to format using the html and would not display it.)
My program currently searches the file and finds the line containing the div tag, but now I need a way to store the next line after that as a string. My current code is: http://pastebin.com/MSDN0vbC
What you really want to be doing is parsing the HTML properly. Use the BeautifulSoup library - it's wonderful at doing so.
Sample code:
import urllib2, BeautifulSoup
opener = urllib2.build_opener()
opener.addheaders[0] = ('User-agent', 'Mozilla/5.1')
response = opener.open('http://www.whitepages.com/carrier_lookup?carrier=other&number_0=1112223333&response=1').read()
bs = BeautifulSoup.BeautifulSoup(response)
print bs.findAll('div', attrs={'class': 'carrier_result'})[0].next.strip()
You should be using a HTML parser such as BeautifulSoup or lxml instead.
to get the next line, you can use
htmlsource = open('carrier.html', 'r')
for line in htmlsource:
if '<div class="carrier_result">' in line:
nextline = htmlsource.next()
print nextline
A "better" way is to split on </div>, then get the things you want, as sometimes the stuff you want can occur all in one line. So using next() if give wrong result.eg
data=open("carrier.html").read().split("</div>")
for item in data:
if '<div class="carrier_result">' in item:
print item.split('<div class="carrier_result">')[-1].strip()
by the way, if its possible, try to use Python's own web module, like urllib, urllib2 instead of calling external wget.

Categories

Resources