Decode HTML Entity on Python - python

I have a file that contain some lines like this:
StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4
Respect to this lines, i have some files on disk, but saved on decoded form:
StatsLearning_Lect1_2a_111213_v2_[2wLfFB_6SKI]_[tag22].mp4
I need get file name from first file list and correct file name from second file and change file name to second name. For this goal, i need decode html entity from file name, so i do somthing like this:
import os
from html.parser import HTMLParser
fpListDwn = open('listDwn', 'r')
for lineNumberOnList, fileName in enumerate(fpListDwn):
print(HTMLParser().unescape(fileName))
but this action doesn't have any effect on run, some run's result is:
meysampg#freedom:~/Downloads/Practical Machine Learning$ python3 changeName.py
StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4
StatsLearning_Lect1_2b_111213_v2_%5BLvaTokhYnDw%5D_%5Btag22%5D.mp4
StatsLearning_Lect3_4a_110613_%5BWjyuiK5taS8%5D_%5Btag22%5D.mp4
StatsLearning_Lect3_4b_110613_%5BUvxHOkYQl8g%5D_%5Btag22%5D.mp4
StatsLearning_Lect3_4c_110613_%5BVusKAosxxyk%5D_%5Btag22%5D.mp4
How i can fix this?

I guess you should use urllib.parse instead of html.parser
>>> f="StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4"
>>> import urllib.parse as parse
>>> f
'StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4'
>>> parse.unquote(f)
'StatsLearning_Lect1_2a_111213_v2_[2wLfFB_6SKI]_[tag22].mp4'
So your script should look like:
import os
import urllib.parse as parse
fpListDwn = open('listDwn', 'r')
for lineNumberOnList, fileName in enumerate(fpListDwn):
print(parse.unquote(fileName))

This is actually "percent encoding", not HTML encoding, see this question:
How to percent-encode URL parameters in Python?
Basically you want to use urllib.parse.unquote instead:
from urllib.parse import unquote
unquote('StatsLearning_Lect1_2a_111213_v2_%5B2wLfFB_6SKI%5D_%5Btag22%5D.mp4')
Out[192]: 'StatsLearning_Lect1_2a_111213_v2_[2wLfFB_6SKI]_[tag22].mp4'

Related

Getting code from a .txt on a website and pasting it in a tempfile PYTHON

I was trying to make a script that gets a .txt from a websites, pastes the code into a python executable temp file but its not working. Here is the code:
from urllib.request import urlopen as urlopen
import os
import subprocess
import os
import tempfile
filename = urlopen("https://randomsiteeeee.000webhostapp.com/script.txt")
temp = open(filename)
temp.close()
# Clean up the temporary file yourself
os.remove(filename)
temp = tempfile.TemporaryFile()
temp.close()
If you know a fix to this please let me know. The error is :
File "test.py", line 9, in <module>
temp = open(filename)
TypeError: expected str, bytes or os.PathLike object, not HTTPResponse
I tried everything such as a request to the url and pasting it but didnt work as well. I tried the code that i pasted here and didnt work as well.
And as i said, i was expecting it getting the code from the .txt from the website, and making it a temp executable python script
you are missing a read:
from urllib.request import urlopen as urlopen
import os
import subprocess
import os
import tempfile
filename = urlopen("https://randomsiteeeee.000webhostapp.com/script.txt").read() # <-- here
temp = open(filename)
temp.close()
# Clean up the temporary file yourself
os.remove(filename)
temp = tempfile.TemporaryFile()
temp.close()
But if the script.txt contains the script and not the filename, you need to create a temporary file and write the content:
from urllib.request import urlopen as urlopen
import os
import subprocess
import os
import tempfile
content = urlopen("https://randomsiteeeee.000webhostapp.com/script.txt").read() #
with tempfile.TemporaryFile() as fp:
name = fp.name
fp.write(content)
If you want to execute the code you fetch from the url, you may also use exec or eval instead of writing a new script file.
eval and exec are EVIL, they should only be used if you 100% trust the input and there is no other way!
EDIT: How do i use exec?
Using exec, you could do something like this (also, I use requests instead of urllib here. If you prefer urllib, you can do this too):
import requests
exec(requests.get("https://randomsiteeeee.000webhostapp.com/script.txt").text)
Your trying to open a file that is named "the content of a website".
filename = "path/to/my/output/file.txt"
httpresponse = urlopen("https://randomsiteeeee.000webhostapp.com/script.txt").read()
temp = open(filename)
temp.write(httpresponse)
temp.close()
Is probably more like what you are intending

How to open a LaTeX file in Python that starts with a comment

Code:
import os
import re
import time
import csv
from TexSoup import TexSoup
path = os.getcwd()
texFile = path + '\\Paper16.tex'
print(texFile)
soup = TexSoup(open(texFile, 'r'))
This returns no output when I try to print(soup) and I believe it is because the first line starts with %.
I think this is some sort of bug of TexSoup.
Namely, if you remove the first line or comment out second line instead, the TexSoup is able to parse the file and print(soup) will give some output.
In addition, if you terminate the first line by adding braces:
%{\documentstyle[aps,epsf,rotate,epsfig,preprint]{revtex}}
the TexSoup is also able to parse the file.

urllib: Get name of file from direct download link

Python 3. Probably need to use urllib to do this,
I need to know how to send a request to a direct download link, and get the name of the file it attempts to save.
(As an example, a KSP mod from CurseForge: https://kerbal.curseforge.com/projects/mechjeb/files/2355387/download)
Of course, the file ID (2355387) will be changed. It could be from any project, but always on CurseForge. (If that makes a difference on the way it's downloaded.)
That example link results in the file:
How can I return that file name in Python?
Edit: I should note that I want to avoid saving the file, reading the name, then deleting it if possible. That seems like the worst way to do this.
Using urllib.request, when you request a response from a url, the response contains a reference to the url you are downloading.
>>> from urllib.request import urlopen
>>> url = 'https://kerbal.curseforge.com/projects/mechjeb/files/2355387/download'
>>> response = urlopen(url)
>>> response.url
'https://addons-origin.cursecdn.com/files/2355/387/MechJeb2-2.6.0.0.zip'
You can use os.path.basename to get the filename:
>>> from os.path import basename
>>> basename(response.url)
'MechJeb2-2.6.0.0.zip'
from urllib import request
url = 'file download link'
filename = request.urlopen(request.Request(url)).info().get_filename()

How to open html file that contains Unicode characters?

I have html file called test.html it has one word בדיקה.
I open the test.html and print it's content using this block of code:
file = open("test.html", "r")
print file.read()
but it prints ??????, why this happened and how could I fix it?
BTW. when I open text file it works good.
Edit: I'd tried this:
>>> import codecs
>>> f = codecs.open("test.html",'r')
>>> print f.read()
?????
import codecs
f=codecs.open("test.html", 'r')
print f.read()
Try something like this.
I encountered this problem today as well. I am using Windows and the system language by default is Chinese. Hence, someone may encounter this Unicode error similarly. Simply add encoding = 'utf-8':
with open("test.html", "r", encoding='utf-8') as f:
text= f.read()
you can make use of the following code:
from __future__ import division, unicode_literals
import codecs
from bs4 import BeautifulSoup
f=codecs.open("test.html", 'r', 'utf-8')
document= BeautifulSoup(f.read()).get_text()
print(document)
If you want to delete all the blank lines in between and get all the words as a string (also avoid special characters, numbers) then also include:
import nltk
from nltk.tokenize import word_tokenize
docwords=word_tokenize(document)
for line in docwords:
line = (line.rstrip())
if line:
if re.match("^[A-Za-z]*$",line):
if (line not in stop and len(line)>1):
st=st+" "+line
print st
*define st as a string initially, like st=""
You can read HTML page using 'urllib'.
#python 2.x
import urllib
page = urllib.urlopen("your path ").read()
print page
Use codecs.open with the encoding parameter.
import codecs
f = codecs.open("test.html", 'r', 'utf-8')
CODE:
import codecs
path="D:\\Users\\html\\abc.html"
file=codecs.open(path,"rb")
file1=file.read()
file1=str(file1)
You can simply use this
import requests
requests.get(url)
you can use 'urllib' in python3 same as
https://stackoverflow.com/a/27243244/4815313 with few changes.
#python3
import urllib
page = urllib.request.urlopen("/path/").read()
print(page)

Python: saving image from web to disk

Can I save images to disk using python? An example of an image would be:
Easiest is to use urllib.urlretrieve.
Python 2:
import urllib
urllib.urlretrieve('http://chart.apis.google.com/...', 'outfile.png')
Python 3:
import urllib.request
urllib.request.urlretrieve('http://chart.apis.google.com/...', 'outfile.png')
If your goal is to download a png to disk, you can do so with urllib:
import urllib
urladdy = "http://chart.apis.google.com/chart?chxl=1:|0|10|100|1%2C000|10%2C000|100%2C000|1%2C000%2C000|2:||Excretion+in+Nanograms+per+gram+creatinine+milliliter+(logarithmic+scale)|&chxp=1,0|2,0&chxr=0,0,12.1|1,0,3&chxs=0,676767,13.5,0,lt,676767|1,676767,13.5,0,l,676767&chxtc=0,-1000&chxt=y,x,x&chbh=a,1,0&chs=640x465&cht=bvs&chco=A2C180&chds=0,12.1&chd=t:0,0,0,0,0,0,0,0,0,1,0,0,3,2,4,6,6,9,3,6,5,11,9,10,6,2,2,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0&chdl=n=87&chtt=William+MD+-+Buprenorphine+Graph"
filename = r"c:\tmp\toto\file.png"
urllib.urlretrieve(urladdy, filename)
In python 3, you will need to use urllib.request.urlretrieve instead of urllib.urlretrieve.
The Google chart API produces PNG files. Just retrieve them with urllib.urlopen(url).read() or something along these lines and safe to a file the usual way.
Full example:
>>> import urllib
>>> url = 'http://chart.apis.google.com/chart?chxl=1:|0|10|100|1%2C000|10%2C000|100%2C000|1%2C000%2C000|2:||Excretion+in+Nanograms+per+gram+creatinine+milliliter+(logarithmic+scale)|&chxp=1,0|2,0&chxr=0,0,12.1|1,0,3&chxs=0,676767,13.5,0,lt,676767|1,676767,13.5,0,l,676767&chxtc=0,-1000&chxt=y,x,x&chbh=a,1,0&chs=640x465&cht=bvs&chco=A2C180&chds=0,12.1&chd=t:0,0,0,0,0,0,0,0,0,1,0,0,3,2,4,6,6,9,3,6,5,11,9,10,6,2,2,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0&chdl=n=87&chtt=William+MD+-+Buprenorphine+Graph'
>>> image = urllib.urlopen(url).read()
>>> outfile = open('chart01.png','wb')
>>> outfile.write(image)
>>> outfile.close()
As noted in other examples, 'urllib.urlretrieve(url, outfilename)` is even more straightforward, but playing with urllib and urllib2 will surely be instructive for you.

Categories

Resources