Python How to get a specific code in website using re - python

I'm trying to make python challange.
http://www.pythonchallenge.com/pc/def/ocr.html
Ok. I know, I can just copy paste the code from source to a txt file and make things like that but I want to take it from net for improving myself. (+ I have done it already) I have tried
re.findall(r"<!--(.*?)-->,html)
But it doesn't get anything.
If you want my full code is here:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests,re
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
x = requests.get(link)
codes = re.findall(r"<!--(.*)-->",str(x.content))
print codes
Also I tried making it like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests,re
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
x = requests.get(link)
codes = re.findall("<!--\n(.*)\n-->",str(x.content))
print codes
Now it finds the text but still can't get that mess :(

I would use an HTML parser instead. You can find comments in HTML with BeautifulSoup.
Working code:
import requests
from bs4 import BeautifulSoup, Comment
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
response = requests.get(link)
soup = BeautifulSoup(response.content, "html.parser")
code = soup.find_all(text=lambda text: isinstance(text, Comment))[-1]
print(code.strip())

Not sure what you mean by "that mess". You should include all of the details of the challenge within this post, instead of linking users to the pythonchallenge post.
Either way, if you set the regex to be in single-line mode, //s, then the dot character, ., should match newlines, /n, as well. This obviates the \n(.+)\n construction in your regex, which may solve your problem.
Here's a link to a working regex example.
Here is the modified python 2.7 code:
#!/usr/bin/python
import requests, re
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
x = requests.get(link)
codes = re.findall("<!--(.*?)-->", str(x.content), re.S)
print codes[1]
Note the re.S, (.*?), and codes[1] modifications.
re.S is python's flag for //s
(.*?) makes the * quantifier non-greedy
codes[1] prints the second set of content found within HTML comments (since findall(..) matches 2 and returns an array of both sets).

You can solve:
codes = re.findall("/<!--(.*?)-->/s",str(x.content))
"s" find with whitespace and breakline

Related

Getting a specific file from requested iframe

I want to get the file link from the anime I'm watching from the site.
`import requests
from bs4 import BeautifulSoup
import re
page = requests.get("http://naruto-tube.org/shippuuden-sub-219")
soup = BeautifulSoup(page.content, "html.parser")
inner_content = requests.get(soup.find("iframe")["src"])
print(inner_content.text)`
the output is the source code from the filehoster's website (ani-stream). However, my problem now is how to i get the "file: xxxxxxx" line to be printed and not the whole source code?
You can Beautiful Soup to parse the iframe source code and find the script elements, but from there you're on your own. The file: "xxxxx", line is in JavaScript code, so you'll have to find the function call (to playerInstance.setup() in this case) and decide which of the two such "file:" lines is the one you want, and strip away the unwanted JS syntax around the URL.
Regular expressions will help with that, and you're probably better off just looking for the lines in the iframe's HTML. You already have re imported, so I just replaced your last line with:
lines = re.findall("file: .*$", inner_content.text, re.MULTILINE)
print( '\n'.join(lines) )
...to get a list of lines with "file:" in them. You can (and should) use a fancier RE to find just the one with "http:// and allows only whitespace before "file:" on the lines. (Python, Java and my text editor all have different ideas about what's in an RE, so I have to go to docs every time I write one. You can do that too--it's your problem, after all.)
The requests.get() function doesn't seem to work to get the bytes. Try Vishnu Kiran's urlretrieve approach--maybe that will work. Using the URL in a browser window does seem to get the right video, though, so there may be a user agent and/or cookie setting that you'll have to spoof.
If the iframe's source is not the primary domain of the website(naruto-tube.org) its contents cannot be accessed via scraping.
You will have to use a different website or you will need to get the url in the Iframe and use some library like requests to call the url.
Note you must also pass all parameters to the url if any to actually get any result. Like so
import urllib
urllib.urlretrieve ("url from the Iframe", "mp4.mp4")

Handling ` ` in Python

Problem Background:
I have an XML file that I'm importing into BeautifulSoup and parsing through. One node has the following:
<DIAttribute name="ObjectDesc" value="Line1
Line2
Line3"/>
Notice that the value has 
 and
within the text. I understand those are the XML representation of carriage return and line feed.
When I import into BeautifulSoup, the value gets converted into the following:
<DIAttribute name="ObjectDesc" value="Line1
Line2
Line3"/>
You'll notice that the
gets converted to a newline.
My use case requires that the value remains as the original. Any idea how to get that to stay? Or convert it back?
Source Code:
python: (2.7.11)
from bs4 import BeautifulSoup #version 4.4.0
s = BeautifulSoup(open('test.xml'),'lxml-xml',from_encoding="ansi")
print s.DIAttribute
#XML file looks like
'''
<?xml version="1.0" encoding="UTF-8" ?>
<DIAttribute name="ObjectDesc" value="Line1
Line2
Line3"/>
'''
Notepad++ says the encoding of the source XML file is ANSI.
Things I've Tried:
I've scoured the documentation without any success.
Variations for line 3:
print s.DIAttribute.prettify('ascii')
print s.DIAttribute.prettify('windows-1252')
print s.DIAttribute.prettify('ansi')
print s.DIAttribute.prettify('utf-8')
print s.DIAttribute['value'].replace('\r','
').replace('\n','
') #This works, but it feels like a bandaid and will likely other problems will remain.
Any ideas anyone? I appreciate any comments/suggestions.
Just for record, first the libraries that DO NOT handle properly the
entity: BeautifulSoup(data ,convertEntities=BeautifulSoup.HTML_ENTITIES), lxml.html.soupparser.unescape, xml.sax.saxutils.unescape
And this is what works (in Python 2.x):
import sys
import HTMLParser
## accept file name as argument, or read stdin if nothing passed
data = len(sys.argv) > 1 and open(sys.argv[1]).read() or sys.stdin.read()
parser = HTMLParser.HTMLParser()
print parser.unescape(data)

How do I use PYTHONIOENCODING environment variable to get past a unicode interpretation issue

I am trying to run a very short script in Python
from bs4 import BeautifulSoup
import urllib.request
html = urllib.request.urlopen("http://dictionary.reference.com/browse/word?s=t").read().strip()
dhtml = str(html, "utf-8").strip()
soup = BeautifulSoup(dhtml.strip(), "html.parser")
I asked a similar question earlier, and this question has been created based on a respectable comment by J Sebastian on his answer. Python program is running in IDLE but not in command line
Is there a way to set the PythonIOEncoding earlier in either Github's Atom or Sublime Text 2 to automatically encode soup.prettify() to utf-8
I am going to run this program on a server (of course, the current portion is merely a quick test)
s=soup.prettify().encode('utf8') makes it UTF-8 explicitly.
setting PYTHONIOENCODING=utf8 in the shell and then print(soup.prettify()) should use the specified encoding implicitly.

Requiring help in figuring out indent error in python code

I get an indentation error when trying to run the code below. I am trying to print out the URLs of a set of html pages recursively.
import urllib2
from BeautifulSoup import *
from urlparse import urljoin
# Create a list of words to ignore
ignorewords=set(['the','of','to','and','a','in','is','it'])
def crawl(self,pages,depth=2):
for i in range(depth):
newpages=set()
for page in pages:
try:
c=urllib2.urlopen(page)
except:
print "Could not open %s" % page
continue
soup=BeautifulSoup(c.read())
self.addtoindex(page,soup)
links=soup('a')
for link in links:
if ('href' in dict(link.attrs)):
url=urljoin(page,link['href'])
if url.find("'")!=-1: continue
url=url.split('#')[0] # remove location portion
if url[0:4]=='http' and not self.isindexed(url):
newpages.add(url)
linkText=self.gettextonly(link)
self.addlinkref(page,url,linkText)
self.dbcommit()
pages=newpages
Well you're coded is totally unindented so Python will cry when you try and run it.
Remember in Python whitespace is important. Indenting with 4 spaces rather than tab saves a lot of "invisible" indentation errors.
I've down-voted as the code was pasted unformatted/unindented which means either the poster doesn't understand python (and hasn't read a basic tutorial) or pasted the code without re-indenting , which makes it impossible for anyone to answer.

Writing the result of a regexed document back to the document in python

I'm trying to create a python script to do a number of regular expression substitutions on a LaTeX document immediately before typesetting it, but I seem to be having some problem making the substitutions take effect. My script is as follows:
# -*- coding: utf-8 -*-
import os, re, sys
tex = sys.argv[-1]
tex_file = open(tex, "r+")
tex_file_data = tex_file.read()
# DO SOME REGEXES
tex_file_data = re.sub(r"\b_(.*?)_\b", r"\emph{\1}", tex_file_data)
tex_file.write(tex_file_data)
# PROCESS THE DOCUMENT
os.system("xelatex --shell-escape " + tex_file.name)
Each time I attempt to process a document with this script, however, I get the usual ! Missing $ inserted. error. According to the regular expression, these underscores were supposed to be replaced with suitable syntax. However, if I substitute the final line for print(tex_file_data), the console will display the document with the changes having taken effect. As far as I can tell, the problem seems to be that the edited document is not being saved correctly, but I am not sure what I am doing wrong.
How might I fix this problem so that the script can be used to process documents?
EDIT: At #Yuushi's suggestion, I've editted the script as follows:
# -*- coding: utf-8 -*-
import os, re, sys
with open(sys.argv[-1], "r+") as tex_file:
tex_file_data = tex_file.read()
tex_file_data = re.sub(r"\_(.*)\_", r"\\emph{\1}", tex_file_data)
tex_file.write(tex_file_data)
os.system("xelatex --shell-escape " + tex_file.name)
However, I am still getting the ! Missing $ inserted. error, which suggests that the original document is still being sent to the LaTeX compiler rather than the regexed one.
You likely have two problems. Firstly, after a read, the stream is set to the end position, so you'll need to reset it to the start with a tex_file.seek(0) before you call write. Secondly, you never close the file, and the writes are probably buffered, hence you need a tex_file.close() at the end. Better still would be to use a with statement:
with open(sys.argv[-1], 'r+') as tex_file:
tex_file_data - tex_file.read()
tex_file_data = re.sub(r"\_(.*)\_", r"\\emph{\1}", tex_file_data)
tex_file.seek(0)
tex_file.write(tex_file_data)
os.system("xelatex --shell-escape " + tex_file.name)

Categories

Resources