Splitting text in Python - python

I'm writing some script which capture data from web site and save them into DB. Some of datas are merged and I need to split them. I have sth like this
Endokrynologia (bez st.),Położnictwo i ginekologia (II st.)
So i need to get:
Endokrynologia (bez st.)
Położnictwo i ginekologia (II st.)
So i wrote some code in python:
#!/usr/bin/env python
# -*- encoding: utf-8
import MySQLdb as mdb
from lxml import html, etree
import urllib
import sys
import re
Nr = 17268
Link = "http://rpwdl.csioz.gov.pl/rpz/druk/wyswietlKsiegaServletPub?idKsiega="
sock = urllib.urlopen(Link+str(Nr))
htmlSource = sock.read()
sock.close()
root = etree.HTML(htmlSource)
result = etree.tostring(root, pretty_print=True, method="html")
Spec = etree.XPath("string(//html/body/div/table[2]/tr[18]/td[2]/text())")
Specjalizacja = Spec(root)
if re.search(r'(,)\b', Specjalizacja):
text = Specjalizacja.split()
print text[0]
print text[1]
and i get:
Endokrynologia
(bez
what i'm doing wrong ?

you would try to replace
text = Specjalizacja.split()
with
text = Specjalizacja.split(',')
Don't know whether that would fix your problem.

Related

Getting wrong characters in pt-br from xml in python

I'm trying to send data from a XML feed to MySQL database, but I'm getting wrong pt-br characters in python and mysql.
import MySQLdb
import urllib2
import sys
import codecs
## default enconding
reload(sys)
sys.setdefaultencoding('utf-8')
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
file = urllib2.urlopen('feed.xml')
data = file.read()
file.close()
data = xmltodict.parse(data)
db = MySQLdb.connect(host=MYSQL_HOST, # your host, usually localhost
user=MYSQL_USER, # your username
passwd=MYSQL_PASSWD, # your password
db=MYSQL_DB) # name of the data base
cur = db.cursor()
product_name = str(data.items()[0][1].items()[2][1].items()[3][1][i].items()[1][1])
But when I print product_name in Python or insert it into mysql, I get this:
'Probi\xc3\xb3tica (120caps)'
this should be:
'Probiótica'
How can I fix this?
'Probi\xc3\xb3tica' is the utf-8 encoded version of 'Probiótica'.
Is your terminal (or whatever you are using to run this) set up to handle utf-8 output?
Try print 'Probi\xc3\xb3tica'.decode('utf-8') to see what happens.
I get Probiótica.

How to detect change on webpage without reload

I found ho to detect by using perl.
How to detect a changed webpage?
But unfortunatelly I don't know perl.
Is there a way in python?
Can you give a detailed example if you do not complicate?
Do you mean an python script, which reads a webpage and shows you if it is different from the last visit? A very simple version would be this (works for python2 and python3):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import os
import requests
from hashlib import sha1
recent_hash_filename = ".recent_hash"
def test(url):
print("looking up %s" % url)
if not os.path.exists(recent_hash_filename):
open(recent_hash_filename, 'a').close()
hash_fetched = sha1()
hash_read = ""
r = requests.get(url)
hash_fetched.update(r.text.encode("utf8"))
with open(recent_hash_filename) as f:
hash_read = f.read()
print(hash_fetched.hexdigest())
print(hash_read)
if hash_fetched.hexdigest() == hash_read:
print("same")
else:
print("different")
with open(recent_hash_filename, "w") as f:
f.write(hash_fetched.hexdigest())
if __name__ == '__main__':
if len(sys.argv) > 1:
url = sys.argv[1]
else:
url = "https://www.heise.de"
test(url)
print("done")
If you have any questions just let me know

How to open html file that contains Unicode characters?

I have html file called test.html it has one word בדיקה.
I open the test.html and print it's content using this block of code:
file = open("test.html", "r")
print file.read()
but it prints ??????, why this happened and how could I fix it?
BTW. when I open text file it works good.
Edit: I'd tried this:
>>> import codecs
>>> f = codecs.open("test.html",'r')
>>> print f.read()
?????
import codecs
f=codecs.open("test.html", 'r')
print f.read()
Try something like this.
I encountered this problem today as well. I am using Windows and the system language by default is Chinese. Hence, someone may encounter this Unicode error similarly. Simply add encoding = 'utf-8':
with open("test.html", "r", encoding='utf-8') as f:
text= f.read()
you can make use of the following code:
from __future__ import division, unicode_literals
import codecs
from bs4 import BeautifulSoup
f=codecs.open("test.html", 'r', 'utf-8')
document= BeautifulSoup(f.read()).get_text()
print(document)
If you want to delete all the blank lines in between and get all the words as a string (also avoid special characters, numbers) then also include:
import nltk
from nltk.tokenize import word_tokenize
docwords=word_tokenize(document)
for line in docwords:
line = (line.rstrip())
if line:
if re.match("^[A-Za-z]*$",line):
if (line not in stop and len(line)>1):
st=st+" "+line
print st
*define st as a string initially, like st=""
You can read HTML page using 'urllib'.
#python 2.x
import urllib
page = urllib.urlopen("your path ").read()
print page
Use codecs.open with the encoding parameter.
import codecs
f = codecs.open("test.html", 'r', 'utf-8')
CODE:
import codecs
path="D:\\Users\\html\\abc.html"
file=codecs.open(path,"rb")
file1=file.read()
file1=str(file1)
You can simply use this
import requests
requests.get(url)
you can use 'urllib' in python3 same as
https://stackoverflow.com/a/27243244/4815313 with few changes.
#python3
import urllib
page = urllib.request.urlopen("/path/").read()
print(page)

How to display a pdf that has been downloaded in python

I have grabbed a pdf from the web using for example
import requests
pdf = requests.get("http://www.scala-lang.org/docu/files/ScalaByExample.pdf")
I would like to modify this code to display it
from gi.repository import Poppler, Gtk
def draw(widget, surface):
page.render(surface)
document = Poppler.Document.new_from_file("file:///home/me/some.pdf", None)
page = document.get_page(0)
window = Gtk.Window(title="Hello World")
window.connect("delete-event", Gtk.main_quit)
window.connect("draw", draw)
window.set_app_paintable(True)
window.show_all()
Gtk.main()
How do I modify the document = line to use the variable pdf that contains the pdf?
(I don't mind using popplerqt4 or anything else if that makes it easier.)
It all depends on the OS your using. These might usually help:
import os
os.system('my_pdf.pdf')
or
os.startfile('path_to_pdf.pdf')
or
import webbrowser
webbrowser.open(r'file:///my_pdf.pdf')
How about using a temporary file?
import tempfile
import urllib
import urlparse
import requests
from gi.repository import Poppler, Gtk
pdf = requests.get("http://www.scala-lang.org/docu/files/ScalaByExample.pdf")
with tempfile.NamedTemporaryFile() as pdf_contents:
pdf_contents.file.write(pdf)
file_url = urlparse.urljoin(
'file:', urllib.pathname2url(pdf_contents.name))
document = Poppler.Document.new_from_file(file_url, None)
Try this and tell me if it works:
document = Poppler.Document.new_from_data(str(pdf.content),len(repr(pdf.content)),None)
If you want to open pdf using acrobat reader then below code should work
import subprocess
process = subprocess.Popen(['<here path to acrobat.exe>', '/A', 'page=1', '<here path to pdf>'], shell=False, stdout=subprocess.PIPE)
process.wait()
Since there is a library named pyPdf, you should be able to load PDF file using that.
If you have any further questions, send me messege.
August 2015 : On a fresh intallation in Windows 7, the problem is still the same :
Poppler.Document.new_from_data(data, len(data), None)
returns : Type error: must be strings not bytes.
Poppler.Document.new_from_data(str(data), len(data), None)
returns : PDF document is damaged (4).
I have been unable to use this function.
I tried to use a NamedTemporayFile instead of a file on disk, but for un unknown reason, it returns an unknown error.
So I am using a temporary file. Not the prettiest way, but it works.
Here is the test code for Python 3.4, if anyone has an idea :
from gi.repository import Poppler
import tempfile, urllib
from urllib.parse import urlparse
from urllib.request import urljoin
testfile = "d:/Mes Documents/en cours/PdfBooklet3/tempfiles/preview.pdf"
document = Poppler.Document.new_from_file("file:///" + testfile, None) # Works fine
page = document.get_page(0)
print(page) # OK
f1 = open(testfile, "rb")
data1 = f1.read()
f1.close()
data2 = "".join(map(chr, data1)) # converts bytes to string
print(len(data1))
document = Poppler.Document.new_from_data(data2, len(data2), None)
page = document.get_page(0) # returns None
print(page)
pdftempfile = tempfile.NamedTemporaryFile()
pdftempfile.write(data1)
file_url = urllib.parse.urljoin('file:', urllib.request.pathname2url(pdftempfile.name))
print( file_url)
pdftempfile.seek(0)
document = Poppler.Document.new_from_file(file_url, None) # unknown error

With regards to urllib AttributeError: 'module' object has no attribute 'urlopen'

import re
import string
import shutil
import os
import os.path
import time
import datetime
import math
import urllib
from array import array
import random
filehandle = urllib.urlopen('http://www.google.com/') #open webpage
s = filehandle.read() #read
print s #display
#what i plan to do with it once i get the first part working
#results = re.findall('[<td style="font-weight:bold;" nowrap>$][0-9][0-9][0-9][.][0-9][0-9][</td></tr></tfoot></table>]',s)
#earnings = '$ '
#for money in results:
#earnings = earnings + money[1]+money[2]+money[3]+'.'+money[5]+money[6]
#print earnings
#raw_input()
this is the code that i have so far. now i have looked at all the other forums that give solutions such as the name of the script, which is parse_Money.py, and i have tried doing it with urllib.request.urlopen AND i have tried running it on python 2.5, 2.6, and 2.7. If anybody has any suggestions it would be really welcome, thanks everyone!!
--Matt
---EDIT---
I also tried this code and it worked, so im thinking its some kind of syntax error, so if anybody with a sharp eye can point it out, i would be very appreciative.
import shutil
import os
import os.path
import time
import datetime
import math
import urllib
from array import array
import random
b = 3
#find URL
URL = raw_input('Type the URL you would like to read from[Example: http://www.google.com/] :')
while b == 3:
#get file name
file1 = raw_input('Enter a file name for the downloaded code:')
filepath = file1 + '.txt'
if os.path.isfile(filepath):
print 'File already exists'
b = 3
else:
print 'Filename accepted'
b = 4
file_path = filepath
#open file
FileWrite = open(file_path, 'a')
#acces URL
filehandle = urllib.urlopen(URL)
#display souce code
for lines in filehandle.readlines():
FileWrite.write(lines)
print lines
print 'The above has been saved in both a text and html file'
#close files
filehandle.close()
FileWrite.close()
it appears that the urlopen method is available in the urllib.request module and not in the urllib module as you're expecting.
rule of thumb - if you're getting an AttributeError, that field/operation is not present in the particular module.
EDIT - Thanks to AndiDog for pointing out - this is a solution valid for Py 3.x, and not applicable to Py2.x!
The urlopen function is actually in the urllib2 module. Try import urllib2 and use urllib2.urlopen
I see that you are using Python2 or at least intend to use Python2.
urlopen helper function is available in both urllib and urllib2 in Python2.
What you need to do this, execute this script against the correct version of your python
C:\Python26\python.exe yourscript.py

Categories

Resources