Why does it say "'PDF Document' is not defined in pdfminer"? - python

I am a complete beginner with Python. I literally started last weekend. I am using Python 3.
I am trying to read text from a pdf file. I first tried pyPDF2 following the instructions in Automate the Boring Stuff, but the result I got had no spaces between words and was therefore unusable. I then installed pdfminer3k by typing "pip install pdfminer3k" in the command line.
I then entered the following lines into the interpreter:
import pdfminer, os
base_path = ("C://Users//ross_")
my_file = os.path.join(base_path + "/" + "sample2.pdf")
log_file = os.path.join(base_path + "/" + "pdf_log.txt")
password = ""
extracted_text = ""
fp = open(my_file, "rb")
parser = PDFParser(fp)
document = PDFDocument(parser, password)
But the last line gave me this error message:
Traceback (most recent call last):
File "", line 1, in
document = PDFDocument(parser, password)
NameError: name 'PDFDocument' is not defined
Does anyone have an idea why I get that error message? I thought PDFDocument would have been defined in the pdfminer module. More generally, how do figure out stuff like this? Isn't there a resource somewhere that explains how to use modules like pdfminer? Many thanks and apologies for my total ignorance.

Related

Python script that executes another python script

I'm pretty new to the world of python. I decided to do a project but came to a stop, after my script wouldn't execute the right way. In which I mean the script that I need to be executed on its own through another script keeps on giving me nothing or some syntax error instead of all the stuff that is supposed to be happening (converting files). The other script in question writes new lines into the other script to change the file name (to be converted) to the newest file. The file looks something like this:
import glob
import os.path
folder_path = r'C:\User\Desktop\Folder\Audio'
file_type = r'\*mp4'
files = glob.glob(folder_path + file_type)
max_file = max(files, key=os.path.getctime)
mp3_file = max_file.replace('.mp4', '')
with open ("file.py", 'w') as f:
f.write("")
with open ("file.py", 'w') as f:
f.write('from moviepy.editor import *\n' "mp4_file = '{}'\n"
"mp3_file = '{}.mp3'\n" 'videoclip = VideoFileClip(mp4_file)\n' 'audioclip = videoclip.audio\n'
'audioclip.write_audiofile(mp3_file)\n' 'audioclip.close()\n' 'videoclip.close()\n'.format(max_file, mp3_file))
exec(open("file.py").read())
Right now it gives this error:
Traceback (most recent call last):
File "C:\Users\Desktop\Folder\Audio\File Manager.py", line 19, in <module>
exec(open("file.py").read())
File "<string>", line 2
mp4_file = 'C:\User\Desktop\Folder\Audio\test.mp4'
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
I plan not on using that exact line of code to execute my python file since there are many alternatives, but if I was on the right trail, then I might as well. The other file that's supposed to be executed has generic file converting code:
from moviepy.editor import *
mp4_file = 'C:\User\Desktop\Folder\Audio\test.mp4'
mp3_file = 'C:\User\Desktop\Folder\Audio\test.mp3'
videoclip = VideoFileClip(mp4_file)
audioclip = videoclip.audio
audioclip.write_audiofile(mp3_file)
audioclip.close()
videoclip.close()
Other solutions mostly gave me a blank inactive shell; if the answer to this problem that it's impossible, then it might as well be, and I'll take that as a valid answer, but please explain why.
Corrections
You are using different quotes while writing to the file, from single quotes ' to double ", update it to be more consistent.
The error is suggesting that while writing to the file it is also writing some unicode characters which it cannot read hence the unicode error (look at where the carrot ^ is pointing at, it's a blank space since it's not a printable character).
Suggestions
Don't just write to a file and then immediately read from it. Different operating systems have different behaviour for such repeated access which will give you strange issues (this is not your issue tho)
Just create a function extractMp3FromVideoFile which takes two arguements max_file and mp3_file
Instead of writing to a file and increasing the HDD IO simply put the file's code into a variable and then exec it.
Solution
import glob
import os.path
folder_path = r'C:\User\Desktop\Folder\Audio'
file_type = r'\*mp4'
files = glob.glob(folder_path + file_type)
max_file = max(files, key=os.path.getctime)
mp3_file = max_file.replace('.mp4', '')
code = "from moviepy.editor import *\nmp4_file = '{}'\nmp3_file = '{}.mp3'\nvideoclip = VideoFileClip(mp4_file)\naudioclip = videoclip.audio\naudioclip.write_audiofile(mp3_file)\naudioclip.close()\nvideoclip.close()\n".format(max_file, mp3_file)
exec(code)

Why GeoIP showing error in log analysis?

I have this GeoIP address code and when I run this code, it shows error that there is no module named GeoIP. This is my code :
import sys
import GeoIP
gi = GeoIP.new(GeoIP.GEOIP_MEMORY_CACHE)
with open ('Desktop/trail.txt', 'r') as f:
for line_string in f.readlines():
line = line_string.rstrip()
arr = line.split()
try:
country_code = gi.country_code_by_addr(arr[0])
country_name = "\"" + gi.country_name_by_addr(arr[0]) + "\""
arr.append(country_code)
arr.append(country_name)
except:
arr.append("None")
print ",".join(arr)
This is the error :
line 4, in
gi = GeoIP.new(GeoIP.GEOIP_MEMORY_CACHE)
GeoIP.error: [Errno 2] No such file or directory: '/usr/local/var/GeoIP/GeoIP.dat'
You forgot to add GeoIP.dat database which is your code using to get information
download it from the official release Here
Scroll down to GeoLite Country and choose Download
Move the downloaded file "GeoIP.dat.gz" to /usr/local/var/GeoIP/
Extract it by right click and select Extract HereOr by the following command: gunzip GeoIP.dat.gz
Then will appear a file named GeoIP.dat leave it in this path
Now you have the database file in path /usr/local/var/GeoIP/GeoIP.dat
try to compile again and let me know if still in problem.

print a pdf file in python

I want to print a pdf file in python. My code is as below:
def printing_reports():
import os
fp = open("/path-to-file/path.txt",'r')
for line in fp:
os.system('lp -d HPLaserJ %s' % (str(line)))
I am on Fedora 20. path.txt is a file that contain path to the pdf file like '/home/user/a.pdf'
When I run the code it says no such file or directory.
Thanks
Try this code may help:
import os
def printing_reports():
fp = open("/path-to-file/path.txt",'r')
for line in fp:
os.system('lp -d HPLaserJ {0}'.format(line.strip()))
printing_reports()
Make sure the file in every line exists.
Old question, but as I needed a an answer of how to print a pdf file from python, I found this answer more profound:
import cups
conn = cups.Connection()
printers = conn.getPrinters()
printer_name = printers.keys()[0]
ppd_options = {}
cups_job_id = conn.printFile(printer_name,'/path/to/a.pdf',"Title printjob", ppd_connection_options)
It uses the pycups module, which needs CUPS >= 1.7 installed on your system (according to their GitHub page)
The ppd_options dictionary might just be empty. (PPD - Postscript Printer Driver)

Changing lyrics in an MP3 file via eyeD3

I am trying to create a program in Python which automatically retrieves lyrics for a particular folder of MP3s. [I get the lyrics from azlyrics.com
]
So far, I have succeeded in doing everything except for actually embedding the lyrics into the "lyrics" tag.
You answered a question regarding reading the lyrics from it's tag over here.
I was wondering if you could help me with setting the lyrics. Here's my code.
import urllib2 # For downloading webpage
import time # For pausing
import eyed3 # For MP3s
import re # For replacing characters
import os # For reading folders
path = raw_input('Please enter folder of music') # TODO Must make GUI PATH SELECTION
files = os.listdir(path)
for x in files:
# Must make the program stop for a while to minimize server load
time.sleep(3)
# Opening MP3
mp3 = eyed3.load(path + '/' + x)
# Setting Values
artist = mp3.tag.artist.lower()
raw_song = str(mp3.tag.title).lower()
song = re.sub('[^0-9a-zA-Z]+', '', raw_song) #Stripping songs of anything other than alpha-numeric characters
# Generating A-Z Lyrics URL
url = "http://www.azlyrics.com/lyrics/" + artist + "/" + song + ".html"
# Getting Source and extracting lyrics
text = urllib2.urlopen(url).read()
where_start = text.find('<!-- start of lyrics -->')
start = where_start + 26
where_end = text.find('<!-- end of lyrics -->')
end = where_end - 2
lyrics = unicode(text[start:end].replace('<br />', ''), "UTF8")
# Setting Lyrics to the ID3 "lyrics" tag
mp3.tag.lyrics = lyrics ### RUNNING INTO PROBLEMS HERE
mp3.tag.save()
I am running into the following error after the 2nd-last line gets executed:-
Traceback (most recent call last):
File "<pyshell#62>", line 31, in <module>
mp3.tag.lyrics = lyrics
AttributeError: can't set attribute
I would also like you to know that I am a 15 year old who has been learning Python for about a year now. I searched everywhere and tried everything but I guess I need some help now.
Thanks in advance for all your help!
I don't pretend to understand why this is the way it is, but check out how lyrics are set in the handy example file:
from eyed3.id3 import Tag
t = Tag()
t.lyrics.set(u"""la la la""")
I believe this has to do with lyrics being placed into frames, but others may have to chime in with corrections on that. Note that this will fail unless you pass it unicode.
chcp 65001
eyeD3 --encoding utf8 --add-lyrics "001-001.txt" 001-001.mp3

Undefined entity error while using ElementTree

I have a set of XML files that I need to read and format into a single CSV file. In order to read from the XML files, I have used the solution mentioned here.
My code looks like this:
from os import listdir
import xml.etree.cElementTree as et
files = listdir(".../blogs/")
for i in range(len(files)):
# fname = ".../blogs/" + files[i]
f = open(".../blogs/" + files[i], 'r')
contents = f.read()
tree=et.fromstring(contents)
for el in tree.findall('post'):
post = el.text
f.close()
This gives me the error cElementTree.ParseError: undefined entity: at the line tree=et.fromstring(contents). Oddly enough, when I run each of the commands on command line Python (without the for-loop though), it runs perfectly.
In case you want to know the XML structure, it is like this:
<Blog>
<date> some date </date>
<post> some blog post </post>
</Blog>
So what is causing this error, and how come it doesn't run from the Python file, but runs from the command line?
Update: After reading this link I checked files[0] and found that '&' symbol occurs a few times. I think that might be causing the problem. I used a random file to read when I ran the same commands on command line.
As I mentioned in the update, there were some symbols that I suspected might be causing a problem.
The reason the error didn't come up when I ran the same lines on the command line is because I would randomly pick a file that didn't have any such characters.
Since I mainly required the content between the <post> and </post> tags, I created my own parser (as was suggested in the link mentioned in the update).
from os import listdir
files = listdir(".../blogs/")
for i in range(len(files)):
f = open(".../blogs/" + files[i], 'r')
contents = f.read()
seek1 = contents.find('<post>')
seek2 = contents.find('</post>', seek1+1)
while(seek1!=-1):
post = contents[seek1+5:seek2+6]
seek1 = contents.find('<post>', seek1+1)
seek2 = contents.find('</post>', seek1+1)
f.close()

Categories

Resources