divided pdf in two - python

I need a program to divide a each PDF page in two (left,right). So I made this code, but for some reason it doesn't catch the image for the title. When trying with other books it didn't work either.
import os
#Info that i collect to know the numbers of pages and the pdf file name
number = int(input("Number os pages: " ))
file = input("Name of the file: " )
file = str(file) + ".pdf"
text = open("for_the_latex.txt","w")
#Putting the first part of the latex document
a = "\documentclass{article}" + "\n"
b = "\\usepackage{pdfpages}" + "\n"
c = "\\begin{document}"
text.write(a)
text.write(b)
text.write(c)
#This is the core of the program
#It basically write in a text document to include the pdf for each page
for i in range(1,number +1):
a = "\includepdf[pages=" + str( i) + ",trim=0 0 400 0]{" + file + "}" + "\n"
text.write(a)
#Writing the finish part
quatro = "\end{document}"
text.write(quatro)
text.close()
#renaming to .tex
os.rename("for_the_latex.txt", "divided.tex")
#activating the latex
os.system("pdflatex divided.tex")
where is the error ?
I want to divide the PDF in two.

Consider the following minimal document (called example-document.pdf) that contains 6 pages, each exactly split in half by colour and number:
\documentclass{article}
\usepackage[paper=a4paper,landscape]{geometry}
\usepackage{pdfpages}
\begin{document}
% http://mirrors.ctan.org/macros/latex/contrib/mwe/example-image-a4-numbered.pdf
\includepdf[pages={1-2},nup=2]{example-image-a4-numbered.pdf}
\includepdf[pages={3-4},nup=2]{example-image-a4-numbered.pdf}
\includepdf[pages={5-6},nup=2]{example-image-a4-numbered.pdf}
\includepdf[pages={7-8},nup=2]{example-image-a4-numbered.pdf}
\includepdf[pages={9-10},nup=2]{example-image-a4-numbered.pdf}
\includepdf[pages={11-12},nup=2]{example-image-a4-numbered.pdf}
\end{document}
The idea is to split these back into a 12-page document. Here's the code for LaTeX:
\documentclass{article}
\usepackage[paper=a4paper]{geometry}
\usepackage{pdfpages,pgffor}
\newlength{\pagedim}% To store page dimensions, if necessary
\begin{document}
\foreach \docpage in {1,...,6} {
\settowidth{\pagedim}{\includegraphics[page=\docpage]{example-document.pdf}}% Establish page width
\includepdf[pages={\docpage},trim=0 0 .5\pagedim{} 0,clip]{example-document.pdf}% Left half
\includepdf[pages={\docpage},trim=.5\pagedim{} 0 0 0,clip]{example-document.pdf}% Right half
}
\end{document}
It's not necessary to read in every page and establish its width (stored in \pagedim), but I wasn't sure whether your pages may have differing sizes.

As mentioned in the comment, I'm not quite sure, if I understand your problem correctly. Since I can execute your program and it includes only the left part of the initial document, I modified the code a bit.
import os
#Info that i collect to know the numbers of pages and the pdf file name
number = int(input("Number of pages: "))
file = input("Name of the file: ")
file = str(file) + ".pdf"
text = open("divided.tex","w")
#Putting the first part of the latex document
header ='''
\\documentclass{article}
\\usepackage{pdfpages}
\\begin{document}
'''
#This is the core of the program
#It basically write in a text document to include the pdf for each page
middle=''
for i in range(1,number +1):
middle += "\includepdf[pages={},trim=0 0 400 0]{{{}}}\n".format(i, file)
middle += "\includepdf[pages={},trim=400 0 0 0]{{{}}}\n".format(i, file)
#Writing the finish part
quatro = "\end{document}"
text.write(header)
text.write(middle)
text.write(quatro)
text.close()
#activating the latex
os.system("pdflatex divided.tex")

Related

Create a link to a specific word count position such as bookmark in docx

How this project works:
Searches external docx / OCR data for a keyword
Builds a context of 100 words surrounding the keyword
Builds a docx to store the passage with a hyperlink posted under each completed search
What is missing:
A way to link to the passage to its source from the external document in Word, so you can just use a hyperlink to it, but the problem is the OCR docx files read have no headings to bookmark a run, and I could not create them with long OCR, so it is not manageable from the aspect of going in to the docx file one by one reading gibberish at times.
So Word needs to be able to store the solution in the document where the passage is printed in the new file. This hyperlink code works... I need something more than what I have here to find the passage locations on its source, unless MS Word will not support such a specific function as finding the indexed word position of the passage? Can I build a macro and call it in python to make a link and run its position using the index?
Hyperlinking/bookmark code post ref:
def add_hyperlink(paragraph, text, url):
# This gets access to the document.xml.rels file and gets a new relation id value
part = paragraph.part
r_id = part.relate_to(url, docx.opc.constants.RELATIONSHIP_TYPE.HYPERLINK, is_external=True)
# Create the w:hyperlink tag and add needed values
hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
hyperlink.set(docx.oxml.shared.qn('r:id'), r_id, )
# Create a w:r element and a new w:rPr element
new_run = docx.oxml.shared.OxmlElement('w:r')
rPr = docx.oxml.shared.OxmlElement('w:rPr')
# Join all the xml elements together add the required text to the w:r element
new_run.append(rPr)
new_run.text = text
hyperlink.append(new_run)
# Create a new Run object and add the hyperlink into it
r = paragraph.add_run()
r._r.append(hyperlink)
# A workaround for the lack of a hyperlink style (doesn't go purple after using the link)
# Delete this if using a template that has the hyperlink style in it
r.font.color.theme_color = MSO_THEME_COLOR_INDEX.HYPERLINK
r.font.underline = True
return hyperlink
def extract_surround_words(text, keyword, n):
'''
text : input text
keyword : the search keyword we are looking
n : number of words around the keyword
'''
# extracting all the words from text
words = re.findall(r'\w+', text)
passage = []
passageText = ''
saveIndex = []
passagePos = []
indexVal = ''
document = Document()
document.add_heading("The keyword searched is: " + searchKeyword + ", WORD COUNT: " + str(len(text)) + "\n", 0)
# iterate through all the words
for index, word in enumerate(words):
# check if search keyword matches
if word == keyword and len(words) > 0:
saveIndex.append(str(index-n))
# fetch left side words and right
passage = words[index - n: index] #start text run
passage.append(keyword)
passage += words[index + 1: index + n + 1] #end of run
passagePos = "\nWORD COUNT POSITION: " + str(saveIndex.pop() + "\n")
bookmark = add_bookmark(index, passagePos)
print(str(passagePos))
for wd in passage:
passageText += ' ' + wd
parag = document.add_paragraph(passageText)
add_hyperlink(parag, passagePos, os.path.join(path, file))
passage.append("\n\n")
document.save(os.path.join(output_path, out_file_doc))
return passageText

trying to create small programs to do simple calculations on data read in from text files

As the title side, I am trying to create small programs to do simple calculations on data read in from text files. But I don't know how to turn the elements from the text file into integers. Any help would be greatly appreciated.
enter code heredef main():
f = input('enter the file name')
# this line open the file and reads the content f + '.txt' is required
getinfo = open(f +'.txt','r')
content = getinfo.read()
num = []
print('here are the number in your file', num)
getinfo.close()
main ()
If your .txt file is in this format,
1
2
3
4
Then you can use the split function on content like this:
f = input('enter the file name')
# this line open the file and reads the content f + '.txt' is required
getinfo = open(f +'.txt','r')
content = getinfo.read()
num = content.split("\n") # Splits the content by every new line
print('here are the number in your file', num)
getinfo.close()
If you need everything in num to be of type int then you can do a for loop to do that like this
f = input('enter the file name')
# this line open the file and reads the content f + '.txt' is required
getinfo = open(f +'.txt','r')
content = getinfo.read()
num = content.split("\n") # Splits the content by every new line
for i in range(len(num)):
num[i] = int(num[i])
print('here are the number in your file', num)
getinfo.close()
One thing you need to be careful of, however, is to make sure that your text file doesn't contain any characters instead of numbers, otherwise python will try to convert something like "c" to an integer which will cause an error.

File read and write adds extra last number

I wrote a quick and sloppy python script for my dad in order to read in text files from a given folder and replace the top lines with a specific format. My apologies for any mix of pluses (+) and commas (,). The purpose was to replace something like this:
Sounding: BASF CPT-1
Depth: 1.05 meter(s)
with something like this:
Tempo(ms); Amplitude(cm/s) Valores provisorios da Sismica; Profundidade[m] = 1.05
I thought I had gotten it all resolved until my dad mentioned that all the text files had the last number repeated in a new line. Here are some examples of output:
output sample links - not enough reputation to post more than 2 links, sorry
Here is my code:
TIME AMPLITUDE
(ms)
#imports
import glob, inspect, os, re
from sys import argv
#work
is_correct = False
succeeded = 0
failed = 0
while not is_correct:
print "Please type the folder name: "
folder_name = raw_input()
full_path = os.path.dirname(os.path.abspath(__file__)) + "\\" + folder_name + "\\"
print "---------Looking in the following folder: " + full_path
print "Is this correct? (Y/N)"
confirm_answer = raw_input()
if confirm_answer == 'Y':
is_correct = True
else:
is_correct = False
files_list = glob.glob(full_path + "\*.txt")
print "Files found: ", files_list
for file_name in files_list:
new_header = "Tempo(ms); Amplitude(cm/s) Valores provisorios da Sismica; Profundidade[m] ="
current_file = open(file_name, "r+")
print "---------Looking at: " + current_file.name
file_data = current_file.read()
current_file.close()
match = re.search("Depth:\W(.+)\Wmeter", file_data)
if match:
new_header = new_header + str(match.groups(1)[0]) + "\n"
print "Depth captured: ", match.groups()
print "New header to be added: ", new_header
else:
print "Match failed!"
match_replace = re.search("(Sounding.+\s+Depth:.+\s+TIME\s+AMPLITUDE\s+.+\s+) \d", file_data)
if match_replace:
print "Replacing text ..."
text_to_replace = match_replace.group(1)
print "SANITY CHECK - Text found: ", text_to_replace
new_data = file_data.replace(text_to_replace, new_header)
current_file = open(file_name, "r+")
current_file.write(new_data)
current_file.close()
succeeded = succeeded + 1
else:
print "Text not found!"
failed = failed + 1
# this was added after I noticed the mysterious repeated number (quick fix)
# why do I need this?
lines = file(file_name, 'r').readlines()
del lines[-1]
file(file_name, 'w').writelines(lines)
print "--------------------------------"
print "RESULTS"
print "--------------------------------"
print "Succeeded: " , succeeded
print "Failed: ", failed
#template -- new_data = file_data.replace("Sounding: BASF CPT-1\nDepth: 29.92 meter(s)\nTIME AMPLITUDE \n(ms)\n\n")
What am I doing wrong exactly? I am not sure why the extra number is being added at the end (as you can see on the "modified text file - broken" link above). I'm sure it is something simple, but I am not seeing it. If you want to replicate the broken output, you just need to comment out these lines:
lines = file(file_name, 'r').readlines()
del lines[-1]
file(file_name, 'w').writelines(lines)
The problem is that, when you go to write your new data to the file, you are opening the file in mode r+, which means "open the file for reading and writing, and start at the beginning". Your code then writes data into the file starting at the beginning. However, your new data is shorter than the data already in the file, and since the file isn't getting truncated, that extra bit of data is left over at the end of the file.
Quick solution: in your if match_replace: section, change this line:
current_file = open(file_name, "r+")
to this:
current_file = open(file_name, "w")
This will open the file in write mode, and will truncate the file before you write to it. I just tested it, and it works fine.

Python overwriting file instead of appending [duplicate]

This question already has answers here:
How do I append to a file?
(13 answers)
Closed 7 years ago.
I'm creating a personal TV show and movie database and I use Python to get the information of the TV shows and movies. I have a file to get information for movies in a folder, which works fine.
I also have a Python file that gets the information of a TV show (which is a folder, e.g. Game of Thrones) and it also gets all the episode files from inside the folder and gets the information for those (it's formatted like this: e.g. Game of Thrones;3;9)
All this information is stored into 2 text files which MySQL can read: tvshows.txt and episodes.txt.
Python easily gets the information of the TV show in the first part of the program.
The second part of the program is to get each episode in the TV show folder and store the information in a file (episodes.txt):
def seTv(show):
pat = '/home/ryan/python/tv/'
pat = pat + show
epList = os.listdir(pat)
fileP = "/home/ryan/python/tvtext/episodes.txt"
f = open(fileP, "w")
print epList
hdrs = ['Title', 'Plot', 'imdbRating', 'Season', 'Episode', 'seriesID']
def searchTvSe(ep):
ep = str(ep)
print ep
seq = ep.split(";")
print seq
tit = seq[0]
seq[0] = seq[0].replace(" ", "+")
url = "http://www.omdbapi.com/?t=%s&Season=%s&Episode=%s&plot=full&r=json" % (seq[0], seq[1], seq[2])
respo = u.urlopen(url)
respo = json.loads(str(respo.read()))
if not os.path.exists("/var/www/html/images/"+tit):
os.makedirs("/var/www/html/images/"+tit)
imgNa = "/var/www/html/images/" + tit + "/" + respo["Title"] + ".jpg";
for each in hdrs:
#print respo[each] # ==== This checks to see if it is working, it is =====
f.write(respo[each] + "\t")
urllib.urlretrieve(respo["Poster"], imgNa)
for co, tt in enumerate(epList):
f.write("\N \t" + str(co) + "\t")
searchTvSe(tt)
f.write("\n")
f.close()
fullTv()
The second part only works once and I have 3 folders inside the tv folder (Game of Thrones, Breaking Bad, The Walking Dead) and inside those files are one episode from the series (Game of Thrones;3;4, Breaking Bad;1;1, The Walking Dead;3;4).
This was working fine before I added 'seriesID' and changed the files (before I had a text file for each folder, which was needed as I had a table for each TV Show).
In episodes.txt, the information for Game of Thrones is the only one that appears. I deleted the Game of Thrones folder and it appears that the final one to be searched is the only one that has been added. It seems to be overwriting it?
Thanks.
Change this line:
f = open(fileP, "w")
To this:
f = open(fileP, "a")
You need to open the file with 'a' instead of 'w':
with open('file.txt', 'a') as myfile:
myfile.write("Hello world!")
You can find more details in the documentation at https://docs.python.org/2/library/functions.html#open.

loops and replacing in python

I seem to have hit a wall with my script. I'm trying to make it grab the text of a commentary from a website and put in some basic XML tags. It grabs everything on a page and that needs to be fixed, but that's a secondary concern right now. I've gotten the script to split the text into chapters, but I can't figure out how to further divide it into the verses. I'm trying to replace every occurrence of "Verse" in a chapter with </verse><verse name = "n">, with "n" being the verse number. I've tried a few things, including for loops and ElementTree, but it either doesn't work or makes every verse name the same.
I tried putting in the following code, but it never seemed to complete when I try it:
x = "Verse"
for x in para:
para = para.replace (x, '</verse><verse name = " ' +str(n+1) + ' " >' )
n = n + 1
The code below seems to be the most...functional that I've managed to make it. Any advice on how I should fix this or what else I might try?
from lxml import html
import requests
name = open("new.txt", "a")
name.write("""<?xml version="1.0"?>""")
name.write("<data>")
n = 0
for i in range(0, 17):
url_base = "http://www.studylight.org/commentaries/acc/view.cgi?bk=45&ch="
url_norm = url_base + str(i)
page = requests.get(url_norm)
tree = html.fromstring(page.text)
para = tree.xpath('/html/body/div[2]//table//text()')
name.write("<chapter name =\"" + str(i) + "\" >")
para = str(para)
para = para.replace("&", " ")
para = para.replace ("Verse", '</verse><verse name = " ' +str(n+1) + ' " >' )
name.write(str(para))
name.write("</chapter>")
name.write("</data>")
name.close()
print "done"
you shouldn't be changing texts, when manipulating xhtml document use xslt

Categories

Resources