I seem to have hit a wall with my script. I'm trying to make it grab the text of a commentary from a website and put in some basic XML tags. It grabs everything on a page and that needs to be fixed, but that's a secondary concern right now. I've gotten the script to split the text into chapters, but I can't figure out how to further divide it into the verses. I'm trying to replace every occurrence of "Verse" in a chapter with </verse><verse name = "n">, with "n" being the verse number. I've tried a few things, including for loops and ElementTree, but it either doesn't work or makes every verse name the same.
I tried putting in the following code, but it never seemed to complete when I try it:
x = "Verse"
for x in para:
para = para.replace (x, '</verse><verse name = " ' +str(n+1) + ' " >' )
n = n + 1
The code below seems to be the most...functional that I've managed to make it. Any advice on how I should fix this or what else I might try?
from lxml import html
import requests
name = open("new.txt", "a")
name.write("""<?xml version="1.0"?>""")
name.write("<data>")
n = 0
for i in range(0, 17):
url_base = "http://www.studylight.org/commentaries/acc/view.cgi?bk=45&ch="
url_norm = url_base + str(i)
page = requests.get(url_norm)
tree = html.fromstring(page.text)
para = tree.xpath('/html/body/div[2]//table//text()')
name.write("<chapter name =\"" + str(i) + "\" >")
para = str(para)
para = para.replace("&", " ")
para = para.replace ("Verse", '</verse><verse name = " ' +str(n+1) + ' " >' )
name.write(str(para))
name.write("</chapter>")
name.write("</data>")
name.close()
print "done"
you shouldn't be changing texts, when manipulating xhtml document use xslt
Related
I've been working on a python script that will scrape certain webpages.
The beginning of the script looks like this:
# -*- coding: UTF-8 -*-
import urllib2
import re
database = ''
contents = open('contents.html', 'r')
for line in contents:
entry = ''
f = re.search('(?<=a href=")(.+?)(?=\.htm)', line)
if f:
entry = f.group(0)
page = urllib2.urlopen('https://indo-european.info/pokorny-etymological-dictionary/' + entry + '.htm').read()
m = re.search('English meaning( )+\s+(.+?)</font>', page)
if m:
title = m.group(2)
else:
title = 'N/A'
This accesses each page and grabs a title from it. Then I have a number of blocks of code that test whether certain text is present in each page, here is an example of one:
abg = re.findall('\babg\b', page);
if len(abg) == 0:
abg = 'N'
else:
abg = 'Y'
Then, finally, still in the for loop, I add this information to the variable database:
database += '\n' + str('<F>') + str(entry) + '<TITLE="' + str(title) + '"><FQ="N"><SQ="N"><ABG="' + str(abg) + '"></F>'
Note that I have used str() for each variable because I was getting a "can't concatenate strings and lists" error for some reason.
Once the for loop is completed, I write the database variable to a file:
f = open('database.txt', 'wb')
f.write(database)
f.close()
When I run this in the command line, it times out or never completes running. Any ideas as to what might be causing the issue?
EDIT: I fixed it. It seems the program was getting slowed down by the fact that I was having the database variable store the result of each line's iteration through the loop. All I had to do to fix the issue was change the write function to happen during the for loop.
I need a program to divide a each PDF page in two (left,right). So I made this code, but for some reason it doesn't catch the image for the title. When trying with other books it didn't work either.
import os
#Info that i collect to know the numbers of pages and the pdf file name
number = int(input("Number os pages: " ))
file = input("Name of the file: " )
file = str(file) + ".pdf"
text = open("for_the_latex.txt","w")
#Putting the first part of the latex document
a = "\documentclass{article}" + "\n"
b = "\\usepackage{pdfpages}" + "\n"
c = "\\begin{document}"
text.write(a)
text.write(b)
text.write(c)
#This is the core of the program
#It basically write in a text document to include the pdf for each page
for i in range(1,number +1):
a = "\includepdf[pages=" + str( i) + ",trim=0 0 400 0]{" + file + "}" + "\n"
text.write(a)
#Writing the finish part
quatro = "\end{document}"
text.write(quatro)
text.close()
#renaming to .tex
os.rename("for_the_latex.txt", "divided.tex")
#activating the latex
os.system("pdflatex divided.tex")
where is the error ?
I want to divide the PDF in two.
Consider the following minimal document (called example-document.pdf) that contains 6 pages, each exactly split in half by colour and number:
\documentclass{article}
\usepackage[paper=a4paper,landscape]{geometry}
\usepackage{pdfpages}
\begin{document}
% http://mirrors.ctan.org/macros/latex/contrib/mwe/example-image-a4-numbered.pdf
\includepdf[pages={1-2},nup=2]{example-image-a4-numbered.pdf}
\includepdf[pages={3-4},nup=2]{example-image-a4-numbered.pdf}
\includepdf[pages={5-6},nup=2]{example-image-a4-numbered.pdf}
\includepdf[pages={7-8},nup=2]{example-image-a4-numbered.pdf}
\includepdf[pages={9-10},nup=2]{example-image-a4-numbered.pdf}
\includepdf[pages={11-12},nup=2]{example-image-a4-numbered.pdf}
\end{document}
The idea is to split these back into a 12-page document. Here's the code for LaTeX:
\documentclass{article}
\usepackage[paper=a4paper]{geometry}
\usepackage{pdfpages,pgffor}
\newlength{\pagedim}% To store page dimensions, if necessary
\begin{document}
\foreach \docpage in {1,...,6} {
\settowidth{\pagedim}{\includegraphics[page=\docpage]{example-document.pdf}}% Establish page width
\includepdf[pages={\docpage},trim=0 0 .5\pagedim{} 0,clip]{example-document.pdf}% Left half
\includepdf[pages={\docpage},trim=.5\pagedim{} 0 0 0,clip]{example-document.pdf}% Right half
}
\end{document}
It's not necessary to read in every page and establish its width (stored in \pagedim), but I wasn't sure whether your pages may have differing sizes.
As mentioned in the comment, I'm not quite sure, if I understand your problem correctly. Since I can execute your program and it includes only the left part of the initial document, I modified the code a bit.
import os
#Info that i collect to know the numbers of pages and the pdf file name
number = int(input("Number of pages: "))
file = input("Name of the file: ")
file = str(file) + ".pdf"
text = open("divided.tex","w")
#Putting the first part of the latex document
header ='''
\\documentclass{article}
\\usepackage{pdfpages}
\\begin{document}
'''
#This is the core of the program
#It basically write in a text document to include the pdf for each page
middle=''
for i in range(1,number +1):
middle += "\includepdf[pages={},trim=0 0 400 0]{{{}}}\n".format(i, file)
middle += "\includepdf[pages={},trim=400 0 0 0]{{{}}}\n".format(i, file)
#Writing the finish part
quatro = "\end{document}"
text.write(header)
text.write(middle)
text.write(quatro)
text.close()
#activating the latex
os.system("pdflatex divided.tex")
I'm working on a web parser for a webpage containing mathematical constants. I need to replace some characters in order to have it on a specific format, but I dont know why if I print it, i seems to be working fine; but when I open the output file the format achieved by replace() doesn't seems to have took effect.
That's the code
#!/usr/bin/env python3
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://www.ebyte.it/library/educards/constants/ConstantsOfPhysicsAndMath.html"
soup = BeautifulSoup(urlopen(url).read(), "html5lib")
f = open("ebyteParse-output.txt", "w")
table = soup.find("table", attrs={"class": "grid9"})
rows = table.findAll("tr")
for tr in rows:
# If its a category of constants we write that as a comment
if tr.has_attr("bgcolor"):
f.write("\n\n# " + tr.find(text=True) + "\n")
continue
cols = tr.findAll("td")
if (len(cols) >= 2):
if (cols[0]["class"][0] == "box" or cols[0]["class"][0] == "boxi" and cols[1]["class"][0] == "boxa"):
constant = str(cols[0].find(text=True)).replace(" ", "-")
value = str(cols[1].find(text=True))
value = value.replace(" ", "").replace("...", "").replace("[", "").replace("]", "")
print(constant + "\t" + value)
f.write(constant + "\t" + value)
f.write("\n")
f.close()
That is what print shows:
That is what I get on the output file
Thanks you,
Salva
File i was looking for was catched so no changes where seen. Thanks for answering
Hey I've tried for a while and I can't figure how to identify the name using soup.find function. The item that I'm looking for is identified by ,"name": how do I find it if it is in something like this. The text continues upwards and below.
,"100002078216989":{"watermark":1488952059387,"action":1488954831234},"100002219436413":{"watermark":1488717577383,"action":1488717619845},"100003348640283":{"watermark":1489154862229,"action":1489158262774},"100004986371453":{"watermark":1489154862229,"action":1489154866065}}],[]],["MDynaTemplate","registerTemplates",[],[{"URLg3i":["MMessageSourceTextTemplate","\u003Cspan
class=\"source mfss
fcg\">[[text]]\u003C/span>"],"DHGslp":["MMessageSourceTextWithLinkTemplate","\u003Cspan
class=\"mfss fcg\">\u003Ca
href=\"[[\u0025UNESCAPED]][[download_href]]\">[[text]]\u003C/a>\u003C/span>"],"vSvEYy":["MReadReceiptTextTemplate","\u003Cspan
class=\"mfss
fcg\">[[text]]\u003C/span>"]}],[]],["MShortProfiles","set",[],["Value",{"id":"Value","name":"Value","firstName":"Value","vanity":"Value","thumbSrc":null
Here is my solution:
def get_name(self, file):
s = BeautifulSoup(open(file), "lxml")
for item in s.find("p"):
print("The base item: \n" +item + "\n")
item = item.split("name\":\"")
print("1st split: \n" + item[-1] + "\n")
item = item[-1].split("\",\"")
print("2nd split: \n" + item[0] + "\n")
Output:
The base item:
"100002078216989":{"watermark":1488952059387,"action":1488954831234},"100002219436413":{"watermark":1488717577383,"action":1488717619845},"100003348640283":{"watermark":1489154862229,"action":1489158262774},"100004986371453":{"watermark":1489154862229,"action":1489154866065}}],[]],["MDynaTemplate","registerTemplates",[],[{"URLg3i":["MMessageSourceTextTemplate","\u003Cspan class=\"source mfss fcg\">[[text]]\u003C/span>"],"DHGslp":["MMessageSourceTextWithLinkTemplate","\u003Cspan class=\"mfss fcg\">\u003Ca href=\"[[\u0025UNESCAPED]][[download_href]]\">[[text]]\u003C/a>\u003C/span>"],"vSvEYy":["MReadReceiptTextTemplate","\u003Cspan class=\"mfss fcg\">[[text]]\u003C/span>"]}],[]],["MShortProfiles","set",[],["Value",{"id":"Value","name":"Value","firstName":"Value","vanity":"Value","thumbSrc":null
1st split:
Value","firstName":"Value","vanity":"Value","thumbSrc":null
2nd split:
Value
In fact, your html file is not a perfect format. So the best way I can find is like this. However, it can somehow suit your need.
I'm trying to scrape temperatures from a weather site using the following:
import urllib2
from BeautifulSoup import BeautifulSoup
f = open('airport_temp.tsv', 'w')
f.write("Location" + "\t" + "High Temp (F)" + "\t" + "Low Temp (F)" + "\t" + "Mean Humidity" + "\n" )
eventually parse from http://www.wunderground.com/history/airport/\w{4}/2012/\d{2}/1/DailyHistory.html
for x in range(10):
locationstamp = "Location " + str(x)
print "Getting data for " + locationstamp
url = 'http://www.wunderground.com/history/airport/KAPA/2013/3/1/DailyHistory.html'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
location = soup.findAll('h1').text
locsent = location.split()
loc = str(locsent[3,6])
hightemp = soup.findAll('nobr')[6].text
htemp = hightemp.split()
ht = str(htemp[1])
lowtemp = soup.findAll('nobr')[10].text
ltemp = lowtemp.split()
lt = str(ltemp[1])
avghum = soup.findAll('td')[23].text
f.write(loc + "\t|" + ht + "\t|" + lt + "\t|" + avghum + "\n" )
f.close()
Unfortunately, I get an error saying:
Getting data for Location 0
Traceback (most recent call last):
File "airportweather.py", line 18, in <module>
location = soup.findAll('H1').text
AttributeError: 'list' object has no attribute 'text'
I've looked through BS and Python documentation, but am still pretty green, so I couldn't figure it out. Please help this newbie!
The .findAll() method returns a list of matches. If you wanted one result, use the .find() method instead. Alternatively, pick out a specific element like the rest of the code does, or loop over the results:
location = soup.find('h1').text
or
locations = [el.text for el in soup.findAll('h1')]
or
location = soup.findAll('h1')[2].text
This is quite simple. findAll returns list, so if you are sure that there is only one interesting you element then: soup.findAll('H1')[0].text should work