I am using the docx package in Python where I read in a Word docx file into Python. Everything seems to work okay except for some reason I am unable to access the front page/cover page of the document. Oddly, the docx package takes the second page of the Word document as the first page, and doesn't seem to know that there is a front page at all!
There are some topics on Stack Overflow which cover this issue, for example: Cannot access first page header (python-docx)
However, on my front page I don't have sections, I only have text. The one idea I had was to try identify the specific style of the text on the front page - notably, it is a 'Heading' style, so to try and locate it I use code like this:
title_size = max_size = 0
max_size_text = title = None
for p in doc.paragraphs:
style = p.style
if style.name == 'Heading':
title_size = style.font.size.pt
title = p.text
print(f" text='{title}'")
break
size = style.font.size
if size is not None:
if size.pt > max_size:
max_size = size.pt
max_size_text = p.text
if title is not None:
print(f"Title: size={title_size} text='{title}'")
else:
print(f"max size title: size={title_size} text='{max_size_text}'")
Unfortunately, this method also fails, since this code only recognizes the style as starting from the second page :(
If anyone could help me out I'd greatly appreciate it, thanks!
Related
I am trying to get Last Updated date from confluence for a document by using the api but was not able to get it. Can someone point me in the right direction? One solution recommended was to use requests library along with Beautiful Soup and parse the html but I am looking to get this done via an api but so far did not have much success.
I am using this:
https://atlassian-python-api.readthedocs.io/confluence.html
and this:
https://github.com/atlassian-api/atlassian-python-api/blob/master/atlassian/confluence.py
I saw the following in the first link i provided:
#Compare content and check is already updated or not
confluence.is_page_content_is_already_updated(page_id, body)
But what I want is to grab the date a document was last updated. This date is present in our confluence docs and the title is “Last Updated” . This last updated date is present in front of every document.
I see that this has not been answered, so here's my take after some research:
goOn = True
pages = list()
startAt = 0
while(goOn):
batch = confluence.get_all_pages_from_space('<ConfluenceSpaceName>', start=startAt, limit=100, status=None, expand='title,history.lastUpdated')
pages.extend(batch)
startAt = len(pages)
if (len(batch) == 0):
goOn = False
for p in pages:
if (debug): print('Page {}; updated: {}'.format(p['title'], p['history']['lastUpdated']['when']))
Thanks, Kirill
I am scraping a YouTube page and find a open program codes online. The code runs and returns correct results. However, as I learn the code sentence by sentence, i find that I could not find the attribute in the source code. I searched for it in page source, inspect element view and copied and paste the raw code in word. Nowhere could I find it.
How did this happen?
Codes below:
soup=BeautifulSoup(result.text,"lxml")
# cannot find yt-lockup-meta-info anywhere......
view_element=soup.find_all("ul",class_="yt-lockup-meta-info")
totalview=0
for objects in view_element:
view_list=obj.findChildren()
for element in view_list:
if element.string.endwith("views"):
videoviews=element.text.replace("views","").replace(",","")
totalview=totalview+int(videoviews)
print(videoviews)
print("----------------------")
print("Total_Views"+str(totalview))
The attribute I searched for is "yt-lockup-meta-info".
The page source is here.
The original page.
I see a few problems, which I think might be cleared up if I saw the full code. However there are some things that need fixed within this block.
For example, this line should read:
for obj in view_element:
instead of:
for objects in view_element:
You are only referencing one "obj", not multiple objects when traversing through "view_element".
Also, there is no need to search for the word "views" when there is a class you can search directly.
Here is how I would address this problem. Hope this helps.
#Go to website and convert page source to Soup
response = requests.get('https://www.youtube.com/results?search_query=web+scraping+youtube')
soup = BeautifulSoup(response.text, 'lxml')
f.close()
videos = soup.find_all('ytd-video-renderer') #Find all videos
total_view_count = 0
for video in videos:
video_meta = video.find('div', {'id': 'metadata'}) #The text under the video title
view_count_text = video_meta.find_all('span', {'class': 'ytd-video-meta-block'})[0].text.replace('views', '').strip() #The view counter
#Converts view count to integer
if 'K' in view_count_text:
video_view_count = int(float(view_count_text.split('K')[0])*1000)
elif 'M' in view_count_text:
video_view_count = int(float(view_count_text.split('M')[0])*1000000)
elif 'B' in view_count_text:
video_view_count = int(float(view_count_text.split('B')[0])*1000000000)
else:
video_view_count = int(view_count_text)
print(video_view_count)
total_view_count += video_view_count
print(total_view_count)
https://www.cpms.osd.mil/Content/AF%20Schedules/survey-sch/111/111R-03Apr2003.html
This is the page I am trying to parse. It's from a government site which in my experience are not known for keeping up their certificates, so you are are going to be warned about it not being safe by your browser. All I want is this part,http://imgur.com/a/BL14W.
edit: Sorry, for the lack of information. I started asking this question then I got called away at work. It's no excuse but when I came back it was time to go home so, I just kinda hit submit.
I have already tried doing it more "manually" but apparently not all of the documents came out exactly the same. Here is what I tried:
def table_parser(page):
file = open(page)
table = []
num = 0
for line in file:
if 'Grade' in line:
num += 1
if num > 0:
num += 1
if 3 <= num < 21:
line = line.rstrip()
if line != '':
split_line = line.split(' ')
split_line = [x for x in split_line if x != '']
strip_line = split_line[:16]
table.append(strip_line)
WG = []
WL = []
WS = []
for l in table:
WG.append((l[1:6]))
WL.append(l[6:11])
WS.append(l[11:16])
file.close()
# Return 3 lists for the 3 charts I want
return WG, WL, WS
This is what I used that got the about half of the 65k files I started with mostly right. I passed the returned lists into csv writers to store them till I can get them all cleaned up. I know there is probably a better way but I came up with this before I could wrap my head around BeautifulSoup. I don't necessarily want the code to do this, just pointers on where to start. I tried to find documentation on BeautifulSoup but I couldn't figure out where to start for what I need.
Your question is a little vague so I'll try my best to help you.
1. Install Beautiful Soup 4
To get a block of text from a webpage,you will need to use the external library BeautifulSoup4 (BS4). Once downloaded and installed to your computer, first import BS4 using the following from bs4 import BeautifulSoupand import urllib.request. Then simply setup BS4 using soup = BeautifulSoup("", "html.parser").
2. Download Webpage
Downloading a webpage is simple, just use site_download = urllib.request.urlopen(url). In your case, simply replace "url" with the url you provided here. Then we need to read what we've downloaded using site_read = site_download.read().decode('utf-8') followed by soup = BeautifulSoup(site_read, "html.parser").
3. Get Block of Text
You can get text in many different ways, so I'll show you a few examples.
To get the first instance of < P > tag (paragraph) text:
text = soup.find("p")
text = getText()
To get all instances of the < P > tag:
text = soup.findAll("p")
text = getText()
To get text from a specific class:
text = soup.find(attrs={"class": "class_name_here"})
text = getText()
4. Further Info
More information on how to get different types of tags and other things you can do with BS4 can be found HERE.
I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf.
I have used the PDFMiner library and code from http://www.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/ to extract text. However, it does not extract the hyperlinks.
For example, I have text that says Check this link out, with a link attached to it. I am able to extract the words Check this link out, but what I really need is the hyperlink itself, not the words.
How do I go about doing this? Ideally, I would prefer to do it in Python, but I'm open to doing it in any other language as well.
I have looked at itextsharp, but haven't used it. I'm running on Ubuntu, and would appreciate any help.
slightly modified version of Ashwin's Answer:
import PyPDF2
PDFFile = open("file.pdf",'rb')
PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
print("Current Page: {}".format(page))
pageSliced = PDF.getPage(page)
pageObject = pageSliced.getObject()
if key in pageObject.keys():
ann = pageObject[key]
for a in ann:
u = a.getObject()
if uri in u[ank].keys():
print(u[ank][uri])
This is an old question, but it seems a lot of people look at it (including me while trying to answer this question), so I am sharing the answer I came up with. As a side note, it helps a lot to learn how to use the Python debugger (pdb) so you can inspect these objects on-the-fly.
It is possible to get the hyperlinks using PDFMiner. The complication is (like with so much about PDFs), there is really no relationship between the link annotations and the text of the link, except that they are both located at the same region of the page.
Here is the code I used to get links on a PDFPage
annotationList = []
if page.annots:
for annotation in page.annots.resolve():
annotationDict = annotation.resolve()
if str(annotationDict["Subtype"]) != "/Link":
# Skip over any annotations that are not links
continue
position = annotationDict["Rect"]
uriDict = annotationDict["A"].resolve()
# This has always been true so far.
assert str(uriDict["S"]) == "/URI"
# Some of my URI's have spaces.
uri = uriDict["URI"].replace(" ", "%20")
annotationList.append((position, uri))
Then I defined a function like:
def getOverlappingLink(annotationList, element):
for (x0, y0, x1, y1), url in annotationList:
if x0 > element.x1 or element.x0 > x1:
continue
if y0 > element.y1 or element.y0 > y1:
continue
return url
else:
return None
which I used to search the annotationList I previously found on the page to see if any hyperlink occupies the same region as a LTTextBoxHorizontal that I was inspecting on the page.
In my case, since PDFMiner was consolidating too much text together in the text box, I walked through the _objs attribute of each text box and looked though all of the LTTextLineHorizontal instances to see if they overlapped any of the annotation positions.
I think using PyPDF you could do that. If you want to extract the links from PDF. I am not sure where I got this from but it resides in my code as a part of something else. Hope this helps:
PDFFile = open('File Location','rb')
PDF = pyPdf.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
pageSliced = PDF.getPage(page)
pageObject = pageSliced.getObject()
if pageObject.has_key(key):
ann = pageObject[key]
for a in ann:
u = a.getObject()
if u[ank].has_key(uri):
print u[ank][uri]
This I hope should give the links in your PDF.
P.S: I haven't extensively tried this.
import pikepdf
pdf_file = pikepdf.Pdf.open("pdf.pdf")
urls = []
for page in pdf_file.pages:
for annots in page.get("/Annots"):
url=annots.get("/A").get("/URI")
if url is not None:
urls.append(url)
urls.append(" ; ")
print(urls)
You will get a semicolon separated list of links in the given PDF
The hyperlink will actually be an annotation, so you need to process the annotation rather than 'extract the text'. I suspect that you are going to need to use a library such as itextsharp, or MuPDF, or Ghostscript if you are really desperate (and comfortable programming in PostScript).
I'd have thought it relatvely easy to process the annotations looking for type LNK though.
Here's a version that creates a list of URLs in the simplest way I could find:
import PyPDF2
pdf = PyPDF2.PdfFileReader('filename.pdf')
urls = []
for page in range(pdf.numPages):
pdfPage = pdf.getPage(page)
try:
for item in (pdfPage['/Annots']):
urls.append(item['/A']['/URI'])
except KeyError:
pass
So I have a data retrieval/entry project and I want to extract a certain part of a webpage and store it in a text file. I have a text file of urls and the program is supposed to extract the same part of the page for each url.
Specifically, the program copies the legal statute following "Legal Authority:" on pages such as this. As you can see, there is only one statute listed. However, some of the urls also look like this, meaning that there are multiple separated statutes.
My code works for pages of the first kind:
from sys import argv
from urllib2 import urlopen
script, urlfile, legalfile = argv
input = open(urlfile, "r")
output = open(legalfile, "w")
def get_legal(page):
# this is where Legal Authority: starts in the code
start_link = page.find('Legal Authority:')
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
legal = page[start_legal+2: end_link]
return legal
for line in input:
pg = urlopen(line).read()
statute = get_legal(pg)
output.write(get_legal(pg))
Giving me the desired statute name in the "legalfile" output .txt. However, it cannot copy multiple statute names. I've tried something like this:
def get_legal(page):
# this is where Legal Authority: starts in the code
end_link = ""
legal = ""
start_link = page.find('Legal Authority:')
while (end_link != '</a> '):
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
end2 = page.find('</a> ', end_link+1)
legal += page[start_legal+2: end_link]
if
break
return legal
Since every list of statutes ends with '</a> ' (inspect the source of either of the two links) I thought I could use that fact (having it as the end of the index) to loop through and collect all the statutes in one string. Any ideas?
I would suggest using BeautifulSoup to parse and search your html. This will be much easier than doing basic string searches.
Here's a sample that pulls all the <a> tags found within the <td> tag that contains the <b>Legal Authority:</b> tag. (Note that I'm using requests library to fetch page content here - this is just a recommended and very easy to use alternative to urlopen.)
import requests
from BeautifulSoup import BeautifulSoup
# fetch the content of the page with requests library
url = "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16"
response = requests.get(url)
# parse the html
html = BeautifulSoup(response.content)
# find all the <a> tags
a_tags = html.findAll('a', attrs={'class': 'pageSubNavTxt'})
def fetch_parent_tag(tags):
# fetch the parent <td> tag of the first <a> tag
# whose "previous sibling" is the <b>Legal Authority:</b> tag.
for tag in tags:
sibling = tag.findPreviousSibling()
if not sibling:
continue
if sibling.getText() == 'Legal Authority:':
return tag.findParent()
# now, just find all the child <a> tags of the parent.
# i.e. finding the parent of one child, find all the children
parent_tag = fetch_parent_tag(a_tags)
tags_you_want = parent_tag.findAll('a')
for tag in tags_you_want:
print 'statute: ' + tag.getText()
If this isn't exactly what you needed to do, BeautifulSoup is still the tool you likely want to use for sifting through html.
They provide XML data over there, see my comment. If you think you can't download that many files (or the other end could dislike so many HTTP GET requests), I'd recommend asking their admins if they would kindly provide you with a different way of accessing the data.
I have done so twice in the past (with scientific databases). In one instance the sheer size of the dataset prohibited a download; they ran a SQL query of mine and e-mailed the results (but had previously offered to mail a DVD or hard disk). In another case, I could have done some million HTTP requests to a webservice (and they were ok) each fetching about 1k bytes. This would have taken long, and would have been quite inconvenient (requiring some error-handling, since some of these requests would always time out) (and non-atomic due to paging). I was mailed a DVD.
I'd imagine that the Office of Management and Budget could possibly be similar accomodating.