Python Tika cannot parse pdf from url

Python Tika cannot parse pdf from url - python

python for parsing the online pdf for future usage. My code are below.
from tika import parser
import requests
import io
url = 'https://www.whitehouse.gov/wp-content/uploads/2017/12/NSS-Final-12-18-2017-0905.pdf'
response = requests.get(url)
with io.BytesIO(response.content) as open_pdf_file:
pdfFile = parser.from_file(open_pdf_file)
print(pdfFile)
However, it shows
AttributeError: '_io.BytesIO' object has no attribute 'decode'
I have taken an example from How can i read a PDF file from inline raw_bytes (not from file)?
In the example, it is using PyPDF2. But I need to use Tika as Tika has a better result than PyPDF2.
Thank you for helping

In order to use tika you will need to have JAVA 8 installed. The code that you'll need to retrieve and print contents of a pdf is as follows:
from tika import parser
url = 'https://www.whitehouse.gov/wp-content/uploads/2017/12/NSS-Final-12-18-2017-0905.pdf'
pdfFile = parser.from_file(url)
print(pdfFile["content"])

Related

How to download Flickr images using photos url (does not contain .jpg, .png, etc.) using Python

I want to download image from Flickr using following type of links using Python:
https://www.flickr.com/photos/66176388#N00/2172469872/
https://www.flickr.com/photos/clairity/798067744/
This data is obtained from xml file given at https://snap.stanford.edu/data/web-flickr.html
Is there any Python script or way to download images automatically.
Thanks.

I try to find answer from other sources and compiled the answer as follows:
import re
from urllib import request
def download(url, save_name):
html = request.urlopen(url).read()
html=html.decode('utf-8')
img_url = re.findall(r'https:[^" \\:]*_b\.jpg', html)[0]
print(img_url)
with open(save_name, "wb") as fp:
fp.write(request.urlopen(img_url).read())
download('https://www.flickr.com/photos/clairity/798067744/sizes/l/', 'image.jpg')

Downloading a xml file using an API URL in python

I am trying to download data which is returned in an xml file from an api with the following url
URL='http://oasis.caiso.com/oasisapi/SingleZip?queryname=PRC_FUEL&fuel_region_id=ALL&startdatetime=20130919T07:00-0000&enddatetime=20130928T07:00-0000&version=1'
When I use the url in my web browser the xml file downloaded looks like this
<?xml version="1.0" encoding="UTF-8"?>
<OASISReport xmlns="http://www.caiso.com/soa/OASISReport_v1.xsd">
<MessageHeader>
<TimeDate>2018-04-06T15:17:51-00:00</TimeDate>
<Source>OASIS</Source>
<Version>v20131201</Version>
</MessageHeader>
<MessagePayload>
<RTO>
<name>CAISO</name>
<REPORT_ITEM>
<REPORT_HEADER>
<SYSTEM>OASIS</SYSTEM>
<TZ>PPT</TZ>
<REPORT>PRC_FUEL</REPORT>
<UOM>US$</UOM>
<INTERVAL>ENDING</INTERVAL>
<SEC_PER_INTERVAL>3600</SEC_PER_INTERVAL>
</REPORT_HEADER>
<REPORT_DATA>
<DATA_ITEM>FUEL_PRC</DATA_ITEM>
<RESOURCE_NAME>CISO</RESOURCE_NAME>
<OPR_DATE>2013-09-19</OPR_DATE>
<INTERVAL_NUM>24</INTERVAL_NUM>
However, when I download using a python script it is something very different.
Python script:
r=requests.get(URL)
r.encoding="UTF-8"
with open ('data.xml','wb') as file:
file.write(r.content)
Downloaded file:
PKEL520130919_20130928_PRC_FUEL_N_20180406_08_44_40_v1.xmlíÝïOÇ¹àïþ+Pt¤ó)ÝåQ*SdJ§_,âlMÿþÇK{Í£yV{í;ï¶vRâ¹XÞÌ<÷=»ûþ×?lü<sy}õ§Ï&O·>Û_½»þöòê»?}öõ¿|þì³?<Ù?}q~rþzþÓõûÛ»ÿÇÕÍ>ûþöö§/67ùå§ï..o®¾»þqóæúbsáï}ûóäé¿n¾ýìî+¼ßÜ\|7ÿëüâÛùû»_¿¹üqþòâv~0Ý<û|kûóÝ7/¶·¿ØÞú|këýÍ?þ'ûç×ÿ|ÿn~ðákïoþþ«'ûûíO~ðóÝWMîþcóão=Ùßü÷æï¿>»øõëoï~ãõÓ»ÿ¼ºøq~pøâäütóÃÿ¾ûGg§¯ß¼=ysôêÓ¯þzôâåÑëû?Ìÿßÿß~u·¤¿½¹ûcÿýÿÏÁÙëÃ·ùúèËýÍßãÉþ×§¯¾>ÿ¯ýÍûÿñdÿä«7G¯ÿöâË£¯^|u¼¿ùÇoÜýß½~ûÇoÍvï]þã·|üòþ¿ÿúå7/î~uÿ_¿æþóöîOµ¿ùé÷îÿîóÓ¯_½ýêÅ«£Ãïî5pöúþËÝÇfo=ÿ|ò|óßü´·_}ýê`ºýi!~cá¯¿yq÷';~õæ¯4Ýz³µûÅïúÇïýÿów/|;«ÿø{|ïÝ«åÅ_l?Ý¹ûÃýö¿?Áýµj¶Ym§í1÷ubDÔ&ÏT©ÝgPFÕ[tµÚó¨~BK®Ul
#-ô¯Ò¢«Õ&PÛª=¶èjµéÔv£j-ºZm6µ½¨Úc®VÛÚ³¨Úc®V{lÃ·NjÏ£jM»Üû/0]îVéLuÿp¦DO®ºm§Iôxðèª«Ùp<DÏ®ºm:óÁ$z#xtÕÕlC8 L¢'GW]Í6Â$zDxtÕÕlC8"L¢gGW]=Ä³-tH(ºmÏ¶Ð)¡´êj¶!<Û¦¡SBiÕÕlCx¶MC§Òª«Ù0ÿN ¥UW³¥#>þòaÂÛiÞ{v|4óÞU°u\&¨uÁ%¨uÁ%¨uÁ%¨uÁ%¨uÁ%¨uÁ%¨uÁeì×:à2Ø:à2Ø:à2Ø:à2Ø:à2Ø:à2Ø:à2Ø:à2Ø:à2Ø:à2áFplFplkÁÃ«vlÑìÅ¸46æ½ßòóÃ£É®¥üð$¨Ñ~VX5:¸dÕèU£¬½bÕèÊ«f÷ó\6úè²Ñ# ¹lô¸e³û.%¹ltpé²Ñ1¹ËF2\6ºä²Ñ3®7ºltÖe£«Û,}Oàqµ1ï}ø2t]sº¬A·5]5yêªÉPWMÞºjòÔUw ¬¼eÑäz&·óX4¹Ç¢ÉÝ<M®æ±hr3Ey,ÜËcÑèZ«&·ò\5¹çªÉ<WM®ä±jt#ÏUy®Üz\mÌu_/}/dI?= lòÔeÏ®<pÕäÉ«&Y]5yïªÉÑ«&§®»jr÷ÂU£{>0\*Ùä#Ì&×ea6¹
³ÉÓVM¾u³É×N`69HÙäÔÒe£#rMîcÀlrù§À6æ½_cápjrnÉ¢É&ço,¿±hrúÆ¢Éá&go,½±htòæªÉÁ«&çn®»¹jrêæªÉ¡«&gn®¹¹jrâæªÉ«Fçm®·¹jrÚæªÉê\µhÌB¼YÍë.|ÇOÎO+^÷ÿK±Þ½òÛf¬÷_`å Ú´ß/
I am assuming it's an encoding issue, but I am struggling with the solution.
Thanks in advance for your help!

This should help. The url you mentioned gives a zip. You can download that and extract it to get your XML.
EX:
import requests
import zipfile
import StringIO
URL='http://oasis.caiso.com/oasisapi/SingleZip?queryname=PRC_FUEL&fuel_region_id=ALL&startdatetime=20130919T07:00-0000&enddatetime=20130928T07:00-0000&version=1'
r = requests.get(URL, stream=True)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
z.extractall()

You could also use urllib
import urllib
urllib.urlretrieve(
"http://oasis.caiso.com/oasisapi/SingleZip?queryname=PRC_FUEL&fuel_region_id=ALL&startdatetime=20130919T07:00-0000&enddatetime=20130928T07:00-0000&version=1",
"oasis.zip")

unable to download pdf from this particular url using python

I have tried downloading a .pdf file using the following code but I can't open the downloaded file, it shows pdf error. I also tried doing the same with urllib2, requests none of them helped. Please help in resolving this.
import urllib
import os
pdf_link = "https://www.indeed.com/resumes/account/login?dest=%2Fr%2F23c59475ad19d393/pdf"
pdf_file = "sample.pdf"
response = urllib.urlopen(pdf_link)
file = open(pdf_file, 'wb')
file.write(response.read())
file.close()

Downloading a song through python-requests

I was trying to make a script to download songs from internet. I was first trying to download the song by using "requests" library. But I was unable to play the song. Then, I did the same using "urllib2" library and I was able to play the song this time.
Can't we use "requests" library to download songs? If yes, how?
Code by using requests:
import requests
doc = requests.get("http://gaana99.com/fileDownload/Songs/0/28768.mp3")
f = open("movie.mp3","wb")
f.write(doc.text)
f.close()
Code by using urllib2:
import urllib2
mp3file = urllib2.urlopen("http://gaana99.com/fileDownload/Songs/0/28768.mp3")
output = open('test.mp3','wb')
output.write(mp3file.read())
output.close()

Use doc.content to save binary data:
import requests
doc = requests.get('http://gaana99.com/fileDownload/Songs/0/28768.mp3')
with open('movie.mp3', 'wb') as f:
f.write(doc.content)
Explanation
A MP3 file is only binary data, you cannot retrieve its textual part. When you deal with plain text, doc.text is ideal, but for any other binary format, you have to access bytes with doc.content.
You can check the used encoding, when you get a plain text response, doc.encoding is set, else it is empty:
>>> doc = requests.get('http://gaana99.com/fileDownload/Songs/0/28768.mp3')
>>> doc.encoding
# nothing
>>> doc = requests.get('http://www.example.org')
>>> doc.encoding
ISO-8859-1

A similar way from here:
import urllib.request
urllib.request.urlretrieve('http://gaana99.com/fileDownload/Songs/0/28768.mp3', 'movie.mp3')

HTML to PDF conversion in app engine python

My website has a lot of dynamically generated HTML content and I would like to give my website users a way to save the data in PDF format. Any ideas on how it can be done? I tried xhtml2pdf library but I couldn't get it to work. I even tried reportlibrary but we have to enter the PDF details manually. Is there any library which converts HTML content to PDF and works on app engine?

You need to copy all dependencies into your GAE project folder:
html5lib
reportlab
six
xhtml2pdf
Then you can use it like this:
from xhtml2pdf import pisa
from cStringIO import StringIO
content = StringIO('html goes here')
output = StringIO()
pisa.log.setLevel('DEBUG')
pdf = pisa.CreatePDF(content, output, encoding='utf-8')
pdf_data = pdf.dest.getvalue()
Some useful info that I googled just for you:
http://www.prahladyeri.com/2013/11/how-to-generate-pdf-in-python-for-google-app-engine/
https://github.com/danimajo/pineapple_pdf

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Tika cannot parse pdf from url - python

Related

How to download Flickr images using photos url (does not contain .jpg, .png, etc.) using Python

Downloading a xml file using an API URL in python

unable to download pdf from this particular url using python

Downloading a song through python-requests

HTML to PDF conversion in app engine python

Categories

Resources