Extract file from gzip folder - python

I am trying extract the XML file from the gzip that comes out of clicking the button "SEC Investment Adviser Report" at the website here (FYI, this links to the SEC website). Below is my (minimal) code. I continue to get "embedded null character" or "embedded null byte", depending on whether I feed gzip.open() .text or .content from my request. Can anyone help me get this file loaded so I can access the XML?
import requests
import gzip
file = gzip.open(requests.get(r'https://www.adviserinfo.sec.gov/IAPD/Content/BulkFeed/CompilationDownload.aspx?FeedPK=39545&FeedType=IA_FIRM_SEC').text,'rt')

gzip.open takes a filename, not compressed data. You could use gzip.decompress.
The archive from your question looks malformed. Specifically, it has HTML appended for some reason.
The following works by only using the content before the beginning of the HTML:
import requests
import gzip
request = requests.get(r'https://www.adviserinfo.sec.gov/IAPD/Content/BulkFeed/CompilationDownload.aspx?FeedPK=39545&FeedType=IA_FIRM_SEC')
xml = gzip.decompress(request.content[:request.content.find(b"\r\n\r\n<!DOCTYPE html>") - 1])

Related

Search for a word in webpage and save to TXT in Python

I am trying to: Load links from a .txt file, search for a specific Word, and if the word exists on that webpage, save the link to another .txt file but i am getting error: No scheme supplied. Perhaps you meant http://<_io.TextIOWrapper name='import.txt' mode='r' encoding='cp1250'>?
Note: the links has HTTPS://
The code:
import requests
list_of_pages = open('import.txt', 'r+')
save = open('output.txt', 'a+')
word = "Word"
save.truncate(0)
for page_link in list_of_pages:
res = requests.get(list_of_pages)
if word in res.text:
response = requests.request("POST", url)
save.write(str(response) + "\n")
Can anyone explain why ? thank you in advance !
Try putting http:// behind the links.
When you use res = requests.get(list_of_pages) you're creating HTTP connection to list_of_pages. But requests.get takes URL string as a parameter (e.g. http://localhost:8080/static/image01.jpg), and look what list_of_pages is - it's an already opened file. Not a string. You have to either use requests library, or file IO API, not both.
If you have an already opened file, you don't need to create HTTP request at all. You don't need this request.get(). Parse list_of_pages like a normal, local file.
Or, if you would like to go the other way, don't open this text file in list_of_arguments, make it a string with URL of that file.

Download excel file using python

I have a web link which downloads an excel file directly. It opens a page writing "your file is downloading" and starts downloading the file.
Is there any way i can automate it using requests module ?
I am able to do it with selenium but i want it to run in background so i was wondering if i can use request module.
I have used request.get but it simply gives the text i.e "your file is downloading" but somehow i am not able to get the file.
This Python3 code downloads any file from web to a memory:
import requests
from io import BytesIO
url = 'your.link/path'
def get_file_data(url):
response = requests.get(url)
f = BytesIO()
for chunk in response.iter_content(chunk_size=1024):
f.write(chunk)
f.seek(0)
return f
data = get_file_data(url)
You can use next code to read the Excel file:
import pandas as pd
xlsx = pd.read_excel(data, skiprows=0)
print(xlsx)
It sounds like you don't actually have a direct URL to the file, and instead need to engage with some javascript. Perhaps there is an underlying network call that you can find by inspecting the page traffic in your browser that shows a direct URL for downloading the file. With that you can actually just read the excel file URL directly with pandas:
import pandas as pd
url = "https://example.com/some_file.xlsx"
df = pd.read_excel(url)
print(df)
This is nice and tidy, but if you really want to use requests (or avoid pandas) you can download the raw file content as shown in this answer and then use the pyexcel_xlsx package's get_xlsx function to read it without any pandas involvement.

POST XML file with requests

I'm getting:
<error>You have an error in your XML syntax...
when I run this python script I just wrote (I'm a newbie)
import requests
xml = """xxx.xml"""
headers = {'Content-Type':'text/xml'}
r = requests.post('https://example.com/serverxml.asp', data=xml)
print (r.content);
Here is the content of the xxx.xml
<xml>
<API>4.0</API>
<action>login</action>
<password>xxxx</password>
<license_number>xxxxx</license_number>
<username>xxx#xyz.com</username>
<training>1</training>
</xml>
I know that the xml is valid because I use the same xml for a perl script and the contents are being printed back.
Any help will greatly appreciated as I am very new to python.
You want to give the XML data from a file to requests.post. But, this function will not open a file for you. It expects you to pass a file object to it, not a file name. You need to open the file before you call requests.post.
Try this:
import requests
# Set the name of the XML file.
xml_file = "xxx.xml"
headers = {'Content-Type':'text/xml'}
# Open the XML file.
with open(xml_file) as xml:
# Give the object representing the XML file to requests.post.
r = requests.post('https://example.com/serverxml.asp', data=xml, headers=headers)
print (r.content);

Image scraped as HTML page with urlretrieve

I'm trying to scrape this image using urllib.urlretrieve.
>>> import urllib
>>> urllib.urlretrieve('http://i9.mangareader.net/one-piece/3/one-piece-1668214.jpg',
path) # path was previously defined
This code successfully saves the file in the given path. However, when I try to open the file, I get:
Could not load image 'imagename.jpg':
Error interpreting JPEG image file (Not a JPEG file: starts with 0x3c 0x21)
When I do file imagename.jpg in my bash terminal, I get imagefile.jpg: HTML document, ASCII text.
So how do I scrape this image as a JPEG file?
It's because the owner of the server hosting that image is deliberately blocking access from Python's urllib. That's why it's working with requests. You can also do it with pure Python, but you'll have to give it an HTTP User-Agent header that makes it look like something other than urllib. For example:
import urllib2
req = urllib2.Request('http://i9.mangareader.net/one-piece/3/one-piece-1668214.jpg')
req.add_header('User-Agent', 'Feneric Was Here')
resp = urllib2.urlopen(req)
imgdata = resp.read()
with open(path, 'wb') as outfile:
outfile.write(imgdata)
So it's a little more involved to get around, but still not too bad.
Note that the site owner probably did this because some people had gotten abusive. Please don't be one of them! With great power comes great responsibility, and all that.

Automatically Downloading Files from an ASPX Website in Python

This is my first time posting on stack overflow, and I look forward to getting more involved with the community. I need to download, rename, and save many Excel files from an ASPX website, but I cannot access these Excel files directly via a URL (i.e., the URL does not end with "excelfilename.csv"). What I can do is go to a URL which initiates the download of the file. An example of the URL is below.
https://www.websitename.com/something/ASPXthing.aspx?ReportName=ExcelFileName&Date=SomeDate&reportformat=csv
The inputs that I want to vary via loops are "ExcelFileName" and "SomeDate". I know one can fetch these files with urllib when the Excel files can be accessed directly via a URL, but how can I do it with a URL like this one?
Thanks in advance for helping out!
Using the requests library, you can fetch the file and iterate over chunks to write to file
import requests
report_names = ["Filename1","Filename2"]
dates = ['2016-02-22','2016-02-23'] # as strings
for report_name in report_names:
for date in dates:
with open('%s_%s_fetched.csv' % (report_name.split('.')[0],date,), 'wb') as handle:
response = requests.get('https://www.websitename.com/something/ASPXthing.aspx?ReportName=%s&Date=%s&reportformat=csv' % (report_name,date,), stream=True)
if not response.ok:
# Something went wrong
for block in response.iter_content(1024):
handle.write(block)

Categories

Resources