downloading csv files from a specific site using python - python

Goal: want to automatize the download of various .csv files from https://wyniki.tge.pl/en/wyniki/archiwum/2/?date_to=2018-03-21&date_from=2018-02-19&data_scope=contract&market=rtee&data_period=3 using Python (this is not the main issue though)
Specifics: in particular, I am trying to download the csv file for the "Settlement price" and "BASE Year"
Problem: when I see the source code for this web page.I see the references to the "Upload" button, but I don't see refences for the csv file(Tbf I am not very good at looking at the source code). As I am using Python (urllib) I need to know the URL of the csv file but don't know how to get it.
This is not a question of Python per se, but about how to find the URL of some .csv that can be downloaded from a web page. Hence, no code is provided.

If you inspect the source code from that webpage in particular, you will see that the form to obtain the csv file has 3 main inputs:
file_type
fields
contracts
So, to obtain the csv file for the "Settlement price" and "BASE Year", you would simply do a POST request to that same URL, passing these as the payload:
file_type=2&fields=4&contracts=4

I would recommend wget command with python. WGET is a command to download any file. Once you download the file with wget then you can manipulate the csv file using other library.
I found this wget library for python.
https://pypi.python.org/pypi/wget
Regards.
Eduardo Estevez.

Related

Getting file from URL that triggers a download in Python

I have a URL in a web analytics reporting platform that basically triggers a download/export of the report you're looking at. The downloaded file itself is a CSV, and the link that triggers the download uses several attached parameters to define things like the fields in the report. What I am looking to do is download the CSV that the link triggers a download of.
I'm using Python 3.6, and I've been told that the server I'll be deploying on does not support Selenium or any webkits like PhantomJS. Has anyone successfully accomplished this?
If the file is a CSV file, you might want to consider downloading it's content directly, by using the requests module, something like this.
import requests
session=requests.Session()
information=session.get(#the link of the page here)
Then You can decode the information and read the contents as you wish using the CSV module, something like this (the csv module should be imported):
decoded_information=information.content.decode('utf-8')
data=decoded_information.splitlines()
data=csv.DictReader(data)
You can use a for loop to access each row in the data as you wish using the column headings as dictionary keys like so:
for row in data:
itemdate=row['Date']
...
Or you can save the decoded contents by writing them to a file with something like this:
decoded_information=information.content.decode('utf-8')
file=open("filename.csv", "w")
file.write(decoded_information)
file.close
A couple of links with documentation on the CSV module is provided here just in case you haven't used it before:
https://docs.python.org/2/library/csv.html
http://www.pythonforbeginners.com/systems-programming/using-the-csv-module-in-python/
Hope this helps!

Python downloading PDF with urllib2 creates corrupt document

Last week I defined a function to download pdfs from a journal website. I successfully downloaded several pdfs using:
import urllib2
def pdfDownload(url):
response=urllib2.urlopen(url)
expdf=response.read()
egpdf=open('ex.pdf','wb')
egpdf.write(expdf)
egpdf.close()
I tried this function out with:
pdfDownload('http://pss.sagepub.com/content/26/1/3.full.pdf')
At the time, this was how the URLs on the journal Psychological Science were formatted. The pdf downloaded just fine.
I then went to write some more code to actually generate the URL lists and name the files appropriately so I could download large numbers of appropriately named pdf documents at once.
When I came back to join my two scripts together (sorry for non-technical language; I'm no expert, have just taught myself the basics) the formatting of URLs on the relevant journal had changed. Following the previous URL takes you to a page with URL 'http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009'. And now the pdfDownload function doesn't work anymore (either with the original URL or new URL). It creates a pdf which cannot be opened "because the file is not a supported file type or has been damaged".
I'm confused as to me it seems like all has changed is the formatting of the URLs, but actually something else must have changed to result in this? Any help would be hugely appreciated.
The problem is that the new URL points to a webpage--not the original PDF. If you print the value of "expdf", you'll get a bunch of HTML--not the binary data you're expecting.
I was able to get your original function working with a small tweak--I used the requests library to download the file instead of urllib2. requests appears to pull the file with the loader referenced in the html you're getting from your current implementation. Try this:
import requests
def pdfDownload(url):
response=requests.get(url)
expdf=response.content
egpdf=open('ex.pdf','wb')
egpdf.write(expdf)
egpdf.close()
If you're using Python 3, you already have requests; if you're using Python 2.7, you'll need to pip install requests.

Downloaded webpage looks different than the original webpage

I am thinking of downloading cplusplus.com's C library by using Python. I want to download it completely and then convert it into a linked document such as Python documentation. This is my initial attempt at downloading the front page.
#! python3
import urllib.request
filehandle = urllib.request.urlopen('http://www.cplusplus.com/reference/clibrary/')
with open('test.html', 'w+b') as f:
for line in filehandle:
f.write(line)
filehandle.close()
The front page is being downloaded correctly but its look is quite different than in the original webpage. By different look I mean that the nice looking formatting on the original webpage is gone after I ran the script to download the webpage.
What's the reason for this?
This occurs because your scraped version doesn't include the Cascading Style Sheets (CSS) linked to by the page. It also won't include any images or javascript linked to either. If you want to obtain the linked files, you'll have to parse the source code you scrape for them.

Downloading public files in Google Drive (Python)

Suppose that someone gives me a link that enables me to download a public file in Google Drive.
I want to write a program that can read the link and then download it as a text file.
For example, https://docs.google.com/document/d/1yJVXtabsP7KrJXSu3XyOh-F2cFoP8Lftr14PtXCLEVU/edit is one of files in my Google Drive.
Everyone can access this file.
But how can I write a Python program that downloads the text file given the above link?
Could someone have some pieces of sample code for me?
It seems that some Google Drive SDK could be useful(?), but is there any way to do it without using SDK?
first you need to write a program that would slice off the link of the file that you have uploaded.
for example in the link that you gave:
https://docs.google.com/document/d/1yJVXtabsP7KrJXSu3XyOh-F2cFoP8Lftr14PtXCLEVU/edit
id is 1yJVXtabsP7KrJXSu3XyOh-F2cFoP8Lftr14PtXCLEVU
save it in some variable , say download_link
now to get the download link:
https://docs.google.com/uc?export=download&id=download_link
this link will download the file
If the above answer doesn't work for you use the following links :
to save as .txt file :
https://docs.google.com/document/d/1yJVXtabsP7KrJXSu3XyOh-F2cFoP8Lftr14PtXCLEVU/export?format=txt
to save as docx file:
https://docs.google.com/document/d/1yJVXtabsP7KrJXSu3XyOh-F2cFoP8Lftr14PtXCLEVU/export?format=docx
generally the trick is to add : export?format=txt instead of edit ! hope it helps.

Convert .pages to .doc or .pdf in Python

How does one convert a .pages file to a .doc or .pdf file using Python? My use case is basically:
User uploads a .pages file to my service
My service converts the .pages to a .pdf`
The .pdf is rendered in browser using a browser-based .pdf viewer
I've never done it, but it appears the .pages file already contains a pdf version if you unzip the file: http://blog.cleverly.com/
A complete native solution in python will be difficult.
Appropriate solution would be to look at how you can automate pages to export the file in pdf or ms word.
For that, there seems to be an available solution:
pyobjc
Three is an example that automates pages using pyobjc: http://www.mugginsoft.com/kosmictask/help/automation-python

Categories

Resources