I have a URL in a web analytics reporting platform that basically triggers a download/export of the report you're looking at. The downloaded file itself is a CSV, and the link that triggers the download uses several attached parameters to define things like the fields in the report. What I am looking to do is download the CSV that the link triggers a download of.
I'm using Python 3.6, and I've been told that the server I'll be deploying on does not support Selenium or any webkits like PhantomJS. Has anyone successfully accomplished this?
If the file is a CSV file, you might want to consider downloading it's content directly, by using the requests module, something like this.
import requests
session=requests.Session()
information=session.get(#the link of the page here)
Then You can decode the information and read the contents as you wish using the CSV module, something like this (the csv module should be imported):
decoded_information=information.content.decode('utf-8')
data=decoded_information.splitlines()
data=csv.DictReader(data)
You can use a for loop to access each row in the data as you wish using the column headings as dictionary keys like so:
for row in data:
itemdate=row['Date']
...
Or you can save the decoded contents by writing them to a file with something like this:
decoded_information=information.content.decode('utf-8')
file=open("filename.csv", "w")
file.write(decoded_information)
file.close
A couple of links with documentation on the CSV module is provided here just in case you haven't used it before:
https://docs.python.org/2/library/csv.html
http://www.pythonforbeginners.com/systems-programming/using-the-csv-module-in-python/
Hope this helps!
Related
I need to upload a few CSV files somewhere on the internet, to be able to use it in Jupyter later using read_csv.
What would be some easy ways to do this?
The CSV contains a database. I want to upload it somewhere and use it in Jupyter using read_csv so that other people can run the code when I send them my file.
The CSV contains a database.
Since the CSV contains a database, I would not suggest uploading it on Github as mentioned by Steven K in the previous answer. It would be a better option to upload it to either Google Drive or Dropbox as rightly said in the previous answer.
To read the file from Google Drive, you could try the following:
Upload the file on Google Drive and click on "Get Sharable Link" and
ensure that anybody with the link can access it.
Click on copy link and get the file ID associated with the CSV.
Ex: If this is the URL https://drive.google.com/file/d/108ARMaD-pUJRmT9wbXfavr2wM0Op78mX/view?usp=sharing then 108ARMaD-pUJRmT9wbXfavr2wM0Op78mX is the file ID.
Simply use the file ID in the following sample code
import pandas as pd
gdrive_file_id = '108ARMaD-pUJRmT9wbXfavr2wM0Op78mX'
data = pd.read_csv(f'https://docs.google.com/uc?id={gdrive_file_id}&export=download', encoding='ISO-8859-1')
Here you are opening up the CSV to anybody with access to the link. A better and more controlled approach would be to share the access with known people and use a library like PyDrive which is a wrapper around Google API's official Python client.
NOTE: Since your question does not mention the version of Python that you are using, I've assumed Python 3.6+ and used f-strings in line #3 of the code. If you use any version of Python before 3.6, you would have to use format method to substitute the value of the variable in the string
You could use any cloud storage provider like Dropbox or Google Drive. Alternatively, you could use Github.
To do this in your notebook, import pandas and read_csv like you normally would for a local file.
import pandas as pd
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
c=pd.read_csv(url)
Goal: want to automatize the download of various .csv files from https://wyniki.tge.pl/en/wyniki/archiwum/2/?date_to=2018-03-21&date_from=2018-02-19&data_scope=contract&market=rtee&data_period=3 using Python (this is not the main issue though)
Specifics: in particular, I am trying to download the csv file for the "Settlement price" and "BASE Year"
Problem: when I see the source code for this web page.I see the references to the "Upload" button, but I don't see refences for the csv file(Tbf I am not very good at looking at the source code). As I am using Python (urllib) I need to know the URL of the csv file but don't know how to get it.
This is not a question of Python per se, but about how to find the URL of some .csv that can be downloaded from a web page. Hence, no code is provided.
If you inspect the source code from that webpage in particular, you will see that the form to obtain the csv file has 3 main inputs:
file_type
fields
contracts
So, to obtain the csv file for the "Settlement price" and "BASE Year", you would simply do a POST request to that same URL, passing these as the payload:
file_type=2&fields=4&contracts=4
I would recommend wget command with python. WGET is a command to download any file. Once you download the file with wget then you can manipulate the csv file using other library.
I found this wget library for python.
https://pypi.python.org/pypi/wget
Regards.
Eduardo Estevez.
I would like to produce some custom output with python with data from Tableau files. I dont have access to the Tableau server to run the 'Tabpy' library.
Is there any other way to do it?
Thank you in advance
You may find the following link useful.
https://community.tableau.com/thread/152463
One of the posts in the thread mentioned the following which is worth exploring:
If you're looking to generate a TWBX dynamically, you should rename
your .twbx file to .zip, extract the contents and you can do whatever
you want with those in Python to dynamically create or adjust a
workbook file. The structure / definition of the workbook file is just
XML so no special code needed to read and parse that.
Last week I defined a function to download pdfs from a journal website. I successfully downloaded several pdfs using:
import urllib2
def pdfDownload(url):
response=urllib2.urlopen(url)
expdf=response.read()
egpdf=open('ex.pdf','wb')
egpdf.write(expdf)
egpdf.close()
I tried this function out with:
pdfDownload('http://pss.sagepub.com/content/26/1/3.full.pdf')
At the time, this was how the URLs on the journal Psychological Science were formatted. The pdf downloaded just fine.
I then went to write some more code to actually generate the URL lists and name the files appropriately so I could download large numbers of appropriately named pdf documents at once.
When I came back to join my two scripts together (sorry for non-technical language; I'm no expert, have just taught myself the basics) the formatting of URLs on the relevant journal had changed. Following the previous URL takes you to a page with URL 'http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009'. And now the pdfDownload function doesn't work anymore (either with the original URL or new URL). It creates a pdf which cannot be opened "because the file is not a supported file type or has been damaged".
I'm confused as to me it seems like all has changed is the formatting of the URLs, but actually something else must have changed to result in this? Any help would be hugely appreciated.
The problem is that the new URL points to a webpage--not the original PDF. If you print the value of "expdf", you'll get a bunch of HTML--not the binary data you're expecting.
I was able to get your original function working with a small tweak--I used the requests library to download the file instead of urllib2. requests appears to pull the file with the loader referenced in the html you're getting from your current implementation. Try this:
import requests
def pdfDownload(url):
response=requests.get(url)
expdf=response.content
egpdf=open('ex.pdf','wb')
egpdf.write(expdf)
egpdf.close()
If you're using Python 3, you already have requests; if you're using Python 2.7, you'll need to pip install requests.
I am thinking of downloading cplusplus.com's C library by using Python. I want to download it completely and then convert it into a linked document such as Python documentation. This is my initial attempt at downloading the front page.
#! python3
import urllib.request
filehandle = urllib.request.urlopen('http://www.cplusplus.com/reference/clibrary/')
with open('test.html', 'w+b') as f:
for line in filehandle:
f.write(line)
filehandle.close()
The front page is being downloaded correctly but its look is quite different than in the original webpage. By different look I mean that the nice looking formatting on the original webpage is gone after I ran the script to download the webpage.
What's the reason for this?
This occurs because your scraped version doesn't include the Cascading Style Sheets (CSS) linked to by the page. It also won't include any images or javascript linked to either. If you want to obtain the linked files, you'll have to parse the source code you scrape for them.