Downloading public files in Google Drive (Python) - python

Suppose that someone gives me a link that enables me to download a public file in Google Drive.
I want to write a program that can read the link and then download it as a text file.
For example, https://docs.google.com/document/d/1yJVXtabsP7KrJXSu3XyOh-F2cFoP8Lftr14PtXCLEVU/edit is one of files in my Google Drive.
Everyone can access this file.
But how can I write a Python program that downloads the text file given the above link?
Could someone have some pieces of sample code for me?
It seems that some Google Drive SDK could be useful(?), but is there any way to do it without using SDK?

first you need to write a program that would slice off the link of the file that you have uploaded.
for example in the link that you gave:
https://docs.google.com/document/d/1yJVXtabsP7KrJXSu3XyOh-F2cFoP8Lftr14PtXCLEVU/edit
id is 1yJVXtabsP7KrJXSu3XyOh-F2cFoP8Lftr14PtXCLEVU
save it in some variable , say download_link
now to get the download link:
https://docs.google.com/uc?export=download&id=download_link
this link will download the file

If the above answer doesn't work for you use the following links :
to save as .txt file :
https://docs.google.com/document/d/1yJVXtabsP7KrJXSu3XyOh-F2cFoP8Lftr14PtXCLEVU/export?format=txt
to save as docx file:
https://docs.google.com/document/d/1yJVXtabsP7KrJXSu3XyOh-F2cFoP8Lftr14PtXCLEVU/export?format=docx
generally the trick is to add : export?format=txt instead of edit ! hope it helps.

Related

How to create CSV FILE URL Link in HTML Report (Python) when csv file is residing in ADLS

I have a csv file in Azure data lake storage, i need to create a HTML report using Azure data bricks notebook(Python) where i am supposed to provide this CSV file link which user can click and download.
I am not sure if this is achievable, just posting this question to reach out folks who can help with logic/flow.
for example:
i am trying to include below piece of code in my HTMl but it is not helping
<a href='abfss://testingZone#testingZone.dfs.core.windows.net/Test/Input/TestData.csv'>CSVFile </a>
I reproduced this and when I try to download the csv from ADLS, I got the below result.
Used your code in displayHTML() function in databricks after mounting.
When I click on the link It is showing blank page with Untitled.
It might be because of the restrictions of the ADLS gen2.
To download the file via link, first copy the above file to dbfs then create the web link in databricks.
Click on this link and you can download the CSV file.
Code:
#Copy csv from ADLS to dbfs
dbutils.fs.cp("abfss://con1#rakeshgen3.dfs.core.windows.net/Sample1.csv","dbfs:/FileStore/tables/ok.csv")
# Download the csv with below syntax
displayHTML("""<a href='/files/tables/ok.csv'>CSVFile </a>""")

How to load large xml dataset file in python?

Hi I am working on a project in data analysis with python where I have an XML file of around 2,8GB which is too large to open . I downloaded EmEditor which helped me open the file . The problem is when i try to load the file in python google colaboratory like this :
import xml.etree.ElementTree as ET
tree = ET.parse('dataset.xml') //dataset.xml is the name of my file
root = tree.getroot()
I get the result that No such file or directory: 'dataset.xml' exists . I have my dataset.xml file on my desktop and it can be opened using the EmEditor which gives me the idea that it can be edited and loaded via the EmEditor but I don't know . I would appreciate your help with helping me load the data in python
google colab.
Google Colab runs remotely on a computer from Google, and can't access files that are on your desktop.
To open the file in Python, you'll first need to transfer the file to your colab instance. There's multiple ways to do this, and you can find them here: https://colab.research.google.com/notebooks/io.ipynb
The easiest is probably this:
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
Although keep in mind that every time you start a new colab session you'll need to reupload the file. This is because Google would like to use to the computer for someone else when you are not using it, and thus wipes all the data on the computer.

downloading csv files from a specific site using python

Goal: want to automatize the download of various .csv files from https://wyniki.tge.pl/en/wyniki/archiwum/2/?date_to=2018-03-21&date_from=2018-02-19&data_scope=contract&market=rtee&data_period=3 using Python (this is not the main issue though)
Specifics: in particular, I am trying to download the csv file for the "Settlement price" and "BASE Year"
Problem: when I see the source code for this web page.I see the references to the "Upload" button, but I don't see refences for the csv file(Tbf I am not very good at looking at the source code). As I am using Python (urllib) I need to know the URL of the csv file but don't know how to get it.
This is not a question of Python per se, but about how to find the URL of some .csv that can be downloaded from a web page. Hence, no code is provided.
If you inspect the source code from that webpage in particular, you will see that the form to obtain the csv file has 3 main inputs:
file_type
fields
contracts
So, to obtain the csv file for the "Settlement price" and "BASE Year", you would simply do a POST request to that same URL, passing these as the payload:
file_type=2&fields=4&contracts=4
I would recommend wget command with python. WGET is a command to download any file. Once you download the file with wget then you can manipulate the csv file using other library.
I found this wget library for python.
https://pypi.python.org/pypi/wget
Regards.
Eduardo Estevez.

Python 3.4 - Downloading newly uploaded text files from pastebin.com

I want to download text files from pastebin.com.
Once I start the program it should look for text files that are being uploaded and "download" them once they're uploaded.
I know how to "download" them but not how to tell Python to click on one of the public files on http://pastebin.com/archive and then click on the "raw"-button to open a new tab that contains the "raw" content.
I googled a lot but literally nothing came up that would help me.
Thanks
Well, a program doesn't know how to "click" anything :). In order to retrieve information from a page, you simply need to send a GET request at the correct url. In your case, that would be http://pastebin.com/raw/4ffLHviP or any other code of the pastebin you want to download. You can retrieve codes manually, or e.g. by applying text parsers (regex, beautifulsoup...) on the archive page.
Note that, there is an API for scraping Pastebin (see http://pastebin.com/scraping). It is strongly recommended, if you want to extract consequent content from them, to use it. It is more "polite", may offer better service, and will avoid you to be blacklisted.
To choose a file you simply do the following:
Visit the link of the file, ex. http://pastebin.com/B8A6L7Zt
The raw content is already on that page, namely inside<textarea id='paste_code'>...</textarea>. So you just cut this content off, using regex for example.

Getting the download link for a public Google Docs file

Reading the Google Docs API I find this:
Downloading
Files cannot be downloaded in a format other than
the one in which they were originally uploaded. The download URL for
files looks something like this:
https://doc-04-20-docs.googleusercontent.com/docs/secure/m7an0emtau/WJm12345/YzI2Y2ExYWVm?h=16655626&e=download&gd=true
Given a public Google Documents file URL, say,
https://docs.google.com/open?id=0B1-vl-dPgKm_NTNhZjZkMWMtZjQxOS00MGE1LTg2MjItNGVjYzdmZjYxNmQ5
How can I turn it into a download link?
Hi nightcracker try this:
https://docs.google.com/uc?export=download&id=DOCIDGOESHERE
I've only tried it with one pdf and it worked ok so maybe having a play with that will help....
All the best,
Dave
The method suggested by dkcwd didn't work for a published document.
I have found the following method in this page. Given a publish document URL, such as
https://docs.google.com/document/pub?id=[ID]
The download link is
https://docs.google.com/document/export?format=[FORMAT]&id=[ID]
where [FORMAT] can be one of these values: pdf, doc, docx, oo, rtf, txt.

Categories

Resources