Batch download data from OpenDAP using Python - python

I am trying to download multiple .nc files from OpenDAP. When I download the files manually (without a script) the files work as expected. To try speed the process up, I have a script that batch downloads data. However, when I download data using xarray the files are 10x larger and the files seem to be corrupted.
My script looks like this:
import pandas as pd
import xarray as xr
import os
import numpy as np
dates = pd.date_range(start='2016-01-01',end='2016-01-05',freq='D')
my_url = "http://www.ifremer.fr/opendap/cerdap1/ghrsst/l4/saf/odyssea-nrt/data/"
print(" ")
print("Downloading data from OPeNDAP - sit back, relax, this will take a while...")
print("...")
print("...")
# Create a list of url's
data_url = []
cnt = 0
for i in np.arange(1,5):
ii = i+1
data_url.append(my_url + str(dates[cnt].year)+"/"+ str('%03d'%+ii)+"/"\
+str(dates[cnt+1].year)+str('%02d'%dates[cnt+1].month)+str('%02d'%dates[cnt+1].day)\
+"-IFR-L4_GHRSST-SSTfnd-ODYSSEA-SAF_002-v2.0-fv1.0.nc?time[0:1:0],lat[0:1:1749],lon[0:1:2249],analysed_sst[0:1:0][0:1:1749][0:1:2249],analysis_error[0:1:0][0:1:1749][0:1:2249],mask[0:1:0][0:1:1749][0:1:2249],sea_ice_fraction[0:1:0][0:1:1749][0:1:2249]")
cnt = cnt+1
url_list = data_url
# Download data from the url's
count = 0
for data in url_list:
print('Downloading file:', str(count))
ds = xr.open_dataset(data,autoclose=True)
fname = 'SAFodyssea_sst'+str(dates[count+1].year)+str('%02d'%dates[count+1].month)+str('%02d'%dates[count+1].day)+'.nc'
ds.to_netcdf(fname)
count = count +1
del ds, fname
print('DONE !!!')
I have xarray version 0.10.8. I have tried running this using python 2.7 and python 3.5.6 as well as on windows 10 and Ubuntu 16.04 and I get the same result.
Your help is much appreciated.

Each of these files as an associated URL for the netCDF file, e.g.,
http://www.ifremer.fr/opendap/cerdap1/ghrsst/l4/saf/odyssea-nrt/data/2018/001/20180101-IFR-L4_GHRSST-SSTfnd-ODYSSEA-SAF_002-v2.0-fv1.0.nc
One simple way to solve this problem would be to use a library such as requests to download each file, e.g., as described here:
How to download large file in python with requests.py?

Related

File Not Found Error while Downloading Image files

I am using Windows 8.1, so I have been web scraping a lot recently and have been very successful in finding out some errors as well, but now I am stuck in downloading the files as they will not download and giving me a
FileNotFoundError.
I have removed all the unknown characters from the name files but still, get this error. any help.
I have also made the names lowercase just in case. The error happens when I download the 22nd item, other items download fine before the 22nd one .
My Code and also the Excel file For reference:
import time
import pandas as pd
import requests
Final1 = pd.read_excel("Sneakers.xlsx")
Final1.index+=1
a = Final1.index.tolist()
Images = Final1["Images"].tolist()
Name = Final1["Name"].str.lower().tolist()
Brand = Final1["Brand"].str.lower().tolist()
s = requests.Session()
for i,n,b,l in zip(a,Name,Brand,Images):
r = s.get(l).content
with open("Images//" + f"{i}-{n}-{b}.jpg","wb") as f:
f.write(r)
Excel File (Google Drive) : Excel File
It seems like you don't have Images folder in your path.
It's better way to use os.path.join() function for joining path in python.
Try Below:
import os
import time
import pandas as pd
import requests
Final1 = pd.read_excel("Sneakers.xlsx")
Final1.index+=1
a = Final1.index.tolist()
Images = Final1["Images"].tolist()
Name = Final1["Name"].str.lower().tolist()
Brand = Final1["Brand"].str.lower().tolist()
# Added
if not os.path.exists("Images"):
os.mkdir("Images")
s = requests.Session()
for i,n,b,l in zip(a,Name,Brand,Images):
r = s.get(l).content
# with open("Images//" + f"{i}-{n}-{b}.jpg","wb") as f:
with open(os.path.join("Images", f"{i}-{n}-{b}.jpg"),"wb") as f:
f.write(r)

Download CSV Data by looping over a Pandas Data frame which consists of 47 URL

I am trying to develop a Python Script for my Data Engineering Project and I want to loop over 47 URLS stored in a dataframe, which downloads a CSV File and stores in my local machine. Below is the example of top 5 URLS:
test_url = "https://data.cdc.gov/api/views/pj7m-y5uh/rows.csv?accessType=DOWNLOAD"
req = requests.get(test_url)
url_content = req.content
csv_file = open('cdc6.csv', 'wb')
csv_file.write(url_content)
csv_file.close()
I have this for a single file, but instead of the opening a CSV File and writing the Data in it, I want to directly download all the files and save it in local machine.
You want to iterate and then download the file to a folder. Iteration is easy by using the .items() method in pandas dataframes and passing it into a loop. See the documentation here.
Then, you want to download each item. Urllib has a .urlretrieve(url, filename) function for downloading a hosted file to a local file, which is elaborated on in the urllib documentation here.
Your code may look like:
for index, url in url_df.items():
urllib.urlretrieve(url, "cdcData" + index + ".csv")
or if you want to preserve the original names:
for index, url in url_df.items():
name = url.split("/")[-1]
urllib.urlretrieve(url, name)

Download multiple xls files using Python

I was wondering if somebody here could help me out creating a script? I have never done something like this before so I have no idea what I’m doing. But I have been reading about it for a couple days now and I’m still not understanding it so I appreciating all help I can get. I’m even willing to pay for your service!
Here is an example of my problem. I have for the moment a CSV file named “Stars” saved on my windows desktop containing around 50.000 different links that directly starts downloading a xls file when pressed. Each row contains one of these links. I would want with your help create some kind of script for this that will make some kind of loop thru each row and visit this different links so it can download these 50.000 different files.
Thank you all for taking time to read this
/ Sarah
Say your CSV file looks like:
http://www.ietf.org/rfc/rfc959.txt
http://www.ietf.org/rfc/rfc1579.txt
http://www.ietf.org/rfc/rfc2577.txt
replace path to csvfile and targetdir in python code:
import os
import urllib2
csvfile = '/tmp/links.csv'
targetdir = '/tmp/so'
with open(csvfile) as links:
for link in links:
filename = link.split('/')[-1].strip()
filepath = os.path.join(targetdir, filename)
print 'Downloading %s \n\t .. to %s' % (link.strip(), filepath)
with open(filepath, 'w') as data:
xlsfile = urllib2.urlopen(link)
data.writelines(xlsfile)
Example of usage:
$ python download_all.py
Downloading http://www.ietf.org/rfc/rfc959.txt
.. to /tmp/so/rfc959.txt
Downloading http://www.ietf.org/rfc/rfc1579.txt
.. to /tmp/so/rfc1579.txt
Downloading http://www.ietf.org/rfc/rfc2577.txt
.. to /tmp/so/rfc2577.txt
$ dir -1 /tmp/so
rfc1579.txt
rfc2577.txt
rfc959.txt
Good Luck.
Another Solution:
Without more information, the best answer I can give you on this question would be to use Selenium to download the file and the csv module to parse your csv with the links.
Example:
import csv
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2)
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', 'PATH\TO\DOWNLOAD\DIRECTORY')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', "application/csv")
driver = webdriver.Firefox(firefox_profile=profile)
input_csv_location = "PATH\TO\CSV.csv"
with open(csv_location, 'r') as input_csv:
reader = csv.reader(input_csv)
for line in reader:
driver.get(line[0])
This assumes there is no header on the csv and that the urls are sitting in spot numero uno.

How to download a CSV file from the World Bank's dataset

I would like to automate the download of CSV files from the World Bank's dataset.
My problem is that the URL corresponding to a specific dataset does not lead directly to the desired CSV file but is instead a query to the World Bank's API. As an example, this is the URL to get the GDP per capita data: http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv.
If you paste this URL in your browser, it will automatically start the download of the corresponding file. As a consequence, the code I usually use to collect and save CSV files in Python is not working in the present situation:
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen("%s" %(baseUrl))
myData = csv.reader(remoteCSV)
How should I modify my code in order to download the file coming from the query to the API?
This will get the zip downloaded, open it and get you a csv object with whatever file you want.
import urllib2
import StringIO
from zipfile import ZipFile
import csv
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen(baseUrl)
sio = StringIO.StringIO()
sio.write(remoteCSV.read())
# We create a StringIO object so that we can work on the results of the request (a string) as though it is a file.
z = ZipFile(sio, 'r')
# We now create a ZipFile object pointed to by 'z' and we can do a few things here:
print z.namelist()
# A list with the names of all the files in the zip you just downloaded
# We can use z.namelist()[1] to refer to 'ny.gdp.pcap.cd_Indicator_en_csv_v2.csv'
with z.open(z.namelist()[1]) as f:
# Opens the 2nd file in the zip
csvr = csv.reader(f)
for row in csvr:
print row
For more information see ZipFile Docs and StringIO Docs
import os
import urllib
import zipfile
from StringIO import StringIO
package = StringIO(urllib.urlopen("http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv").read())
zip = zipfile.ZipFile(package, 'r')
pwd = os.path.abspath(os.curdir)
for filename in zip.namelist():
csv = os.path.join(pwd, filename)
with open(csv, 'w') as fp:
fp.write(zip.read(filename))
print filename, 'downloaded successfully'
From here you can use your approach to handle CSV files.
We have a script to automate access and data extraction for World Bank World Development Indicators like: https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
The script does the following:
Downloading the metadata data
Extracting metadata and data
Converting to a Data Package
The script is python based and uses python 3.0. It has no dependencies outside of the standard library. Try it:
python scripts/get.py
python scripts/get.py https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
You also can read our analysis about data from World Bank:
https://datahub.io/awesome/world-bank
Just a suggestion than a solution. You can use pd.read_csv to read any csv file directly from a URL.
import pandas as pd
data = pd.read_csv('http://url_to_the_csv_file')

Downloading contents of several html pages using python

I'm new to Python and was trying to figure out how to code a script that will download the contents of HTML pages. I was thinking of doing something like:
Y = 0
X = "example.com/example/" + Y
While Y != 500:
(code to download file), Y++
if Y == 500:
break
so the (Y) is the file name and I need to download files from example.com/example/1 all the way till file number 500, regardless of the file type.
Read this official docs page:
This module provides a high-level interface for fetching data across the World Wide Web.
In particular, the urlopen() function is similar to the built-in function open(), but accepts Universal Resource Locators (URLs) instead of filenames.
Some restrictions apply — it can only open URLs for reading, and no seek operations are available.
So you have code like this:
import urllib
content = urllib.urlopen("http://www.google.com").read()
#urllib.request.urlopen(...).read() in python 3
The following code should meet your need. It will download 500 web contents and save them to disk.
import urllib2
def grab_html(url):
response = urllib2.urlopen(url)
mimetype = response.info().getheader('Content-Type')
return response.read(), mimetype
for i in range(500):
filename = str(i) # Use digit as filename
url = "http://example.com/example/{0}".format(filename)
contents, _ = grab_html(url)
with open(filename, "w") as fp:
fp.write(contents)
Notes:
If you need parallel fetching, here is a great example https://docs.python.org/3/library/concurrent.futures.html

Categories

Resources