Efficient netCDF analysis when looping through data - python

This is a follow up question related to this question.
Thanks to previous help I have successfully imported a netCDF file (or files with MFDataset) and am able to compare the different times to one another to create another cumulative dataset. Here is a piece of the current code.
from numpy import *
import netCDF4
import os
f = netCDF4.MFDataset('air.2m.1979.nc')
atemp = f.variables['air']
ntimes, ny, nx = atemp.shape
cold_days = zeros((ntimes, ny, nx), dtype=int)
for i in range(ntimes):
for b in range(ny):
for c in range(nx):
if i == 1:
if atemp[i,b,c] < 0:
cold_days[i,b,c] = 1
else:
cold_days[i,b,c] = 0
else:
if atemp[i,b,c] < 0:
cold_days[i,b,c] = cold_days[i-1,b,c] + 1
else:
cold_days[i,b,c] = 0
This seems like a brute force way to get the job done, and though it works it takes a very long time. I'm not sure if it takes such a long time because I'm dealing with 365 349x277 matrices (35,285,645 pixels) or if my old school brute force way is simply slow in comparison to some built in python methods.
Below is an example of what I believe the code is doing. It looks at Time and increments cold days if temp < 0. If temp >= 0 than cold days resets to 0. In the below image you will see that the cell at row 2, column 1 increments each Time that passes but the cell at row 2, column 2 increments at Time 1 but resets to zero on Time 2.
Is there a more efficient way to rip through this netCDF dataset to perform this type of operation?

Seems like this is a minor modification -- just write the data out at each time step. Something close to this should work:
from pylab import *
import netCDF4
# open NetCDF input files
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.19??.nc')
# print variables
f.variables.keys()
atemp = f.variables['air']
print atemp
ntimes, ny, nx = shape(atemp)
cold_days = zeros((ny,nx),dtype=int)
# create output NetCDF file
nco = netCDF4.Dataset('/usgs/data2/notebook/cold_days.nc','w',clobber=True)
nco.createDimension('x',nx)
nco.createDimension('y',ny)
nco.createDimension('time',ntimes)
cold_days_v = nco.createVariable('cold_days', 'i4', ( 'time', 'y', 'x'))
cold_days_v.units='days'
cold_days_v.long_name='total number of days below 0 degC'
cold_days_v.grid_mapping = 'Lambert_Conformal'
timeo = nco.createVariable('time','f8',('time'))
lono = nco.createVariable('lon','f4',('y','x'))
lato = nco.createVariable('lat','f4',('y','x'))
xo = nco.createVariable('x','f4',('x'))
yo = nco.createVariable('y','f4',('y'))
lco = nco.createVariable('Lambert_Conformal','i4')
# copy all the variable attributes from original file
for var in ['time','lon','lat','x','y','Lambert_Conformal']:
for att in f.variables[var].ncattrs():
setattr(nco.variables[var],att,getattr(f.variables[var],att))
# copy variable data for time, lon,lat,x and y
timeo[:] = f.variables['time'][:]
lato[:] = f.variables['lat'][:]
xo[:] = f.variables['x'][:]
yo[:] = f.variables['y'][:]
for i in xrange(ntimes):
cold_days += atemp[i,:,:].data-273.15 < 0
# write the cold_days data
cold_days_v[i,:,:]=cold_days
# copy Global attributes from original file
for att in f.ncattrs():
setattr(nco,att,getattr(f,att))
nco.Conventions='CF-1.6'
nco.close()

Related

How to tell if a Python program is working without having to wait for it to finish?

Here is my code:
import xlwings as xw
import datetime as dt
import numpy as np
import pandas as pd
import threading
import time
#connect to workbook
wb = xw.Book(r'C:\Users\Ryan\AppData\Local\Programs\Python\Python37-32\constituents.xlsx')
sht = wb.sheets['constituents']
#store data in np array, pass to Pandas
a = sht.range('A2:C1760').options(np.array).value
df = pd.DataFrame(a)
df = df.rename(index=str, columns={0: "tickers", 1: "start_dates", 2: "end_dates"})
#initialize variables
start_quarter = 0
start_year = 0
fiscal_dates = []
s1 = pd.date_range(start='1/1/1964', end='12/31/2018', freq='B')
df2 = pd.DataFrame(data=np.ndarray(shape=(len(s1),500), dtype=str), index=s1)
#create list of fiscal quarters
def fiscal_quarters(start_year):
year_count = start_year - 1
quarter_count = 1
for n in range(2019 - start_year):
year_count += 1
for i in range(1,5):
fiscal_dates.append(str(quarter_count) + 'Q'+ str(year_count)[-2:])
quarter_count += 1
quarter_count = 1
#iterate over list of tickers to create self-named spreadsheets
def populate_worksheets():
for n in range(len(fiscal_dates)):
wb.sheets.add(name=fiscal_dates[n])
#populate df2 with appropriate tickers
def populate_tickers():
count = 0
for n in range(len(s1)):
for i in range(len(df['tickers'])):
if df.loc[str(i), 'start_dates'] <= s1[n] and df.loc[str(i), 'end_dates'] > s1[n]:
count += 1
df2.loc[str(s1[n]), str(count)] = df.loc[str(i), 'tickers']
count = 0
#run populate_tickers function with status updates
def pt_thread():
t = threading.Thread(target=populate_tickers)
c = 0
t.start()
while (t.is_alive()):
time.sleep(5)
count += 5
print("Working... " + str(c) + 's')
First, I run fiscal_quarters(1964) in the Python Shell, then pt_thread(), which appears to be particularly resource intensive. It's been running for over a half hour for this point on my (admittedly slow) laptop. However, without waiting for it to finish running, is there any way to be able to see whether it's working as intended? Or at all? It's still printing "Working..." to the shell which I suppose is a good sign, but I'd like to start troubleshooting if something's wrong rather than waiting an indefinite amount of time before giving up on it.
For reference, the s1 Series contains ~17,500 items, while the df['tickers'] column contains ~2,000 items, so there should be somewhere in the neighborhood of 35,000,000 iterations with 4 operations each. Is this a lot, or should a modern PC be able to work through this rather quickly and my program is probably just not working?
If you are running loops that are taking a long time and want to see what is going on you could use tqdm. It gives iterations per second and estimated time remaining. Here is a quick example:
from tqdm import tqdm
def sim(sims):
x = 0
pb = tqdm(total=sims, initial=x)
while x < sims:
x+=1
pb.update(1)
pb.close()
sim(5000000)

Combining multiple functions through loops and parameters

UPDATE My question has been fully answered, I have applied it to my program using jarmod's answer, and although the code looks neater, it has not effected the speed of (when my graph appears( i plot this data using matplotlib) I am a a little confused on why my program runs slowly and how I can increase the speed ( takes about 30 seconds and I know this portion of the code is slowing it down) I have shown my real code in the second block of code. Also, the speed is strongly determined by the Range I set, with a short range it is quiet fast
I have this sample code here that shows my calculation needed to conduct forecasting and extracting values. I use the for loops to run through a specific range of CSV files that I labeled 1-100. I return numbers for each month (1-12) to get the forecasting average for a forecast for a given amount of month.
My full code includes 12 functions for a full year forecast but I feel the code is inefficient because the functions are very similar except for one number and reading the csv file so many times slows the program.
Is there a way I can combine these functions and perhaps add another parameter to make it run so. The biggest concern I had was that it would be hard to return separate numbers and categorize them. In other words, I would like to ideally only have one function for all 12 month accuracy predictions and the way I can possibly see how to do that would to add another parameter and another loop series, but have no idea how to go about that or if it is possible. Essentially, I would like to store all the values of onemonthaccuracy ( which goes into the file before the current file and compares the predicted value for the date associated with the currentfile) and then store all the values of twomonthaccurary and so on... so I can later use these variables for graphing and other purposes
import csv
import pandas as pd
def onemonthaccuracy(basefilenumber):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
onemonthread = pd.read_csv(str(basefilenumber-1)+'.csv', encoding='latin-1')
onemonthvalue = onemonthread.loc[onemonthread['Customer'].str.contains('Customer A', na=False),'Jun-16\nQty']
onetotal = int(onemonthvalue)/int(basefilevalue)
return onetotal
def twomonthaccuracy(basefilenumber):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
twomonthread = pd.read_csv(str(basefilenumber-2)+'.csv', encoding = 'Latin-1')
twomonthvalue = twomonthread.loc[twomonthread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
twototal = int(twomonthvalue)/int(basefilevalue)
return twototal
onetotal = 0
twototal = 0
onetotallist = []
twototallist = []
for basefilenumber in range(24,36):
onetotal += onemonthaccuracy(basefilenumber)
twototal +=twomonthaccuracy(basefilenumber)
onetotallist.append(onemonthaccuracy(i))
twototallist.append(twomonthaccuracy(i))
onetotalpermonth = onetotal/12
twototalpermonth = twototal/12
x = [1,2]
y = [onetotalpermonth, twototalpermonth]
z = [1,2]
w = [(onetotallist),(twototallist)]
for ze, we in zip(z, w):
plt.scatter([ze] * len(we), we, marker='D', s=5)
plt.scatter(x,y)
plt.show()
This is the real block of code I am using in my program, perhaps something is slowing it down that I am unaware of?
#other parts of code
#StartRange = yearvalue+Value
#EndRange = endValue + endyearvalue
#Range = EndRange - StartRange
# Department
#more code....
def nmonthaccuracy(basefilenumber, n):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
baseheader = getfileheader(basefilenumber)
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains(Department, na=False), baseheader]
nmonthread = pd.read_csv(str(basefilenumber-n)+'.csv', encoding = 'Latin-1')
nmonthvalue = nmonthread.loc[nmonthread['Customer'].str.contains(Department, na=False), baseheader]
return (1-(int(basefilevalue)/int(nmonthvalue))+1) if int(nmonthvalue) > int(basefilevalue) else int(nmonthvalue)/int(basefilevalue)
N = 13
total = [0] * N
total_by_month_list = [[] for _ in range(N)]
for basefilenumber in range(int(StartRange),int(EndRange)):
for n in range(N):
total[n] += nmonthaccuracy(basefilenumber, n)
total_by_month_list[n].append(nmonthaccuracy(basefilenumber,n))
onetotal=total[1]/ Range
twototal=total[2]/ Range
threetotal=total[3]/ Range
fourtotal=total[4]/ Range
fivetotal=total[5]/ Range #... all the way to 12
onetotallist=total_by_month_list[1]
twototallist=total_by_month_list[2]
threetotallist=total_by_month_list[3]
fourtotallist=total_by_month_list[4]
fivetotallist=total_by_month_list[5] #... all the way to 12
# alot more code after this
Something like this:
def nmonthaccuracy(basefilenumber, n):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Lam DepT', na=False), 'Jun-16\nQty']
nmonthread = pd.read_csv(str(basefilenumber-n)+'.csv', encoding = 'Latin-1')
nmonthvalue = nmonthread.loc[nmonthread['Customer'].str.contains('Lam DepT', na=False), 'Jun-16\nQty']
return int(nmonthvalue)/int(basefilevalue)
N = 2
total_by_month = [0] * N
total_aggregate = 0
for basefilenumber in range(20,30):
for n in range(N):
a = nmonthaccuracy(basefilenumber, n)
total_by_month[n] += a
total_aggregate += a
In case you are wondering what the following code does:
N = 2
total_by_month = [0] * N
It sets N to the number of months desired (2, but you could make it 12 or another value) and it then creates a total_by_month array that can store N results, one per month. It then initializes total_by_month to all zeroes (N zeroes) so that each of the N monthly totals starts at zero.

loop for loading files and assigning variables

I am brand new to python and am working on a school project. I was able to write the code to make my project work. However, I realize the way I did it is terribly inefficient and that there is probably a better way to store and work with the data in a 3D numpy array. I want to learn a better way to write code like this.
I have tried searching but I don't understand enough python to know what to search for or find the answers.
Question 1: How would I rewrite the code below to use a for loop to do the repetitive parts?
Question 2: Should I have used a 3D numpy array instead of assigning data to array_2001, array_2002, etc.? If so how would I do the array math using it?
from pyhdf.SD
import SD, SDC
import numpy as np
import matplotlib.pyplot as plt
import glob
#Number of images
max_samples = 18
#Values to filter with
exclude_below = 0
exclude_above = 100
min_threshold = 40
max_threshold = 100
#path to directory containing hdf files
file_path = '/Data/2018_2001_georef_MODIS/'
#Get a list of all the .hdf files in the directory
MODIS_files = glob.glob(file_path + '*.hdf')
#Get 2001 data
file = SD(MODIS_files[0], SDC.READ)
datasets_dic = file.datasets()
sds_obj = file.select('NDSI_Snow_Cover') # select sds
Array_2001 = np.array(sds_obj.get()) # get sds data
# Get 2002 data
file = SD(MODIS_files[1], SDC.READ)
sds_obj = file.select('NDSI_Snow_Cover') # select sds
Array_2002 = np.array(sds_obj.get()) # get sds data
#Repeat over and over for the rest of the files
#Get 2018 data
file = SD(MODIS_files[17], SDC.READ)
sds_obj = file.select('NDSI_Snow_Cover') # select sds
Array_2018 = np.array(sds_obj.get()) # get sds data
#print ('Array_2018 :', Array_2018)
#Create a boolean mask where 'true' means there is snow on
# on the pixel within the thresholds
snow_mask_2001 = (Array_2001 >= min_threshold) & (Array_2001 <= max_threshold)
snow_mask_2002 = (Array_2002 >= min_threshold) & (Array_2002 <= max_threshold)
snow_mask_2003 = (Array_2003 >= min_threshold) & (Array_2003 <= max_threshold)
#Repeat over and over for the rest of the files
snow_mask_2018 = (Array_2018 >= min_threshold) & (Array_2018 <= max_threshold)
non_snow_2001 = (Array_2001 > exclude_above) | (Array_2001 < exclude_below)
non_snow_2002 = (Array_2002 > exclude_above) | (Array_2002 < exclude_below)
non_snow_2003 = (Array_2003 > exclude_above) | (Array_2003 < exclude_below)
#Repeat over and over for the rest of the files
non_snow_2018 = (Array_2018 > exclude_above) | (Array_2018 < exclude_below)
Next, I converted the boolean 'true, false' arrays to ones and zeros.
(code not shown for brevity).
#Sum the number of snow days per pixel
snow_days = snow_mask_2001 + snow_mask_2002 + snow_mask_ ... snow_mask_2018
#sum the number of days with a 'non-snow' reading per pixel
no_reading_days = non_snow_2001 + non_snow_2002 + non_snow_2003 + … non_snow_2017 + non_snow_2018
There's more to my code after this point but it's non-repetative.
Any advice is appreciated.
You can just loop through the files you stored in your list, and append everything to one big numpy array. On that numpy array your can do your masking...
#path to directory containing hdf files
file_path = '/Data/2018_2001_georef_MODIS/'
#Get a list of all the .hdf files in the directory
MODIS_files = glob.glob(file_path + '*.hdf')
#?? should be the size of your array.
arr = np.empty((0,??), float)
for file_name in MODIS_files:
file = SD(MODIS_files[file_name ], SDC.READ)
datasets_dic = file.datasets()
sds_obj = file.select('NDSI_Snow_Cover') # select sds
arr = np.append(arr, np.array(sds_obj.get()), axis=0)

Student - np.random.choice: How to isolate and tally hit frequency within a np.random.choice range

Currently learning Python and very new to Numpy & Panda
I have pieced together a random generator with a range. It uses Numpy and I am unable to isolate each individual result to count the iterations within a range within my random's range.
Goal: Count the iterations of "Random >= 1000" and then add 1 to the appropriate cell that correlates to the tally of iterations. Example in very basic sense:
#Random generator begins... these are first four random generations
Randomiteration0 = 175994 (Random >= 1000)
Randomiteration1 = 1199 (Random >= 1000)
Randomiteration2 = 873399 (Random >= 1000)
Randomiteration3 = 322 (Random < 1000)
#used to +1 to the fourth row of column A in CSV
finalIterationTally = 4
#total times random < 1000 throughout entire session. Placed in cell B1
hits = 1
#Rinse and repeat to custom set generations quantity...
(The logic would then be to +1 to A4 in the spreadsheet. If the iteration tally would have been 7, then +1 to the A7, etc. So essentially, I am measuring the distance and frequency of that distance between each "Hit")
My current code includes a CSV export portion. I do not need to export each individual random result any longer. I only need to export the frequency of each iteration distance between each hit. This is where I am stumped.
Cheers
import pandas as pd
import numpy as np
#set random generation quantity
generations=int(input("How many generations?\n###:"))
#random range and generator
choices = range(1, 100000)
samples = np.random.choice(choices, size=generations)
#create new column in excel
my_break = 1000000
if generations > my_break:
n_empty = my_break - generations % my_break
samples = np.append(samples, [np.nan] * n_empty).reshape((-1, my_break)).T
#export results to CSV
(pd.DataFrame(samples)
.to_csv('eval_test.csv', index=False, header=False))
#left uncommented if wanting to test 10 generations or so
print (samples)
I believe you are mixing up iterations and generations. It sounds like you want 4 iterations for N numbers of generations, but your bottom piece of code does not express the "4" anywhere. If you pull all your variables out to the top of your script it can help you organize better. Panda is great for parsing complicated csvs, but for this case you don't really need it. You probably don't even need numpy.
import numpy as np
THRESHOLD = 1000
CHOICES = 10000
ITERATIONS = 4
GENERATIONS = 100
choices = range(1, CHOICES)
output = np.zeros(ITERATIONS+1)
for _ in range(GENERATIONS):
samples = np.random.choice(choices, size=ITERATIONS)
count = sum([1 for x in samples if x > THRESHOLD])
output[count] += 1
output = map(str, map(int, output.tolist()))
with open('eval_test.csv', 'w') as f:
f.write(",".join(output)+'\n')

Reducing numpy array for drawing chart

I want to draw chart in my python application, but source numpy array is too large for doing this (about 1'000'000+). I want to take mean value for neighboring elements. The first idea was to do it in C++-style:
step = 19000 # every 19 seconds (for example) make new point with neam value
dt = <ordered array with time stamps>
value = <some random data that we want to draw>
index = dt - dt % step
cur = 0
res = []
while cur < len(index):
next = cur
while next < len(index) and index[next] == index[cur]:
next += 1
res.append(np.mean(value[cur:next]))
cur = next
but this solution works very slow. I tried to do like this:
step = 19000 # every 19 seconds (for example) make new point with neam value
dt = <ordered array with time stamps>
value = <some random data that we want to draw>
index = dt - dt % step
data = np.arange(index[0], index[-1] + 1, step)
res = [value[index == i].mean() for i in data]
pass
This solution is slower than the first one. What is the best solution for this problem?
np.histogram can provide sums over arbitrary bins. If you have time series, e.g.:
import numpy as np
data = np.random.rand(1000) # Random numbers between 0 and 1
t = np.cumsum(np.random.rand(1000)) # Random time series, from about 1 to 500
then you can calculate the binned sums across 5 second intervals using np.histogram:
t_bins = np.arange(0., 500., 5.) # Or whatever range you want
sums = np.histogram(t, t_bins, weights=data)[0]
If you want the mean rather than the sum, remove the weights and use the bin tallys:
means = sums / np.histogram(t, t_bins)][0]
This method is similar to the one in this answer.

Categories

Resources