NetCDF variables that have the fill value/ missing value

NetCDF variables that have the fill value/ missing value - python

I have a variable in a NetCDF file that has a default value if the variable is null. How do you remove this value or change it to 0 when the variable is missing a value?

It sounds like the problem is that when the variable is populated into the NetCDF file, it is set to insert some default value for values that are missing. Now, I am assuming that you need to remove these default values after the file has been written and you are working with the data.
So (depending on how you are accessing the variable) I would pull the variable out of the NetCDF file and assign it to a python variable. This is the first method that comes to mind.
Use a for loop to step through and replace that default value with 0
variable=NetCDF_variable #Assume default value is 1e10
cleaned_list=[]
for i in variable:
if i == 1e10:
cleaned_list.append(0) #0 or whatever you want to fill here
else:
cleaned_list.append(i)
If the default value is a float, you may want to look into numpy.isclose if the above code isn't working. You might also be interested in masking your data in case any computations you do would be thrown off by inserting a 0.
EDIT: User N1B4 provided a much cleaner and efficient way of doing the exact same thing as above.
variable[variable == 1e10] = 0

Related

Problem transforming a variable in logs, python

I am using Python. I would like to create a new column which is the log transformation of column 'lights1992'.
I am using the following code:
log_lights1992 = np.log(lights1992)
I obtain the following error:
I have tried two things: 1) adding a 1 to each value and transform the column 'lights1992' to numeric.
city_join['lights1992'] = pd.to_numeric(city_join['lights1992'])
city_join["lights1992"] = city_join["lights1992"] + 1
However, that two solution has not worked. Variable 'lights1992' is a float64 type. Do you know what can be the problem?
Edit:
The variable 'lights1992' comes from doing a zonal_statistics from a raster 'junk1992', maybe this affect.
zs1 = zonal_stats(city_join, junk1992, stats=['mean'], nodata=np.nan)
city_join['lights1992'] = [x['mean'] for x in zs1]

the traceback states:
'DatasetReader' object has no attribute'log'.
Did you re-assign numpy to something else at some point? I can't find much about 'DatasetReader' is that a custom class?
EDIT:
I think you would need to pass the whole column because your edit doesn't show a variable named 'lights1992'
so instead of:
np.log(lights1992)
can you try passing in the Dataframe's column to log?:
np.log(city_join['lights1992'])
2ND EDIT:
Since you've reported back that it works I'll dive into the why a little bit.
In your original statement you called the log function and gave it an argument, then you assigned the result to a variable name:
log_lights1992 = np.log(lights1992)
The problem here is that when you give python text without any quotes it thinks you are giving it a variable name (see how you have log_lights1992 on the left of the equal sign? You wanted to assign the results of the operation on the right hand side of the equal sign to the variable name log_lights1992) but in this case I don't think lights1992 had any value!
So there were two ways to make it work, either what I said earlier:
Instead of giving it a variable name you give .log the column of the city_join dataframe (that's what city_join["lights1992"]) directly.
Or
You assign the value of that column to the variable name first then you pass it in to .log, like this:
lights1992 = city_join["lights1992"]
log_lights1992 = np.log(lights1992)
Hope that clears it up for you!

Can I have 2 index within a single dataframe?

I'm working with computer simulation and I use a lot of variables that change from one simulation to another. I have to run short but numerous simulation (like 1000+) so keeping track of these is important.
Up until now, I was simply adding a new columns with the data inside. So my data would look something like.
DataX, DataY, DataZ, variable1, variable2, variable3, ....
So I was basically making 1 column per variable.
Every time I would need to get new variables I would add them as a new column.
Not effective but at least everything was within the same file which was quite handy tbh.
My internship at my lab is about to end and my tutor asked me to clear up the code and make it so that anyone could keep using it.
The thing is, each of these variable also have 2 further sub variable.
So I made a new function that gather all those variables and make a neat little dataframe which looks like this
Parameter Value Lambda Mod
temperature 10 1 0
VarE 1.5 5 0.5
etc
To make it easily accessible I also turned Parameter as the index so I can use df_param.loc['VarE','Value'] for instance
However because of that, they're not all within the same file. Which isn't handy.
Since they'll have to use above 1000+ data file when plotting, and have to filter everything, having the parameter separated from the data can lead to mistakes (which isn't possible atm since everything is within the same file).
If I convert back "parameter" as a column, I can easily do that
Index DataX, DataY, DataZ, Parameter, Value, Lambda, Mod
The issue I have (mostly from a practical stand point) is that since parameter isn't the index anymore, I can't do df_param.loc['VarE','Value'] anymore. I would need to know at exactly what index 'VarE' is and do df.param.loc['index','value'] and with well over 15 parameter to pick from, it's a bit sketchy.
Basically, is there a way to have 2 index? like
One index for DataX, DataY, DataZ (let's call it 'dt') and one for Value Lambda and Mod which would be 'Parameter'
So two df within one basically.
Thank you in advance

The reason of using ''temp'' variables?

The question I'm asking is why do we use temporary variables after taking an input? For example: In the code down below, we've requested a 'num' from the customer. Then we've changed into it a 'temp'. Why don't we simply continue with 'num'? I can't see any aim into changing it a different variable. Why don't the code work if we don't make this swap? Thanks.

It is beacause in the while cycle in the last row you change the value of temp so if instead of temp you use num you will change its value and in the if else statement you can't compare the sum with the input number.

How to use missing parameter of XGBRegressor of scikit-learn

I am working on a dataset which contains missing values in certain columns. I am trying to use XGBRegressor of Scikit-Learn wrapper interface for XGBoost. There it provides a parameter called 'missing' in which you can enter float values or otherwise it takes NaN of python as default. So i need help like how can i use this parameter to fill missing values of the columns in my dataset. It will be helpful if one can provide me a simple example as well.

The missing value parameter works as whatever value you provide for 'missing' parameter it treats it as missing value. For example if you provide 0.5 as missing value, then wherever it finds 0.5 in your data it treats it as missing value. Default is NaN. So what XGBoost does is based on the data it defines one of the path as default path. For example based on one parameter say it can go in two directions either left or right, so one of that will be made default based on the data. So whenever one of the missing value comes as input for a parameter, say you defined 0.5 as missing, then whenever 0.5 comes in the data it takes the default path. Initially I thought it imputes the missing value but it does not. It just defines one of the path as default and whenever any missing value come it takes that default path. This is defined in the paper XGBoost: A Scalable Tree Boosting System

its my understanding you got it mixed up.
The missing parameter only replaces a certain value (or list of values) for missing (aka NaN) - the default is "np.nan"
if you want to replace the actual missing values for some different value, lets say "X" you gotta do it on your data before applying the model.
if you got a dataframe "df" you can:
df.fillna(X)
if you got a np.array "array" you can:
np.nan_to_num(array)
but the above will replace the np.nan with zeros.
hope that helps,

Replace values in netCDF file using python

I would like to replace all the -999.0 values in the foll. netCDF file by 0.0. The file is here:
https://www.dropbox.com/s/ohh7tntzm2pj1uw/all_SSP3_BaU_NoCC.nc?dl=0
here's my function:
def modify_nc_value(path_inp, name_var, val_to_modify, new_value):
"""
Replace value in netCDF vriable from val_to_modify to new_value
Args:
path_inp:
name_var: Name of variable for which to modify value, if -1, then change all variables
val_to_modify:
new_value:
Returns:
"""
hndl_inp = netCDF4.Dataset(path_inp, 'r+')
if name_var == -1:
for var, varin in hndl_inp.variables.iteritems():
hndl_inp[var][:][hndl_inp[var][:] == val_to_modify] = new_value
else:
hndl_inp[name_var][:][hndl_inp[name_var][:] == val_to_modify] = new_value
hndl_inp.close() # Write variable back to disk
However, the function does not seem to work. None of the -999.0 values are replaced. How to fix this?

You opened the file in append mode (r+), which will only allow new values to be added. There really is no read/write mode because data sets grow huge quickly. You have two options:
Read the dataset, open a new file (mode is w, clobber is True), write to it with modifications, copy it back over the original dataset.
If the dataset is large, meaning > 5T or so, you really want to modify your code to look for -999.0 values instead.
There may be a way to do it. I've never made these hints, but you may have better luck.
Good luck! If you make it work, add it to these answers.

If you are happy to consider doing this task from the command line, it is simple to achieve with climate data operators (CDO)
You can define a mask that has 0 where the original field has -999 and then multiple this mask with the original field.
cdo nec,-999.0 file.nc mask.nc
cdo mul file.nc mask.nc new_file.nc
or piped together
cdo mul file.nc -nec,-999.0 file.nc new_file.nc
However, I'm assuming that -999.0 is actually the "missing value" for the field identified in the metadata. In that case you can also achieve this using
cdo setmissval,0 file.nc new_file.nc
The file contents of the variable are identical both ways, but the difference is that the latter command will also change the metadata to identify zero as the "missing value". This means that 0 is correctly identified as missing in software such as ncview. However, note that any incidence of zero is now missing, including those that were already there in the original file!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.