Replace values in netCDF file using python

Replace values in netCDF file using python - python

I would like to replace all the -999.0 values in the foll. netCDF file by 0.0. The file is here:
https://www.dropbox.com/s/ohh7tntzm2pj1uw/all_SSP3_BaU_NoCC.nc?dl=0
here's my function:
def modify_nc_value(path_inp, name_var, val_to_modify, new_value):
"""
Replace value in netCDF vriable from val_to_modify to new_value
Args:
path_inp:
name_var: Name of variable for which to modify value, if -1, then change all variables
val_to_modify:
new_value:
Returns:
"""
hndl_inp = netCDF4.Dataset(path_inp, 'r+')
if name_var == -1:
for var, varin in hndl_inp.variables.iteritems():
hndl_inp[var][:][hndl_inp[var][:] == val_to_modify] = new_value
else:
hndl_inp[name_var][:][hndl_inp[name_var][:] == val_to_modify] = new_value
hndl_inp.close() # Write variable back to disk
However, the function does not seem to work. None of the -999.0 values are replaced. How to fix this?

You opened the file in append mode (r+), which will only allow new values to be added. There really is no read/write mode because data sets grow huge quickly. You have two options:
Read the dataset, open a new file (mode is w, clobber is True), write to it with modifications, copy it back over the original dataset.
If the dataset is large, meaning > 5T or so, you really want to modify your code to look for -999.0 values instead.
There may be a way to do it. I've never made these hints, but you may have better luck.
Good luck! If you make it work, add it to these answers.

If you are happy to consider doing this task from the command line, it is simple to achieve with climate data operators (CDO)
You can define a mask that has 0 where the original field has -999 and then multiple this mask with the original field.
cdo nec,-999.0 file.nc mask.nc
cdo mul file.nc mask.nc new_file.nc
or piped together
cdo mul file.nc -nec,-999.0 file.nc new_file.nc
However, I'm assuming that -999.0 is actually the "missing value" for the field identified in the metadata. In that case you can also achieve this using
cdo setmissval,0 file.nc new_file.nc
The file contents of the variable are identical both ways, but the difference is that the latter command will also change the metadata to identify zero as the "missing value". This means that 0 is correctly identified as missing in software such as ncview. However, note that any incidence of zero is now missing, including those that were already there in the original file!

Related

Generate always unique ID every-time it called with Python

What is the best way to create a python script which by calling which, it will ALWAYS generate a new UNIQUE ID (Autoincremental)
You run the script and it will tell you 1, then close script and open again and will tell you 2.
Purpose of it is to create a script which will be used across and this ID will be used to track the latest changes and so on.
P.S. I'm not talking to make a function which will do it.

import uuid
uniqueid = uuid.uuid1()

Since you didnt provide any code, I will also not provide any code.
Solution 1: Unique ID
1) TIME: create function to give you timestamp
2) ID: create function that generate long string with random numbers and letters
This is of course 'risky' because there is a chance you will generate already existing ID, but from statistical point of view, it is so called 'impossible even if it is possible'
save in file or somewhere
Solution 2: offset - incremental
1) have file with a 0 in it.
2) open a file, read line, convert to integer, increment to +1, write in a file.
Note:
Your title is wrong. One moment you talk about UNIQUE ID, Next moment you are talking about offset. Unique ID and counting running python script are quite contradicting ideas

I assume you have a script, it will generate some result every time it is executed. Then you need need a value that (1) distinguish one result from another and (2) shows which result came last. Is that right? If so, we have many options here. In the simplest case (a script always running in the same machine) I would suggest two options
Save a count to a file
In this case, you would have a file and would read the number from it:
try:
with open('count.txt') as count_file:
content = count_file.read()
count = int(content)
except Exception:
count = 0
After doing whatever your script does, you would write to the file the value you've read, but incremented:
with open('count.txt', 'w') as count_file:
count_file.write(str(count + 1))
Save the timestamp
A simpler option, however, is not to increment a value but get a timestamp. You could use time.time(), that returns the number of seconds since Unix epoch:
>>> import time
>>> time.time()
1547479233.9383247
You will always know which result came later than the others. Personally, however, I would rather format the current time, it is easier to read and reason about:
>>> from datetime import datetime
>>> datetime.now().strftime('%Y%m%d%H%M%S')
'20190114132407'
Those are basic ideas, you may need to pay attention to corner cases and possible failures (especially with the file-based solution). That said, I guess those are quite viable first steps.
A technical note
What you want here is to a program to remember a piece of information between two or more executions, and we have a technical term for that: the information should be persistent. Since you asked for an autoincrementing feature, you wanted a persistent count. I suspect, however, you do not need that if you use the timestamp option. It is up to you to decide what to do here.

I had the same situation. I ended up in creating a CSV file so that I can map variable names.
def itemId_generator(itemIdLocation):
# Importing value of ItemID from the csv file
df = pd.read_csv(itemIdLocation)
#return value which is current ItemID in the csv file
ItemId = df.loc[0, 'ItemId']
# If loop to set limit to maximum Item ID
if ItemId>= 10000:
df.loc[0, 'ItemId'] = 1
elif ItemId<10000:
# updating the column value/data
df.loc[0, 'ItemId'] = df.loc[0,'ItemId'] + 1
else:
print("Invalid value returned")
sys.exit()
# writing new ItemID into the file
df.to_csv(itemIdLocation, index=False)
# The function .item() converts numpy integer to a normal int
return str(ItemId.item())

If there is any chance of the file being accessed concurrently, it is best to lock the file. Keep trying if the file is locked.
http://tilde.town/~cristo/file-locking-in-python.html
Old answer:
You could store it as an environment variable on the system. If not set, initialise to 1. Else increment it by 1.

NetCDF variables that have the fill value/ missing value

I have a variable in a NetCDF file that has a default value if the variable is null. How do you remove this value or change it to 0 when the variable is missing a value?

It sounds like the problem is that when the variable is populated into the NetCDF file, it is set to insert some default value for values that are missing. Now, I am assuming that you need to remove these default values after the file has been written and you are working with the data.
So (depending on how you are accessing the variable) I would pull the variable out of the NetCDF file and assign it to a python variable. This is the first method that comes to mind.
Use a for loop to step through and replace that default value with 0
variable=NetCDF_variable #Assume default value is 1e10
cleaned_list=[]
for i in variable:
if i == 1e10:
cleaned_list.append(0) #0 or whatever you want to fill here
else:
cleaned_list.append(i)
If the default value is a float, you may want to look into numpy.isclose if the above code isn't working. You might also be interested in masking your data in case any computations you do would be thrown off by inserting a 0.
EDIT: User N1B4 provided a much cleaner and efficient way of doing the exact same thing as above.
variable[variable == 1e10] = 0

Big data File: Read and Create structured file

I have a 20+GB dataset that is structured as follows:
1 3
1 2
2 3
1 4
2 1
3 4
4 2
(Note: the repetition is intentional and there is no inherent order in either column.)
I want to construct a file in the following format:
1: 2, 3, 4
2: 3, 1
3: 4
4: 2
Here is my problem; I have tried writing scripts in both Python and C++ to load in the file, create long strings, and write to a file line-by-line. It seems, however, that neither language is capable of handling the task at hand. Does anyone have any suggestions as to how to tackle this problem? Specifically, is there a particular method/program that is optimal for this? Any help or guided directions would be greatly appreciated.

You can try this using Hadoop. You can run a stand-alone Map Reduce program. The mapper will output the first column as key and the second column as value. All the outputs with same key will go to one reducer. So you have a key and a list of values with that key. You can run through the values list and output the (key, valueString) which is the final output you desire. You can start this with a simple hadoop tutorial and do mapper and reducer as I suggested. However, I've not tried to scale a 20GB data on a stand-alone hadoop system. You may try. Hope this helps.

Have you tried using a std::vector of std::vector?
The outer vector represents each row. Each slot in the outer vector is a vector containing all the possible values for each row. This assumes that the row # can be used as an index into the vector.
Otherwise, you can try std::map<unsigned int, std::vector<unsigned int> >, where the key is the row number and the vector contains all values for the row.
A std::list of would work also.
Does your program run out of memory?
Edit 1: Handling large data files
You can handle your issue by treating it like a merge sort.
Open a file for each row number.
Append the 2nd column values to the file.
After all data is read, close all files.
Open each file and read the values and print them out, comma separated.

Open output file for each key.
While iterating over lines of source file append values into output files.
Join output files.

An interesting thought found also on Stack Overflow
If you want to persist a large dictionary, you are basically looking at a database.
As recommended there, use Python's sqlite3 module to write to a table where the primary key is auto incremented, with a field called "key" (or "left") and a field called "value" (or "right").
Then SELECT out of the table which was the MIN(key) and MAX(key), and with that information you can SELECT all rows that have the same "key" (or "left") value, in sorted order, and print those informations to an outfile (if the database is not a good output for you).
I have written this approach in the assumption you call this problem "big data" because the number of keys do not fit well into memory (otherwise, a simple Python dictionary would be enough). However, IMHO this question is not correctly tagged as "big data": in order to require distributed computations on Hadoop or similar, your input data should be much more than what you can hold in a single hard drive, or your computations should be much more costly than a simple hash table lookup and insertion.

How to read NetCDF variable float data into a Numpy array with the same precision and scale as the original NetCDF float values?

I have a NetCDF file which contains a variable with float values with precision/scale == 7/2, i.e. there are possible values from -99999.99 to 99999.99.
When I take a slice of the values from the NetCDF variable and look at them in in my debugger I see that the values I now have in my array have more precision/scale than what I see in the original NetCDF. For example when I look at the values in the ToosUI/ncdump viewer they display as '-99999.99' or '12.45' but when I look at the values in the slice array they look like '-99999.9921875' (a greater scale length). So if I'm using '-99999.99' as the expected value to indicate a missing data point then I won't get a match with what gets pulled into the slice array since those values have a greater scale length and the additional digits in the scale are not just zeros for padding.
For example I see this if I do a ncdump on a point within the NetCDF dataset:
Variable: precipitation(0:0:1, 40:40:1, 150:150:1)
float precipitation(time=1348, lat=180, lon=360);
:units = "mm/month";
:long_name = "precipitation totals";
data:
{
{
{-99999.99}
}
}
However if I get a slice of the data from the variable like so:
value = precipitationVariable[0:1:1, 40:41:1, 150:151:1]
then I see it like this in my debugger (Eclipse/PyDev):
value == ndarray: [[[-99999.9921875]]]
So it seems as if the NetCDF dataset values that I read into a Numpy array are not being read with the same precision/scale of the original values in the NetCDF file. Or perhaps the values within the NetCDF are actually the same as what I'm seeing when I read them, but what's shown to me via ncdump is being truncated due to some format settings in the ncdump program itself.
Can anyone advise as to what's happening here? Thanks in advance for your help.
BTW I'm developing this code using Python 2.7.3 on a Windows XP machine and using the Python module for the NetCDF4 API provided here: https://code.google.com/p/netcdf4-python/

There is no simple way of doing what you want because numpy stores the values as single precision, so they will always have the trailing numbers after 0.99.
However, netCDF already provides a mechanism for missing data (see the best practices guide). How was the netCDF file written in the first place? The missing_value is a special variable attribute that should be used to indicate those values that are missing. In the C and Fortran interfaces, when the file is created all variable values are set to be missing. If you wrote a variable all in one go, you can then set the missing_value attribute to an array of indices where the values are missing. See more about the fill values in the C and Fortran interfaces. This is the recommended approach. The python netCDF4 module plays well with these missing values, and such arrays are read as masked arrays in numpy.
If you must work with the file you currently have, then I'd suggest creating a mask to cover values around your missing value:
import numpy as np
value = precipitationVariable[:]
mask = (value < -99999.98) & (value > -100000.00)
value = np.ma.MaskedArray(value, mask=mask)

openpyxl please do not assume text as a number when importing

There are numerous questions about how to stop Excel from interpreting text as a number, or how to output number formats with openpyxl, but I haven't seen any solutions to this problem:
I have an Excel spreadsheet given to me by someone else, so I did not create it. When I open the file with Excel, I have certain values like "5E12" (clone numbers, if anyone cares) that appear to display correctly, but there's a little green arrow next to each one warning me that "This appears to be a number stored as text". Excel then asks me if I would like to convert it to a number, and if I saw yes, I get 5000000000000, which then converts automatically to scientific notation and displays 5E12 again, only this time a text output would show the full number with zeroes. Note that before the conversion, this really is text, even to Excel, and I'm only being warned/offered to convert it.
So, when reading this file in with openpyxl (from openpyxl.reader.excel import load_workbook), the 5E12 is getting converted automatically to 5000000000000. I assume that openpyxl is making the same assumption that Excel made, only the conversion happens without a prompt or input on my part.
How can I prevent this from happening? I do not want text that look like "numbers stored as text" to convert to numbers. They are text unless I say so.
So far, the only solution I have found is to add single quotes to the front of each cell, but this is not an ideal solution, as it's manual labor rather than a programmatic solution. Also, the solution needs to be general, since I don't always know where this problem might occur (I'm reading millions of lines per day, so I don't want to have to do anything by hand).
I think this is a problem with openpyxl. There is a google group discussion from the beginning of 2011 that mentions this problem, but assumes it's too rare to matter. https://groups.google.com/forum/?fromgroups=#!topic/openpyxl-users/HZfpShMp8Tk
So, any suggestions?

If you want to use openpyxl again (for whatever reason), the following changes to the worksheet reader routine do the trick of keeping the strings as strings:
diff --git a/openpyxl/reader/worksheet.py b/openpyxl/reader/worksheet.py
--- a/openpyxl/reader/worksheet.py
+++ b/openpyxl/reader/worksheet.py
## -134,8 +134,10 ##
data_type = element.get('t', 'n')
if data_type == Cell.TYPE_STRING:
value = string_table.get(int(value))
-
- ws.cell(coordinate).value = value
+ ws.cell(coordinate).set_value_explicit(value=value,
+ data_type=Cell.TYPE_STRING)
+ else:
+ ws.cell(coordinate).value = value
# to avoid memory exhaustion, clear the item after use
element.clear()
The Cell.value is a property and on assignment call Cell._set_value, which then does a Cell.bind_value which according to the method's doc: "Given a value, infer type and display options". As the types of the values are in the XML file those should be taken (here I only do that for strings) instead of doing something 'smart'.
As you can see from the code, the test whether it is a string was already there.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.