I have a nc file in which the time variable is a bit weird. It is not a gregorian calendar but a simple 365-day-a-year calendar (i.e. leap years are not included). This means the unit is also a bit off, but nothing too worrisome.
xarray.DataArray 'days' (time: 6570)>
array([730817., 730818., 730819., ..., 737384., 737385., 737386.])
Dimensions without coordinates: time
Attributes:
units: days_since_Jan11900
long_name: calendar_days
730817 represents 01-01-2001 and 737386 represents 31-12-2018
I want to obtain a certain time period of the data set for multiple years, just as you can do with cdo -seldmonth, -selday etc. But of course, with no date, I cannot use the otherwise brilliant option. My idea was to slice the time range I need with np.slice, but I do not know how and cannot seem to find adequate answers on SO.
In my specific case, I need to slice a range going from May 30th (150th day of the year) to Aug 18th (229th day of the year) every year. I know the first slice should be something like:
ds = ds.loc[dict(time = slice(149,229))]
But, that will only give me the range for 2001, and not the following years.
I cannot do it with cdo -ntime as it does not recognize the time unit.
How do I make sure that I get the range for the following 17 years too? And by that skipping 285 days in between the ranges I need?
I fixed it through Python. It can probably be done in a smarter way, but I manually picked the ranges I needed with help from #dl.meteo and using np.r_.
ds = ds.sel(time=np.r_[149:229,514:594,879:959,1244:1324,1609:1689,1974:2054,2339:2419,2704:2784,3069:3149,3434:3514,3799:3879,4164:4244,4529:4609,4894:4974,5259:5339,5624:5704,5989:6069,6354:6434])
From your answer it seems you know the timeslices, so you could also extract them with cdo using
cdo seltimestep,149/229 in.nc out.nc
etc
but if you want to do it (semi)automatically with cdo, it should also be possible as cdo supports a 365 day calendar. I think you need to set the calendar to this type and then probably reset the time units and the reftime. without an example file I can't test this, but I think something like this could work:
step 1: set the calendar type to 365 and then set the reference data to your first date:
cdo setcalendar,365_day infile.nc out1.nc
cdo setreftime,2000-01-01,00:00:00 out1.nc out2.nc
you then need to see what the first date is in the file, you can pipe it to less:
cdo showdate out2.nc | less
step 2: you then can shift the timeaxis to the correct date using cdo,shifttime
e.g. if the showdate gives the first day as 2302-04-03, then you can simply do
cdo shiftime,-302years -shifttime,-3months -shifttime,-2days out2.nc out3.nc
to correct the dates...
Then you should be able to use all the cdo functionality on the file to do the manipulation as you wish
Related
Goal: extract dates from medical records (stored in pandas Series, dates are in all possible formats)
For numerical dates I used:
str.extractall(r'((?:\b\d{1,2}[/]){1,2}(?:(?:\d{2}\b)|\b\d{4}\b))')
Problem:
Input text1:
"(5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; Independent
Output1: 5/11/85 (as wished) but also: 16/22
Input text2:
[text...] (7/11/77) CBC: 4.9/36/308 Pertinent [...]:
Output2: 7/11/77 (as wished) but also 9/36
Especially the second case is hard, because transforming it to date returns: September 2036, so, it can't be selected out that way.
[^-] makes it even worse.
The dates are everywhere in the text, like:
[...] has also taken diet pills (last episode in Feb 1993) but [...]
Feb 1993 etc. wasn't a problem.
You should specify what "all formats" means. In your example you just show 1 format. Could "JAN-02-2016" "01/02/2016" "02/01/2016" all be present? European and US time formats? etc?
In your example, it looks like dates are always at the start of the line and surrounded by parentheses, however, which makes it sort of straightforward.
^((\d+/\d+)).|^((\d+/\d+/d+)).
The main rule when you working with regexes is: know your data. You must compose as much accurate regex as you can.
Then I would suggest you to parse such crude dates into actual, full-fledged date objects. It serves two main goals: first, you filter out negative regex matches; second, now you can cope with your dates in much more convenient, handy way using date object's methods rather than comparing just text strings. For example, you can access date's day, month or year, compare it with desired value, and filter out dates based on such comparison.
For parsing dates I would recommend you to use one of sophisticated date parsing libraries, such as dateutil or dateparser, which handles a lot of tricky details for you, for free.
I am having troubles understanding the difference between a PeriodIndex and a DateTimeIndex, and when to use which. In particular, it always seemed to be more natural to me to use Periods as opposed to Timestamps, but recently I discovered that Timestamps seem to provide the same indexing capability, can be used with the timegrouper and also work better with Matplotlib's date functionalities. So I am wondering if there is every a reason to use Periods (a PeriodIndex)?
Periods can be use to check if a specific event occurs within a certain period. Basically a Period represents an interval while a Timestamp represents a point in time.
# For example, this will return True since the period is 1Day. This test cannot be done with a Timestamp.
p = pd.Period('2017-06-13')
test = pd.Timestamp('2017-06-13 22:11')
p.start_time < test < p.end_time
I believe the simplest reason for ones to use Periods/Timestamps is whether attributes from a Period and a Timestamp are needed for his/her code.
This question already has answers here:
Convert weird Python date format to readable date
(2 answers)
Closed 7 years ago.
I'm importing data from an Excel spreadsheet into python. My dates are coming through in a bizarre format of which I am not familiar and cannot parse.
in excel: (7/31/2015)
42216
after I import it:
u'/Date(1438318800000-0500)/'
Two questions:
what format is this and how might I parse it into something more intuitive and easier to read?
is there a robust, swiss-army-knife-esque way to convert dates without specifying input format?
Timezones necessarily make this more complex, so let's ignore them...
As #SteJ remarked, what you get is (close to) the time in seconds since 1st January 1970. Here's a Wikipedia article how that's normally used. Oddly, the string you get seems to have a timezone (-0500, EST in North America) attached. Makes no sense if it's properly UNIX time (which is always in UTC), but we'll pass on that...
Assuming you can get it reduced to a number (sans timezone) the conversion into something sensible in Python is really straight-forward (note the reduction in precision; your original number is the number of milliseconds since the epoch, rather than the standard number of seconds from the epoch):
from datetime import datetime
time_stamp = 1438318800
time_stamp_dt = datetime.fromtimestamp(time_stamp)
You can then get time_stamp_dt into any format you think best using strftime, e.g., time_stamp_dt.strftime('%m/%d/%Y'), which pretty much gives you what you started with.
Now, assuming that the format of the string you provided is fairly regular, we can extract the relevant time quite simply like this:
s = '/Date(1438318800000-0500)/'
time_stamp = int(s[6:16])
In pandas, I've converted some of my dataset from US/Eastern and some from American/Chicago:
data_f1 = data_f[:'2007-04-26 16:59:00']
data_f1.index = data_f1.index.tz_localize('US/Eastern', infer_dst=True).tz_convert('Europe/London')
data_f2 = data_f['2007-04-26 17:00:00':]
data_f2.index = data_f2.index.tz_localize('America/Chicago', infer_dst=True).tz_convert('Europe/London')
data = data_f1.append(data_f2)
I have two questions about this.
(1) Does tz_convert() account for the DST changes between NY (or Chicago) time and London? Is there any documentation to support this? I couldn't find it anywhere.
(2) The output looks like this:
I'm not sure what the "+01:00" is at the end of the time stamp, but I think it has something to do with DST transitions? What is the + exactly relative to? I'm not sure what it means or why it's necessary - if I convert from US/Eastern 14:00 to Europe/London 19:00, it's simply 19:00, not 19:00+01:00? Why is that added?
When I output to Excel, I have to manually chop off the everything after the "+". Is there any option to simply not output it to begin with (unless it's actually important)?
Thanks for your help in advance!
UPDATE:
The closest thing I've found to stripping the +'s is here: Convert pandas timezone-aware DateTimeIndex to naive timestamp, but in certain timezone but it seems like this may take a long time with a lot of data. Is there not a more efficient way?
The way I used to solve this was to output to a .csv, read it back in (which makes it time zone naive but keeps the time zone it was in), then strip the +'s.
I am new to python. I am looking for ways to extract/tag the date & time specific information from text
e.g.
1.I will meet you tomorrow
2. I had sent it two weeks back
3. Waiting for you last half an hour
I had found timex from nltk_contrib, however found couple of problems with it
https://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/timex.py
b. Not sure of the Date data type passed to ground(tagged_text, base_date)
c. It deals only with date i.e. granularity at day level. Cant find expression like next one hour etc.
Thank you for your help
b) The data type that you need to pass to ground(tagged_text, base_date) is an instance of the datetime.date class which you'd initialize using something like:
from datetime import date
base_date = date.today()