Change all NaT values in Pandas dataframe to Timedelta 00:00:00 - python

I have a dataframe in Pandas, and one column "timeOff" has some NaT values.
All I want to do is change all the NaT values to timeDelta values with '00:00:00' as the value.
This is my current output:
Output with NaT values
I have tried to run this line of code:
replaceNaT = pd.to_timedelta('00:00:00')
print(replaceNaT)
startEndEventsDataframe['timeOff'] = np.where(pd.isnull(startEndEventsDataframe['timeOff']) == True, replaceNaT, startEndEventsDataframe['timeOff'])
But this destroys all the values in my dataframe column, as seen below:
After running code from above
I would like for all the values that are not NaT to remain unchanged, and I would like all values that are NaT to be timeDelta with values "00:00:00".
Thanks for the help.

So, as it turns out I figured it out on my own, but figured I would post the solution to anybody who might need to know in the future.
I got rid of the "replaceNaT" and simply wrote "0" in where NaT was found. I guess timeDeltas are stored as integers based on the lowest resolution of time they measure, and are only converted to look like they do when they are displayed?
Anyways, here is the code change that worked for me:
startEndEventsDataframe['timeOff'] = np.where(pd.isnull(startEndEventsDataframe['timeOff']) == True, 0, startEndEventsDataframe['timeOff'])

Related

How can I calculate the number of days between two dates with different format in Python?

I have a pandas dataframe with a column of orderdates formatted like this: 2019-12-26.
However when I take the max of this date it will give 2019-12-12. While it is actually 2019-12-26. It makes sense because my dateformat is Dutch and the max() function uses the 'American' (correct me if I'm wrong) format.
This meas that my calculations aren't correct.
How I can change the way the function calculate? Or if thats not possible, change the format of my date column so the calculations are correct?
[In] df['orderdate'] = df['orderdate'].astype('datetime64[ns]')
print(df["orderdate"].max())
[Out] 2019-12-12 00:00:00
Thank you!

Formatting date data in NumPy array

I would be really grateful for an advice. I had an exercise like it's written bellow:
The first column (index 0) contains year values as four digit numbers
in the format YYYY (2016, since all trips in our data set are from
2016). Use assignment to change these values to the YY format (16) in
the test_array ndarray.
I used a code to solve it:
test_array[:,0] = test_array[:,0]%100
But I'm sure it has to be more universal and smart way to get the same results with datetime or smth else. But I cant find it. I tried different variations of this code, but I dont get whats wrong:
dt.datetime.strptime(str(test_array[:,0]), "%Y")
test_array[:,0] = dt.datetime.strftime("%y")
Could you help me with this, please?
Thank you
In order to carry out the conversion of year from YYYY format to YY format would require intermediate datetime value on which operations such as strftime can be carried out in following manner:
df.iloc[:, 0] = df.iloc[:, 0].apply(lambda x: pd.datetime(x, 1, 1).strftime('%y'))
Here to obtain the datetime values we needed 3 args: year, month and date, out of which we had year and the values for rest were assumed to be 1 as default.

Change date to day + 1 in a pandas dataframe where time = 00:00:00

If you see in the image of my dataframe, I have time points where midnight is a day behind what it should be, which affects my time series graphs.
I tried df.replace() where I passed in lists a and b:
df.replace(to_replace=a,value=b,inplace=True)
This just replaced all values in a with just the same one value in b instead of all the values in the list.
I also tried passing in a dictionary but received:
Value Error: "Replacement not allowed with overlapping keys and values"
Is there any way I can change either the dates in either the date column or the date_time column to day+1 for instances where time is 00:00:00 ?
Maybe using pandas map() method with strftime format?
Maybe you can do something in this context
df.loc[df['time'] == datetime.time(0, 0), 'date'] += datetime.timedelta(days+1)
It selects the rows where the time is 00:00. Only on that rows, you increase the date-column by one day.

Count occurrences of certain values in dask.dataframe

I have a dataframe like this:
df.head()
day time resource_record
0 27 00:00:00 AAAA
1 27 00:00:00 A
2 27 00:00:00 AAAA
3 27 00:00:01 A
4 27 00:00:02 A
and want to find out how many occurrences of certain resource_records exist.
My first try was using the Series returned by value_counts(), which seems great, but does not allow me to exclude some labels afterwards, because there is no drop() implemented in dask.Series.
So I tried just to not print the undesired labels:
for row in df.resource_record.value_counts().iteritems():
if row[0] in ['AAAA']:
continue
print('\t{0}\t{1}'.format(row[1], row[0]))
Which works fine, but what if I ever want to further work on this data and really want it 'cleaned'. So I searched the docs a bit more and found mask(), but this feels a bit clumsy as well:
records = df.resource_record.mask(df.resource_record.map(lambda x: x in ['AAAA'])).value_counts()
I looked for a method which would allow me to just count individual values, but count() does count all values that are not NaN.
Then I found str.contains(), but I don't know how to handle the undocumented Scalar type I get returned with this code:
print(df.resource_record.str.contains('A').sum())
Output:
dd.Scalar<series-..., dtype=int64>
But even after looking at Scalar's code in dask/dataframe/core.py I didn't find a way of getting its value.
How would you efficiently count the occurrences of a certain set of values in your dataframe?
In most cases pandas syntax will work as well with dask, with the necessary addition of .compute() (or dask.compute) to actually perform the action. Until the compute, you are merely constructing the graph which defined the action.
I believe the simplest solution to your question is this:
df[df.resource_record!='AAAA'].resource_record.value_counts().compute()
Where the expression in the selector square brackets could be some mapping or function.
One quite nice method I found is this:
counts = df.resource_record.mask(df.resource_record.isin(['AAAA'])).dropna().value_counts()
First we mask all entries we'd like to get removed, which replaces the value with NaN. Then we drop all rows with NaN and last count the occurrences of unique values.
This requires df to have no NaN values, which otherwise leads to the row containing NaN being removed as well.
I expect something like
df.resource_record.drop(df.resource_record.isin(['AAAA']))
would be faster, because I believe drop would run through the dataset once, while mask + dropna runs through the dataset twice. But drop is only implemented for axis=1, and here we need axis=0.

Pandas.DataFrame - find the oldest date for which a value is available

I have a pandas.DataFrame object containing 2 time series. One series is much shorter than the other.
I want to determine the farther date for which a data is available in the shortest series, and remove data in the 2 columns before that date.
What is the most pythonic way to do that?
(I apologize that I don't really follow the SO guideline for submitting questions)
Here is a fragment of my dataframe:
osr go
Date
1990-08-17 NaN 239.75
1990-08-20 NaN 251.50
1990-08-21 352.00 265.00
1990-08-22 353.25 274.25
1990-08-23 351.75 290.25
In this case, I want to get rid of all rows before 1990-08-21 (I add there may be NAs in one of the columns for more recent dates)
You can use idxmax in inverted s by df['osr'][::-1] and then use subset of df:
print df
# osr go
#Date
#1990-08-17 NaN 239.75
#1990-08-20 NaN 251.50
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
s = df['osr'][::-1]
print s
#Date
#1990-08-23 351.75
#1990-08-22 353.25
#1990-08-21 352.00
#1990-08-20 NaN
#1990-08-17 NaN
#Name: osr, dtype: float64
maxnull = s.isnull().idxmax()
print maxnull
#1990-08-20 00:00:00
print df[df.index > maxnull]
# osr go
#Date
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
EDIT: New answer based upon comments/edits
It sounds like the data is sequential and once you have lines that don't have data you want to throw them out. This can be done easily with dropna.
df = df.dropna()
This answer assumes that once you are passed the bad rows, they stay good. Or if you don't care about dropping rows in the middle...depends on how sequential you need to be. If the data needs to be sequential and your input is well formed jezrael answer is good
Original answer
You haven't given much here by way of structure in your dataframe so I am going to make assumptions here. I'm going to assume you have many columns, two of which: time_series_1 and time_series_2 are the ones you referred to in your question and this is all stored in df
First we can find the shorter series by just using
shorter_col = df['time_series_1'] if len(df['time_series_1']) > len(df['time_series_2']) else df['time_series_2']
Now we want the last date in that
remove_date = max(shorter_col)
Now we want to remove data before that date
mask = (df['time_series_1'] > remove_date) | (df['time_series_2'] > remove_date)
df = df[mask]

Categories

Resources