I am attempting to create a downward velocity model for offshore drilling which uses the variables Depth (which increases every 1 foot) and DateTime data which is more intermittent and is only updated every foot of depth:
Dept DateTime
1141 5/24/2017 04:31
1142 5/24/2017 04:32
1143 5/24/2017 04:40
1144 5/24/2017 04:42
1145 5/25/2017 04:58
I am trying to get something like this:
Where Velocity iterated down dept/(DateTime gap)
If you are happy to use a 3rd party library, this is straightforward with Pandas:
import pandas as pd
# read file into dataframe
df = pd.read_csv('file.csv')
# convert series to datetime
df['DateTime'] = pd.to_datetime(df['DateTime'])
# perform calculation
df['Velocity'] = df['Dept'].diff() / (df['DateTime'].diff().dt.total_seconds() / 60)
# expert to csv
df.to_csv('file_out.csv', index=False)
print(df)
# Dept DateTime Velocity
# 0 1141 2017-05-24 04:31:00 NaN
# 1 1142 2017-05-24 04:32:00 1.000000
# 2 1143 2017-05-24 04:40:00 0.125000
# 3 1144 2017-05-24 04:42:00 0.500000
# 4 1145 2017-05-25 04:58:00 0.000687
Related
I'm trying to change a date frame with the following contents:
Date
Change
1802
2017-09-14
-1.14%
462
2021-05-16
NaN
935
2020-01-29
0.04%
713
2020-09-07
2.39%
1471
2018-08-11
NaN
[1460 rows × 2 columns]
Into this:
TimeSeries (DataArray) (Month: 144component: 1sample: 1)
array([[[112.]],
[[118.]],
[[132.]],
[[129.]],
[[121.]],
[[135.]],
[[148.]],
[[148.]],
[[136.]],
Coordinates:
Month
(Month)
datetime64[ns].
2019-01-01 ... 2021-12-01
component
(component)
object
'Change'
Attributes:
static_covariates: None
hierarchy: None
In order to run a neural network model on multiple time series.
Any help or advice is greatly appreciated!
The solution required removing the '%' sign from the column values. Then converting the column to a float.
ftse_change['Change'] = ftse_change['Change'].str.rstrip('%').astype('float') / 100.0
did the trick
I have a large csv file that has a date column. I want to calculate time differences between consecutive rows using pandas. how can I calculate time differences in seconds and write it in a new column? I already checked similar questions but their date format was different.this is top five rows of my data
2017-02-01T00:00:01
2017-02-01T00:00:01
2017-02-01T00:00:06
2017-02-01T00:00:07
2017-02-01T00:00:10
I tried
import pandas as pd
df=pd.read_csv('Output1.csv')
df['Time_diff'] = df['BaseDateTime'].diff()
print(df)
but got this error
TypeError Traceback (most recent call last)
<ipython-input-7-0dc1df27a3d2> in <module>
1 import pandas as pd
2 df=pd.read_csv('Output1.csv')
----> 3 df['Time_diff'] = df['BaseDateTime'].diff()
4 print(df)
D:\anaconda\lib\site-packages\pandas\core\series.py in diff(self, periods)
2356 dtype: float64
2357 """
-> 2358 result = algorithms.diff(self.array, periods)
2359 return self._constructor(result, index=self.index).__finalize__(self)
2360
D:\anaconda\lib\site-packages\pandas\core\algorithms.py in diff(arr, n, axis, stacklevel)
1924 out_arr[res_indexer] = arr[res_indexer] ^ arr[lag_indexer]
1925 else:
-> 1926 out_arr[res_indexer] = arr[res_indexer] - arr[lag_indexer]
1927
1928 if is_timedelta:
TypeError: unsupported operand type(s) for -: 'str' and 'str'`
Try this example:
import pandas as pd
import io
s = io.StringIO('''
dates,nums
2017-02-01T00:00:01,1
2017-02-01T00:00:01,2
2017-02-01T00:00:06,3
2017-02-01T00:00:07,4
2017-02-01T00:00:10,5
''')
df = pd.read_csv(s)
Currently the frame looks like this:
nums is nothing and just there to be a secondary column of "something".
dates nums
0 2017-02-01T00:00:01 1
1 2017-02-01T00:00:01 2
2 2017-02-01T00:00:06 3
3 2017-02-01T00:00:07 4
4 2017-02-01T00:00:10 5
Carrying on:
# format as datetime
df['dates'] = pd.to_datetime(df['dates'])
# shift the dates up and into a new column
df['dates_shift'] = df['dates'].shift(-1)
# work out the diff
df['time_diff'] = (df['dates_shift'] - df['dates']) / pd.Timedelta(seconds=1)
# remove the temp column
del df['dates_shift']
# see what you've got
print(df)
dates nums time_diff
0 2017-02-01 00:00:01 1 0.0
1 2017-02-01 00:00:01 2 5.0
2 2017-02-01 00:00:06 3 1.0
3 2017-02-01 00:00:07 4 3.0
4 2017-02-01 00:00:10 5 NaN
To get the absolute values change this line above:
df['time_diff'] = (df['dates_shift'] - df['dates']) / pd.Timedelta(seconds=1)
To:
df['time_diff'] = (df['dates_shift'] - df['dates']).abs() / pd.Timedelta(seconds=1)
I've been trying to draw a stacked bar chart using plotnine. This graphic represents the end of month inventory within the same "Category". The "SubCategory" its what should get stacked.
I've built a pandas dataframe from a query to a database. The query retrieves the sum(inventory) for each "subcategory" within a "category" in a date range.
This is the format of the DataFrame:
SubCategory1 SubCategory2 SubCategory3 .... Dates
0 1450.0 130.5 430.2 .... 2019/Jan
1 1233.2 1000.0 13.6 .... 2019/Feb
2 1150.8 567.2 200.3 .... 2019/Mar
Dates should be in the X axis, and Y should be determined by the sum of "SubCategory1" + "SubCategory2" + "SubCategory3" and being color distinguishable.
I tried this because I thought it made sense but had no luck:
g = ggplot(df)
for key in subcategories:
g = g + geom_bar(aes(x='Dates', y=key), stat='identity', position='stack')
Where subcategories is a dictionary with the SubCategories name.
Maybe the format of the dataframe is not ideal. Or I don't know how to properly use it with plotnine/ggplot.
Thanks for the help.
You need the data in tidy format
from io import StringIO
import pandas as pd
from plotnine import *
from mizani.breaks import date_breaks
io = StringIO("""
SubCategory1 SubCategory2 SubCategory3 Dates
1450.0 130.5 430.2 2019/Jan
1233.2 1000.0 13.6 2019/Feb
1150.8 567.2 200.3 2019/Mar
""")
data = pd.read_csv(io, sep='\s+', parse_dates=[3])
# Make the data tidy
df = pd.melt(data, id_vars=['Dates'], var_name='categories')
"""
Dates categories value
0 2019-01-01 SubCategory1 1450.0
1 2019-02-01 SubCategory1 1233.2
2 2019-03-01 SubCategory1 1150.8
3 2019-01-01 SubCategory2 130.5
4 2019-02-01 SubCategory2 1000.0
5 2019-03-01 SubCategory2 567.2
6 2019-01-01 SubCategory3 430.2
7 2019-02-01 SubCategory3 13.6
8 2019-03-01 SubCategory3 200.3
"""
(ggplot(df, aes('Dates', 'value', fill='categories'))
+ geom_col()
+ scale_x_datetime(breaks=date_breaks('1 month'))
)
Do you really need to use plotnine? You can do it with just:
df.plot.bar(x='Dates', stacked=True)
Output:
I want to split monthly data to weekly and fill each week row with the same monthly value for which each week refers to.
These variables are the ones that I need to work with.
"starting date" non-null datetime64[ns]
"ending date" non-null datetime64[ns]
import pandas as pd
df = pd.read_excel("file")
import pandas as pd
import math, datetime
d1 = datetime.date(yyyy, mm, dd)
d2 = datetime.date(yyyy, mm, dd)
h = []
while d1 <= d2:
print(d1)
d1 = d1 + datetime.timedelta(days=7)
h.append(d1)
df = pd.Series(h)
print(df)
I have tried the code above but
I think It is completly useless:
This is what I have in my dataset:
price starting date ending date model
1000 2013-01-01 2013-01-14 blue
598 2013-01-01 2013-01-14 blue
156 2013-01-15 2013-01-28 red
This is what I would like to get:
weekly date price model
2013-01-01 1000 blue
2013-01-01 598 blue
2013-01-08 1000 blue
2013-01-08 598 blue
2013-01-15 156 red
2013-01-22 156 red
Something like below:
Convert to pd.to_datetime()
df[['starting date','ending date']] = df[['starting date','ending date']].apply(pd.to_datetime)
Create a dictionary from the start time column:
d=dict(zip(df['starting date'],df.data))
#{Timestamp('2013-01-01 00:00:00'): 20, Timestamp('2013-01-15 00:00:00'): 21}
Using pd.date_range() create a dataframe having weekly intervals of the start time:
df_new = pd.DataFrame(pd.date_range(df['starting date'].iloc[0],df['ending date'].iloc[-1],freq='W-TUE'),columns=['StartDate'])
Same for end time:
df_new['EndDate']=pd.date_range(df['starting date'].iloc[0],df['ending date'].iloc[-1],freq='W-MON')
Map the data column based on start time and ffill() till the next start time arrives:
df_new['data']=df_new.StartDate.map(d).ffill()
print(df_new)
StartDate EndDate data
0 2013-01-01 2013-01-07 20.0
1 2013-01-08 2013-01-14 20.0
2 2013-01-15 2013-01-21 21.0
3 2013-01-22 2013-01-28 21.0
I'm going to make an assumption that the starting date and the ending date never overlap in your dataset. I'm also going to assume that your example is correct because it contradicts your words. It's not monthly data, but rather bi-monthly data. This code should work with any frequency.
# creates some sample data
df = pd.DataFrame(data={'starting date':pd.to_datetime(['2019-01-01','2019-01-15','2019-02-01','2019-02-15']),
'data':[1,2,3,4]})
# Hold the stant and end dates of the new df
d1 = pd.datetime(2019,1,1)
d2 = pd.datetime(2019,2,28)
# Create a new DF to hold results
new_df = pd.DataFrame({'date':pd.DatetimeIndex(start=d1,end=d2,freq='w')})
# Merge based on the closest start date.
result = pd.merge_asof(new_df,df,left_on='date',right_on='starting date')
I know there are similar questions that have already been answered. However, I can't seem to troubleshoot why none of the solutions are working for me.
My sample dataset:
TimeStamp 340 341 342
10:27:00 1.953036 2.110234 1.981548
10:28:00 1.973408 2.046361 1.806923
10:29:00 0.000000 0.000000 0.014881
10:30:00 2.567976 3.169928 3.479591
I want to find the mean of the data every two minutes for each column. While df.groupby promises a neat solution, it makes my TimeStamp column disappear for some reason. the help is greatly appreciated.
Expected output:
TimeStamp 340 341 342
10:27:30 1.963222 2.078298 1.894235
10:29:30 1.283988 1.584964 1.747236
Attempted code:
import pandas as pd
import numpy as np
path = '/Users/username/Desktop/Model/'
file1 = 'filename.csv'
df = pd.read_csv(path + file1, skipinitialspace = True)
df['TimeStamp'] = pd.to_timedelta(df['TimeStamp'])
df['TimeStamp'] = df['TimeStamp'].dt.floor('min')
df.set_index('TimeStamp')
rowF = len(df['TimeStamp'])
# Average every two min
newdf = df.groupby(np.arange(len(df.index))//2).mean()
print(newdf)
Set the time as index:
df.set_index(pd.to_timedelta(df.TimeStamp), inplace=True)
And then use resample and aggregate every two minutes:
df.resample("2min").mean().reset_index()
# TimeStamp 340 341 342
#0 10:27:00 1.963222 2.078298 1.894235
#1 10:29:00 1.283988 1.584964 1.747236
#2 10:31:00 NaN NaN NaN
Drop the last observation with iloc:
df.resample("2min").mean().reset_index().iloc[:-1]
# TimeStamp 340 341 342
#0 10:27:00 1.963222 2.078298 1.894235
#1 10:29:00 1.283988 1.584964 1.747236
If you prefer to shift the TimeStamp by 30 seconds:
(df.resample("2min").mean().reset_index()
.assign(TimeStamp = lambda x: x.TimeStamp + pd.Timedelta('30 seconds'))
.iloc[:-1])
# TimeStamp 340 341 342
#0 10:27:30 1.963222 2.078298 1.894235
#1 10:29:30 1.283988 1.584964 1.747236