I am working on some glaicer borehole temperature data consisting of ~1,000 rows by 700 columns. The vertical index is depth (i.e. as you move down the array depth increases) and the column headers are datetime values (i.e. as you move right along the array you move forwards in time).
I am looking for a way to average all temperatures in the columns depending on a date sampling rate. For example, the early datetimes have a spacing of 10 minutes, but the later datetimes have a spacing of six hours.
It would be good to be able to put in the sampling as an input and get out data based on that sampling rate so that I can see which one works best.
It would also be good that if I choose say 3 hour sampling this is simply ignored for spacing of above 3 hours and no change to the data is made in this case (i.e. datetime spacings of 10 minutes are averaged, but datetime spacings of 6 hours are left unaffected).
All of this needs to come out in either a pandas dataframe with date as column headers and depth as the index, or as a numpy array and separate list of datetimes.
I'm fairly new to Python, and this is my first question on stackoverflow!! Thanks :)
(I know the following is not totally correct use of Pandas, but it works for the figure slider I've produced!)
import numpy as np
import pandas as pd
#example array
T = np.array([ [-2, -2, -2, -2.1, -2.3, -2.6],
[-2.2, -2.3, -3, -3.1, -3.3, -3.3],
[-4, -4, -4.5, -4.4, -4.6, -4.5]])
#example headers at 8 and then 4 hour spacing
headers = [pd.date_range(start='2018-04-24 00:00:00', end='2018-04-24 08:00:00', periods=3).tolist() +
pd.date_range(start='2018-04-24 12:00:00', end='2018-04-25 12:00:00', periods=3).tolist()]
#pandas dataframe in same setup as much larger one I'm using
T_df = pd.DataFrame(T, columns = headers)
One trick you can use is to convert your time series to a numeric series, and then use the groupby method.
For instance, imagine you have
df = pd.DataFrame([['10:00:00', 1.],['10:10:00', 2.],['10:20:00', 3.],['10:30:00', 4.]],columns=['Time', 'Value'])
df.Time = pd.to_datetime(df.Time, format='%X')
You can convert your time series by:
df['DeltaT'] = pd.to_timedelta(df.Time).dt.total_seconds().astype(int)
df['DeltaT'] -= df['DeltaT'][0] # To start to 0
Then use the groupby method. You can for instance create a new column to floor the time interval you want:
myInterval = 1200.
df['group'] = (df['DeltaT']/myInterval).astype(int)
So you can use groupby followed by mean() (or a function you define)
df.groupby('group').mean()
Hope this helps!
Related
I want to fill in missing values by groups via forward fill and backward fill but only values that are within 1 year. For example, with the below dataframe, I need "desired" from "df" where both forward fill and backward fill is used for values within 1 year by pair(group).
import pandas as pd
import numpy as np
pair = ['A1','A1','A1','A2','B1','B1','B1','B1','B2','C2','C2','C2','C2','C2']
ym = ['2001-01-01','2001-04-01','2002-03-01',
'2000-01-01',
'2003-04-01','2003-05-01','2005-06-01','2007-03-01',
'2004-05-01',
'2001-07-01','2001-09-01','2002-01-01','2003-06-01','2004-07-01']
value = [np.nan,7,np.nan,
np.nan,
3,4,np.nan,9,
10,
2,np.nan,np.nan,np.nan,np.nan]
df = pd.DataFrame(list(zip(pair,ym,value)), columns=['pair','ym','value'])
df
pair = ['A1','A1','A1','A2','B1','B1','B1','B1','B2','C2','C2','C2','C2','C2']
ym = ['2001-01-01','2001-04-01','2002-03-01',
'2000-01-01',
'2003-04-01','2003-05-01','2005-06-01','2007-03-01',
'2004-05-01',
'2001-07-01','2001-09-01','2002-01-01','2003-06-01','2004-07-01']
desired_value = [7,7,7,
np.nan,
3,4,np.nan,9,
10,
2,2,2,np.nan,np.nan]
desired = pd.DataFrame(list(zip(pair,ym,value)), columns=['pair','ym','value'])
desired
I don't know what is the right way. I've used 'limit' option in ffill('limit'=12). However, I realized this means go up to 12 observations regardless of the date which is not what I want. Another way I thought is to use 'resample' to create monthly empty observations and then use ('limit'=12), however the data I have is quite large and this doesn't seem to be efficient way. Any suggestion or help will be greatly appreciated. Thank you!
just getting into data visualization with pandas. At the moment i try to visualize a pd with matplotlib that looks like this:
Initiative_160608 Initiative_160570 Initiative_160056
Beschluss_BR 2009-05-15 2009-05-15 2006-04-07
Vorlage_BT 2009-05-22 2009-05-22 2006-04-26
Beratung_BT 2009-05-28 2009-05-28 2006-05-11
ABeschluss_BT 2009-06-17 2009-06-17 2006-05-17
Beschlussempf 2009-06-17 2009-06-17 2006-05-26
As you can see, i have a number of columns with five different dates (every date symbolizes one event in a total chain of five events). Now to the problem:
My plan is to visualize shown data with a stacked horizontal chart, using the timedeltas between the 5 different events (how many days have passed between the first and last event, including the dates in between). Every Column should represent one bar in the chart. The whole chart is not about the absolute time that has passed, but about the duration of the five events in relation to the overall duration of one column, which means that all bars should have the same overall length.
Yet i haven`t found anything similar or found a solution by myself. I would be extremely thankful for any kind of solution to proceed with the shown data.
I'm not exactly sure if this is what you are looking for, but if each column is supposed to be a bar, and you want the time deltas within each column, then you need the difference in days between each row, and I am guessing the first row should have a difference of 0 days (since it is the starting point).
Also for stacked barplots, the index is used to create the categories, but in your case, you want the columns as categories, and each bar to be composed of the different index values. This means you need to transpose your df eventually.
This solution is pretty ugly, but hopefully it helps.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({
"Initiative_160608": ['2009-05-15', '2009-05-22', '2009-05-28', '2009-06-17', '2009-06-17'],
"Initiative_160570": ['2009-05-15', '2009-05-22', '2009-05-28', '2009-06-17', '2009-06-17'],
"Initiative_160056": ['2006-04-07', '2006-04-26', '2006-05-11', '2006-05-17', '2006-05-26']})
df.index = ['Beschless_BR', 'Vorlage_BT', 'Beratung_BT', 'ABeschless_BT', 'Beschlussempf']
# convert everything to dates
df = df.apply(lambda x: pd.to_datetime(x, format="%Y-%m-%d"))
def get_days(x):
diff_list = []
for i in range(len(x)):
if i == 0:
diff_list.append(x[i] - x[i])
else:
diff_list.append(x[i] - x[i-1])
return diff_list
# get the difference in days, then convert back to numbers
df_diff = df.apply(lambda x: get_days(x), axis = 0)
df_diff = df_diff.apply(lambda x: x.dt.days)
# transpose the matrix so that each initiative becomes a stacked bar
df_diff = df_diff.transpose()
# replace 0 values with 0.2 so that the bars are visible
df_diff = df_diff.replace(0, 0.2)
df_diff.plot.bar(stacked = True)
plt.show()
I have a dataset of locations of stores with dates of events (the date all stock was sold from that store) and quantities of the sold items, such as the following:
import numpy as np, pandas as pd
# Dates
start = pd.Timestamp("2014-02-26")
end = pd.Timestamp("2014-09-24")
# Generate some data
N = 1000
quantA = np.random.randint(10, 500, N)
quantB = np.random.randint(50, 250, N)
sell = np.random.randint(start.value, end.value, N)
sell = pd.to_datetime(np.array(sell, dtype="datetime64[ns]"))
df = pd.DataFrame({"sell_date": sell, "quantityA":quantA, "quantityB":quantB})
df.index = df.sell_date
I would like to create a new time series dataframe that has per-weekly summaries (or per daily; or per custom date_range object) from a range of these quantities A and B.
I can generate week number and aggregate sales based on those, like so...
df['week'] = df.sell_date.dt.week
df.pivot_table(values = ['quantityA', 'quantityB'], index = 'week', aggfunc = [np.sum, len])
But I don't see how to do the following:
expand this out to a full time series (based on a date_range object, such as period_range = pd.date_range(start = start, end = end, freq='7D')),
include the original date (as a 'week starting' variable), instead of integer week number, or
change the date variable to be the index of this new dataframe.
I'm not sure if this is what you want but you can try
df.set_index('sell_date', inplace=True)
resampled = df.resample('7D', [sum, len])
The resulting index might not be exactly what you want as it starts with the earliest datetime correct to the nanosecond. You could replace with datetimes which have 00:00:00 in the time by doing
resampled.index = pd.to_datetime(resampled.index.date)
EDIT:
You can actually just do
resampled = df.resample('W', [sum, len])
And the resulting index is exactly what you want. Interestingly, passing 'D' also gives the index you would expect but passing a multiple like '2D' results in the 'ugly' index, that is, starting at the earliest correct to the nanosecond and increasing in multiples of exactly 2 days. I guess the lesson is stick to singles like 'D', 'W', 'M' where possible.
EDIT:
The API for resampling changed at some point such that the above no longer works. Instead one can do:
resampled = df.resample('W').agg([sum, len])
.resample now returns a Resampler object which exposes methods, much like the groupbyAPI.
I have been running an experiment that outputs data with two columns:
seconds since start of experiment (float)
a measurement. (float)
I would now like to load this into Pandas to resample and plot the measurements. I've done this before, but those times my timestamps have been since epoch or in datetime (YYY-MM-DD HH:mm:ss) format. If I'm loading my first column as integers I'm unable to do
data.resample('5Min', how='mean')
. It also does not seem possible if I'd convert my first column to timedelta(seconds=...). My question is, is it possible to resample this data without subverting to epoch conversion?
You can use groupby with time // period to do this:
import pandas as pd
import numpy as np
t = np.random.rand(10000)*3600
t.sort()
v = np.random.rand(10000)
df = pd.DataFrame({"time":t, "value":v})
period = 5*60
s = df.groupby(df.time // period).value.mean()
s.index *= period
I have the same structure of sensors data. First column is seconds since start of the experiment and rest of columns are value.
here is the data structure:
time x y z
0 0.015948 0.403931 0.449005 -0.796860
1 0.036006 0.403915 0.448029 -0.795395
2 0.055885 0.404907 0.446548 -0.795853
here is what worked for me:
convert the time to time delta:
df.time=pd.to_timedelta(df.time,unit="s")
set the time as index
df.set_index("time",inplace=True)
resample to the frequency you want
df.resample("40ms").mean()
I'm trying to calculate daily sums of values using pandas. Here's the test file - http://pastebin.com/uSDfVkTS
This is the code I came up so far:
import numpy as np
import datetime as dt
import pandas as pd
f = np.genfromtxt('test', dtype=[('datetime', '|S16'), ('data', '<i4')], delimiter=',')
dates = [dt.datetime.strptime(i, '%Y-%m-%d %H:%M') for i in f['datetime']]
s = pd.Series(f['data'], index = dates)
d = s.resample('D', how='sum')
Using the given test file this produces:
2012-01-02 1128
Freq: D
First problem is that calculated sum corresponds to the next day. I've been able to solve that by using parameter loffset='-1d'.
Now the actual problem is that the data may start not from 00:30 of a day but at any time of a day. Also the data has gaps filled with 'nan' values.
That said, is it possible to set a lower threshold of number of values that are necessary to calculate daily sums? (e.g. if there're less than 40 values in a single day, then put NaN instead of a sum)
I believe that it is possible to define a custom function to do that and refer to it in 'how' parameter, but I have no clue how to code the function itself.
You can do it directly in Pandas:
s = pd.read_csv('test', header=None, index_col=0, parse_dates=True)
d = s.groupby(lambda x: x.date()).aggregate(lambda x: sum(x) if len(x) >= 40 else np.nan)
X.2
2012-01-01 1128
Much easier way is to use pd.Grouper:
d = s.groupby(pd.Grouper(freq='1D')).sum()