I have a pandas dataframe that includes time intervals that overlapping at some points (figure 1). I need a data frame that has a time series that starts beginning from the first start_time to the end of the last end_time (figure 2).
I have to sum up VIS values at overlapped time intervals.
I couldn't figure it out. How can I do it?
This problem is easily solved with the python package staircase, which is built on pandas and numpy for the purposes of working with (mathematical) step functions.
Assume your original dataframe is called df and the times you want in your resulting dataframe are an array (or datetime index, or series etc) called times.
import staircase as sc
stepfunction = sc.Stairs(df, start="start_time", end="end_time", value="VIS")
result = stepfunction(times, include_index=True)
That's it, result is a pandas Series indexed by times, and has the values you want. You can convert it to a dataframe in the format you want using reset_index method on the Series.
You can generate your times data like this
import pandas as pd
times = pd.date_range(df["start_time"].min(), df["end_time"].max(), freq="30min")
Why it works
Each row in your dataframe can be thought of a step function. For example the first row corresponds to a step function which starts with a value of zero, then at 2002-02-03 04:15:00 increases to a value of 10, then at 2002-02-04 04:45:00 returns to zero. When you sum all the step functions up for each row you have one step function whose value is the sum of all VIS values at any point. This is what has been assigned to the stepfunction variable above. The stepfunction variable is callable, and returns values of the step function at the points specified. This is what is happening in the last line of the example where the result variable is being assigned.
note:
I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
If you paste your data instead of the images, I'd be able to test this. But this is how you may want to think about it. Assume your dataframe is called df.
df['start_time'] = pd.to_datetime(df['start_time']) # in case it's not datetime already
df.set_index('start_time', inplace=True)
new_dates = pd.date_range(start=min(df.index), end=max(df.end_time), freq='15Min')
new_df = df.reindex(new_dates, fill_value=np.nan)
As long as there are no duplicates in start_time, this should work. If there is, that'd need to be handled in some other way.
Resample is another possibility, but without data, it's tough to say what would work.
Related
So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day.
So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes.
I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean()
The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean()
It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.
The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error.
There is two ways to solve this problem. The first one is using transform as:
df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean())
The second is to create a new dfn from groupby then merge back to df
dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average')
df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']
You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by.
Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe
df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column']
This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df
Using the following code I can build a simple table with the current COVID-19 cases worldwide, per country:
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
raw_data = pd.read_csv(url, sep=",")
raw_data.drop(['Province/State','Lat','Long'], axis = 1, inplace = True)
plot_data = raw_data.groupby('Country/Region').sum()
The plot_data is a simple DataFrame:
What I would like to do now is to subtract the values on each column by the values on the column on a prior day - i.e., I wan to get the new cases per day.
If I do something like plot_data['3/30/20'].add(-plot_data['3/29/20']), it works well. But if I do something like plot_data.iloc[:,68:69].add(-plot_data.iloc[:,67:68]), I got two columns with NaN values. I.e. Python tries to "preserve" de columns header and doesn't perform the operation the way I would like it to.
My goal was to perform this operation in an "elegant way". I was thinking something in the lines of plot_data.iloc[:,1:69].add(-plot_data.iloc[:,0:68]). But of course, if it doesn't work as the single-column example, it doesn't work with multiple columns either (as Python will match the column headers and return a bunch of zeros/NaN values).
Maybe there is a way to tell Python to ignore the headers during an operation with a DataFrame? I know that I can transform my DataFrame into a NumPy array and do a bunch of operations. However, since this is a simple/small table, I thought I would try to keep using a DataFrame data type.
The good old shift can be used on the horizontal axis:
plot_data - plot_data.shift(-1, axis=1)
should be what you want.
Thank you very much #Serge Ballesta! Your answer is exactly the type of "elegant solution" I was looking for. The only comment is the shift sign should be "positive".
plot_data - plot_data.shift(1, axis=1)
This way we bring the historical figures forward one day and now I can subtract it from the actual numbers on each day.
I'm dealing with time series data using python's pandas DataFrame.
Given that this time series has a value in the range of -10 to 10, we want to find out how many times it passes by 3.
In the simplest case, you can check if the values in the previous and current columns are small or large based on 3 to see if there are any changes.
Is there a function in pandas to help with this?
If you just want to find how many times 0 come out
use pd.count(axis=columns)
import pandas as pd
path = ('./test.csv')
dataframe = pd.read_csv(path,encoding='utf8')
print(dataframe.count(axis='columns'))
I have used groupby in pandas, however the label for the groups is simply an arbitrary value, whereas I would like this label to be the index of the original dataframe (which is datetime) so that I can create a new dataframe which I can plot in terms of datetime.
grouped_data = df.groupby(
['X',df.X.ne(df.X.shift()).cumsum().rename('grp')])
grouped_data2 = grouped_data['Y'].agg(np.trapz).loc[2.0:4.0]
The column x has changing values from 1-4 and the second line of code is intended to integrate the column Y in the groups where X is either 2 or 3. These are repeating units, so I don't want all the 2s and all the 3s integrated together, I want the period of time where it goes: 22222333333 as one group and then apply the np.trapz again to the next group where it goes: 2222233333. That way I should have a new dataframe with an index corresponding to the start of these time periods and values which are an integral of these periods.
If I understand correctly, you've already set your index to DateTime values? If yes, try the grouper function:
df.groupby(pd.Grouper(key={index name}, freq={appropriate offset alias}))
Without a sample data-set, I can't really provide a complete solution, but this should solve your indexing issue:)
Grouper Function tutorial
Offset aliases
EDIT 2016-01-24: This behavior was from a bug in xarray (at the time known as 'xray'). See answer by skc below.
I have an xarray.DataArray comprising daily data spanning multiple years. I want to compute the time tendency of that data for each month in the timeseries. I can get the numerator, i.e. the change in the quantity over each month, using resample. Supposing arr is my xarray.DataArray object, with the time coordinate named 'time':
data_first = arr.resample('1M', 'time' how='first')
data_last = arr.resample('1M', 'time' how='last')
Then data_last - data_first gives me the change in that variable over that month.
However, this doesn't work on the time=arr.time object itself: both 'first' and 'last' kwarg values yield the same value, which is the last day of that month. Also, I can't use the groupby methods, because doing so with time.month groups all the Januaries together, all the Februaries together, etc., when I want the first and last time value within each individual month in the timeseries.
Is there a simple way to do this in xarray? I suspect yes, but I'm new to the package and am failing miserably.
Since 'time' is a coordinate in the DataArray you provided, for the moment it is not possible1 preform resample directly upon it. A possible workaround is to create a new DataArray with the time coordinate values as a variable (still linked with the same coordinate 'time')
If arr is the DataArray you are starting from I would suggest something like this:
time = xray.DataArray(arr.time.values, coords=[arr.time.values], dims=['time'])
time_first = time.resample('1M', 'time', how='first')
time_last = time.resample('1M', 'time', how='last')
time_diff = time_last - time_first
1This is not the intended behavior -- see Stephan's comment above.
Update: Pull request 648 has fixed this issue, so there should no longer be a need to use a workaround.