Find how often data passes zero in time series data - python

I'm dealing with time series data using python's pandas DataFrame.
Given that this time series has a value in the range of -10 to 10, we want to find out how many times it passes by 3.
In the simplest case, you can check if the values in the previous and current columns are small or large based on 3 to see if there are any changes.
Is there a function in pandas to help with this?

If you just want to find how many times 0 come out
use pd.count(axis=columns)
import pandas as pd
path = ('./test.csv')
dataframe = pd.read_csv(path,encoding='utf8')
print(dataframe.count(axis='columns'))

Related

Python - generate a timestamp table in pandas given a date period

This is kind of a mixture between these two questions:
Pandas is a Timestamp within a Period (because it adds a time period in pandas)
Generate a random date between two other dates (but I need multiple dates (at least 1 million which I specify with a variable LIMIT))
How can I generate random dates WITH random time between a given date period randomly for a specific given amount?
Performance is rather important for me, hence I chose to go with pandas, any performance boosts are appreciated even if that means using another library.
My approach so far would be the following:
tstamp = pd.to_datetime(['01/01/2010', '2020-12-31'])
# ???
But I don't know how to randomize between dates. I was thinking of using randint for a random unix epoch time and then converting that, but it would slow it down A LOT.
You can try this, it is very fast:
start = np.datetime64('2017-01-01')
end = np.datetime64('2018-01-01')
limit = 1000000
delta = np.arange(start,end)
indices = np.random.choice(len(delta), limit)
delta[indices]
All I had to do was to add str(fake.date_time_between(start_date='-10y', end_date='now')) into my Pandas DataFrame append logic. I'm not even sure that the str() there is necessary.
P.S. you initialize it like this:
from faker import Faker
# initialize Faker
fake = Faker()

Pandas overlapped time intervals to time series

I have a pandas dataframe that includes time intervals that overlapping at some points (figure 1). I need a data frame that has a time series that starts beginning from the first start_time to the end of the last end_time (figure 2).
I have to sum up VIS values at overlapped time intervals.
I couldn't figure it out. How can I do it?
This problem is easily solved with the python package staircase, which is built on pandas and numpy for the purposes of working with (mathematical) step functions.
Assume your original dataframe is called df and the times you want in your resulting dataframe are an array (or datetime index, or series etc) called times.
import staircase as sc
stepfunction = sc.Stairs(df, start="start_time", end="end_time", value="VIS")
result = stepfunction(times, include_index=True)
That's it, result is a pandas Series indexed by times, and has the values you want. You can convert it to a dataframe in the format you want using reset_index method on the Series.
You can generate your times data like this
import pandas as pd
times = pd.date_range(df["start_time"].min(), df["end_time"].max(), freq="30min")
Why it works
Each row in your dataframe can be thought of a step function. For example the first row corresponds to a step function which starts with a value of zero, then at 2002-02-03 04:15:00 increases to a value of 10, then at 2002-02-04 04:45:00 returns to zero. When you sum all the step functions up for each row you have one step function whose value is the sum of all VIS values at any point. This is what has been assigned to the stepfunction variable above. The stepfunction variable is callable, and returns values of the step function at the points specified. This is what is happening in the last line of the example where the result variable is being assigned.
note:
I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
If you paste your data instead of the images, I'd be able to test this. But this is how you may want to think about it. Assume your dataframe is called df.
df['start_time'] = pd.to_datetime(df['start_time']) # in case it's not datetime already
df.set_index('start_time', inplace=True)
new_dates = pd.date_range(start=min(df.index), end=max(df.end_time), freq='15Min')
new_df = df.reindex(new_dates, fill_value=np.nan)
As long as there are no duplicates in start_time, this should work. If there is, that'd need to be handled in some other way.
Resample is another possibility, but without data, it's tough to say what would work.

Looking to insert value into column of pandas dataframe based off calculation from two other rows in the Dataframe

I was wondering the best way to essentially work out the conversion rate and place it into a conversion rate column into a pandas DataFrame.
Currently my dataframe looks like this:
Sessions Conversions Conversion Rate
1000 50 Default Value
I want to loop through the dataframe calculating conversion rate by doing the following code:
e = 0
for i in dataset.itertuples():
dataset['Conversion Rate'].loc[e] = dataset['ga:goalCompletions'].loc[e] / dataset['ga:sessions'].loc[e]
e+=1
But I get the warning - A value is trying to be set on a copy of a slice from a DataFrame
So i'm assuming it's not the best way to do it.
Would love some help as i've been rattling my brains over this for a couple of hours now even though it's probably a super simple thing to fix...

Generating a list of values from a pandas DataFrame column for a range of values in another column

For a list of daily maximum temperature values from 5 to 27 degrees celsius, I want to calculate the corresponding maximum ozone concentration, from the following pandas DataFrame:
I can do this by using the following code, by changing the 5 then 6, 7 etc.
df_c=df_b[df_b['Tmax']==5]
df_c.O3max.max()
Then I have to copy and paste the output values into an excel spreadsheet. I'm sure there must be a much more pythonic way of doing this, such as by using a list comprehension. Ideally I would like to generate a list of values from the column 03max. Please give me some suggestions.
use pd.Series.map with another pd.Series
pd.Series(list_of_temps).map(df_b.set_index('Tmax')['O3max'])
You can get a dataframe
result_df = pd.DataFrame(dict(temps=list_of_temps))
result_df['O3max'] = result_df.temps.map(df_b.set_index('Tmax')['O3max'])
I had another play around and think the following piece of code seems to do the job:
df_c=df_b.groupby(['Tmax'])['O3max'].max()
I would appreciate any thoughts on whether this is correct

Resampling pandas timeseries without computing a new offset

I'm reading in timeseries data that contains only the available times. This leads to a Series with no missing values, but an unequally spaced index. I'd like to convert this to a Series with an equally spaced index with missing values. Since I don't know a priori what the spacing will be, I'm currently using a function like
min_dt = np.diff(series.index.values).min()
new_spacing = pandas.DateOffset(days=min_dt.days, seconds=min_dt.seconds,
microseconds=min_dt.microseconds)
series = series.asfreq(new_spacing)
to compute what the spacing should be (note that this is using Pandas 0.7.3 - the 0.8 beta code looks slightly differently since I have to use series.index.to_pydatetime() for correct behavior with Numpy 1.6).
Is there an easier way to do this operation using the pandas library?
If you want NaN's in the places where there is no data, you can just use Minute() located in datetools (as of pandas 0.7.x)
from pandas.core.datetools import day, Minute
tseries.asfreq(Minute())
That should provide an evenly spaced time series with 1 minute differences with NaNs as the series values where there is no data.

Categories

Resources