I have a time series that look like this one:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO('''
id,start,end,value
1,2021-01-01,2021-03-31,1000
1,2021-01-01,2021-06-30,2000
1,2021-01-01,2021-12-31,5000
2,2021-01-01,2021-02-01,100
2,2021-02-02,2021-05-04,200
2,2021-01-01,2021-08-24,1000
'''))
It's possible to have irregular steps and they may overlap, but the value is a summation. And I need to transform it in the smallest possible time intervals by id without overlapping, discounting the overlapped values. The output to that time series would look like this:
output = pd.read_csv(StringIO('''
id,start,end,value
1,2021-01-01,2021-03-31,1000
1,2021-04-01,2021-06-30,1000
1,2021-07-01,2021-12-31,3000
2,2021-01-01,2021-02-01,100
2,2021-02-02,2021-05-04,200
2,2021-05-05,2021-08-24,700
'''))
I tried to adapt this Overlap in date range grouped dataframe to my problem, but without success.
Thanks in advance!
Related
I am a pandas newbie and I want to make a graph from a CSV I have. On this csv, there's some date written to it, and I want to make a graph of how frequent those date appears.
This is how it looks :
2022-01-12
2022-01-12
2022-01-12
2022-01-13
2022-01-13
2022-01-14
Here, we can see that I have three records on the 12th of january, 2 records the 13th and only one records the 14th. So we should see a decrease on the graph.
So, I tried converting my csv like this :
date,records
2022-01-12,3
2022-01-13,2
2022-01-14,1
And then make a graph with the date as the x axis and the records amount as the y axis.
But is there a way panda (or matplotlib I never understand which one to use) can make a graph based on the frequency of appearance, so that I don't have to convert the csv before ?
There is a function of PANDAS which allows you to count the number of values.
First off, you'd need to read your csv file into a dataframe. Do this by using:
import pandas as pd
df = pd.read_csv("~csv file name~")
Using the unique() function in the pandas library, you can display all of the unique values. The syntax should look like:
uniqueVals = df("~column name~").unique()
That should return a list of all the unique values. Then what you'll do is use the function value_counts() with whatever value you are trying to count in square brackets after the normal brackets. The syntax should look something like this:
totalOfVals = []
for date in uniqueVals:
numDate = df[date].valuecounts("~Whatever date you're looking for~")
totalOfVals.append(numDate)
Then you can use the two arrays you have for the unique dates and the amount of dates there are to then use matplotlib to create a graph.
You'll want to use the syntax:
import matplotlib.pyplot as mpl
mpl.plot(uniqueVals, totalOfVals, color = "~whatever colour you want the line to be~", marker = "~whatever you want the marker to look like~")
mpl.xlabel('Date')
mpl.ylabel('Number of occurrences')
mpl.title('Number of occurrences of dates')
mpl.grid(True)
mpl.show()
And that should display a graph with all the dates and number of occurrences with a grid behind it. Of course if you don't want the grid just either set mpl.grid to False or just get rid of it.
I have a 1 minute interval intraday stock data which looks like this:
import yfinance as yf
import pandas as pd
n = yf.download('^nsei', period= '5d', interval= '1m')
I am trying to resample it to '5m' data like this:
n = n.resample('5T').agg(dict(zip(n.columns, ['first', 'max', 'min', 'last', 'last', 'sum'])))
But it tries to resample the datetime information which is not in my data. The market data is only available till 03:30 PM, but when I look at the resampled dataframe I find its tried to resample for entire 24 hrs.
How do I stop the resampling till 03:30PM and move on to the succeeding date?
Right now the dataframe has mostly NaN values due to this. Any suggestions will be welcome.
I am not sure what you are trying to achieve with that agg() function. Assuming 'first' refers to the first quantile and 'last' to the last quantile and you want to calculate some statistics per column, I suggest you do the following:
Get your data:
import yfinance as yf
import pandas as pd
n = yf.download('^nsei', period= '5d', interval= '1m')
Resample your data:
Note: your result is the same as when you resample with n.resample('5T').first() but this means every value in the dataframe
equals the first value from the 5 minute interval consisting of 5
values. A more logical resampling method is to use the mean() or
sum() function as shown below.
If this is data on stock prices it makes more sense to use mean():
resampled_df = n.resample('5T').mean()
To remove resampled hours that are outside of the working stock hours you have 2 options.
Option 1: drop na values:
filtered_df = resampled_df.dropna()
Note: this will not work if you use sum() since the result won't contain missing values but zeros.
Option 2 filter based on start and end hour
Get minimum and maximum time of day where data is available as datetime.time object:
start = n.index.min().time() # 09:15 as datetime.time object
end = n.index.max().time() # 15:29 as datetime.time object
Filter dataframe based on start and end times:
filtered_df = resampled_df.between_time(start, end)
Get the statistics:
statistics = filtered_df.describe()
statistics
Note that describe() will not contain the sum, so in order to add it you could do:
statistics = pd.concat([statistics, filtered_df.agg(['sum'])])
statistics
Output:
The agg() is to apply individual method of operation for each column, I used this so that I can get to see the 'candlestick' formation as it is called in stock technical analysis.
I was able to fix the issue, by dropping the NaN values.
I have a pandas dataframe that includes time intervals that overlapping at some points (figure 1). I need a data frame that has a time series that starts beginning from the first start_time to the end of the last end_time (figure 2).
I have to sum up VIS values at overlapped time intervals.
I couldn't figure it out. How can I do it?
This problem is easily solved with the python package staircase, which is built on pandas and numpy for the purposes of working with (mathematical) step functions.
Assume your original dataframe is called df and the times you want in your resulting dataframe are an array (or datetime index, or series etc) called times.
import staircase as sc
stepfunction = sc.Stairs(df, start="start_time", end="end_time", value="VIS")
result = stepfunction(times, include_index=True)
That's it, result is a pandas Series indexed by times, and has the values you want. You can convert it to a dataframe in the format you want using reset_index method on the Series.
You can generate your times data like this
import pandas as pd
times = pd.date_range(df["start_time"].min(), df["end_time"].max(), freq="30min")
Why it works
Each row in your dataframe can be thought of a step function. For example the first row corresponds to a step function which starts with a value of zero, then at 2002-02-03 04:15:00 increases to a value of 10, then at 2002-02-04 04:45:00 returns to zero. When you sum all the step functions up for each row you have one step function whose value is the sum of all VIS values at any point. This is what has been assigned to the stepfunction variable above. The stepfunction variable is callable, and returns values of the step function at the points specified. This is what is happening in the last line of the example where the result variable is being assigned.
note:
I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
If you paste your data instead of the images, I'd be able to test this. But this is how you may want to think about it. Assume your dataframe is called df.
df['start_time'] = pd.to_datetime(df['start_time']) # in case it's not datetime already
df.set_index('start_time', inplace=True)
new_dates = pd.date_range(start=min(df.index), end=max(df.end_time), freq='15Min')
new_df = df.reindex(new_dates, fill_value=np.nan)
As long as there are no duplicates in start_time, this should work. If there is, that'd need to be handled in some other way.
Resample is another possibility, but without data, it's tough to say what would work.
I'm dealing with time series data using python's pandas DataFrame.
Given that this time series has a value in the range of -10 to 10, we want to find out how many times it passes by 3.
In the simplest case, you can check if the values in the previous and current columns are small or large based on 3 to see if there are any changes.
Is there a function in pandas to help with this?
If you just want to find how many times 0 come out
use pd.count(axis=columns)
import pandas as pd
path = ('./test.csv')
dataframe = pd.read_csv(path,encoding='utf8')
print(dataframe.count(axis='columns'))
I have a scenario of the index being datetime objects and the data I want to plot are sales counts. Most of the time, there are multiple sales done throughout the day and each day can have different amount of sales. I would like to create a plot that shows a date range that nicely formats the xticklabels, depending on how many days I'd like to show in the plot. Kind of like this. I've tried different variants of code but have thus far been unsuccessful. Could someone look at my script below and please help me?
import pandas as pd
import matplotlib.pyplot as plt
index1 = ['2017-07-01','2017-07-01','2017-07-02','2017-07-02','2017-07-03','2017-07-03','2017-07-03']
index2 = pd.to_datetime(index1,format='%Y-%m-%d')
df = pd.DataFrame([[123456],[123789],[123654],[654321],[654987],[789456],789123]],columns=['Count'],index=index1)
df.plot(kind='box')
plt.show()
Use T, transpose and reshape your dataframe.
df.T.plot(kind='box', figsize=(10,7))
Output:
Okay to keep those dates as separate records and boxplot. Let's do this:
df.reset_index().set_index('index',append=True).unstack()['Count'].plot(kind='box',figsize=(10,7))
This is better.
df.set_index(np.arange(len(df)),append=True).unstack(0)['Count']\
.plot(kind='box',figsize=(10,7))
Output: