I'm missing something really obvious or simply doing this wrong. I have two dataframes of similar structure and I'm trying to plot a time-series of the cumulative sum of one column from both. The dataframes are indexed by date:
df1
value
2020-01-01 2435
2020-01-02 12847
...
2020-10-01 34751
The plot should be grouped by month and be a cumulative sum of the whole time range. I've tried:
line1 = df1.groupby(pd.Grouper(freq='1M')).value.cumsum()
line2 = df2.groupby(pd.Grouper(freq='1M')).value.cumsum()
and then plot, but it resets after each month. How can I change this?
I am guessing you want to group and take the mean or something to represent the cumulative value for each month, and plot:
df1 = pd.DataFrame({'value':np.random.randint(100,200,366)},
index=pd.date_range(start='1/1/2018', end='1/1/2019'))
df1.cumsum().groupby(pd.Grouper(freq='1M')).mean().plot()
Related
I want to create a graph with lines represented by my label
so in this example picture, each line represents a distinct label
The data looks something like this where the x-axis is the datetime and the y-axis is the count.
datetime, count, label
1656140642, 12, A
1656140643, 20, B
1656140645, 11, A
1656140676, 1, B
Because I have a lot of data, I want to aggregate it by 1 hour or even 1 day chunks.
I'm able to generate the above picture with
# df is dataframe here, result from pandas.read_csv
df.set_index("datetime").groupby("label")["count"].plot
and I can get a time-range average with
df.set_index("datetime").groupby(pd.Grouper(freq='2min')).mean().plot()
but I'm unable to get both rules applied. Can someone point me in the right direction?
You can use .pivot (documentation) function to create a convenient structure where datetime is index and the different labels are the columns, with count as values.
df.set_index('datetime').pivot(columns='label', values='count')
output:
label A B
datetime
1656140642 12.0 NaN
1656140643 NaN 20.0
1656140645 11.0 NaN
1656140676 NaN 1.0
Now when you have your data in this format, you can perform simple aggregation over the index (with groupby / resample/ whatever suits you) so it will be applied each column separately. Then plotting the results is just plotting different line for each column.
Through the following code, i get the 1 year history data for both eth and btc price, i know how to get the correlation of the two columns for the 12 months. But how to get past 30 days correlation coefficient for each value of 1 year data and plot it?
def get_price(pair):
df=binance.fetch_ohlcv(pair,timeframe="1d",limit=365)
df=pd.DataFrame(df).rename(columns={0:"date",1:"open",2:"high",3:"low",4:"close",5:"vol"})
df.set_index("date",inplace=True)
df.index=pd.to_datetime(btc.index,unit="ms")+pd.Timedelta(hours=8)
return df
eth=get_price("ETH/USDT")
btc=get_price("BTC/USDT")
btc["close"].corr(eth["close"])
i tried the following code but not sure if it is correct?
btc["corre"]=btc["close"].rolling(30).corr(eth["close"].rolling(30))
You can groupby month, deriving month from your index. You can then subset your groupby to the two variables you want to correlate.
btc.groupby(btc.index.month)[['Val1','Val2']].corr()
Consider the following DataFrame df:
Date Kind
2018-09-01 13:15:32 Red
2018-09-02 16:13:26 Blue
2018-09-04 22:10:09 Blue
2018-09-04 09:55:30 Red
... ...
In which you have a column with a datetime64[ns] dtype and another which contains a np.object which can assume only a finite number of values (in this case, 2).
You have to plot a date histogram in which you have:
On the x-axis, the dates (per-day histogram showing month and day);
On the y-axis, the number of items belonging to that date, showing in a stacked bar the difference between Blue and Red.
How is it possible to achieve this using Matplotlib?
I was thinking to do a set_index and resample as follows:
df.set_index('Date', inplace=True)
df.resample('1d').count()
But I'm losing the information on the number of items per Kind. I also want to keep any missing day as zero.
Any help very appreciated.
Use groupby, count and unstack to adjust the dataframe:
df2 = df.groupby(['Date', 'Kind'])['Kind'].count().unstack('Kind').fillna(0)
Next, re-sample the dataframe and sum the count for each day. This will also add any missing days that are not in the dataframe (as specified). Then adjust the index to only keep the date part.
df2 = df2.resample('D').sum()
df2.index = df2.index.date
Now plot the dataframe with stacked=True:
df2.plot(kind='bar', stacked=True)
Alternatively, the plt.bar() function can be used for the final plotting:
cols = df['Kind'].unique() # Find all original values in the column
ind = range(len(df2))
p1 = plt.bar(ind, df2[cols[0]])
p2 = plt.bar(ind, df2[cols[1]], bottom=df2[cols[0]])
Here it is necessary to set the bottom argument of each part to be the sum of all the parts that came before.
I am trying to develop a program to convert daily data into monthly or yearly data and so on.
I have a DataFrame with datetime index and price change %:
% Percentage
Date
2015-06-02 0.78
2015-06-10 0.32
2015-06-11 0.34
2015-06-12 -0.06
2015-06-15 -0.41
...
I had success grouping by some frequency. Then I tested:
df.groupby('Date').sum()
df.groupby('Date').cumsum()
If it was the case it would work fine, but the problem is that I can't sum it percent way (1+x0) * (1+x1)... -1. Then I tried:
def myfunc(values):
p = 0
for val in values:
p = (1+p)*(1+val)-1
return p
df.groupby('Date').apply(myfunc)
I can't understand how apply () works. It seems to apply my function to all data and not just to the grouped items.
Your apply is applying to all rows individually because you're grouping by the date column. Your date column looks to have unique values for each row, so each group has only one row in it. You need to use a Grouper to group by month, then use cumprod and get the last value for each group:
# make sure Date is a datetime
df["Date"] = pd.to_datetime(df["Date"])
# add one to percentages
df["% Percentage"] += 1
# use cumprod on each month group, take the last value, and subtract 1
df.groupby(pd.Grouper(key="Date", freq="M"))["% Percentage"].apply(lambda g: g.cumprod().iloc[-1] - 1)
Note, though, that this applies the percentage growth as if the steps between your rows were the same, but it looks like sometimes it's 8 days and sometimes it's 1 day. You may need to do some clean-up depnding on the result you want.
I have dataframes of 1 minute bars going back years (the datetime is the index). I need to get a set of bars covering an irregular (non-consecutive) long list of dates.
For daily bars, I could do something like this:
datelist = ['20140101','20140205']
dfFiltered = df[df.index.isin(datelist)]
However if I try that on 1 minute bar data, it only gives me the bars with time 00:00:00, e.g. in this case it gives me two bars for 20140101 00:00:00 and 20140205 00:00:00.
My actual source df will look something like:
df1m = pd.DataFrame(index=pd.date_range('20100101', '20140730', freq='1min'),
data={'open':3, 'high':4, 'low':1, 'close':2}
).between_time('00:00:00', '07:00:00')
Is there any better way to get all the bars for each day in the list than looping over the list? Thanks in advance.
One way is to add a date column based on the index
df1m['date'] = pd.to_datetime(df1m.index.date)
Then use that column when filtering
datelist = ['20140101','20140205']
df1m[df1m['date'].isin(datelist)]