How would I grab the very last value of a time series?
I have a df with timeseries info for many countries, that tracks several variables and does some simple averaging etc.
I just want to grab the most recent value / values for each country and graph it with plotly. I have tried using .last() but not really sure where to fit it into the loop.
I need to grab both the last value for one chart, and the last n values for another chart.
# Daily Change
country = "X"
#Plot rolling average new cases
data = [go.Scatter(x=df_join.loc[f'{country}']['Date'],
y=df_join.loc[f'{country}']['Pct Change'],
mode='lines',
name='Pct Change')]
layout = go.Layout(title=f'{country}: Pct Change')
fig = go.Figure(data=data, layout=layout)
pyo.plot(fig)
IIUC you need to filter your dataframe before hand :
dates = pd.date_range(pd.Timestamp('today'),pd.Timestamp('today') + pd.DateOffset(days=5))
df = pd.DataFrame({'Date' : dates, 'ID' : ['A','A','A','B','B','B']})
df2 = df.loc[df.groupby(['ID'])['Date'].idxmax()]
print(df2)
Date ID
2 2020-05-16 12:26:06.772939 A
5 2020-05-19 12:26:06.772939 B
Related
I'm using the omicron dataset from Kaggle, and I wanted to make a seaborn lineplot of omicron cases in Czechia over time.
I did this, but the x axis label is unreadable, since every single day is put on there. Could you help me sort the dataframe, so that I could visualize only the summed cases for each month of every year?
Here's my code so far:
data = "..input/omicron-covid19-variant-daily-cases/covid-variants.csv"
data = data[data.variant.str.contains("Omicron")] # a mask with only Omicron cases
data = data[data.location.str.contains("Czech")] # mask only with cases from Czech republic
plt.figure(figsize=(10, 10))
plt.title("Omicron in Czech Republic")
plt.ylabel("Number of cases")
sns.lineplot(x=data.date, y=data.num_sequences_total)
I tried something with the groupby() method, but I only generated a series with 2 columns named "date" and don't know what to do next.
test = data
test.date = pd.to_datetime(data.date)
test = data.groupby([data.date.dt.year, data.date.dt.month]).num_sequences_total.sum()
test.head()
Please help me figure this out, I'm stuck 🥲
i always use this to grouping year and month
example:
GB=DF.groupby([(DF.index.year),(DF.index.month)]).sum()
I have a dataset of S&P500 historical prices with the date, the price and other data that i don't need now to solve my problem.
Date Price
0 1981.01 6.19
1 1981.02 6.17
2 1981.03 6.24
3 1981.04 6.25
. . .
and so on till 2020
The date is a float with the year, a dot and the month.
I tried to plot all historical prices with matplotlib.pyplot as plt.
plt.plot(df["Price"].tail(100))
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
This is the result. I used df["Price"].tail(100) so you can see better the difference between the first and the second graph(You are going to see in a sec).
But then I tried to set the index from the one before(0, 1, 2 etc..) to the df["Date"] column in the DataFrame in order to see the date in the x axis.
df = df.set_index("Date")
plt.plot(df["Price"].tail(100))
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
This is the result, and it's quite disappointing.
I have the Date where it should be in the x axis but the problem is that the graph is different from the one before which is the right one.
If you need the dataset to try out the problem here you can find it.
It is called U.S. Stock Markets 1871-Present and CAPE Ratio.
Hope you've understood everything.
Thanks in advance
UPDATE
I found something that could cause the problem. If you look in depth at the date you can see that in month #10 each is written as a float(in the original dataset) like this: example Year:1884 1884.1. The problem occur when you use pd.to_datetime() to transform the Date float series to a Datetime. So the problem could be that the date in the month #10, when converted into a Datetime, become: (example from before) 1884-01-01 which is the first month in the year and it has an effect on the final plot.
SOLUTION
Finally, I solved my problem!
Yes, the error was the one I explain in the UPDATE paragraph, so I decided to add a 0 as a String where the lenght of the Date (as a string) is 6 in order to change, for example: 1884.1 ==> 1884.10
df["len"] = df["Date"].apply(len)
df["Date"] = df["Date"].where(df["len"] == 7, df["Date"] + "0")
Then i drop the len column i've just created.
df.drop(columns="len", inplace=True)
At the end I changed the "Date" to a Datetime with pd.to_datetime
df["Date"] = pd.to_datetime(df["Date"], format='%Y.%m')
df = df.set_index("Date")
And then I plot
df["Price"].tail(100).plot()
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
plt.show()
The easiest way would be to transform the date into an actual datetime index. This way matplotlib will automatically pick it up and plot it accordingly. For example, given your date format, you could do:
df["Date"] = pd.to_datetime(df["Date"].astype(str), format='%Y.%m')
df = df.set_index("Date")
plt.plot(df["Price"].tail(100))
Currently, the first plot you showed is actually plotting the Price column against the index, which seems to be a regular range index from 0 - 1800 and something. You suggested your data starts in 1981, so although each observation is evenly spaced on the x axis (it's spaced at an interval of 1, which is the jump from one index value to the next). That's why the chart looks reasonable. Yet the x-axis values don't.
Now when you set the Date (as float) to be the index, note that you're not evenly covering the interval between, for example, 1981 and 1982. You have evenly spaced values from 1981.1 - 1981.12, but nothing from 1981.12 - 1982. That's why the second chart is also plotted as expected. Setting the index to a DatetimeIndex as described above should remove this issue, as Matplotlib will know how to evenly space the dates along the x-axis.
I think your problem is that your Date is of float type and taking it as an x-axis does exactly what is expected for taking an array of the kind ([2012.01, 2012.02, ..., 2012.12, 2013.01....]) as x-axis. You might convert the Date column to a DateTimeIndex first and then use the built-in pandas plot method:
df["Price"].tail(100).plot()
It is not a good idea to treat df['Date'] as float. It should be converted into pandas datetime64[ns]. This can be achieved using pandas pd.to_datetime method.
Try this:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('ie_data.csv')
df=df[['Date','Price']]
df.dropna(inplace=True)
#converting to pandas datetime format
df['Date'] = df['Date'].astype(str).map(lambda x : x.split('.')[0] + x.split('.')[1])
df['Date'] = pd.to_datetime(df['Date'], format='%Y%m')
df.set_index(['Date'],inplace=True)
#plotting
df.plot() #full data plot
df.tail(100).plot() #plotting just the tail
plt.title("S&P500 Composite Historical Data")
plt.xlabel("Date")
plt.ylabel("Price")
plt.show()
Output:
Consider the following DataFrame df:
Date Kind
2018-09-01 13:15:32 Red
2018-09-02 16:13:26 Blue
2018-09-04 22:10:09 Blue
2018-09-04 09:55:30 Red
... ...
In which you have a column with a datetime64[ns] dtype and another which contains a np.object which can assume only a finite number of values (in this case, 2).
You have to plot a date histogram in which you have:
On the x-axis, the dates (per-day histogram showing month and day);
On the y-axis, the number of items belonging to that date, showing in a stacked bar the difference between Blue and Red.
How is it possible to achieve this using Matplotlib?
I was thinking to do a set_index and resample as follows:
df.set_index('Date', inplace=True)
df.resample('1d').count()
But I'm losing the information on the number of items per Kind. I also want to keep any missing day as zero.
Any help very appreciated.
Use groupby, count and unstack to adjust the dataframe:
df2 = df.groupby(['Date', 'Kind'])['Kind'].count().unstack('Kind').fillna(0)
Next, re-sample the dataframe and sum the count for each day. This will also add any missing days that are not in the dataframe (as specified). Then adjust the index to only keep the date part.
df2 = df2.resample('D').sum()
df2.index = df2.index.date
Now plot the dataframe with stacked=True:
df2.plot(kind='bar', stacked=True)
Alternatively, the plt.bar() function can be used for the final plotting:
cols = df['Kind'].unique() # Find all original values in the column
ind = range(len(df2))
p1 = plt.bar(ind, df2[cols[0]])
p2 = plt.bar(ind, df2[cols[1]], bottom=df2[cols[0]])
Here it is necessary to set the bottom argument of each part to be the sum of all the parts that came before.
Hi there My dataset is as follows
username switch_state time
abcd sw-off 07:53:15 +05:00
abcd sw-on 07:53:15 +05:00
Now using this i need to find that on a given day how many times in a day the switch state is manipulated i.e switch on or switch off. My test code is given below
switch_off=df.loc[df['switch_state']=='sw-off']#only off switches
groupy_result=switch_off.groupby(['time','username']).count()['switch_state'].unstack#grouping the data on the base of time and username and finding the count on a given day. fair enough
the result of this groupby clause is given as
print(groupy_result)
username abcd
time
05:08:35 3
07:53:15 3
07:58:40 1
Now as you can see that the count is concatenated in the time column. I need to separate them so that i can plot it using Seaborn scatter plot. I need to have the x and y values which in my case will be x=time,y=count
Kindly help me out that how can i plot this column.
`
You can try the following to get the data as a DataFrame itself
df = df.loc[df['switch_state']=='sw-off']
df['count'] = df.groupby(['username','time'])['username'].transform('count')
The two lines of code will give you an updated data frame df, which will add a column called count.
df = df.drop_duplicates(subset=['username', 'time'], keep='first')
The above line will remove the duplicate rows. Then you can plot df['time'] and df['count'].
plt.scatter(df['time'], df['count'])
I am trying to plot two stock prices on an index plot. This plot is very common as it starts both stocks with different prices, at the same place.
See below for a chart of IBM vs. TSLA
def get_historical_closes(ticker, start_date, end_date):
# get the data for the tickers. This will be a panel
p = wb.DataReader(ticker, "yahoo", start_date, end_date)
# convert the panel to a DataFrame and selection only Adj Close
# while making all index levels columns
d = p.to_frame()['Adj Close'].reset_index()
# rename the columns
d.rename(columns={'minor': 'Ticker', 'Adj Close': 'Close'}, inplace=True)
# pivot each ticker to a column
pivoted = d.pivot(index='Date', columns='Ticker')
# and drop the one level on the columns
pivoted.columns = pivoted.columns.droplevel(0)
return pivoted
tickers = ['IBM','TSLA']
start = '2015-12-31'
end ='2016-12-22'
df_ret=get_historical_closes(tickers,start,end).pct_change().replace('NaN',0)
df_ret=np.cumprod(1+df_ret)
df_ret.plot()
As you can see, both start at 1.00.
What I would like to do is to have the convergence at 1.00 be at some arbitrary point in the date index. For example, I would like to see the same chart, except that the lines converge at 1 on July 31, 2016. Thus, offsetting the index convergence at a given point.
Does anyone have any idea how to accomplish this?
Was trying to make it more difficult than what it actually should be. See below:
df_day=df_ret[df_ret.index=='2016-03-31']
df_plot = pd.DataFrame(index=df_ret.index, columns=df_ret.columns)
for col in df_ret.columns: # For each factor
df_plot[col]=df_ret[col]/df_day[col].values
df_plot.plot()