I have the following dataset:
dataset.head(7)
Transaction_date Product Product Code Description
2019-01-01 A 123 A123
2019-01-02 B 267 B267
2019-01-09 B 267 B267
2019-02-11 C 139 C139
2019-02-11 A 125 C125
2019-02-12 C 139 C139
2019-02-12 A 123 A123
The dataset stores transaction information, for which a transaction date is available. In other words, not for all days, data is available.
Ultimately, I want to create a time series plot, showing me the number of transactions per day.
So far, I have done a simple countplot:
ax = sns.countplot(x=dataset["Transaction_date"],data=dataset)
This plot shows me the dates, where a transaction happened. But I would prefer to see also the dates, where no transaction has happened in a plot, preferably shown as 0.
I have tried the following, but retrieve an error message:
groupbydate = dataset.groupby("Transaction_date")
ax = sns.tsplot(x="Transaction_date",y="Product",data=groubydate.fillna(0))
But I get the error
cannot label index with a null key
Due to restrictions, I can only use seaborn 0.8.1
I believe reindex should work for you:
# First convert the index to datetime
dataset.index = pd.DatetimeIndex(dataset.index)
# Then reindex! You can also select the min and max of the index for the limits
dataset= dataset.reindex(pd.date_range("2019-01-01", "2019-02-12"), fill_value="NaN")
You can drop the rows containing NaN values using pandas.DataFrame.dropna, and then plot the chart. For example:
dataset.dropna(thresh=2)
will drop all rows where there are at least two NaN values.
You may also want to fill the NaN values using pandas.DataFrame.fillna
Related
I have a data frame like that (it's just the head) :
Timestamp Function_code Node_id Delta
0 2000-01-01 10:39:51.790683 Tx_PDO_2 54 551.0
1 2000-01-01 10:39:51.791650 Tx_PDO_2 54 601.0
2 2000-01-01 10:39:51.792564 Tx_PDO_3 54 545.0
3 2000-01-01 10:39:51.793511 Tx_PDO_3 54 564.0
There are only two types of Function_code : Tx_PDO_2 and Tx_PDO_3
I plot in two windows, a graph with Timestamp on the x-axis and Delta on the y-axis. One for Tx_PDO_2 and the other for Tx_PDO_3 :
delta_rx_tx_df.groupby("Function_code").plot(x="Timestamp", y="Delta", )
Now, I want to know which window corresponds to which Function_code
I tried to use title=delta_rx_tx_df.groupby("Function_code").groups but it did not work.
There may be a better way, but for starters, you can assign the titles to the plots after they are created:
plots = delta_rx_tx_df.groupby("Function_code").plot(x="Timestamp", y="Delta")
plots.reset_index()\
.apply(lambda x: x[0].set_title(x['Function_code']), axis=1)
I am trying to use PyCaret for time series, according to this tutorial.
My analysis did not work. When I created a new column
data['MA12'] = data['variable'].rolling(12).mean()
I got this new MA12 column with NA values only.
As a resulted I decided to replicate the code from the tutorial, using AirPassangers dataset, but got the same issue.
When I print data, I get
Month Passengers MA12
0 1949-01-01 112 NaN
1 1949-02-01 118 NaN
2 1949-03-01 132 NaN
3 1949-04-01 129 NaN
4 1949-05-01 121 NaN
I would greatly appreciate any tips on what is going on here.
My only guess, I use a default version of PyCaret, maybe I need to install a full one. Tried this too - the same result.
Since you want the previous 12 reads, the first 11 will be NaN. You need more rows than 12 before you get a moving average of 12. You can see this on the link you provided. The chart of MA doesn't start up right away.
I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?
Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.
I have a pandas DataFrame with a DateTime index and two columns called 'text' and 'labels'. I want to assign the value of labels which have value =2 and lie within a DateTime index range with value =50
I tried using,
df[df['labels']==2]['2017-02-01 05:03:25+00:00':'2017-02-01 05:05:55+00:00']['labels']=50
I am able to view the DataFrame filtered by DataTime index (rows) and columns but not able to assign it
Also tried
df.loc[df['2017-03-13 00:00:00':'2017-03-23 00:00:00'], df['labels']==2]=50
but it threw an error
df looks like
created text labels
2017-02-01 05:03:25+00:00 break john cena eyelash grow 4
2017-02-01 05:05:55+00:00 eyelash tooooo much sweeti definit 2
2017-02-01 05:14:57+00:00 come eyelash 2
created is the DateTime index and 'text' and 'labels' are the columns of the DataFrame
df[df['labels']==2]['2017-02-01 05:03:25+00:00':'2017-02-01 05:05:55+00:00']['labels']
filters the DataFrame but doesn't assign it if we set it equal to a value
On assigning the DataFrame for created between '2017-02-01 05:03:25+00:00':'2017-02-01 05:05:55+00:00' and labels =2 for labels=50 I expect the result to be like this
created text labels
2017-02-01 05:03:25+00:00 break john cena eyelash grow 4
2017-02-01 05:05:55+00:00 eyelash tooooo much sweeti definit 50
2017-02-01 05:14:57+00:00 come eyelash 2
Let us do get_level_values
s=df.index.get_level_values(0)
m=(s>'2017-02-01 05:03:25+00:00') & (s<='2017-02-01 05:05:55+00:00')
df.loc[m&(df.labels==2),'lable']=50
I have a DataFrame like this
df = pd.DataFrame( data = numpy_data, columns=['value','date'])
value date
0 64.885 2018-01-11
1 74.839 2018-01-15
2 41.481 2018-01-17
3 22.027 2018-01-17
4 53.747 2018-01-18
... ... ...
514 61.017 2018-12-22
515 68.376 2018-12-21
516 79.079 2018-12-26
517 73.975 2018-12-26
518 76.923 2018-12-26
519 rows × 2 columns
And I want to plot this value vs date and I am using this
df.plot( x='date',y='value')
And I get this
The point here, this plot have to many fluctuation, and I want to soften this, my idea is group the values by date intervals and get the mean, for example 10 days, the mean between July 1 and July 10, and create de point in July 5
A long way is, get date range, separate in N ranges with start and end dates, filter data with date calculate the mean, and put in other DataFrame
Is there a short way to do that?
PD: Ignore the peaks
One thing you could do for instance is to take the rolling mean of the dataframe, using DataFrame.rolling along with mean:
df = df.set_index(df.date).drop('date', axis=1)
df.rolling(3).mean().plot()
For the example dataframe you have, directly plotting the dataframe would result in:
And having taking the rolling mean, you would have:
Here I chose a window of 3, but his will depend on how wmooth you want it to be
Based on yatu answer
The problem with his answer, is the rolling function considere values as index, not as date, with some transformations rolling can read Timestamp as use time as window [ pandas.rolling ]
df = pd.DataFrame( data = numpy_data, columns=['value','date'])
df['date'] = df.apply(lambda row: pd.Timestamp(row.date), axis=1 )
df = df.set_index(df.date).drop('date', axis=1)
df.sort_index(inplace=True)
df.rolling('10d').mean().plot( ylim=(30,100) , figsize=(16,5),grid='true')
Final results