I have a large data set with names of stores, dates and profits.
My data set is not the most organized but I now have it in this df.
df
Store Date Profit
ABC May 1 2018 234
XYZ May 1 2018 410
AZY May 1 2018 145
ABC May 2 2018 234
XYZ May 2 2018 410
AZY May 2 2018 145
I proudly created a function to get each day into one df by itself until I realized it would be very time consuming to do one for each day.
def avg(n):
return df.loc[df['Date'] == "May" + " " + str(n) + " " +str(2018)]
where n would be the date I want to get. So that function get me just the dates I want.
What I really need is to have a way to get all dates I want in a list and to append them to a pd for each day. I tried doing this but did not work out.
def avg(n):
dlist= []
for i in n:
dlist= df.loc[df['Date'] == "May" + " " + str(i) + " " +str(2018)]
dlist=pd.DataFrame(dlist)
dlist.append(i)
return dlist
df2=avg([21,23,24,25])
My goal there was to have all the dates of (21,23,24,25) for the May
into its own series of df.
But it was a total fail got this error
cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
I am not sure if it's also possible to add a rolling average or mean, to columns for each day of (21,23,24,25), but that's where analysis will conclude.
output desired
Store Date Profit Rolling Mean
ABC May 1 2018 234 250
XYZ May 1 2018 410 401
AZY May 1 2018 145 415
where the rolling mean is for the past 30 days. Above all, I would like to have each day into its own df where I can save it to csv file the end.
Rolling Mean:
The example data given in the question, has data in the format of May 1 2018, which can't be used for rolling. Rolling requires a datetime index.
Instead of string splitting the original Date column, it should be converted to datetime, using df.Date = pd.to_datetime(df.Date), which will give dates in the format 2018-05-01
With a properly formatted datetime column, use df['Day'] = df.Date.dt.day and df['Month'] = df.Date.dt.month_name() to get a Day and Month column, if desired.
Given the original data:
Original Data:
Store Date Profit
ABC May 1 2018 234
XYZ May 1 2018 410
AZY May 1 2018 145
ABC May 2 2018 234
XYZ May 2 2018 410
AZY May 2 2018 145
Transformed Original Data:
df.Date = pd.to_datetime(df.Date)
df['Day'] = df.Date.dt.day
df['Month'] = df.Date.dt.month_name()
Store Date Profit Day Month
ABC 2018-05-01 234 1 May
XYZ 2018-05-01 410 1 May
AZY 2018-05-01 145 1 May
ABC 2018-05-02 234 2 May
XYZ 2018-05-02 410 2 May
AZY 2018-05-02 145 2 May
Rolling Example:
The example dataset is insufficient to produce a 30-day rolling average
In order to have a 30-day rolling mean, there needs to be more than 30 days of data for each store (i.e. on the 31st day, you get the 1st mean, for the previous 30 days)
The following example will setup a dataframe consisting of every day in 2018, a random profit between 100 and 1001, and a random store, chosen from ['ABC', 'XYZ', 'AZY'].
Extended Sample:
import pandas as pd
import random
import numpy as np
from datetime import datetime, timedelta
list_of_dates = [date for date in np.arange(datetime(2018, 1, 1), datetime(2019, 1, 1), timedelta(days=1)).astype(datetime)]
df = pd.DataFrame({'Store': [random.choice(['ABC', 'XYZ', 'AZY']) for _ in range(365)],
'Date': list_of_dates,
'Profit': [np.random.randint(100, 1001) for _ in range(365)]})
Store Date Profit
ABC 2018-01-01 901
AZY 2018-01-02 540
AZY 2018-01-03 417
XYZ 2018-01-04 280
XYZ 2018-01-05 384
XYZ 2018-01-06 104
XYZ 2018-01-07 691
ABC 2018-01-08 376
XYZ 2018-01-09 942
XYZ 2018-01-10 297
df.set_index('Date', inplace=True)
df_rolling = df.groupby(['Store']).rolling(30).mean()
df_rolling.rename(columns={'Profit': '30-Day Rolling Mean'}, inplace=True)
df_rolling.reset_index(inplace=True)
df_rolling.head():
Note the first 30-days for each store, will be NaN
Store Date 30-Day Rolling Mean
ABC 2018-01-01 NaN
ABC 2018-01-03 NaN
ABC 2018-01-07 NaN
ABC 2018-01-11 NaN
ABC 2018-01-13 NaN
df_rolling.tail():
Store Date 30-Day Rolling Mean
XYZ 2018-12-17 556.966667
XYZ 2018-12-18 535.633333
XYZ 2018-12-19 534.733333
XYZ 2018-12-24 551.066667
XYZ 2018-12-27 572.033333
Plot:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
g = sns.lineplot(x='Date', y='30-Day Rolling Mean', data=df_rolling, hue='Store')
for item in g.get_xticklabels():
item.set_rotation(60)
plt.show()
Alternatively: A dataframe for each store:
It's also possible to create a separate dataframe for each store and put it inside a dict
This alternative makes is easier to plot a more detailed graph with less code
import pandas as pd
import random
import numpy as np
from datetime import datetime, timedelta
list_of_dates = [date for date in np.arange(datetime(2018, 1, 1), datetime(2019, 1, 1), timedelta(days=1)).astype(datetime)]
df = pd.DataFrame({'Store': [random.choice(['ABC', 'XYZ', 'AZY']) for _ in range(365)],
'Date': list_of_dates,
'Profit': [np.random.randint(100, 1001) for _ in range(365)]})
df_dict = dict()
for store in df.Store.unique():
df_dict[store] = df[['Date', 'Profit']][df.Store == store]
df_dict[store].set_index('Date', inplace=True)
df_dict[store]['Profit: 30-Day Rolling Mean'] = df_dict[store].rolling(30).mean()
print(df_dict.keys())
>>> dict_keys(['ABC', 'XYZ', 'AZY'])
print(df_dict['ABC'].head())
Plot:
import matplotlib.pyplot as plt
_, axes = plt.subplots(1, 1, figsize=(13, 8), sharex=True)
for k, v in df_dict.items():
axes.plot(v['Profit'], marker='.', linestyle='-', linewidth=0.5, label=k)
axes.plot(v['Profit: 30-Day Rolling Mean'], marker='o', markersize=4, linestyle='-', linewidth=0.5, label=f'{k} Rolling')
axes.legend()
axes.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.ylabel('Profit ($)')
plt.xlabel('Date')
plt.title('Recorded Profit vs. 30-Day Rolling Mean of Profit')
plt.show()
Get a dataframe for a specific month:
Recall, this is randomly generated data, so the stores don't have data for every day of the month.
may_df = dict()
for k, v in df_dict.items():
v.reset_index(inplace=True)
may_df[k] = v[v.Date.dt.month_name() == 'May']
may_df[k].set_index('Date', inplace=True)
print(may_df['XYZ'])
Plot: May data only:
Save dataframes:
pandas.DataFrame.to_csv()
may_df.reset_index(inplace=True)
may_df.to_csv('may.csv', index=False)
A simple solution may be groupby()
Check out this example :
import pandas as pd
listt = [['a',2,3],
['b',5,7],
['a',3,9],
['a',1,3],
['b',9,4],
['a',4,7],
['c',7,2],
['a',2,5],
['c',4,7],
['b',5,5]]
my_df = pd.DataFrame(listt)
my_df.columns=['Class','Day_1','Day_2']
my_df.groupby('Class')['Day_1'].mean()
OutPut :
Class
a 2.400000
b 6.333333
c 5.500000
Name: Day_1, dtype: float64
Note : Similarly You can Group your data by Date and get Average of your Profit.
This question already has an answer here:
Pandas Dataframe line plot display date on xaxis
(1 answer)
Closed 4 years ago.
I have the following pandas dataframe, which consist of datetime timestamps and user ids:
id datetime
130 2018-05-17 19:46:18
133 2018-05-17 20:59:57
131 2018-05-17 21:54:01
142 2018-05-17 22:49:07
114 2018-05-17 23:02:34
136 2018-05-18 06:06:48
324 2018-05-18 12:21:38
180 2018-05-18 12:49:33
120 2018-05-18 14:03:58
120 2018-05-18 15:28:36
How can I plot on the y axis the id and on the x axis day or minutes? I tried to:
plt.plot(df3['datatime'], df3['id'], '|')
plt.xticks(rotation='vertical')
However, I have two problems, my dataframe is quite large and I have multiple ids, the second problem is that I wasn't able to arrange each label on the y axis and plot it against its datime value in the x axis. Any idea of how to do something like this:
The whole objective of this plot is to visualize the logins per time of that specific user.
Something like this?
X axis: date, Y axis: id
from datetime import date
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import pandas as pd
# set your data as df
# strip only YYYY-mm-dd part from original `datetime` column
df.datetime = df.datetime.apply(lambda x: str(x)[:10])
df.datetime = df.datetime.apply(lambda x: date(int(x[:4]), int(x[5:7]), int(x[8:10])))
# plot
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gca().xaxis.set_major_locator(mdates.DayLocator())
plt.plot(df.datetime, df.id, '|')
plt.gcf().autofmt_xdate()
Output:
I know there are similar questions that have already been answered. However, I can't seem to troubleshoot why none of the solutions are working for me.
My sample dataset:
TimeStamp 340 341 342
10:27:00 1.953036 2.110234 1.981548
10:28:00 1.973408 2.046361 1.806923
10:29:00 0.000000 0.000000 0.014881
10:30:00 2.567976 3.169928 3.479591
I want to find the mean of the data every two minutes for each column. While df.groupby promises a neat solution, it makes my TimeStamp column disappear for some reason. the help is greatly appreciated.
Expected output:
TimeStamp 340 341 342
10:27:30 1.963222 2.078298 1.894235
10:29:30 1.283988 1.584964 1.747236
Attempted code:
import pandas as pd
import numpy as np
path = '/Users/username/Desktop/Model/'
file1 = 'filename.csv'
df = pd.read_csv(path + file1, skipinitialspace = True)
df['TimeStamp'] = pd.to_timedelta(df['TimeStamp'])
df['TimeStamp'] = df['TimeStamp'].dt.floor('min')
df.set_index('TimeStamp')
rowF = len(df['TimeStamp'])
# Average every two min
newdf = df.groupby(np.arange(len(df.index))//2).mean()
print(newdf)
Set the time as index:
df.set_index(pd.to_timedelta(df.TimeStamp), inplace=True)
And then use resample and aggregate every two minutes:
df.resample("2min").mean().reset_index()
# TimeStamp 340 341 342
#0 10:27:00 1.963222 2.078298 1.894235
#1 10:29:00 1.283988 1.584964 1.747236
#2 10:31:00 NaN NaN NaN
Drop the last observation with iloc:
df.resample("2min").mean().reset_index().iloc[:-1]
# TimeStamp 340 341 342
#0 10:27:00 1.963222 2.078298 1.894235
#1 10:29:00 1.283988 1.584964 1.747236
If you prefer to shift the TimeStamp by 30 seconds:
(df.resample("2min").mean().reset_index()
.assign(TimeStamp = lambda x: x.TimeStamp + pd.Timedelta('30 seconds'))
.iloc[:-1])
# TimeStamp 340 341 342
#0 10:27:30 1.963222 2.078298 1.894235
#1 10:29:30 1.283988 1.584964 1.747236
I understand this must be a very basic question, but oddly enough, the resources I've read online don't seem very clear on how to do the following:
How can I index specific columns in pandas?
For example, after importing data from a csv, I have a pandas Series object with individual dates, along with a corresponding dollar amount for each date.
Now, I'd like to group the dates by month (and add their respective dollar amounts for that given month). I plan to create an array where the indexing column is the month, and the next column is the sum of dollar amounts for that month. I would then take this array and create another pandas Series object out of it.
My problem is that I can't seem to call the specific columns from the current pandas series object I have.
Any help?
Edited to add:
from pandas import Series
from matplotlib import pyplot
import numpy as np
series = Series.from_csv('FCdata.csv', header=0, parse_dates = [0], index_col =0)
print(series)
pyplot.plot(series)
pyplot.show() # this successfully plots the x-axis (date) with the y-axis (dollar amount)
dates = series[0] # this is where I try to call the column, but with no luck
This is what my data looks like in a csv:
Dates Amount
1/1/2015 112
1/2/2015 65
1/3/2015 63
1/4/2015 125
1/5/2015 135
1/6/2015 56
1/7/2015 55
1/12/2015 84
1/27/2015 69
1/28/2015 133
1/29/2015 52
1/30/2015 91
2/2/2015 144
2/3/2015 114
2/4/2015 59
2/5/2015 95
2/6/2015 72
2/9/2015 73
2/10/2015 119
2/11/2015 133
2/12/2015 128
2/13/2015 141
2/17/2015 105
2/18/2015 107
2/19/2015 81
2/20/2015 52
2/23/2015 135
2/24/2015 65
2/25/2015 58
2/26/2015 144
2/27/2015 102
3/2/2015 95
3/3/2015 98
You are reading the CSV file into a Series. A Series is a one-dimensional object - there are no columns associated with it. You see the index of that Series (dates) and probably think that's another column but it's not.
You have two alternatives: you can convert it to a DataFrame (either by calling reset_index() or to_frame or use it as a Series.
series.resample('M').sum()
Out:
Dates
2015-01-31 1040
2015-02-28 1927
2015-03-31 193
Freq: M, Name: Amount, dtype: int64
Since you already have an index formatted as date, grouping by month with resample is very straightforward so I'd suggest keeping it as a Series.
However, you can always convert it to a DataFrame with:
df = series.to_frame('Value')
Now, you can use df['Value'] to select that single column. resampling can be done both on the DataFrame and the Series:
df.resample('M').sum()
Out:
Value
Dates
2015-01-31 1040
2015-02-28 1927
2015-03-31 193
And you can access the index if you want to use that in plotting:
series.index # df.index would return the same
Out:
DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
'2015-01-05', '2015-01-06', '2015-01-07', '2015-01-12',
'2015-01-27', '2015-01-28', '2015-01-29', '2015-01-30',
'2015-02-02', '2015-02-03', '2015-02-04', '2015-02-05',
'2015-02-06', '2015-02-09', '2015-02-10', '2015-02-11',
'2015-02-12', '2015-02-13', '2015-02-17', '2015-02-18',
'2015-02-19', '2015-02-20', '2015-02-23', '2015-02-24',
'2015-02-25', '2015-02-26', '2015-02-27', '2015-03-02',
'2015-03-03'],
dtype='datetime64[ns]', name='Dates', freq=None)
Note: For basic time-series charts, you can use pandas' plotting tools.
df.plot() produces:
And df.resample('M').sum().plot() produces:
I'm trying to find if is possible to use data.asfreq(MonthEnd()) with no date_range created data.
What I'm trying to achive. I run csv query with the following code:
import numpy as np
import pandas as pd
data = pd.read_csv("https://www.quandl.com/api/v3/datasets/FRED/GDPC1.csv?api_key=", parse_dates=True)
data.columns = ["period", "integ"]
data['period'] = pd.to_datetime(data['period'], infer_datetime_format=True)
Then I want to assign frequency to my 'period' column by doing this:
tdelta = data.period[1] - data.period[0]
data.period.freq = tdelta
And some print comands:
print(data)
print(data.period.freq)
print(data.dtypes)
Returns:
..........
270 1948-07-01 2033.2
271 1948-04-01 2021.9
272 1948-01-01 1989.5
273 1947-10-01 1960.7
274 1947-07-01 1930.3
275 1947-04-01 1932.3
276 1947-01-01 1934.5
[277 rows x 2 columns]
-92 days +00:00:00
period datetime64[ns]
integ float64
dtype: object
I can also parse the original 'DATE' column by making it 'index':
data = pd.read_csv("https://www.quandl.com/api/v3/datasets/FRED/GDPC1.csv?api_key=", parse_dates=True, index_col='DATE')
What I want to do is just to covert the quarterly data in to monthly rows. For example:
270 1948-07-01 2033.2
271 1948-06-01 NaN
272 1948-05-01 NaN
273 1948-04-01 2021.9
274 1948-03-01 NaN
275 1948-02-01 NaN
276 1948-01-01 1989.5
......and so on.......
I'm eventually trying to do this by using ts.asfreq(MonthBegin()) and , ts.asfreq(MonthBegin(), method='pad'). So far unsuccessfully. I have the following error:
NameError: name 'MonthBegin' is not defined
My question is can I use asfreq if I don't use date_range to create the frame? Somehow to 'pass' my date column to the function. If this is not the solution is it there any other easy way to convert quarterly to monthly frequency?
Use a TimeGrouper:
import pandas as pd
periods = ['1948-07-01', '1948-04-01', '1948-01-01', '1947-10-01',
'1947-07-01', '1947-04-01', '1947-01-01']
integs = [2033.2, 2021.9, 1989.5, 1960.7, 1930.3, 1932.3, 1934.5]
df = pd.DataFrame({'period': pd.to_datetime(periods), 'integ': integs})
df = df.set_index('period')
df = df.groupby(pd.TimeGrouper('MS')).sum().sort_index(ascending=False)
EDIT: You can also use resample instead of a TimeGrouper:
df.resample('MS').sum().sort_index(ascending=False)