I'm just getting started with Pandas and am trying to combine: Grouping my data by date, and counting the unique values in each group.
Here's what my data looks like:
User, Type
Datetime
2014-04-15 11:00:00, A, New
2014-04-15 12:00:00, B, Returning
2014-04-15 13:00:00, C, New
2014-04-20 14:00:00, D, New
2014-04-20 15:00:00, B, Returning
2014-04-20 16:00:00, B, Returning
2014-04-20 17:00:00, D, Returning
And here's what I would like to get to: Resample the datetime index to the day (which I can do), and also count the unique users for each day.
I'm not interested in the 'Type' column yet.
Day, Unique Users
2014-04-15, 3
2014-04-20, 2
I'm trying df.user.resample('D', how='count').unique but it doesn't seem to give me the right answer.
You don't need to do a resample to get the desired output in your question. I think you can get by with just a groupby on date:
print df.groupby(df.index.date)['User'].nunique()
2014-04-15 3
2014-04-20 2
dtype: int64
And then if you want to you could resample to fill in the time series gaps after you count the unique users:
cnt = df.groupby(df.index.date)['User'].nunique()
cnt.index = cnt.index.to_datetime()
print cnt.resample('D')
2014-04-15 3
2014-04-16 NaN
2014-04-17 NaN
2014-04-18 NaN
2014-04-19 NaN
2014-04-20 2
Freq: D, dtype: float64
I came across the same problem. Resample worked for me with nunique. The nice way with resample is that it makes it very simple to change the sample rate for example to hour or minutes and that the timestamp is kept as index.
df.user.resample('D').nunique()
I was running into the same problem. Karl D's answer works for some kind of reindexing -- on date, for example. but what if you want the index to be
Jan 2014
Feb 2014
March 2014
and then plot it as a timeseries?
Here's what I did:
df.user.resample('M',lambda x: x.nunique())
Related
I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?
Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.
I am currently doing some exercises on a Pandas DataFrame indexed by date (DD/MM/YY). The current exercise requires me to groupby on Year to obtain average yearly values.
So what I tried to do was to create a new column containing only the years extracted from the DataFrame's index. The code I wrote is:
data["year"] = [t.year for t in data.index]
data.groupby("year").mean()
but for some reason, the new column "year" ends up replacing the previous full-date indexing (which does not even become a "standard" column, it plain disappears), which came a bit by surprise. How can this be?
Thanks in advance!
For a sample dataframe:
value
2016-01-22 1
2014-02-02 2
2014-08-27 3
2016-01-23 4
2014-03-18 5
If you would like to keep your logic, you just need to call the column you want to take the mean() of and use transform() and then assign it back to the value column:
data['year'] = [t.year for t in data.index]
data['value'] = data.groupby('year')['value'].transform('mean')
Yields:
value year
2016-01-22 2.500000 2016
2014-02-02 3.333333 2014
2014-08-27 3.333333 2014
2016-01-23 2.500000 2016
2014-03-18 3.333333 2014
I have a pandas DataFrame like this.
As you can see, the data corresponds to end of month data. The problem is that the end of month date is not the same for all the columns. ( The underlying reason is that the last trading day of the month does not always coincide with the end of the month. )
Currently, the end of 2016 January have two rows "2016-01-29" and "2016-01-31." It should be just one row. For example, the end of 2016 January should just be 451.1473 1951.218 1401.093 for Index A, Index B and Index C.
Another point is that even though each row almost always corresponds to the end of monthly data, the data might not be nice enough and can conceivably include the middle of the month data for a random columns. In that case, I don't want to make any adjustment so that any prior data collection error would be caught.
What is the most efficient way to achieve this goal.
EDIT:
Index A Index B Index C
DATE
2015-03-31 2067.89 1535.07 229.1
2015-04-30 2085.51 1543 229.4
2015-05-29 2107.39 NaN NaN
2015-05-31 NaN 1550.39 229.1
2015-06-30 2063.11 1534.96 229
2015-07-31 2103.84 NaN 228.8
2015-08-31 1972.18 1464.32 NaN
2015-09-30 1920.03 1416.84 227.5
2015-10-30 2079.36 NaN NaN
2015-10-31 NaN 1448.39 227.7
2015-11-30 2080.41 1421.6 227.6
2015-12-31 2043.94 1408.33 227.5
2016-01-29 1940.24 NaN NaN
2016-01-31 NaN 1354.66 227.5
2016-02-29 1932.23 1355.42 227.3
So, in this case, I need to combine rows at the end of 2015-05, 2015-10, 2016-01. However, rows at 2015-07 and 2015-08 simply does not have data. So, in this case, I would like to leave 2015-07 and 2015-08 as NaN while I like to merge the end of month rows at 2015-05, 2015-10, 2016-01. Hopefully, this provides more insight to what I am trying to do.
You can use:
df = df.groupby(pd.TimeGrouper('M')).fillna(method='ffill')
df = df.resample(rule='M', how='last')
to create a new DateTimeIndex ending on the last day of the months and sample the last available data point for each months. fillna() ensures that, for columns with of missing data for the last available date, you use the prior available value.
Here is what I'm trying to do in Pandas:
load CSV file containing information about stocks for certain days
find the earliest and latest dates in the column date
create a new dataframe where all the days between the earliest and latest are filled (NaN or something like "missing" for all columns would be fine)
Currently it looks like this:
import pandas as pd
import dateutil
df = pd.read_csv("https://dl.dropboxusercontent.com/u/84641/temp/berkshire_new.csv")
df['date'] = df['date'].apply(dateutil.parser.parse)
new_date_range = pd.date_range(df['date'].min(), df['date'].max())
df = df.set_index('date')
df.reindex(new_date_range)
Unfortunately this throws the following error which I don't quite understand:
ValueError: Shape of passed values is (3, 4825), indices imply (3, 4384)
I've tried a dozen variations of this - without any luck. Any help would be much appreciated.
Edit:
After investigating this further, it looks like the problem is caused by duplicate indexes. The CSV does contain several entries for each date, which is probably causing the errors.
The question is still relevant though: How can I fill the gaps in between, although there are duplicate entries for each date?
So you have duplicates when considering symbol,date,action.
In [99]: df.head(10)
Out[99]:
symbol date change action
0 FDC 2001-08-15 00:00:00 15.069360 new
1 GPS 2001-08-15 00:00:00 19.653780 new
2 HON 2001-08-15 00:00:00 8.604316 new
3 LIZ 2001-08-15 00:00:00 6.711568 new
4 NKE 2001-08-15 00:00:00 22.686257 new
5 ODP 2001-08-15 00:00:00 5.686902 new
6 OSI 2001-08-15 00:00:00 5.893340 new
7 USB 2001-08-15 00:00:00 15.694478 new
8 NEE 2001-11-15 00:00:00 100.000000 new
9 GPS 2001-11-15 00:00:00 142.522231 increase
Create the new date index
In [102]: idx = pd.date_range(df.date.min(),df.date.max())
In [103]: idx
Out[103]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2001-08-15 00:00:00, ..., 2013-08-15 00:00:00]
Length: 4384, Freq: D, Timezone: None
This will, group by symbol and action
Then reindex that set to the full dates (idx)
Select out the only remaining column (change)
As now the index is symbol/date
In [100]: df.groupby(['symbol','action']).apply(
lambda x: x.set_index('date').reindex(idx)
)['change'].reset_index(level=1).head()
Out[100]:
action change
symbol
ADM 2001-08-15 decrease NaN
2001-08-16 decrease NaN
2001-08-17 decrease NaN
2001-08-18 decrease NaN
2001-08-19 decrease NaN
In [101]: df.groupby(['symbol','action']).apply(lambda x: x.set_index('date').reindex(idx))['change'].reset_index(level=1)
Out[101]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 977632 entries, (ADM, 2001-08-15 00:00:00) to (svm, 2013-08-15 00:00:00)
Data columns (total 2 columns):
action 977632 non-null values
change 490 non-null values
dtypes: float64(1), object(1)
You can then fill forward or whatever you need. FYI, not sure what you are going to do with this, but this is not a very common type of operation as you have mostly empty data.
I'm having a similar problem at the moment, I think you shouldn't use reindex but something like asfreq or resample.
with them you don't need to create an index, thy will.
I'm running into problems when taking lower-frequency time-series in pandas, such as monthly or quarterly data, and upsampling it to a weekly frequency. For example,
data = np.arange(3, dtype=np.float64)
s = Series(data, index=date_range('2012-01-01', periods=len(data), freq='M'))
s.resample('W-SUN')
results in a series filled with NaN everywhere. Basically the same thing happens if I do:
s.reindex(DatetimeIndex(start=s.index[0].replace(day=1), end=s.index[-1], freq='W-SUN'))
If s were indexed with a PeriodIndex instead I would get an error: ValueError: Frequency M cannot be resampled to <1 Week: kwds={'weekday': 6}, weekday=6>
I can understand why this might happen, as the weekly dates don't exactly align with the monthly dates, and weeks can overlap months. However, I would like to implement some simple rules to handle this anyway. In particular, (1) set the last week ending in the month to the monthly value, (2) set the first week ending in the month to the monthly value, or (3) set all the weeks ending in the month to the monthly value. What might be an approach to accomplish that? I can imagine wanting to extend this to bi-weekly data as well.
EDIT: An example of what I would ideally like the output of case (1) to be would be:
2012-01-01 NaN
2012-01-08 NaN
2012-01-15 NaN
2012-01-22 NaN
2012-01-29 0
2012-02-05 NaN
2012-02-12 NaN
2012-02-19 NaN
2012-02-26 1
2012-03-04 NaN
2012-03-11 NaN
2012-03-18 NaN
2012-03-25 2
I made a github issue regarding your question. Need to add the relevant feature to pandas.
Case 3 is achievable directly via fill_method:
In [25]: s
Out[25]:
2012-01-31 0
2012-02-29 1
2012-03-31 2
Freq: M
In [26]: s.resample('W', fill_method='ffill')
Out[26]:
2012-02-05 0
2012-02-12 0
2012-02-19 0
2012-02-26 0
2012-03-04 1
2012-03-11 1
2012-03-18 1
2012-03-25 1
2012-04-01 2
Freq: W-SUN
But for others you'll have to do some contorting right now that will hopefully be remedied by the github issue before the next release.
Also it looks like you want the upcoming 'span' resampling convention as well that will upsample from the start of the first period to the end of the last period. I'm not sure there is an easy way to anchor the start/end points for a DatetimeIndex but it should at least be there for PeriodIndex.