Here is what I'm trying to do in Pandas:
load CSV file containing information about stocks for certain days
find the earliest and latest dates in the column date
create a new dataframe where all the days between the earliest and latest are filled (NaN or something like "missing" for all columns would be fine)
Currently it looks like this:
import pandas as pd
import dateutil
df = pd.read_csv("https://dl.dropboxusercontent.com/u/84641/temp/berkshire_new.csv")
df['date'] = df['date'].apply(dateutil.parser.parse)
new_date_range = pd.date_range(df['date'].min(), df['date'].max())
df = df.set_index('date')
df.reindex(new_date_range)
Unfortunately this throws the following error which I don't quite understand:
ValueError: Shape of passed values is (3, 4825), indices imply (3, 4384)
I've tried a dozen variations of this - without any luck. Any help would be much appreciated.
Edit:
After investigating this further, it looks like the problem is caused by duplicate indexes. The CSV does contain several entries for each date, which is probably causing the errors.
The question is still relevant though: How can I fill the gaps in between, although there are duplicate entries for each date?
So you have duplicates when considering symbol,date,action.
In [99]: df.head(10)
Out[99]:
symbol date change action
0 FDC 2001-08-15 00:00:00 15.069360 new
1 GPS 2001-08-15 00:00:00 19.653780 new
2 HON 2001-08-15 00:00:00 8.604316 new
3 LIZ 2001-08-15 00:00:00 6.711568 new
4 NKE 2001-08-15 00:00:00 22.686257 new
5 ODP 2001-08-15 00:00:00 5.686902 new
6 OSI 2001-08-15 00:00:00 5.893340 new
7 USB 2001-08-15 00:00:00 15.694478 new
8 NEE 2001-11-15 00:00:00 100.000000 new
9 GPS 2001-11-15 00:00:00 142.522231 increase
Create the new date index
In [102]: idx = pd.date_range(df.date.min(),df.date.max())
In [103]: idx
Out[103]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2001-08-15 00:00:00, ..., 2013-08-15 00:00:00]
Length: 4384, Freq: D, Timezone: None
This will, group by symbol and action
Then reindex that set to the full dates (idx)
Select out the only remaining column (change)
As now the index is symbol/date
In [100]: df.groupby(['symbol','action']).apply(
lambda x: x.set_index('date').reindex(idx)
)['change'].reset_index(level=1).head()
Out[100]:
action change
symbol
ADM 2001-08-15 decrease NaN
2001-08-16 decrease NaN
2001-08-17 decrease NaN
2001-08-18 decrease NaN
2001-08-19 decrease NaN
In [101]: df.groupby(['symbol','action']).apply(lambda x: x.set_index('date').reindex(idx))['change'].reset_index(level=1)
Out[101]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 977632 entries, (ADM, 2001-08-15 00:00:00) to (svm, 2013-08-15 00:00:00)
Data columns (total 2 columns):
action 977632 non-null values
change 490 non-null values
dtypes: float64(1), object(1)
You can then fill forward or whatever you need. FYI, not sure what you are going to do with this, but this is not a very common type of operation as you have mostly empty data.
I'm having a similar problem at the moment, I think you shouldn't use reindex but something like asfreq or resample.
with them you don't need to create an index, thy will.
Related
sorry if this question has been asked before but I can't seem to find one that describes my current issue.
Basically, I have a large climate dataset that is not bound to "real" dates. The dataset starts at "year one" and goes to "year 9999". These dates are stored as strings such as Jan-01, Feb-01, Mar-01 etc, where the number indicates the year. When trying to convert this column to date time objects, I get an out of range error. (My reading into this suggests this is due to a 64bit limit on the possible datetime timestamps that can exist)
What is a good way to work around this problem/process the date information so I can effectively plot the associated data vs these dates, over this ~10,000 year period?
Thanks
the cftime library was created specifically for this purpose, and xarray has a convenient xr.cftime_range function that makes creating such a range easy:
In [3]: import xarray as xr, pandas as pd
In [4]: date_range = xr.cftime_range('0001-01-01', '9999-01-01', freq='D')
In [5]: type(date_range)
Out[5]: xarray.coding.cftimeindex.CFTimeIndex
This creates a CFTimeIndex object which plays nicely with pandas:
In [8]: df = pd.DataFrame({"date": date_range, "vals": range(len(date_range))})
In [9]: df
Out[9]:
date vals
0 0001-01-01 00:00:00 0
1 0001-01-02 00:00:00 1
2 0001-01-03 00:00:00 2
3 0001-01-04 00:00:00 3
4 0001-01-05 00:00:00 4
... ... ...
3651692 9998-12-28 00:00:00 3651692
3651693 9998-12-29 00:00:00 3651693
3651694 9998-12-30 00:00:00 3651694
3651695 9998-12-31 00:00:00 3651695
3651696 9999-01-01 00:00:00 3651696
[3651697 rows x 2 columns]
I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?
Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.
I have a hypothetical time series data frame, which is with some missing observations (assumption is that the data frame shall include all dates and corresponding values and for all the dates in the year). As we can see in the head and tail information, there are certain dates and corresponding values are missing (30th Jan & 29th Dec). There would be many more such in the data frame, sometimes missing observations for more than one consecutive date.
Is there a way that missing dates are detected and inserted into the data frame and corresponding values are filled with a rolling average with one week window (this would naturally increase the number of rows of the data frame)? Appreciate inputs.
df.head(3)
date value
0 2020-01-28 25
1 2020-01-29 32
2 2020-01-31 45
df.tail(3)
date value
3 2020-12-28 24
4 2020-12-30 35
5 2020-12-31 37
df.dtypes
date object
value int64
dtype: object
Create DaetimeIndex, then use DataFrame.asfreq with rolling and mean:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').asfreq('d').rolling('7D').mean()
If need all values by year use:
df['date'] = pd.to_datetime(df['date'])
idx = pd.date_range('2020-01-01','2020-12-31')
df = df.set_index('date').reindex(idx).rolling('7D').mean()
I'm cleaning a dataframe column of dates and I wrote a function that cleans some entries in some way and cleans all other entries in another way.
I'm getting the data cleaned and in two separate series. I've recorded the index position of each entry from the original column and I know which indices are in each of the two clean series.
My trouble is assigning both series together to the dataframe column. I just can't do it.
My function f receives the dates column and returns a list of 4: Indices of column in the first series [0], clean entries in the first series[1], indices of column in the second series[2], clean entries in the second series[3].
So when I do f(column)[3] and f(column)[1] I do get cleaned pandas series.
#Function works:
>>> f(df['dates_column'])[0]
, 18812, 18813, 18814, 18815, 18816, 18817, 18818, 18819, 18820, 18821,
18822, 18823, 18824, 18825, 18826,
>>> f(df['dates_column'])[1].tail()
331849 2009-10-03
331850 2006-10-03
331851 2015-09-27
331852 1911-08-09
331853 2013-09-03
Name: dates_column, dtype: datetime64[ns]
>>> f(df['dates_column'])[3].tail()
331898 1996-12-11
331899 2004-06-01
331900 2010-03-12
331901 2016-01-06
331902 2010-03-12
Name: dates_column, dtype: datetime64[ns]
>>> f(df['dates_column'])[1].head()
0 1900-01-01
1 1900-01-01
2 1900-01-01
3 1900-01-01
4 1900-01-01
Name: dates_column, dtype: datetime64[ns]
>>> f(df['dates_column'])[3].head()
40036 2002-06-18
40037 2005-04-01
40038 2002-04-01
40039 2003-05-02
40040 2006-10-01
Name: dates_column, dtype: datetime64[ns]
#But cannot assign properly..
>>> df['dates_column'][function(df['dates_column'])[0]] =
f(df['dates_column'])[1]
<input>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
>>> df['dates_column'][f(df['dates_column'])[2]] =
f(df['dates_column'])[3]
<input>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
#And it gets all crazy in the head..
>>> df['dates_column'].head()
0 -2208988800000000000
1 -2208988800000000000
2 -2208988800000000000
3 -2208988800000000000
4 -2208988800000000000
Name: dates_column, dtype: object
#And in the tail
>>> df['dates_column'].tail()
31898 1996-12-11 00:00:00
331899 2004-06-01 00:00:00
331900 2010-03-12 00:00:00
331901 2016-01-06 00:00:00
331902 2010-03-12 00:00:00
Name: dates_column, dtype: object
How do I assign the values on both series to dates_column? I can't understand change of format either
I have a .json file extension (logs.json) that was sent to me with the following data in it (I am showing only some of it as there are over 2,000 entries):
["2012-03-01T00:05:55+00:00", "2012-03-01T00:06:23+00:00", "2012-03-01T00:06:52+00:00", "2012-03-01T00:11:23+00:00", "2012-03-01T00:12:47+00:00", "2012-03-01T00:12:54+00:00", "2012-03-01T00:16:14+00:00", "2012-03-01T00:17:31+00:00", "2012-03-01T00:21:23+00:00", "2012-03-01T00:21:26+00:00", "2012-03-01T00:22:25+00:00", "2012-03-01T00:28:24+00:00", "2012-03-01T00:31:21+00:00", "2012-03-01T00:32:20+00:00", "2012-03-01T00:33:32+00:00", "2012-03-01T00:35:21+00:00", "2012-03-01T00:38:14+00:00", "2012-03-01T00:39:24+00:00", "2012-03-01T00:43:12+00:00", "2012-03-01T00:46:13+00:00", "2012-03-01T00:46:31+00:00", "2012-03-01T00:48:03+00:00", "2012-03-01T00:49:34+00:00", "2012-03-01T00:49:54+00:00", "2012-03-01T00:55:19+00:00", "2012-03-01T00:56:27+00:00", "2012-03-01T00:56:32+00:00"]
Using Pandas, I did:
import pandas as pd
logs = pd.read_json('logs.json')
logs.head()
And I get the following:
0
0 2012-03-01T00:05:55+00:00
1 2012-03-01T00:06:23+00:00
2 2012-03-01T00:06:52+00:00
3 2012-03-01T00:11:23+00:00
4 2012-03-01T00:12:47+00:00
[5 rows x 1 columns]
Then, in order to assign the proper data type including the UTC zone, I do:
logs = pd.to_datetime(logs[0], utc=True)
logs.head()
And get:
0 2012-03-01 00:05:55
1 2012-03-01 00:06:23
2 2012-03-01 00:06:52
3 2012-03-01 00:11:23
4 2012-03-01 00:12:47
Name: 0, dtype: datetime64[ns]
Here are my questions:
Is the above code correct to get my data in the right format?
where did my UTC zone go? and what if I want to create a column with the corresponding PST time and add it to this dataset in a data frame format?
I seem to recall that in order to obtain counts per day/week, or year, I need to add .day, .week, or .year somewhere (logs.day?), but I cannot figure it out and I am guessing that it is because of the current shape of my data. How do I get counts by day? week? year? so that I can plot the data? and how would I go with plotting the data?
Such simple questions that seem so hard for someone who is transitioning from R to using Python for Data Analysis! I hope you guys can help!
I think there may be a bug in the tz handling here, it's certainly possible that this should be converted by default (I was surprised that it wasn't, I suspect it's because it's just a list).
In [21]: s = pd.read_json(js, convert_dates=[0], typ='Series') # more honestly this is a Series
In [22]: s.head()
Out[22]:
0 2012-03-01 00:05:55
1 2012-03-01 00:06:23
2 2012-03-01 00:06:52
3 2012-03-01 00:11:23
4 2012-03-01 00:12:47
dtype: datetime64[ns]
To get counts of year, month, etc. I would probably use a DatetimeIndex (at the moment date-like columns don't have year/month etc methods, though I think they (c|sh)ould):
In [23]: dti = pd.DatetimeIndex(s)
In [24]: s.groupby(dti.year).size()
Out[24]:
2012 27
dtype: int64
In [25]: s.groupby(dti.month).size()
Out[25]:
3 27
dtype: int64
Perhaps it makes more sense to view the data as a TimeSeries:
In [31]: ts = pd.Series(1, dti)
In [32]: ts.head()
Out[32]:
2012-03-01 00:05:55 1
2012-03-01 00:06:23 1
2012-03-01 00:06:52 1
2012-03-01 00:11:23 1
2012-03-01 00:12:47 1
dtype: int64
This way you can use resample:
In [33]: ts.resample('M', how='sum')
Out[33]:
2012-03-31 27
Freq: M, dtype: int64