I am having trouble understanding what happens to a timestamp after you reindex a data frame using pd.date_range. If I have the following example where I am using pd.DataFrame.reindex to create a longer time series:
import pandas as pd
import numpy as np
idx_inital = pd.date_range('2004-03-01','2004-05-05')
df = pd.DataFrame(index = idx_inital, data={'data': np.random.randint(0,100,idx_inital.size)})
idx_new = pd.date_range('2004-01-01','2004-05-05')
df= df.reindex(idx_new, fill_value = 0)
which returns the expected result where all data are assigned 0:
data
2004-01-01 0
2004-01-02 0
2004-01-03 0
2004-01-04 0
2004-01-05 0
Now If I want to use apply to assign a new column using:
def year_attrib(row):
if row.index.month >2:
result = row.index.year + 11
else:
result = row.index.year + 15
return result
df['year_attrib'] = df.apply(lambda x: year_attrib(x), axis=1)
I am getting the error:
AttributeError: ("'Index' object has no attribute 'month'", 'occurred at index 2004-01-01 00:00:00')
If I inspect what each row is being passed to year_attrib with:
row = df.iloc[0]
row
Out[32]:
data 0
Name: 2004-01-01 00:00:00, dtype: int32
It looks like the timestamp is being passed to Name and I have no idea how to access it. When I look at row.index I get:
row.index
Out[34]: Index(['data'], dtype='object')
What is the cause of this behavior?
the problem is, when use apply function to a DataFrame with parameter axis=1, each row of the dataframe is passed to the function as a Series. See the doc of pandas.
So, what actually happened in the year_attrib function is, row.index will return the index of the row, which is the column of the dataframe.
In [5]: df.columns
Out[5]: Index(['data'], dtype='object')
thus AttributeError will be raised when use row.index.month.
if you really want to use this function to get what you want, use row.name.month instead.
however it's still suggested to use a vectorized way, like:
In [10]: df.loc[df.index.month>2, 'year_attrib'] = df[df.index.month>2].index.year + 11
In [11]: df.loc[df.index.month<=2, 'year_attrib'] = df[df.index.month>2].index.year + 15
In [12]: df
Out[12]:
data year_attrib
2004-03-01 93 2015
2004-03-02 48 2015
2004-03-03 88 2015
2004-03-04 44 2015
2004-03-05 11 2015
2004-03-06 4 2015
2004-03-07 70 2015
Related
I have some troubles with my Python work,
my steps are:
1)add the list to ordinary Dataframe
2)delete the columns which is min in the list
my list is called 'each_c' and my ordinary Dataframe is called 'df_col'
I want it to become like this:
hope someone can help me, thanks!
This is clearly described in the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
df_col.drop(columns=[3])
Convert each_c to Series, append by DataFrame.append and then get indices by minimal value by Series.idxmin and pass to drop - it remove only first minimal column:
s = pd.Series(each_c)
df = df_col.append(s, ignore_index=True).drop(s.idxmin(), axis=1)
If need remove all columns if multiple minimals:
each_c = [-0.025,0.008,-0.308,-0.308]
s = pd.Series(each_c)
df_col = pd.DataFrame(np.random.random((10,4)))
df = df_col.append(s, ignore_index=True)
df = df.loc[:, s.ne(s.min())]
print (df)
0 1
0 0.602312 0.641220
1 0.586233 0.634599
2 0.294047 0.339367
3 0.246470 0.546825
4 0.093003 0.375238
5 0.765421 0.605539
6 0.962440 0.990816
7 0.810420 0.943681
8 0.307483 0.170656
9 0.851870 0.460508
10 -0.025000 0.008000
EDIT: If solution raise error:
IndexError: Boolean index has wrong length:
it means there is no default columns name by range - 0,1,2,3. Possible solution is set index values in Series by rename:
each_c = [-0.025,0.008,-0.308,-0.308]
df_col = pd.DataFrame(np.random.random((10,4)), columns=list('abcd'))
s = pd.Series(each_c).rename(dict(enumerate(df.columns)))
df = df_col.append(s, ignore_index=True)
df = df.loc[:, s.ne(s.min())]
print (df)
a b
0 0.321498 0.327755
1 0.514713 0.575802
2 0.866681 0.301447
3 0.068989 0.140084
4 0.069780 0.979451
5 0.629282 0.606209
6 0.032888 0.204491
7 0.248555 0.338516
8 0.270608 0.731319
9 0.732802 0.911920
10 -0.025000 0.008000
I have a list(actually a column in pandas DataFrame if this matters) of Timestamps and I'm trying to convert every element of the list to ordinal format. So I run a for loop through the list(is there a faster way?) and use:
import datetime as dt
a = a.toordinal()
or
import datetime as dt
a = dt.datetime.toordinal(a)
however the following happened(for simplicity):
In[1]: a
Out[1]: Timestamp('2019-12-25 00:00:00')
In[2]: b = dt.datetime.toordinal(a)
In[3]:b
Out[3]: 737418
In[4]:a = b
In[5]:a
Out[5]: Timestamp('1970-01-01 00:00:00.000737418')
The result makes absolutely non sense to me. Obviously what I was trying to get is:
In[1]: a
Out[1]: Timestamp('2019-12-25 00:00:00')
In[2]: b = dt.datetime.toordinal(a)
In[3]:b
Out[3]: 737418
In[4]:a = b
In[5]:a
Out[5]: 737418
What went wrong?
console output screenshot
What went wrong?
Your question is a bit misleading, and the screenshot shows what is going on.
Normally, when you write
a = b
in Python, it will bind the name a to the object bound to b. In this case, you will have
id(a) == id(b)
In your case, however, contrary to your question, you're actually doing the assignment
a[0] = b
This will call a method of a, assigning b to its 0 index. The object's class determines what happens in this case. Here, specifically, a is a pandas.Series, and it converts the object in order to conform to its dtype.
Please don't loop. It's not necessary.
#!/usr/bin/env python
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'dates': [datetime(1990, 4, 28),
datetime(2018, 4, 13),
datetime(2017, 11, 4)]})
print(df)
print(df['dates'].dt.weekday_name)
print(df['dates'].dt.weekday)
print(df['dates'].dt.month)
print(df['dates'].dt.year)
gives the dataframe:
dates
0 1990-04-28
1 2018-04-13
2 2017-11-04
And the printed values
0 Saturday
1 Friday
2 Saturday
Name: dates, dtype: object
0 5
1 4
2 5
Name: dates, dtype: int64
0 4
1 4
2 11
Name: dates, dtype: int64
0 1990
1 2018
2 2017
Name: dates, dtype: int64
For the toordinal, you need to "loop" with apply:
print(df['dates'].apply(lambda x: x.toordinal()))
gives the following pandas series
0 726585
1 736797
2 736637
Name: dates, dtype: int64
I have dataframe
ID 2016-01 2016-02 ... 2017-01 2017-02 ... 2017-10 2017-11 2017-12
111 12 34 0 12 3 0 0
222 0 32 5 5 0 0 0
I need to count every 12 columns and get
ID 2016 2017
111 46 15
222 32 10
I try to use
(df.groupby((np.arange(len(df.columns)) // 31) + 1, axis=1).sum().add_prefix('s'))
But it returns to all columns
But when I try to use
df.groupby['ID']((np.arange(len(df.columns)) // 31) + 1, axis=1).sum().add_prefix('s'))
It returns
TypeError: 'method' object is not subscriptable
How can I fix that?
First set_index of all columns without dates:
df = df.set_index('ID')
1. groupby by splited columns and selected first:
df = df.groupby(df.columns.str.split('-').str[0], axis=1).sum()
2. lambda function for split:
df = df.groupby(lambda x: x.split('-')[0], axis=1).sum()
3. converted columns to datetimes and groupby years:
df.columns = pd.to_datetime(df.columns)
df = df.groupby(df.columns.year, axis=1).sum()
4. resample by years:
df.columns = pd.to_datetime(df.columns)
df = df.resample('A', axis=1).sum()
df.columns = df.columns.year
print (df)
2016 2017
ID
111 46 15
222 32 10
The above code has a slight syntax error and throws the following error:
ValueError: No axis named 1 for object type
Basically, the groupby condition needs to be wrapped by []. So I'm rewriting the code correctly for convenience:
new_df = df.groupby([[i//n for i in range(0,m)]], axis = 1).sum()
where n is the number of columns you want to group together and m is the total number of columns being grouped. You have to rename the columns after that.
If you don't mind losing the labels, you can try this:
new_df = df.groupby([i//n for i in range(0,m)], axis = 1).sum()
where n is the number of columns you want to group together and m is the total number of columns being grouped. You have to rename the columns after that.
I want to subtract dates in 'A' from dates in 'B' and add a new column with the difference.
df
A B
one 2014-01-01 2014-02-28
two 2014-02-03 2014-03-01
I've tried the following, but get an error when I try to include this in a for loop...
import datetime
date1=df['A'][0]
date2=df['B'][0]
mdate1 = datetime.datetime.strptime(date1, "%Y-%m-%d").date()
rdate1 = datetime.datetime.strptime(date2, "%Y-%m-%d").date()
delta = (mdate1 - rdate1).days
print delta
What should I do?
To remove the 'days' text element, you can also make use of the dt() accessor for series: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.html
So,
df[['A','B']] = df[['A','B']].apply(pd.to_datetime) #if conversion required
df['C'] = (df['B'] - df['A']).dt.days
which returns:
A B C
one 2014-01-01 2014-02-28 58
two 2014-02-03 2014-03-01 26
Assuming these were datetime columns (if they're not apply to_datetime) you can just subtract them:
df['A'] = pd.to_datetime(df['A'])
df['B'] = pd.to_datetime(df['B'])
In [11]: df.dtypes # if already datetime64 you don't need to use to_datetime
Out[11]:
A datetime64[ns]
B datetime64[ns]
dtype: object
In [12]: df['A'] - df['B']
Out[12]:
one -58 days
two -26 days
dtype: timedelta64[ns]
In [13]: df['C'] = df['A'] - df['B']
In [14]: df
Out[14]:
A B C
one 2014-01-01 2014-02-28 -58 days
two 2014-02-03 2014-03-01 -26 days
Note: ensure you're using a new of pandas (e.g. 0.13.1), this may not work in older versions.
A list comprehension is your best bet for the most Pythonic (and fastest) way to do this:
[int(i.days) for i in (df.B - df.A)]
i will return the timedelta(e.g. '-58 days')
i.days will return this value as a long integer value(e.g. -58L)
int(i.days) will give you the -58 you seek.
If your columns aren't in datetime format. The shorter syntax would be: df.A = pd.to_datetime(df.A)
How about this:
times['days_since'] = max(list(df.index.values))
times['days_since'] = times['days_since'] - times['months']
times
I have a massive dataframe with four columns, two of which are 'date' (in datetime format) and 'page' (a location saved as a string). I have grouped the dataframe by 'page' and called it pagegroup, and want to know the range of time over which each page is accessed (e.g. the first access was on 1-1-13, the last on 1-5-13, so the max-min is 5 days).
I know in pandas I can use date_range to compare two datetimes, but trying something like:
pagegroup['date'].agg(np.date_range)
returns
AttributeError: 'module' object has no attribute 'date_range'
while trying the simple (non date-specific) numpy function ptp gives me an integer answer:
daterange = pagegroup['date'].agg([np.ptp])
daterange.head()
ptp
page
%2F 0
/ 13325984000000000
/-509606456 297697000000000
/-511484155 0
/-511616154 0
Can anyone think of a way to calculate the range of dates and have it return in a recognizable date format?
Thank you
Assuming you have indexed by datetime can use groupby apply:
In [11]: df = pd.DataFrame([[1, 2], [1, 3], [2, 4]],
columns=list('ab'),
index=pd.date_range('2013', freq='H', periods=3)
In [12]: df
Out[12]:
a b
2013-08-22 00:00:00 1 2
2013-08-22 01:00:00 1 3
2013-08-22 02:00:00 2 4
In [13]: g = df.groupby('a')
In [14]: g.apply(lambda x: x.iloc[-1].name - x.iloc[0].name)
Out[14]:
a
1 01:00:00
2 00:00:00
dtype: timedelta64[ns]
Here iloc[-1] grabs the last row in the group and iloc[0] gets the first. The name attribute is the index of the row.
#Elyase points out that this only works if the original DatetimeIndex was in order, if not you can use max/min (which actually reads better, but may be less efficient):
In [15]: g.apply(lambda x: x.index.max() - x.index.min())
Out[15]:
a
1 01:00:00
2 00:00:00
dtype: timedelta64[ns]
Note: to get the timedelta between two Timestamps we have just subtracted (-).
If date is a column rather than an index, then use the column name:
g.apply(lambda x: x['date'].iloc[-1] - x['date'].iloc[0])
g.apply(lambda x: x['date'].max() - x['date'].min())