I need to subtract all elements in one column of pandas dataframe by its first value.
In this code, pandas complains about self.inferred_type, which I guess is the circular referencing.
df.Time = df.Time - df.Time[0]
And in this code, pandas complains about setting value on copies.
df.Time = df.Time - df.iat[0,0]
What is the correct way to do this computation in Pandas?
I think you can select first item in column Time by iloc:
df.Time = df.Time - df.Time.iloc[0]
Sample:
start = pd.to_datetime('2015-02-24 10:00')
rng = pd.date_range(start, periods=5)
df = pd.DataFrame({'Time': rng, 'a': range(5)})
print (df)
Time a
0 2015-02-24 10:00:00 0
1 2015-02-25 10:00:00 1
2 2015-02-26 10:00:00 2
3 2015-02-27 10:00:00 3
4 2015-02-28 10:00:00 4
df.Time = df.Time - df.Time.iloc[0]
print (df)
Time a
0 0 days 0
1 1 days 1
2 2 days 2
3 3 days 3
4 4 days 4
Notice:
For me works perfectly your 2 ways also.
Related
I have a file with date time column 'A' with several months. I have to restrict to August 2017, group by the weekday, aggregate by summing. Set the Weekday column to numbers from one to seven. Then set the column Weekday as the (row) index. Return the resulting DataFrame.
I tried below:
pd.date_range("2017-08-01", "2017-08-31",freq='D').to_series()
y=z.dt.weekday
y
This gives
2017-08-01 1
2017-08-02 2
2017-08-03 3...
and so forth but I am not able to rename or change the index of this single column in my jupyter notebook. How can I go about solving this exercise.Please help.
IIUC you'd like to have something like
import numpy as np
import pandas as pd
df = pd.date_range("2017-08-01", "2017-08-31",freq='D')\
.to_frame(name="date")\
.reset_index(drop=True)
df["n"] = np.random.randn(len(df))
df["dow"] = df["date"].dt.weekday
df.head()
date n dow
0 2017-08-01 2.104356 1
1 2017-08-02 0.475884 2
2 2017-08-03 -0.849579 3
3 2017-08-04 -0.134266 4
4 2017-08-05 -1.322617 5
And then perform the following groupby
df.groupby("dow")["n"].sum().reset_index()
dow n
0 0 -1.579067
1 1 0.793178
2 2 -2.310629
3 3 -2.275956
4 4 -0.091897
5 5 -3.918192
6 6 -2.252314
** Update **
If you have problems with to_frame you could define df as following
df = pd.DataFrame({"date": pd.date_range("2017-08-01",
"2017-08-31",
freq='D')})
I'm trying to use the df.loc from pandas to locate a partially variable value in my dataframe.
in my dataframe there are times and dates combined
for example
Time Col1 Col2
1-1-2019 01:00 5 7
1-1-2019 02:00 6 9
1-2-2019 01:00 8 3
if I use df.loc[df['Time'] == ['1-1-2019']] I want to locate the first 2 columns of this dataframe that being 1-1-2019 01:00 and 1-1-2019 02:00.
it is giving my an error: Lengths must match. To be fair that is a logical error for pandas to give because I only am inputting the day, not the time.
Is there a way to make the search value partialy variable? So pandas look for the 1-1-2019 01:00 and 1-1-2019 02:00?
First idea is remove times, better say set to 0 by Series.dt.floor or Series.dt.normalize:
df['Time'] = pd.to_datetime(df['Time'])
df1 = df.loc[df['Time'].dt.floor('d') == '2019-01-01']
#alternative
#df1 = df.loc[df['Time'].dt.normalize() == '2019-01-01']
print (df1)
Time Col1 Col2
0 2019-01-01 01:00:00 5 7
1 2019-01-01 02:00:00 6 9
Or compare dates by Series.dt.date:
from datetime import date
df['Time'] = pd.to_datetime(df['Time'])
df1 = df.loc[df['Time'].dt.date == date(2019,1,1)]
#in some version should working
#df1 = df.loc[df['Time'].dt.date == '2019-01-01']
Or convert to strings YYYY-MM-DD by Series.dt.strftime and compare:
df['Time'] = pd.to_datetime(df['Time'])
df1 = df.loc[df['Time'].dt.strftime('%Y-%m-%d') == '2019-01-01']
The following will subset the df on your condition
df[df.Time.str.contains('1-1-2019')]
Time Col1 Col2
0 2019-01-01 01:00:00 5 7
1 2019-01-01 02:00:00 6 9
I want to write a transformation function accessing two columns from a DataFrame and pass it to transform().
Here is the DataFrame which I would like to modify:
print(df)
date increment
0 2012-06-01 0
1 2003-04-08 1
2 2009-04-22 3
3 2018-05-24 6
4 2006-09-25 2
5 2012-11-02 4
I would like to increment the year in column date by the number of years given variable increment. The proposed code (which does not work) is:
df.transform(lambda df: date(df.date.year + df.increment, 1, 1))
Is there a way to access individual columns in the function (here a lambda function) passed to transform()?
You can use pandas.to_timedelta :
# If necessary convert to date type first
# df['date'] = pd.to_datetime(df['date'])
df['date'] = df['date'] + pd.to_timedelta(df['increment'], unit='Y')
[out]
date increment
0 2012-06-01 00:00:00 0
1 2004-04-07 05:49:12 1
2 2012-04-21 17:27:36 3
3 2024-05-23 10:55:12 6
4 2008-09-24 11:38:24 2
5 2016-11-01 23:16:48 4
or alternatively:
df['date'] = pd.to_datetime({'year': df.date.dt.year.add(df.increment),
'month': df.date.dt.month,
'day': df.date.dt.day})
[out]
date increment
0 2012-06-01 0
1 2004-04-08 1
2 2012-04-22 3
3 2024-05-24 6
4 2008-09-25 2
5 2016-11-02 4
Your own solution could also be fixed by instead using the apply method and passing the axis=1 argument:
from datetime import date
df.apply(lambda df: date(df.date.year + df.increment, 1, 1), axis=1)
I am having a data frame like this I have to get missing Weeks value and count between them
year Data Id
20180406 57170 A
20180413 55150 A
20180420 51109 A
20180427 57170 A
20180504 55150 A
20180525 51109 A
The output should be like this.
Id Start year end-year count
A 20180420 20180420 1
A 20180518 20180525 2
Use:
#converting to week period starts in Thursday
df['year'] = pd.to_datetime(df['year'], format='%Y%m%d').dt.to_period('W-Thu')
#resample by start of months with asfreq
df1 = (df.set_index('year')
.groupby('Id')['Id']
.resample('W-Thu')
.asfreq()
.rename('val')
.reset_index())
print (df1)
Id year val
0 A 2018-04-06/2018-04-12 A
1 A 2018-04-13/2018-04-19 A
2 A 2018-04-20/2018-04-26 A
3 A 2018-04-27/2018-05-03 A
4 A 2018-05-04/2018-05-10 A
5 A 2018-05-11/2018-05-17 NaN
6 A 2018-05-18/2018-05-24 NaN
7 A 2018-05-25/2018-05-31 A
#onverting to datetimes with starts dates
#http://pandas.pydata.org/pandas-docs/stable/timeseries.html#converting-between-representations
df1['year'] = df1['year'].dt.to_timestamp('D', how='s')
print (df1)
Id year val
0 A 2018-04-06 A
1 A 2018-04-13 A
2 A 2018-04-20 A
3 A 2018-04-27 A
4 A 2018-05-04 A
5 A 2018-05-11 NaN
6 A 2018-05-18 NaN
7 A 2018-05-25 A
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2018-05-11 2018-05-18 2
I have the following time series dataset of the number of sales happening for a day as a pandas data frame.
date, sales
20161224,5
20161225,2
20161227,4
20161231,8
Now if I have to include the missing data points here(i. e. missing dates) with a constant value(zero) and want to make it look the following way, how can I do this efficiently(assuming the data frame is ~50MB) using Pandas.
date, sales
20161224,5
20161225,2
20161226,0**
20161227,4
20161228,0**
20161229,0**
20161231,8
**Missing rows which are been added to the data frame.
Any help will be appreciated.
You can first cast to to_datetime column date, then set_index and reindex by min and max value of index, reset_index and if necessary change format by strftime:
df.date = pd.to_datetime(df.date, format='%Y%m%d')
df = df.set_index('date')
df = df.reindex(pd.date_range(df.index.min(), df.index.max()), fill_value=0)
.reset_index()
.rename(columns={'index':'date'})
print (df)
date sales
0 2016-12-24 5
1 2016-12-25 2
2 2016-12-26 0
3 2016-12-27 4
4 2016-12-28 0
5 2016-12-29 0
6 2016-12-30 0
7 2016-12-31 8
Last if need change format:
df.date = df.date.dt.strftime('%Y%m%d')
print (df)
date sales
0 20161224 5
1 20161225 2
2 20161226 0
3 20161227 4
4 20161228 0
5 20161229 0
6 20161230 0
7 20161231 8