So I have pandas dataframe with a 'date' column. Our calendar is based off of July 1st being the first day. I know I can do df['date'].dt.week, but that gives me the week from Jan 1. Is there a way to take my df and make a new column 'week' where 'week' is 0 for the first days in July until Sunday and then 1... etc.? Basically the same way that dt.week works... just shifted to Jul 1. I know that resample allows me to shift this way, I just can't seem to figure out how to get it all correct as a column.
Thanks
Update: Currently doing this... not exactly working.
def get_academic_year(x):
if (x.month < 7):
year = x.year - 1
else:
year = x.year
return year
def get_week(x):
return ((x['date'].week -
pd.to_datetime(pd.datetime(x['academic_year'], 7, 1)).week) % 52)
df_x['academic_year'] = df_x['date'].apply(lambda x: get_academic_year(x))
df_x['week'] = df_x.apply(lambda x: get_week(x), axis=1)
My Dataset:
'{"date":{"0":1414368000000,"1":1414454400000,"2":1414540800000,"3":1414627200000,"4":1414713600000,"5":1414800000000,"6":1414886400000,"7":1425254400000,"8":1425340800000,"9":1425427200000,"10":1425513600000,"11":1425600000000,"12":1425686400000,"13":1425772800000,"14":1433116800000,"15":1433203200000,"16":1433289600000,"17":1433376000000,"18":1433462400000,"19":1433548800000,"20":1433635200000,"21":1444262400000,"22":1444348800000,"23":1444608000000,"24":1444694400000,"25":1444780800000,"26":1444867200000,"27":1444953600000,"28":1445040000000,"29":1445126400000,"30":1452643200000,"31":1452729600000,"32":1452816000000,"33":1452902400000,"34":1452988800000,"35":1460505600000,"36":1460937600000,"37":1461024000000,"38":1461110400000,"39":1461196800000,"40":1461283200000,"41":1461369600000,"42":1461456000000,"43":1465776000000,"44":1465862400000,"45":1465948800000,"46":1466035200000,"47":1466121600000,"48":1470873600000,"49":1470960000000,"50":1471219200000,"51":1471305600000,"52":1471392000000,"53":1486598400000,"54":1489968000000,"55":1490054400000,"56":1490140800000,"57":1490227200000,"58":1490313600000,"59":1492387200000,"60":1492473600000,"61":1492560000000,"62":1492646400000,"63":1492732800000,"64":1494201600000,"65":1494288000000,"66":1494374400000,"67":1494460800000,"68":1494547200000,"69":1502668800000,"70":1502755200000,"71":1502841600000,"72":1502928000000,"73":1503014400000,"74":1503100800000,"75":1503187200000,"76":1505174400000,"77":1505433600000,"78":1507507200000,"79":1507593600000,"80":1507680000000,"81":1507766400000,"82":1507852800000,"83":1507939200000,"84":1508025600000,"85":1508976000000,"86":1509062400000,"87":1509148800000,"88":1509235200000,"89":1509321600000,"90":1509408000000,"91":1512086400000,"92":1524268800000,"93":1524355200000,"94":1529884800000,"95":1529971200000,"96":1530057600000,"97":1530144000000,"98":1530230400000}}'
Update #2:
def get_academic_year(x):
if (x.month < 7):
year = x.year - 1
else:
year = x.year
return year
def get_week(x):
return int(((x['date'] - pd.to_datetime(pd.datetime(x['academic_year'], 7, 1)))).days / 7) + 1
rng = pd.date_range('7/1/2015', periods=365*3, freq='D')
df_x = pd.DataFrame()
df_x['date'] = rng
df_x['academic_year'] = df_x['date'].apply(lambda x: get_academic_year(x))
df_x['week'] = df_x.apply(lambda x: get_week(x), axis=1)
df_x
This might work for you.
df = pd.DataFrame({'A': ['2017-07-05', '2017-07-21', '2017-07-22',
'2017-08-01','2017-08-15', '2017-08-30']})
df['A'] = pd.to_datetime(df['A'])
df['Week'] = df['A'].dt.week - pd.to_datetime('2017-07-01').week
# A Week
# 0 2017-07-05 1
# 1 2017-07-21 3
# 2 2017-07-22 3
# 3 2017-08-01 5
# 4 2017-08-15 7
# 5 2017-08-30 9
Related
I have a dataframe with a column of integers that symbolise birthyears. Each row has 20xx or 19xx in it but some rows have only the xx part.
What I wanna do is add 19 in front of those numbers with only 2 "elemets" if the integer is bigger than 22(starting from 0), or/and add 20 infront of those that are smaller or equal to 22.
This is what I wrote;
for x in DF.loc[DF["Year"] >= 2022]:
x + 1900
if:
x >= 22
else:
x + 2000
You can also change the code completely, I would just like you to maybe explain what exactly your code does.
Thanks for everybody who takes time to answer this.
Instead of iterating through the rows, use where to change the whole column:
y = df["Year"] # just to save typing
df["Year"] = y.where(y > 99, (y + 1900).where(y > 22, y + 2000))
or indexing:
df["Year"][df["Year"].between(0, 21)] += 2000
df["Year"][df["Year"].between(22, 99)] += 1900
or loc:
df.loc[df["Year"].between(0, 21), "Year"] += 2000
df.loc[df["Year"].between(22, 99), "Year"] += 1900
You can do it in one line with the apply method.
Example:
df = pd.DataFrame({'date': [2002, 95, 1998, 3, 56, 1947]})
print(df)
date
0 2002
1 95
2 1998
3 3
4 56
5 1947
Then:
df = df.date.apply(lambda x: x+1900 if (x<100) & (x>22) else (x+2000 if (x<100)&(x<22) else x) )
print(df)
date
0 2002
1 1995
2 1998
3 2003
4 1956
5 1947
It is basically what you did, an if inside a for:
new_list_of_years = []
for year in DF.loc[DF["Year"]:
full_year = year+1900 if year >22 else year+2000
new_list_of_years.append(full_year)
DF['Year'] = pd.DataFrame(new_list_of_years)
Edit: You can do that with for-if list comprehension also:
DF['Year'] = [year+1900 if year > 22 else year+2000 for year in DF.loc[DF["Year"]]]
my date format come in format 11122020 (ddmmyyyy) in a pandas column.
I use
datapdf["wholetime"]=pd.to_datetime(datapdf["wholetime"],format='%d%m%Y)
to convert to time and do processing on the time.
recently my code failed for date 3122020 as
ValueError: day is out of range for month
python is interpreting as 31 2 2020 instead of 3 12 2020 which is causing the error. Any one have solution for this?
One way would be to use str.zfill to ensure that date is in 8 digits:
s = pd.Series(["11122020", "3122020"])
pd.to_datetime(s.str.zfill(8), format="%d%m%Y")
Output:
0 2020-12-11
1 2020-12-03
dtype: datetime64[ns]
Note that this answer only concerns about missing 0 in the day. It won't be able to parse more ambiguous items such as 332020, where the month part also requires heading 0.
Little bit newbie approach using apply I created custom parser for dates, if you have some other formats in it then you can tweak the function w.r.t your date formats,
import pandas as pd
data = {
#assuming your dates are mix of ddmmyyyy,dmmyyyy,dmyyyy
'date': ['11122020','3122020','572020','','222019','3112019']
}
df = pd.DataFrame(data)
def parser(elem):
res = ''
if len(elem) > 7:
res = elem
elif len(elem) > 6:
d = '0' + elem[0]
m = elem[1:3]
y = elem[3:]
res = d+m+y
elif len(elem) > 5:
d = '0' + elem[0]
m = '0' + elem[1]
y = elem[2:]
res = d+m+y
else:
res = ''
return pd.to_datetime(res, format='%d%m%Y',errors='coerce')
df['date'] = df['date'].apply(parser)
df
output:
date
0 2020-12-11
1 2020-12-03
2 2020-07-05
3 NaT
4 2019-02-02
5 2019-11-03
I encountered a problem of how to duplicate rows with loop function in Python. I have a dataset like this(this is a pandas data frame):
userId period Date
0 41851 4 1/4/2015
1 13575 1 1/4/2015
And I want to duplicate the first row 3 times, every time for the original row, the period column need to minus 1. until the period for the original is 1. and also every time when I duplicate it, I want to add 1 month to the date. So the result would be like this:
userId period Date
0 41851 1 1/4/2015
1 41851 1 2/4/2015
2 41851 1 3/4/2015
3 41851 1 4/4/2015
4 13575 1 1/4/2015
Does someone know how to do that? Thanks!
Idea is repeat rows by Index.repeat and DataFrame.loc, then add days by GroupBy.cumcount with this solution and last if necessary change format of datetimes by Series.dt.strftime:
def combine64(years, months=1, days=1, weeks=None, hours=None, minutes=None,
seconds=None, milliseconds=None, microseconds=None, nanoseconds=None):
years = np.asarray(years) - 1970
months = np.asarray(months) - 1
days = np.asarray(days) - 1
types = ('<M8[Y]', '<m8[M]', '<m8[D]', '<m8[W]', '<m8[h]',
'<m8[m]', '<m8[s]', '<m8[ms]', '<m8[us]', '<m8[ns]')
vals = (years, months, days, weeks, hours, minutes, seconds,
milliseconds, microseconds, nanoseconds)
return sum(np.asarray(v, dtype=t) for t, v in zip(types, vals)
if v is not None)
def year(dates):
"Return an array of the years given an array of datetime64s"
return dates.astype('M8[Y]').astype('i8') + 1970
def month(dates):
"Return an array of the months given an array of datetime64s"
return dates.astype('M8[M]').astype('i8') % 12 + 1
def day(dates):
"Return an array of the days of the month given an array of datetime64s"
return (dates - dates.astype('M8[M]')) / np.timedelta64(1, 'D') + 1
df['Date'] = pd.to_datetime(df['Date'])
df1 = df.loc[df.index.repeat(df['period'])]
g = df1.groupby(level=0).cumcount()
start = df1['Date'].values
df1['Date'] = combine64(year(start), months=month(start) + g,
days=day(start))
df1['period'] = 1
df1 = df1.reset_index(drop=True)
df1['Date'] = df1['Date'].dt.strftime('%m/%d/%Y')
print (df1)
userId period Date
0 41851 1 01/04/2015
1 41851 1 02/04/2015
2 41851 1 03/04/2015
3 41851 1 04/04/2015
4 13575 1 01/04/2015
I am following the suggestions here pandas create new column based on values from other columns but still getting an error. Basically, my Pandas dataframe has many columns and I want to group the dataframe based on a new categorical column whose value depends on two existing columns (AMP, Time).
df
df['Time'] = pd.to_datetime(df['Time'])
#making sure Time column read from the csv file is time object
import datetime as dt
day_1 = dt.date.today()
day_2 = dt.date.today() - dt.timedelta(days = 1)
def f(row):
if (row['AMP'] > 100) & (row['Time'] > day_1):
val = 'new_positives'
elif (row['AMP'] > 100) & (day_2 <= row['Time'] <= day_1):
val = 'rec_positives'
elif (row['AMP'] > 100 & row['Time'] < day_2):
val = 'old_positives'
else:
val = 'old_negatives'
return val
df['GRP'] = df.apply(f, axis=1) #this gives the following error:
TypeError: ("Cannot compare type 'Timestamp' with type 'date'", 'occurred at index 0')
df[(df['AMP'] > 100) & (df['Time'] > day_1)] #this works fine
df[(df['AMP'] > 100) & (day_2 <= df['Time'] <= day_1)] #this works fine
df[(df['AMP'] > 100) & (df['Time'] < day_2)] #this works fine
#df = df.groupby('GRP')
I am able to select the proper sub-dataframes based on the conditions specified above, but when I apply the above function on each row, I get the error. What is the correct approach to group the dataframe based on the conditions listed?
EDIT:
Unforunately, I cannot provide a sample of my dataframe. However, here is simple dataframe that gives an error of the same type:
import numpy as np
import pandas as pd
mydf = pd.DataFrame({'a':np.arange(10),
'b':np.random.rand(10)})
def f1(row):
if row['a'] < 5 & row['b'] < 0.5:
value = 'less'
elif row['a'] < 5 & row['b'] > 0.5:
value = 'more'
else:
value = 'same'
return value
mydf['GRP'] = mydf.apply(f1, axis=1)
ypeError: ("unsupported operand type(s) for &: 'int' and 'float'", 'occurred at index 0')
EDIT 2:
As suggested below, enclosing the comparison operator with parentheses did the trick for the cooked up example. This problem is solved.
However, I am still getting the same error in my my real example. By the way, if I were to use the column 'AMP' with perhaps another column in my table, then everything works and I am able to create df['GRP'] by applying the function f to each row. This shows the problem is related to using df['Time']. But then why am I able to select df[(df['AMP'] > 100) & (df['Time'] > day_1)]? Why would this work in this context, but not when the condition appears in a function?
Based on your error message and example, there are two things to fix. One is to adjust parentheses for operator precedence in your final elif statement. The other is to avoid mixing datetime.date and Timestamp objects.
Fix 1: change this:
elif (row['AMP'] > 100 & row['Time'] < day_2):
to this:
elif (row['AMP'] > 100) & (row['Time'] < day_2):
These two lines are different because the bitwise & operator takes precedence over the < and > comparison operators, so python attempts to evaluate 100 & row['Time']. A full list of Python operator precedence is here: https://docs.python.org/3/reference/expressions.html#operator-precedence
Fix 2: Change these 3 lines:
import datetime as dt
day_1 = dt.date.today()
day_2 = dt.date.today() - dt.timedelta(days = 1)
to these 2 lines:
day1 = pd.to_datetime('today')
day_2 = day_1 - pd.DateOffset(days=1)
Some parentheses need to be added in the if-statements:
import numpy as np
import pandas as pd
mydf = pd.DataFrame({'a':np.arange(10),
'b':np.random.rand(10)})
def f1(row):
if (row['a'] < 5) & (row['b'] < 0.5):
value = 'less'
elif (row['a'] < 5) & (row['b'] > 0.5):
value = 'more'
else:
value = 'same'
return value
mydf['GRP'] = mydf.apply(f1, axis=1)
If you don't need to use a custom function, then you can use multiple masks (somewhat similar to this SO post)
For the Time column, I used this code. It may be that you were trying to compare Time column values that did not have the required dtype (??? this is my guess)
import datetime as dt
mydf['Time'] = pd.date_range(start='10/14/2018', end=dt.date.today())
day_1 = pd.to_datetime(dt.date.today())
day_2 = day_1 - pd.DateOffset(days = 1)
Here is the raw data
mydf
a b Time
0 0 0.550149 2018-10-14
1 1 0.889209 2018-10-15
2 2 0.845740 2018-10-16
3 3 0.340310 2018-10-17
4 4 0.613575 2018-10-18
5 5 0.229802 2018-10-19
6 6 0.013724 2018-10-20
7 7 0.810413 2018-10-21
8 8 0.897373 2018-10-22
9 9 0.175050 2018-10-23
One approach involves using masks for columns
# Append new column
mydf['GRP'] = 'same'
# Use masks to change values in new column
mydf.loc[(mydf['a'] < 5) & (mydf['b'] < 0.5) & (mydf['Time'] < day_2), 'GRP'] = 'less'
mydf.loc[(mydf['a'] < 5) & (mydf['b'] > 0.5) & (mydf['Time'] > day_1), 'GRP'] = 'more'
mydf
a b Time GRP
0 0 0.550149 2018-10-14 same
1 1 0.889209 2018-10-15 same
2 2 0.845740 2018-10-16 same
3 3 0.340310 2018-10-17 less
4 4 0.613575 2018-10-18 same
5 5 0.229802 2018-10-19 same
6 6 0.013724 2018-10-20 same
7 7 0.810413 2018-10-21 same
8 8 0.897373 2018-10-22 same
9 9 0.175050 2018-10-23 same
Another approach is to set a, b and Time as a multi-index and use index-based masks to set values
mydf.set_index(['a','b','Time'], inplace=True)
# Get Index level values
a = mydf.index.get_level_values('a')
b = mydf.index.get_level_values('b')
t = mydf.index.get_level_values('Time')
# Apply index-based masks
mydf['GRP'] = 'same'
mydf.loc[(a < 5) & (b < 0.5) & (t < day_2), 'GRP'] = 'less'
mydf.loc[(a < 5) & (b > 0.5) & (t > day_1), 'GRP'] = 'more'
mydf.reset_index(drop=False, inplace=True)
mydf
a b Time GRP
0 0 0.550149 2018-10-14 same
1 1 0.889209 2018-10-15 same
2 2 0.845740 2018-10-16 same
3 3 0.340310 2018-10-17 less
4 4 0.613575 2018-10-18 same
5 5 0.229802 2018-10-19 same
6 6 0.013724 2018-10-20 same
7 7 0.810413 2018-10-21 same
8 8 0.897373 2018-10-22 same
9 9 0.175050 2018-10-23 same
Source to filter by datetime and create a range of dates.
You have a excelent example here, it is very useful and you could apply filters after groupby. It is a way without using mask.
def get_letter_type(letter):
if letter.lower() in 'aeiou':
return 'vowel'
else:
return 'consonant'
In [6]: grouped = df.groupby(get_letter_type, axis=1)
https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html
I'm trying to add a Year column to my DataFrame based on the value that already exists within the Rk column. I've tried using the code below, however it automatically sets all values to 0.
df['Year'] = np.where(df['Rk'] <= 540, '2017/2018', 0)
df['Year'] = np.where((df['Rk'] >= 541) & (df['Rk'] <= 1135), '2016/2017', 0)
df['Year'] = np.where((df['Rk'] >= 1136) & (df['Rk'] <= 1713), '2016/2017', 0)
Use cut with specify bins:
df = pd.DataFrame({
'Rk': [10, 540,541,1135,1136,1713,1714,2000],
})
labs = ['2017/2018','2016/2017','2015/2016', '0']
df['Year'] = pd.cut(df['Rk'], bins=[-np.inf,540, 1135, 1713, np.inf], labels=labs)
print (df)
Rk Year
0 10 2017/2018
1 540 2017/2018
2 541 2016/2017
3 1135 2016/2017
4 1136 2015/2016
5 1713 2015/2016
6 1714 0
7 2000 0