I am new to Python so I'm sorry if this sounds silly. I have a date column in a DataFrame. I need to check if the values in the date column is the end of the month, if yes then add one day and display the result in the new date column and if not we will just replace the day of with the first of that month.
For example. If the date 2000/3/31 then the output date column will be 2000/4/01
and if the date is 2000/3/30 then the output value in the date column would be 2000/3/1
Now I can do a row wise iteration of the column but I was wondering if there is a pythonic way to do it.
Let's say my Date column is called "Date" and new column which I want to create is "Date_new" and my dataframe is df, I am trying to code it like this but it is giving me an error:
if(df['Date'].dt.is_month_end == 'True'):
df['Date_new'] = df['Date'] + timedelta(days = 1)
else:
df['Date_new'] =df['Date'].replace(day=1)
I made your if statement into a function and modified it a bit so it works for columns. I used dataframe .apply method with axis=1 so it operates on columns instead of rows
import pandas as pd
import datetime
df = pd.DataFrame({'Date': [datetime.datetime(2022, 1, 31), datetime.datetime(2022, 1, 20)]})
print(df)
def my_func(column):
if column['Date'].is_month_end:
return column['Date'] + datetime.timedelta(days = 1)
else:
return column['Date'].replace(day=1)
df['Date_new'] = df.apply(my_func, axis=1)
print(df)
Related
I want to add a new column to dataframe with a condition based on its date time index.
I used the following code:
I already set the date values as index so that I'm working with the time index.
new_col= []
start_date= pd.to_datetime('2020-03-01 00:00:00')
end_date= pd.to_datetime('2020-03-07 00:00:00')
for idx in range(len(df)):
if df.index[idx] => start_date and df.index[idx] <= end_date:
new_col.append(1)
else:
new_col.append(2)
df["newC"] = new_col
I still get an error that the length of df and the new column are not equal- It was indicated that the length of new column is greater. I tried the numpy where method but I did not work as well.
Is there any better way to add value in a new column based on certain period of time condition for example in this case from '2020-03-01 00:00:00' until '2020-03-07 00:00:00'?
This should work:
df["newC"] = pd.Series(df.index, index=df.index).apply(lambda dt: 1 if start_date <= dt <= end_date else 2)
I have dataframe with column date with type datetime64[ns].
When I try to create new column day with format MM-DD based on date column only first method works from below. Why second method doesn't work in pandas?
df['day'] = df['date'].dt.strftime('%m-%d')
df['day2'] = str(df['date'].dt.month) + '-' + str(df['date'].dt.day)
Result for one row:
day 01-04
day2 0 1\n1 1\n2 1\n3 1\n4 ...
Types of columns
day object
day2 object
Problem of solution is if use str with df['date'].dt.month it return Series, correct way is use Series.astype:
df['day2'] = df['date'].dt.month.astype(str) + '-' + df['date'].dt.day.astype(str)
I have a Data set that contains Dates as an index, and each column is the name of an item with count as value. I'm trying to figure out how to filter each column where there will be more than 3 consecutive days where the count is zero for each different column. I was thinking of using a for loop, any help is appreciated. I'm using python for this project.
I'm fairly new to python, so far I tried using for loops, but did not get it to work in any way.
for i in a.index:
if a.loc[i,'name']==3==df.loc[i+1,'name']==df.loc[i+2,'name']:
print(a.loc[i,"name"])
Cannot add integral value to Timestamp without freq.
It would be better if you included a sample dataframe and desired output in your question. Please do the next time. This way, I have to guess what your data looks like and may not be answering your question. I assume the values are integers. Does your dataframe have a row for every day? I will assume that might not be the case. I will make it so that every day in the last delta days has a row. I created a sample dataframe like this:
import pandas as pd
import numpy as np
import datetime
# Here I am just creating random data from your description
delta = 365
start_date = datetime.datetime.now() - datetime.timedelta(days=delta)
end_date = datetime.datetime.now()
datetimes = [end_date - diff for diff in [datetime.timedelta(days=i) for i in range(delta,0,-1)]]
# This is the list of dates we will have in our final dataframe (includes all days)
dates = pd.Series([date.strftime('%Y-%m-%d') for date in datetimes], name='Date', dtype='datetime64[ns]')
# random integer dataframe
df = pd.DataFrame(np.random.randint(0, 5, size=(delta,4)), columns=['item' + str(i) for i in range(4)])
df = pd.concat([df, dates], axis=1).set_index('Date')
# Create a missing day
df = df.drop(df.loc['2019-08-01'].name)
# Reindex so that index has all consecutive days
df = df.reindex(index=dates)
Now that we have a sample dataframe, the rest will be straightforward. I am going to check if a value in the dataframe is equal to 0 and then do a rolling sum with the window of 4 (>3). This way I can avoid for loops. The resulting dataframe has all the rows where at least one of the items had a value of 0 for 4 consecutive rows. If there is a 0 for more than window consecutive rows, it will show as two rows where the dates are just one day apart. I hope that makes sense.
# custom function as I want "np.nan" returned if a value does not equal "test_value"
def equals(df_value, test_value=0):
return 1 if df_value == test_value else np.nan
# apply the function to every value in the dataframe
# for each row, calculate the sum of four subsequent rows (>3)
df = df.applymap(equals).rolling(window=4).sum()
# if there was np.nan in the sum, the sum is np.nan, so it can be dropped
# keep the rows where there is at least 1 value
df = df.dropna(thresh=1)
# drop all columns that don't have any values
df = df.dropna(thresh=1, axis=1)
I have the following dataframe
How can I aggregate the number of tickets (summing) for every month?
I tried:
df_res[df_res["type"]=="other"].groupby(["type","date"])["n_tickets"].sum()
date is an object
You need assign to new DataFrame for same size of Series created by Series.dt.month:
#if necessary convert to datetimes
df['date'] = pd.to_datetime(df['date'])
df = df_res[df_res["type"]=="pax"]
#type is same, so should be omited
out = df.groupby(df["date"].dt.month)["n_tickets"].sum()
#if need column with same value `pax`
#out = df.groupby(['type',df["date"].dt.month])["n_tickets"].sum()
If want grouping by pax and no pax:
types = np.where(df_res["type"]=="pax", 'pax', 'no pax')
df_res.groupby([types, df_res["date"].dt.month])["n_tickets"].sum()
If I have a dataframe with a value for a date "interval" and then another dataframe of consecutive dates, how can I set a value in the second dataframe given the date interval in the first dataframe.
# first dataframe (the "lookup", if you will)
df1 = pd.DataFrame(np.random.random((10, 1)))
df1['date'] = pd.date_range('2017-1-1', periods=10, freq='10D')
# second dataframe
df2 = pd.DataFrame(np.arange(0,100))
df2['date'] = pd.date_range('2016-12-29', periods=100, freq='D')
So if df2 date is greater than or equal to a df1 date and less than a contiguous date in df1 we would say something like:
df2['multiplier'] = df1[0], for the proper element that fits within the dates.
Also not sure how the upper boundary would be handled, i.e. if df2 date is greater than the greatest date in df1, it would get the last value in df1.
This feels dirty, so with apologies to the arts of element-wise operations, here's my go at it.
# create an "end date" second column by shifting the date
df1['end_date'] = df1['date'].shift(-1) + pd.DateOffset(-1)
# create a simple list by nested iteration
multiplier = []
for elem, row in df2.iterrows():
if row['date'] < min(df1['date']):
# kinda don't care about this instance
multiplier.append(0)
elif row['date'] < max(df1['date']):
tmp_mult = df1[(df1['date'] <= row['date']) & (row['date'] <= df1['end_date'])][0].values[0]
multiplier.append(tmp_mult)
# for l_elem, l_row in df1.iterrows():
# if l_row.date <= row['date'] <= l_row.end_date:
# multiplier.append(l_row[0])
else:
multiplier.append(df1.loc[df1.index.max(), 0])
# set the list as a new column in the dataframe
df2['multiplier'] = multiplier