Here I got a pandas data frame with daily return of stocks and columns are date and return rate.
But if I only want to keep the last day of each week, and the data has some missing days, what can I do?
import pandas as pd
df = pd.read_csv('Daily_return.csv')
df.Date = pd.to_datetime(db.Date)
count = 300
for last_day in ('2017-01-01' + 7n for n in range(count)):
Actually my brain stop working at this point with my limited imagination......Maybe one of the biggest point is "+7n" kind of stuff is meaningless with some missing dates.
I'll create a sample dataset with 40 dates and 40 sample returns, then sample 90 percent of that randomly to simulate the missing dates.
The key here is that you need to convert your date column into datetime if it isn't already, and make sure your df is sorted by the date.
Then you can groupby year/week and take the last value. If you run this repeatedly you'll see that the selected dates can change if the value dropped was the last day of the week.
Based on that
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['date'] = pd.date_range(start='04-18-2022',periods=40, freq='D')
df['return'] = np.random.uniform(size=40)
# Keep 90 percent of the records so we can see what happens when some days are missing
df = df.sample(frac=.9)
# In case your dates are actually strings
df['date'] = pd.to_datetime(df['date'])
# Make sure they are sorted from oldest to newest
df = df.sort_values(by='date')
df = df.groupby([df['date'].dt.isocalendar().year,
df['date'].dt.isocalendar().week], as_index=False).last()
print(df)
Output
date return
0 2022-04-24 0.299958
1 2022-05-01 0.248471
2 2022-05-08 0.506919
3 2022-05-15 0.541929
4 2022-05-22 0.588768
5 2022-05-27 0.504419
Related
I have a DataFrame in which I have multiple names and multiple timestamps associated with them. This is the data of players who have played a game in a month.
Like shown above. These _ids have duplicates in them as the data is from this month's dates.
I need to know how many hours does a person play per day?
I have tried to make a sample DataFrame for you guys to make it easier.
> import pandas as pd
>
> data = {'ids':['Kelsier', 'Kelsier', 'Saze',
> 'Val','Kelsier','Val','Val','Val','Saze','Saze','Saze','Val']
> 'ts' : ["2022-12-21 18:29:59.141", "2022-12-21 19:02:59.141", '2022-12-21 10:12:23.545', '2022-12-19 11:15:20.612', "2022-12-22
> 01:29:59.141", "2022-12-22 05:26:48.151", "2022-12-22 05:28:09.543"\
> , "2022-12-22 05:30:14.522", "2022-12-23 15:14:19.231", "2022-12-24 10:14:39.601", "2022-12-24 11:44:34.173",
> "2022-12-24 13:12:23.566"]}
> df = pd.DataFrame(data)
>
> df['ts'] = pd.to_datetime(df['ts'])
What and How should I do to get the data I desire from the given DataFrame?
I want an output like this:
Is this possible? If so then how???
This could be a solution. I'm not sure how you want to calculate the days and hours played exactly. However, if you want to get the time between the last and first timestamp you could use the following:
# Calculate timedeltas max - min
time_deltas = df.groupby('ids')['ts'].agg(lambda x: x.max() - x.min()).reset_index()
# Create day and hour column with timedelta column
time_deltas['DaysPlayed'] = time_deltas['ts'].apply(lambda x: x.days)
time_deltas['HoursPlayed'] = time_deltas['ts'].apply(lambda x: round(x.seconds/3600, 0))
time_deltas
ids ts DaysPlayed HoursPlayed
0 Kelsier 0 days 07:00:00 0 7.0
1 Saze 3 days 01:32:10.628000 3 2.0
2 Val 5 days 01:57:02.954000 5 2.0
Here is the code for sample simulated data. Actual data can have varying start and end dates.
import pandas as pd
import numpy as np
dates = pd.date_range("20100121", periods=3653)
df = pd.DataFrame(np.random.randn(3653, 1), index=dates, columns=list("A"))
dfb=df.resample('B').apply(lambda x:x[-1])
From the dfb, I want to select the rows that contain values for all the days of the month.
In dfb, 2010 January and 2020 January have incomplete data. So I would like data from 2010 Feb till 2019 December.
For this particular dataset, I could do
df_out=dfb['2010-02':'2019-12']
But please help me with a better solution
Edit-- Seems there is plenty of confusion in the question. I want to omit rows that does not begin with first day of the month and rows that does not end on last day of the month. Hope that's clear.
When you say "better" solution - I assume you mean make the range dynamic based on input data.
OK, since you mention that your data is continuous after the start date - it is a safe assumption that dates are sorted in increasing order. With this in mind, consider the code:
import pandas as pd
import numpy as np
from datetime import date, timedelta
dates = pd.date_range("20100121", periods=3653)
df = pd.DataFrame(np.random.randn(3653, 1), index=dates, columns=list("A"))
print(df)
dfb=df.resample('B').apply(lambda x:x[-1])
# fd is the first index in your dataframe
fd = df.index[0]
first_day_of_next_month = fd
# checks if the first month data is incomplete, i.e. does not start with date = 1
if ( fd.day != 1 ):
new_month = fd.month + 1
if ( fd.month == 12 ):
new_month = 1
first_day_of_next_month = fd.replace(day=1).replace(month=new_month)
else:
first_day_of_next_month = fd
# ld is the last index in your dataframe
ld = df.index[-1]
# computes the next day
next_day = ld + timedelta(days=1)
if ( next_day.month > ld.month ):
last_day_of_prev_month = ld # keeps the index if month is changed
else:
last_day_of_prev_month = ld.replace(day=1) - timedelta(days=1)
df_out=dfb[first_day_of_next_month:last_day_of_prev_month]
There is another way to use dateutil.relativedelta but you will need to install python-dateutil module. The above solution attempts to do it without using any extra modules.
I assume that in the general case the table is chronologically ordered (if not use .sort_index). The idea is to extract the year and month from the date and select only the lines where (year, month) is not equal to the first and last lines.
dfb['year'] = dfb.index.year # col#1
dfb['month'] = dfb.index.month # col#2
first_month = (dfb['year']==dfb.iloc[0, 1]) & (dfb['month']==dfb.iloc[0, 2])
last_month = (dfb['year']==dfb.iloc[-1, 1]) & (dfb['month']==dfb.iloc[-1, 2])
dfb = dfb.loc[(~first_month) & (~last_month)]
dfb = dfb.drop(['year', 'month'], axis=1)
i have this kind of dataframe
These data represents the value of an consumption index generally encoded once a month (at the end or at the beginning of the following month) but sometimes more. This value can be resetted to "0" if the counter is out and be replaced. Moreover some month no data is available.
I would like select only one entry per month but this entry has to be the nearest to the first day of the month AND inferior to the 15th day of the month (because if the day is higher it could be the measure of the end of the month). Another condition is that if the difference between two values is negative (the counter has been replaced), the value need to be kept even if the date is not the nearest day near the first day of month.
For example, the output data need to be
The purpose is to calculate only a consumption per month.
A solution is to parse the dataframe (as a array) and perform some if conditions statements. However i wonder if there is "simple" alternative to achieve that.
Thank you
You can normalize the month data with MonthEnd and then drop duplicates based off that column and keep the last value.
from pandas.tseries.offsets import MonthEnd
df.New = df.Index + MonthEnd(1)
df.Diff = abs((df.Index - df.New).dt.days)
df = df.sort_values(df.New, df.Diff)
df = df.drop_duplicates(subset='New', keep='first').drop(['New','Diff'], axis=1)
That should do the trick, but I was not able to test, so please copy and past the sample data into StackOverFlow if this isn't doing the job.
Defining dataframe, converting index to datetime, defining helper columns,
using them to run shift method to conditionally remove rows, and finally removing the helper columns:
from pandas.tseries.offsets import MonthEnd, MonthBegin
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame([
[1254],
[1265],
[1277],
[1301],
[1345],
[1541]
], columns=["Value"]
, index=[dt.strptime("05-10-19", '%d-%m-%y'),
dt.strptime("29-10-19", '%d-%m-%y'),
dt.strptime("30-10-19", '%d-%m-%y'),
dt.strptime("04-11-19", '%d-%m-%y'),
dt.strptime("30-11-19", '%d-%m-%y'),
dt.strptime("03-02-20", '%d-%m-%y')
]
)
early_days = df.loc[df.index.day < 15]
early_month_end = early_days.index - MonthEnd(1)
early_day_diff = early_days.index - early_month_end
late_days = df.loc[df.index.day >= 15]
late_month_end = late_days.index + MonthBegin(1)
late_day_diff = late_month_end - late_days.index
df["day_offset"] = (early_day_diff.append(late_day_diff) / np.timedelta64(1, 'D')).astype(int)
df["start_of_month"] = df.index.day < 15
df["month"] = df.index.values.astype('M8[D]').astype(str)
df["month"] = df["month"].str[5:7].str.lstrip('0')
# df["month_diff"] = df["month"].astype(int).diff().fillna(0).astype(int)
df = df[df["month"].shift().ne(df["month"].shift(-1))]
df = df.drop(columns=["day_offset", "start_of_month", "month"])
print(df)
Returns:
Value
2019-10-05 1254
2019-10-30 1277
2019-11-04 1301
2019-11-30 1345
2020-02-03 1541
I am reading a csv file of the number of employees in the US by year and month (in thousands). It starts out like this:
Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
1961,45119,44969,45051,44997,45119,45289,45400,45535,45591,45716,45931,46035
1962,46040,46309,46375,46679,46668,46644,46720,46775,46888,46927,46910,46901
1963,46912,47000,47077,47316,47328,47356,47461,47542,47661,47805,47771,47863
...
I want my Pandas Dataframe to have the datetime as the index for each month's value. I'm doing this so I can later add values for specific time ranges. I want it to look something like this:
1961-01-01 45119.0
1961-02-01 44969.0
1961-03-01 45051.0
1961-04-01 44997.0
1961-05-01 45119.0
...
I did some research and thought that if I stacked the years and months together, I could combine them into a datetime. Here is what I have done:
import pandas as pd
import numpy as np
df = pd.read_csv("BLS_private.csv", header=5, index_col="Year")
df.columns = range(1, 13) # I transformed months into numbers 1-12 for easier datetime conversion
df = df.stack() # Months are no longer columns
print(df)
Here is my output:
Year
1961 1 45119.0
2 44969.0
3 45051.0
4 44997.0
5 45119.0
...
I do not know how to combine the year and the months in the stacked indices. Does stacking the indices help at all in my case? I am also not the most familiar with Pandas datetime, so any explanation about how I could use that would be very helpful. Also if anyone has alternate solutions than making datetime the index, I welcome ideas.
After the stack create the DateTimeIndex from the current index
from datetime import datetime
dt_index = pd.to_datetime([datetime(year=year, month=month, day=1)
for year, month in df.index.values])
df.index = dt_index
df.head(3)
# 1961-01-01 45119
# 1961-02-01 44969
# 1961-03-01 45051
import pandas as pd
df = pd.read_csv("BLS_private.csv", index_col="Year")
dates = pd.date_range(start=str(df.index[0]), end=str(df.index[-1] + 1), closed='left', freq="MS")
df = df.stack()
df.index = dates
df.to_frame()
s = """Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
1961,45119,44969,45051,44997,45119,45289,45400,45535,45591,45716,45931,46035
1962,46040,46309,46375,46679,46668,46644,46720,46775,46888,46927,46910,46901
1963,46912,47000,47077,47316,47328,47356,47461,47542,47661,47805,47771,47863"""
df = pd.read_csv(StringIO(s))
# set index and stack
stack = df.set_index('Year').stack().reset_index()
# create a new index
stack.index = pd.to_datetime(stack['Year'].astype(str) +'-'+ stack['level_1'])
# remove columns
final = stack[0].to_frame()
1961-01-01 45119
1961-02-01 44969
1961-03-01 45051
1961-04-01 44997
1961-05-01 45119
1961-06-01 45289
I'm a beginner in Python. I'm working on a dateset which contains data for several years. This is the sample of the dataset.
enter image description here
Here, Hour(LT) means time and DN(LT) means Day number of the year.
I have tried in Python 3.0 pandas Anaconda to work on this dataset. My final goal is to find out the daily, weekly, monthly and yearly mean so I preferred to convert it into pandas.DatetimeIndex (by resampling would be quite easy I guess!)
I provided the codes I've written until now.
import pandas as pd
import numpy as np
df = pd.read_csv('test_file.txt', sep=' ', delimiter=' ')
#convert the year, month, day int columns into datetime format
year_month = pd.to_datetime(10000 * df.Year +100 * df.Month +df.Day, format='%Y%m%d')
#convert Year, Month, Day, Hour(LT) into DayTimeHour format
year_hour_convert = pd.DataFrame({
'Day': np.array(year_month, dtype=np.datetime64),
'Hour': np.array(df['Hour(LT)'], dtype=np.int64)
})
#merge into "year-month-day-hour" format
year_hour = pd.to_datetime(year_hour_convert.Day) + pd.to_timedelta(year_hour_convert.Hour, unit='h')
#Define a new column for Time Series
df['DateTime'] = year_hour
#Drop unnecessary columns
df = df.drop(['Year', 'Month', 'Day', 'Hour(LT)', 'DN(LT)'], axis=1)
#Set YYYYMMDD HHMMSS as index
df = df.set_index('DateTime')
#Choose the data for 9 a.m. to 3 p.m.
df = df.between_time('09:00:00', '15:00:00')
I have turned my dataset into this format. I dropped the 'Year', 'Month', 'Day', 'Hour(LT)', 'DN(LT)'columns eventually. I provided the picture of that format.enter image description here
Now I want to filter the data if a certain number of data is available for a certain day. For instance, if the number of data for 4th Jan, 2016 as well as 2016-01-04 is above 4, I will take the data of that day. Otherwise, I will drop the data of that day.
How can I do it in Pandas?