Can I take the average of a datetime function? [duplicate] - python

This question already has answers here:
computing the mean for python datetime
(5 answers)
Get the average date from multiple dates - pandas
(2 answers)
Closed 2 years ago.
New to python, hence I hope you don't mind my simple questions...
I'm new to datetime functions, and I'm working on a time series data at the moment.
Below is a sample dataframe for my purpose.
My objective is to groupby my messages according to the 'group' column and take a mean of the datetime. This would be helpful for my data visualisation.
df = pd.DataFrame([['2018-04-12 11:20:57','Hello everyone',1],['2018-04-12 11:20:57','Hello everyone',1],
['2018-04-12 11:19:34','second msg',1],['2018-04-13 11:00:57','Random',1],
['2018-04-13 11:49:34','3rd msg',2],
['2018-04-13 11:29:57','Msg',2]],columns=['datetime','msg','group'])
The code below does not work.
chat_1.groupby('group')['datetime'].mean()
DataError: No numeric types to aggregate
Wondering if there's any way to get around this? Thank you.

See you do it like this :
df1.datetime = pd.to_datetime(df1.datetime).values.astype(np.int64)
df1 = pd.DataFrame(pd.to_datetime(df1.groupby('group').mean().datetime))
Output will be:
group datetime
1 2018-04-12 17:15:36.249999872
2 2018-04-13 11:39:45.500000000

Never had to do this myself but I thought it would work out of box. I might be missing a point but here's a workaround (if I understood you correctly):
df["datetime"] = pd.to_datetime(df["datetime"])
out = [
{"group": g, "mean": df.loc[df["group"].eq(g)]["datetime"].mean()}
for g in df["group"].unique()
]
pd.DataFrame(out)
Output
EDIT
If anyone could explain why df["datetime"].mean() works but df.groupby("group")["datetime"].mean() doesn't, that would be interesting to hear because I'm confused.

Related

How to convert Python DataFrame column with Excel 5 digit date to yyyy-mm-dd [duplicate]

This question already has an answer here:
Pandas read_excel: parsing Excel datetime field correctly [duplicate]
(1 answer)
Closed 1 year ago.
I'm trying to convert an entire column containing a 5 digit date code (EX: 43390, 43599) to a normal date format. This is just to make data analysis easier, it doesn't matter which way it's formatted. In a series, the DATE column looks like this:
1 43390
2 43599
3 43605
4 43329
5 43330
...
264832 43533
264833 43325
264834 43410
264835 43461
264836 43365
I don't understand previous submissions with this question, and when I tried code such as
date_col = df.iloc[:,0]
print((datetime.utcfromtimestamp(0) + timedelta(date_col)).strftime("%Y-%m-%d"))
I get this error
unsupported type for timedelta days component: Series
Thanks, sorry if this is a basic question.
You are calling the dataframe and assigning it to date_col. If you want to get the value of your first row, for example, use date_col = df.iloc[0]. This will return the value.
Timedelta takes an integer value, not Series.

Converting Pandas Object to minutes and seconds [duplicate]

This question already has answers here:
Pandas - convert strings to time without date
(3 answers)
Closed 1 year ago.
I have a column in for stop_time 05:38 (MM:SS) but it is showing up as an object. is there a way to turn this to a time?
I tried using # perf_dfExtended['Stop_Time'] = pd.to_datetime(perf_dfExtended['Stop_Time'], format='%M:%S')
but then it adds a date to the output: 1900-01-01 00:05:38
I guess what you're looking for is pd.to_timedelta (https://pandas.pydata.org/docs/reference/api/pandas.to_timedelta.html). to_datetime operation which will of course always try to create a date.
What you have to remember about though is that pd.to_timedelta could raise ValueError for your column, as it requires hh:mm:ss format. Try to use apply function on your column by adding '00:' by the beginning of arguments of your column (which I think are strings?), and then turn the column to timedelta. Could be something like:
pd.to_timedelta(perf_dfExtended['Stop_Time'].apply(lambda x: f'00:{x}'))
This may work for you:
perf_dfExtended['Stop_Time'] = \
pd.to_datetime(perf_dfExtended['Stop_Time'], format='%M:%S').dt.time
Output (with some additional examples)
0 00:05:38
1 00:10:17
2 00:23:45

Why does pandas Dataframe bfill or ffill yield random results when used with groupby? [duplicate]

This question already has an answer here:
(pandas) Why does .bfill().ffill() act differently than ffill().bfill() on groups?
(1 answer)
Closed 1 year ago.
The following is the example data
id,c1,c2
1,1,g1
2,,g2
3,,g1
4,2,g2
5,,g1
6,,g2
7,3,g1
8,,g2
9,,g1
10,4,g2
11,,g1
12,,g2
df=pd.read_clipboard(sep=',')
I want to groupby c2, and forward fill and then backfill c1 within each group. I expect the following 3 approaches to yield the same results, because "Groupby preserves the order of rows within each group.". However, Approach 1 is different from the other two and gives the wrong results: for the row where id=2, the filled result is 1. This is obviously wrong since there is no c1=1 at all within the g2 group. Is this a pandas bug? I am using pandas 1.1.3.
Approach 1
df['fill_value']=df.groupby('c2').c1.ffill().bfill()
df
Approach 2
df['fill_value']=df.groupby('c2').c1.ffill()
df['fill_value']=df.groupby('c2').fill_value.bfill()
df
Approach 3
df=df.sort_values('c2')
df['fill_value']=df.groupby('c2').c1.ffill().bfill()
df.sort_values('id')
I only found this answer after writing the question. I will close my question, but still publish it so the future generation have a better chance of finding it on Google.
(pandas) Why does .bfill().ffill() act differently than ffill().bfill() on groups?

How to set frequency of data shown in pandas? [duplicate]

This question already has answers here:
Resampling Minute data
(2 answers)
Closed 2 years ago.
I have some dataset. Let's presume it is:
dataset = pd.read_csv('some_stock_name_here.csv', index_col=['Date'], parse_dates=['Date'])
The csv file has 2500 observation(Date and Close price position) and I want to create a new csv file which inlude the same time series but with much less frequency data on the raw. For example every 40-th of the previous? How can I do this?
2. Also I'm wondering whether I could manipulate that frequency within the notebook without creating new csv file.
Thanks in advance.
You can slice your df using iloc:
Going over all rows and taking those at indexes that are divisible with X.
X = 40
df.iloc[::X]
Saving data-frame is achieved by the following code:
df.to_csv(FILE_PATH_HERE)

How to find an index with maximum number of rows in Pandas? [duplicate]

This question already has answers here:
The first three max value in a column in python
(1 answer)
Count and Sort with Pandas
(5 answers)
Closed 3 years ago.
I am doing an online course which has a problem like " Find the name of the state with maximum number of counties". The problem dataframe is the image below
Problem Dataframe
Now, I have given the dataframe two new index (hierarchical indexing) and after that the dataframe takes a new look like the image below
Modified Dataframe
I have used this code to get the modified dataframe:
def answer_five():
new_df = census_df[census_df['SUMLEV'] == 50]
new_df = new_df.set_index(['STNAME', 'CTYNAME'])
return new_df
answer_five()
What I want to do now is to find the name of the state with most number of counties i.e to find the index with maximum number of rows. How Can I do that?
I know that using something like groupby() method this can be done but I'm not familiar with this method yet and so don't want to use it. Can anyone help? I have searched for this but failed. Sorry if the problem is rudimentary. Thanks in advance.

Categories

Resources