I want to group by two columns. day of the week and another second column. but I don't know How should I do this.
It is my query for one column:
grouped = (df.groupby(df['time'].dt.weekday_name)['id'].count().rename('count'))
Where should I add the second column? for example "type" column in my dataframe.
df.groupby() takes a list, like this:
df.groupby([df['time'].dt.weekday_name, df['type']])
Related
I have a dataframe with the following details: (first df in image)
I want to be able to add new rows to to df that calculate the column next_apt + days with the new timestamp that it was run. So I want it to look like this:
the other columns should be left as it it. just add the next next_apt with the newer timestamp that it was calculated and append the rows to the same df.
Use date_add and cast it to timestamp
This should work:
df1.withColumn("newDateWithTimestamp", F.date_add(F.col("next_apt"), F.col("days")).cast("timestamp")).show()
Input
Output
I would like to ask how can I join the dataframe as shown in (exiting dataframe) to group values based on date&time and take the means of the values. what I meant is that if col B have 2 values in the same minute , it will take average of that value and do same for rest of the columns. What I want to achieve is to have one value each minutes as shown in (preprocessed dataframe)
Thank you
If your dataframe is called df, you can do as following :
df.groupby(['DataTime']).mean()
I have the below dataframe and i am trying to display how many rides per day.
But i can see only 1 column "near_penn" is considered as a column but "Date" is not.
c = df[['start day','near_penn','Date']]
c=c.loc[c['near_penn']==1]
pre_pandemic_df_new=pd.DataFrame()
pre_pandemic_df_new=c.groupby('Date').agg({'near_penn':'sum'})
print(pre_pandemic_df_new)
print(pre_pandemic_df_new.columns)
Why doesn't it consider "Date" as a column?
How can i make Date as a column of "pre_pandemic_df_new"?
Feel you can use to to_datetime method.
import pandas as pd
pre_pandemic_df_new["Date"]= pd.to_datetime(pre_pandemic_df_new["Date"])
Hope this works
Why doesn't it consider "Date" as a column?
Because the date is an index for your Dataframe.
How can I make Date as a column of "pre_pandemic_df_new"?
you can try this:
pre_pandemic_df_new.reset_index(level=['Date'])
df[['Date','near_penn']] = df[['Date_new','near_penn_new']]
Once you created your dataframe you can try this to add new columns to the end of the dataframe to test if it works before you make adjustments
OR
You can check for a value for the first row corresponding to the first "date" row.
These are the first things that came to my mind hope it helps
say i have a df, and i group by two columns. i then want to only take the first two rows for my grouped by object. i.e.
grouped_data = df.groupby(['company','person']).first()
how do i then select the first two rows for each of these. e.g. for company = asda there are 8 rows i,e, 9 people under this company but i only want the first two rows. how can i do this using the dataframe above? note i have used first because after grouping by i want to retain the column by column information without aggregation.
If you want frist two rows for each companies you can do :
df.groupby('company').head(2)
I have a dataframe that looks like this (df1):
I want to recreate the following dataframe(df2) to look like df1:
The number of years in df2 goes up to 2020.
So, essentially for each row in df2, a new row for each year should be created. Then, new columns should be created for each month. Finally, the value for % in each row should be copied to the column corresponding to the month in the "Month" column.
Any ideas?
Many thanks.
This is pivot:
(df2.assign(Year=df2.Month.str[:4],
Month=df2.Month.str[5:])
.pivot(index='Year', columns='Month', values='%')
)
More details about pivoting a dataframe here.