How to group by a variable in pandas - python

I am trying to group by a variable in pandas, but it does not seem to work.
The variable is just a list of several column headers, and it is much easier to write the variable each time for the purposes of analysis rather than list the columns for each groupby.
Trying to turn this:
df_grouped = (df.groupby(['Column1','Column2','Column3','Column4'])
[compvars].sum()).reset_index()
Into this:
groupbyvars=['Column1','Column2','Column3','Column4']
df_grouped = (df.groupby([groupbyvars])
[compvars].sum()).reset_index()

As groupbyvars is already a list, we can replace :
df_grouped = (df.groupby([groupbyvars])
[compvars].sum()).reset_index()
by :
df_grouped = (df.groupby(groupbyvars)
[compvars].sum()).reset_index()

Related

Optimal way to create a column by matching two other columns

The first df I have is one that has station codes and names, along with lat/long (not as relevant), like so:
code name latitude longitude
I have another df with start/end dates for travel times. This df has only the station code, not the station name, like so:
start_date start_station_code end_date end_station_code duration_sec
I am looking to add columns that have the name of the start/end stations to the second df by matching the first df "code" and second df "start_station_code" / "end_station_code".
I am relatively new to pandas, and was looking for a way to optimize doing this as my current method takes quite a while. I use the following code:
for j in range(0, len(df_stations)):
for i in range(0, len(df)):
if(df_stations['code'][j] == df['start_station_code'][i]):
df['start_station'][i] = df_stations['name'][j]
if(df_stations['code'][j] == df['end_station_code'][i]):
df['end_station'][i] = df_stations['name'][j]
I am looking for a faster method, any help is appreciated. Thank you in advance.
Use merge. If you are familiar with SQL, merge is equivalent to LEFT JOIN:
cols = ["code", "name"]
result = (
second_df
.merge(first_df[cols], left_on="start_station_code", right_on="code")
.merge(first_df[cols], left_on="end_station_code", right_on="code")
.rename(columns={"code_x": "start_station_code", "code_y": "end_station_code"})
)
The answer by #Code-Different is very nearly correct. However the columns to be renamed are the name columns not the code columns. For neatness you will likely want to drop the additional code columns that get created by the merges. Using your names for the dataframes df and df_station the code needed to produce df_required is:
cols = ["code", "name"]
required_df = (
df
.merge(df_stations[cols], left_on="start_station_code", right_on="code")
.merge(df_stations[cols], left_on="end_station_code", right_on="code")
.rename(columns={"name_x": "start_station", "name_y": "end_station"})
.drop(columns = ['code_x', 'code_y'])
)
As you may notice the merge means that the dataframe acquires duplicate 'code' columns which get suffixed automatically, this is a built in default of the merge command. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html for more detail.

Combining dummies and count for pandas dataframe

I have a pandas dataframe like this:
as a plain text:
{'id;sub_id;value;total_stuff related to id and sub_id':
['aaa;1;cat;10', 'aaa;1;cat;10', 'aaa;1;dog;10', 'aaa;2;cat;7',
'aaa;2;dog;7', 'aaa;3;cat;5', 'bbb;1;panda;20', 'bbb;1;cat;20',
'bbb;2;panda;12']}
The desired output I want is this.
Note that there are many different "values" possible, so I would need to automate the creation of dummies variables (nb_animals).
But these dummies variables must contain the number of occurences by id and sub_id.
The total_stuff is always the same value for a given id/sub_id combination.
I've tried using get_dummies(df, columns = ['value']), which gave me this table.
using get_dummies
as a plain text:
{'id;sub_id;value_cat;value_dog;value_panda;total_stuff related to id
and sub_id': ['aaa;1;2;1;0;10', 'aaa;1;2;1;0;10', 'aaa;1;2;1;0;10',
'aaa;2;1;1;0;7', 'aaa;2;1;1;0;7', 'aaa;3;1;0;0;5', 'bbb;1;1;0;1;20',
'bbb;1;1;0;1;20', 'bbb;2;0;0;1;12']}
I'd love to use some kind of df.groupby(['id','sub_id']).agg({'value_cat':'sum', 'value_dog':'sum', ... , 'total_stuff':'mean'}), but writing all of the possible animal values would be too tedious.
So how to get a proper aggregated count/sum for values, and average for total_stuff (since total_stuff is unique per id/sub_id combination)
Thanks
EDIT : Thanks chikich for the neat answer. The agg_dict is what I needed
Use pd.get_dummies to transform categorical data
df = pd.get_dummies(df, prefix='nb', columns='value')
Then group by id and subid
agg_dict = {key: 'sum' for key in df.columns if key[:3] == 'nb_'}
agg_dict['total_stuff'] = 'mean'
df = df.groupby(['id', 'subid']).agg(agg_dict).reset_index()

Exclude values in DF column

I have a problem, I want to exclude from a column and drop from my DF all my rows finishing by "99".
I tried to create a list :
filteredvalues = [x for x in df['XX'] if x.endswith('99')]
I have in this list all the concerned rows but how to apply to my DF and drop those rows :
I tried a few things but nothing works :
Lately I tried this :
df = df[df['XX'] not in filteredvalues]
Any help on this?
Use the .str attribute, with corresponding string methods, to select such items. Then use ~ to negate the result, and filter your dataframe with that:
df = df[~df['XX'].str.endswith('99')]

Create a dictionary from pandas empty dataframe with only column names

I have a pandas data frame with only two column names( single row, which can be also considered as headers).I want to make a dictionary out of this with the first column being the value and the second column being the key.I already tried the
to.dict() method, but it's not working as it's an empty dataframe.
Example
df=|Land |Norway| to {'Land': Norway}
I can change the pandas data frame to some other type and find my way around it, but this question is mostly to learn the best/different/efficient approach for this problem.
For now I have this as the solution :
dict(zip(a.iloc[0:0,0:1],a.iloc[0:0,1:2]))
Is there any other way to do this?
Here's a simple way convert the columns to a list and a list to a dictionary
def list_to_dict(a):
it = iter(a)
ret_dict = dict(zip(it, it))
return ret_dict
df = pd.DataFrame([], columns=['Land', 'Normway'])
dict_val = list_to_dict(df.columns.to_list())
dict_val # {'Land': 'Normway'}
Very manual solution
df = pd.DataFrame(columns=['Land', 'Norway'])
df = pd.DataFrame({df.columns[0]: df.columns[1]}, index=[0])
If you have any number of columns and you want each sequential pair to have this transformation, try:
df = pd.DataFrame(dict(zip(df.columns[::2], df.columns[1::2])), index=[0])
Note: You will get an error if your DataFrame does not have at least two columns.

How to group a Series by values in pandas?

I currently have a pandas Series with dtype Timestamp, and I want to group it by date (and have many rows with different times in each group).
The seemingly obvious way of doing this would be something similar to
grouped = s.groupby(lambda x: x.date())
However, pandas' groupby groups Series by its index. How can I make it group by value instead?
grouped = s.groupby(s)
Or:
grouped = s.groupby(lambda x: s[x])
Three methods:
DataFrame: pd.groupby(['column']).size()
Series: sel.groupby(sel).size()
Series to DataFrame:
pd.DataFrame( sel, columns=['column']).groupby(['column']).size()
For anyone else who wants to do this inline without throwing a lambda in (which tends to kill performance):
s.to_frame(0).groupby(0)[0]
You should convert it to a DataFrame, then add a column that is the date(). You can do groupby on the DataFrame with the date column.
df = pandas.DataFrame(s, columns=["datetime"])
df["date"] = df["datetime"].apply(lambda x: x.date())
df.groupby("date")
Then "date" becomes your index. You have to do it this way because the final grouped object needs an index so you can do things like select a group.
To add another suggestion, I often use the following as it uses simple logic:
pd.Series(index=s.values).groupby(level=0)

Categories

Resources