how i can solve Pandas groupby function? - python

I have the dataframe 'df1' contain 1226 rows × 13 columns I want to group it by 'Region' columns but it is not working

Try this out for grouping based on column
blockedGroup = df1.groupby('Region')
blocking_df = {}
for x in blockedGroup.groups:
temp_df = blockedGroup.get_group(x)
blocking_df.update({temp_df['customerProfile'].iloc[0]: temp_df})
This will return the groups into a dict structure, where keys will be the unique items in the region and the values will data frames.
e.g : {"USA": DataFrame}

df.groupby does not change dataframe. Instead try doing it like this
df2 = df1.groupby('Region')
df2

Related

create multiple csv/excel files based on column value after operation with dataframe

My dataframe example (over 35k rows):
stop_id time
7909 2022-04-06T03:47:00+03:00
7909 2022-04-06T04:07:00+03:00
1009413 2022-04-06T04:10:00+03:00
1002246 2022-04-06T04:19:00+03:00
1009896 2022-04-06T04:20:00+03:00
I want to conduct some operations on this dataframe, and then split the dataframe based on the value stop_id. So, assuming there are 50 unique stop_id values, I want to get 50 separate csv/excel files containing data with one unique stop_id. How can I do this?
Using group by
# group by 'stop_id' column
groups = df.groupby("stop_id")
And then iterating over the groups (named to the stop_id of the group using an f-string)
for name, group in groups:
#logic to write to files
group.to_csv(f'{name}.csv')
I used the groupby and first method here.
import pandas as pd
df = pd.DataFrame({"stop_id" : [7909, 7909, 1009413, 1002246,1009896],
"time":["2022-04-06T03:47:00+03:00", "2022-04-06T04:10:00+03:00",
"2022-04-06T04:07:00+03:00","2022-04-06T04:19:00+03:00","2022-04-06T04:20:00+03:00"]})
df = df.groupby("stop_id")
df = df.first().reset_index()
print(df)
print(df)
for idx, item in enumerate(df["stop_id"]):
df_inner = pd.DataFrame({item})
df_inner.to_csv(f'{df["time"].values[idx]}.csv', index=False)
stop_id time
0 7909 2022-04-06T03:47:00+03:00
1 1002246 2022-04-06T04:19:00+03:00
2 1009413 2022-04-06T04:07:00+03:00
3 1009896 2022-04-06T04:20:00+03:00

Drop rows in dataframe whose column has more than a certain number of distinct values

I have an example dataframe as given below, and am trying to drop the rows where the column cluster_num has only 1 distinct value.
df = pd.DataFrame([[1,2,3,4,5],[1,3,4,2,5],[1,3,7,9,10],[2,6,2,7,9],[2,2,4,7,0],[3,1,9,2,7],[4,9,5,1,2],[5,8,4,2,1],[5,0,7,1,2],[6,9,2,5,7]])
df.rename(columns = {0:"cluster_num",1:"value_1",2:"value_2",3:"value_3",4:"value_4"},inplace=True)
# Dropping rows for which cluster_num has only one distinct value
count_dict = df['cluster_num'].value_counts().to_dict()
df['count'] = df['cluster_num'].apply(lambda x : count_dict[x])
df[df['count']>1]
In the above example, the rows where cluster_num equals 3,4 and 6 would be dropped.
Is there a way of doing this without having to create a separate column? I need all 5 initial columns (cluster_num, value_1, value_2, value_3, value_4) in the output. My output dataframe according to the above code is :
I have tried to filter using groupby() with count() but it was not working out.
groupby/filter
df.groupby('cluster_num').filter(lambda d: len(d) > 1)
duplicated
df[df.duplicated('cluster_num', keep=False)]
groupby/transform
Per #QuangHoang
df[df.groupby('cluster_num')['cluster_num'].transform('size') >= 2]

pandas df masking specific row by list

I have pandas df which has 7000 rows * 7 columns. And I have list (row_list) that consists with the value that I want to filter out from df.
What I want to do is to filter out the rows if the rows from df contain the corresponding value in the list.
This is what I got when I tried,
"Empty DataFrame
Columns: [A,B,C,D,E,F,G]
Index: []"
df = pd.read_csv('filename.csv')
df1 = pd.read_csv('filename1.csv', names = 'A')
row_list = []
for index, rows in df1.iterrows():
my_list = [rows.A]
row_list.append(my_list)
boolean_series = df.D.isin(row_list)
filtered_df = df[boolean_series]
print(filtered_df)
replace
boolean_series = df.RightInsoleImage.isin(row_list)
with
boolean_series = df.RightInsoleImage.isin(df1.A)
And let us know the result. If it doesn't work show a sample of df and df1.A
(1) generating separate dfs for each condition, concat, then dedup (slow)
(2) a custom function to annotate with bool column (default as False, then annotated True if condition is fulfilled), then filter based on that column
(3) keep a list of indices of all rows with your row_list values, then filter using iloc based on your indices list
Without an MRE, sample data, or a reason why your method didn't work, it's difficult to provide a more specific answer.

Select columns based on != condition

I have a dataframe and I have a list of some column names that correspond to the dataframe. How do I filter the dataframe so that it != the list of column names, i.e. I want the dataframe columns that are outside the specified list.
I tried the following:
quant_vair = X != true_binary_cols
but get the output error of: Unable to coerce to Series, length must be 545: given 155
Been battling for hours, any help will be appreciated.
It will help:
df.drop(columns = ["col1", "col2"])
You can either drop the columns from the dataframe, or create a list that does not contain all these columns:
df_filtered = df.drop(columns=true_binary_cols)
Or:
filtered_col = [col for col in df if col not in true_binary_cols]
df_filtered = df[filtered_col]

add selected columns from two pandas dfs

I have two pandas dataframes a_df and b_df. a_df has columns ID, atext, and var1-var25, while b_df has columns ID, atext, and var1-var 25.
I want to add ONLY the corresponding vars from a_df and b_df and leave ID, and atext alone.
The code below adds ALL the corresponding columns. Is there a way to get it to add just the columns of interest?
absum_df=a_df.add(b_df)
What could I do to achieve this?
Use filter:
absum_df = a_df.filter(like='var').add(b_df.filter(like='var'))
If you want to keep additional columns as-is, use concat after summing:
absum_df = pd.concat([a_df[['ID', 'atext']], absum_df], axis=1)
Alternatively, instead of subselecting columns from a_df, you could instead just drop the columns in absum_df, if you want to add all columns from a_df not in absum_df:
absum_df = pd.concat([a_df.drop(absum_df.columns axis=1), absum_df], axis=1)
You can subset a dataframe to particular columns:
var_columns = ['var-{}'.format(i) for i in range(1,26)]
absum_df=a_df[var_columns].add(b_df[var_columns])
Note that this will result in a dataframe with only the var columns. If you want a dataframe with the non-var columns from a_df, and the var columns being the sum of a_df and b_df, you can do
absum_df = a_df.copy()
absum_df[var_columns] = a_df[var_columns].add(b_df[var_columns])

Categories

Resources