python - dataframe - groupby - treatment of non-grouped column in case of difference - python

I have a dataframe containing an ID and I wish to 'group by' based on the ID. I need to keep all other columns (static data,strings) of the dataframe as well, so initially I included all static data columns in the group by. However, there can be differences in the static data between 2 or more rows that have the same ID (due to different source). In that case I would still like to group on the ID and not create 'duplicates'. For the column having the difference I'm rather indifferent, the grouped row can just take the first one it encounters of the conflicting rows.
Hope this illustration clarifies:
example
Any suggestions?

You can use groupby().agg() and specify what you want to do which each column in your dataframe within a dictionary. Based on the example of your intended outcome that would be:
df.groupby('identifier').agg({'name': 'first', 'amount':'sum'})
It takes the first value of the name columns and it sums the values in the amount column.

Related

How to combine column in dataframe based on date&time and take the mean of the value

I would like to ask how can I join the dataframe as shown in (exiting dataframe) to group values based on date&time and take the means of the values. what I meant is that if col B have 2 values in the same minute , it will take average of that value and do same for rest of the columns. What I want to achieve is to have one value each minutes as shown in (preprocessed dataframe)
Thank you
If your dataframe is called df, you can do as following :
df.groupby(['DataTime']).mean()

How to make new dataframe from existing dataframe with unique rows values of one column and corresponding row values from other columns?

I have a dataframe 'raw' that looks like this -
It has many rows with duplicate values in each column.
I want to make a new dataframe 'new_df' which has unique customer_code corresponding and market_code.
The new_df should look like this -
It sounds like you simply want to create a DataFrame with unique customer_code which also shows market_code. Here's a way to do it:
df = df[['customer_code','market_code']].drop_duplicates('customer_code')
Output:
customer_code market_code
0 Cus001 Mark001
1 Cus003 Mark003
3 Cus004 Mark003
4 Cus005 Mark004
The part reading df[['customer_code','market_code']] gives us a DataFrame containing only the two columns of interest, and the drop_duplicates('customer_code') part eliminates all but the first occurrence of duplicate values in the customer_code column (though you could instead keep the last occurrence of each duplicate by calling it using the keep='last' argument).

Drop duplicates from a pandas dataframe based on all columns starting from the third one

I have a dataframe with 50 + more columns, and the first 2 are unique IDs. For some reason for different IDs the data from the third column can be the exact same.
What I want to achieve is to delete the duplicates from the dataframe based on all columns starting from the third one. If there are more than 1 rows with different IDs and the same data from the third column, it is all the same which row we will keep, it can be the last one or the first one, whichever is easier to do.
I am fairly new to pandas, what I tried is something like this:
df.drop_duplicates(subset=df.iloc[2:], keep="last")
df.drop_duplicates expects a list of column names as the subset argument, so try this:
df.drop_duplicates(subset=df.columns[2:], keep="last")

Using describe() method to exclude a column

I am new to using python with data sets and am trying to exclude a column ("id") from being shown in the output. Wondering how to go about this using the describe() and exclude functions.
describe works on the datatypes. You can include or exclude based on the datatype & not based on columns. If your column id is of unique data type, then
df.describe(exclude=[datatype])
or if you just want to remove the column(s) in describe, then try this
cols = set(df.columns) - {'id'}
df1 = df[list(cols)]
df1.describe()
TaDa its done. For more info on describe click here
You can do that by slicing your original DF and remove the 'id' column. One way is through .iloc . Let's suppose the column 'id' is the first column from you DF, then, you could do this:
df.iloc[:,1:].describe()
The first colon represents the rows, the second the columns.
Although somebody responded with an example given from the official docs which is more then enough, I'd just want to add this, since It might help a few ppl:
IF your DataFrame is large (let's say 100s columns), removing one or two, might not be a good idea (not enough), instead, create a smaller DataFrame holding what you're interested and go from there.
Example of removing 2+ columns:
table_of_columns_you_dont_want = set(your_bigger_data_frame.colums) = {'column_1', 'column_2','column3','etc'}
your_new_smaller_data_frame = your_new_smaller_data_frame[list[table_of_columns_you_dont_want]]
your_new_smaller_data_frame.describe()
IF your DataFrame is medium/small size, you already know every column and you only need a few columns, just create a new DataFrame and then apply describe():
I'll give an example from reading a .csv file and then read a smaller portion of that DataFrame which only holds what you need:
df = pd.read_csv('.\docs\project\file.csv')
df = [['column_1','column_2','column_3','etc']]
df.describe()
Use output.describe(exclude=['id'])

Unique Value Index from two fields

I'm new to pandas and python, and could definitely use some help.
I have the code below, which almost does what I want. It creates dummy variables for the unique values in a field and indexes them by the unique combinations of the unique values in two other fields.
What I would like is only one row for each unique combination of the fields used for the index. Right now I get multiple rows for say 'asset subs end dt' = 10/30/2008 and 'reseller csn' = 55008 if the dummy variable comes up 3 times. I would rather have one row for the combination of index field values with a 3 in the dummy variable column.
Code:
df = data
df = df.set_index(['ASSET_SUBS_END_DT','RESELLER_CSN'])
Dummies=pd.get_dummies(df['EXPERTISE'])
something like:
df.groupby(level=[0, 1]).EXPERTISE.count()
when you do this groupby, everything with the same index is grouped together. assuming your data in EXPERTISE is notnull, you will get a new DataFrame returned with unique index values and the count per each index. try it out for yourself, play around with the results, and see how it can be combined with your existing DataFrame to get the final result you want.

Categories

Resources