groupby based on conditions - python

I'm handling my data.
Here's my data.
I write my code like this.
complete_data = complete_data.groupby(['STDR_YM_CD', 'TRDAR_CD' ]).sum().reset_index()
I got the dataframe like below picture After executing the code
But I wanna aggregate the values based on the first three letters of characters in SVC_INDUTY_CD column like below picture.
Here is my data link
http://blogattach.naver.com/c356df6c7f2127fbd539596759bfc1bd1848b453f1/20170316_215_blogfile/khm2963_1489653338468_dtPz6k_csv/test2.csv?type=attachment
Thank in advance

I'm sure there's a better way but this is one way you could do this:
complete_data['first_three_temp'] = complete_data['SVC_INDUTY_CD'].str[:3]
complete_data = complete_data.groupby(['STDR_YM_CD', 'TRDAR_CD', 'first_three_temp' ], as_index=False).sum()
complete_data.drop('first_three_temp', axis=1, inplace=True)
This will add a temporary column containing only the first three characters of your SVC_INDUTY_CD column. You can then groupby on and drop the temporary column. As I said I'm sure there's a more efficient way so I'm not sure if you'll be limited by the size of your dataset.

Related

Iterate over a pandas DataFrame & Check Row Comparisons

I'm trying to iterate over a large DataFrame that has 32 fields, 1 million plus rows.
What i'm trying to do is iterate over each row, and check whether any of the rest of the rows have duplicate information in 30 of the fields, while the other two fields have different information.
I'd then like to store the the ID info. of the rows that meet these conditions.
So far i've been trying to figure out how to check two rows with the below code, it seems to work when comparing single columns but throws an error when I try more than one column, could anyone advise on how best to approach?
for index in range(len(df)):
for row in range(index, len(df)):
if df.iloc[index][1:30] == df.iloc[row][1:30]:
print(df.iloc[index])
As a general rule, you should always always try not to iterate over the rows of a DataFrame.
It seems that what you need is the pandas duplicated() method. If you have a list of the 30 columns you want to use to determine duplicates rows, the code looks something like this:
df.duplicated(subset=['col1', 'col2', 'col3']) # etc.
Full example:
# Set up test df
from io import StringIO
sub_df = pd.read_csv(
StringIO("""ID;col1;col2;col3
One;23;451;42;31
Two;24;451;42;54
Three;25;513;31;31"""
),
sep=";"
)
Find which rows are duplicates in col1 and col2. Note that the default is that the first instance is not marked as a duplicate, but later duplicates are. This behaviour can be changed as described in the documentation I linked to above.
mask = sub_df.duplicated(["col1", "col2"])
This looks like:
Now, filter using the mask.
sub_df["ID"][sub_df.duplicated(["col1", "col2"])]
Of course, you can do the last two steps in one line.

EDA for loop on multiple columns of dataframe in Python

Just a random q. If there's a dataframe, df, from the Boston Homes ds, and I'm trying to do EDA on a few of the columns, set to a variable feature_cols, which I could use afterwards to check for na, how would one go about this? I have the following, which is throwing an error:
This is what I was hoping to try to do after the above:
Any feedback would be greatly appreciated. Thanks in advance.
There are two problems in your pictures. First is a keyError, because if you want to access subset of columns of a dataframe, you need to pass the names of the columns in a list not a tuple, so the first line should be
feature_cols = df[['RM','ZN','B']]
However, this will return a dataframe with three columns. What you want to use in the for loop can not work with pandas. We usually iterate over rows, not columns, of a dataframe, you can use the one line:
df.isna().sum()
This will print all names of columns of the dataframe along with the count of the number of missing values in each column. Of course, if you want to check only a subset of columns, you can. replace df buy df[list_of_columns_names].
You need to store the names of the columns only in an array, to access multiple columns, for example
feature_cols = ['RM','ZN','B']
now accessing it as
x = df[feature_cols]
Now to iterate on columns of df, you can use
for column in df[feature_cols]:
print(df[column]) # or anything
As per your updated comment,. if your end goal is to see null counts only, you can achieve without looping., e.g
df[feature_cols].info(verbose=True,null_count=True)

GroupBy using select columns with apply(list) and retaining other columns of the dataframe

data={'order_num':[123,234,356,123,234,356],'email':['abc#gmail.com','pqr#hotmail.com','xyz#yahoo.com','abc#gmail.com','pqr#hotmail.com','xyz#gmail.com'],'product_code':['rdcf1','6fgxd','2sdfs','34fgdf','gvwt5','5ganb']}
df=pd.DataFrame(data,columns=['order_num','email','product_code'])
My data frame looks something like this:
Image of data frame
For sake of simplicity, while making the example, I omitted the other columns. What I need to do is that I need to groupby on the column called order_num, apply(list) on product_code, sort the groups based on a timestamp column and retain the columns like email as they are.
I tried doing something like:
df.groupby(['order_num', 'email', 'timestamp'])['product_code'].apply(list).sort_values(by='timestamp').reset_index()
Output: Expected output appearance
but I do not wish to groupby with other columns. Is there any other alternative to performing the list operation? I tried using transform but it threw me size mismatch error and I don't think it's the right way to go either.
If there is a lot another columns and need grouping by order_num only use Series.map for new column filled by lists and then remove duplicates by DataFrame.drop_duplicates by column order_num, last if necessary sorting:
df['product_code']=df['order_num'].map(df.groupby('order_num')['product_code'].apply(list))
df = df.drop_duplicates('order_num').sort_values(by='timestamp')

Using describe() method to exclude a column

I am new to using python with data sets and am trying to exclude a column ("id") from being shown in the output. Wondering how to go about this using the describe() and exclude functions.
describe works on the datatypes. You can include or exclude based on the datatype & not based on columns. If your column id is of unique data type, then
df.describe(exclude=[datatype])
or if you just want to remove the column(s) in describe, then try this
cols = set(df.columns) - {'id'}
df1 = df[list(cols)]
df1.describe()
TaDa its done. For more info on describe click here
You can do that by slicing your original DF and remove the 'id' column. One way is through .iloc . Let's suppose the column 'id' is the first column from you DF, then, you could do this:
df.iloc[:,1:].describe()
The first colon represents the rows, the second the columns.
Although somebody responded with an example given from the official docs which is more then enough, I'd just want to add this, since It might help a few ppl:
IF your DataFrame is large (let's say 100s columns), removing one or two, might not be a good idea (not enough), instead, create a smaller DataFrame holding what you're interested and go from there.
Example of removing 2+ columns:
table_of_columns_you_dont_want = set(your_bigger_data_frame.colums) = {'column_1', 'column_2','column3','etc'}
your_new_smaller_data_frame = your_new_smaller_data_frame[list[table_of_columns_you_dont_want]]
your_new_smaller_data_frame.describe()
IF your DataFrame is medium/small size, you already know every column and you only need a few columns, just create a new DataFrame and then apply describe():
I'll give an example from reading a .csv file and then read a smaller portion of that DataFrame which only holds what you need:
df = pd.read_csv('.\docs\project\file.csv')
df = [['column_1','column_2','column_3','etc']]
df.describe()
Use output.describe(exclude=['id'])

filter off NaNs in a large DataFrame with headers

I have a large number of time series, with blanks on certain dates for some of them. I read that with xlwings from an XL sheet:
Y0 = xw.Range('SomeRangeinXLsheet').options(pd.DataFrame, index=True , header=3).value
I'm trying to create a filter to run regressions on those series so I have to take out the void dates. If I :
print(Y0.iloc[:,[i]]==Y0.iloc[:,[i]])
I get a proper series of true/false for my column number i, fine.
I'm then stuck, can't find a way to filter the whole df, with the true/false for that column, or even just extract that clean series as a pd.Series.
I need them one by one to adapt my independent variables' dates to those of my each of these separately.
Thank you for your help.
I believe you want to use df.dropna()
I am not sure if I understood your problem, but if you want to for check NULLs in a specific column and drop those rows, you can try this -
import pandas as pd
df = df[pd.notnull(df['column_name'])]
For deleting NaNs, df.dropna() should work, as suggested in the previous answer. If it is not working, you can try replacing NaNs with a placeholder text and try deleting the rows that contain that placeholder text.
df['column_name'] = df['column_name'].replace(np.nan, 'delete-it', regex = True)
df = df[df["column_name"] != 'delete-it']
Hope this helps!

Categories

Resources