I would like to set for each row, the value of ART_IN_TICKET to be the number of rows that have the same TICKET_ID as this row.
for example, for the first 5 rows of this dataframe, TICKET_ID is 35592159 and ART_IN_TICKET should be 5 since there are 5 rows with that same TICKET_ID.
There can be other solutions as well. A relatively simple solution would be to get the count of rows for each TICKET_ID and then merge the new df with this one to get the final result in ART_IN_TICKET. Assuming the above dataframe is in df.
count_df = df[['TICKET_ID', 'ART_IN_TICKET']].groupby("TICKET_ID").count().reset_index()
df = df[list(set(df.columns.tolist())-set(["ART_IN_TICKET"]))] # Removing ART_IN_TICKET column before merging
final_df = df.merge(count_df, on="TICKET_ID")
Related
I have a df where I want to check for duplicate rows in only two of the columns, but if those columns are similar to the previous row, then I'd like to isolate/print them. So for example, if rows 12 - 89 have the same value in column 2 and column 3 as the previous row(s), then I want to know this range of rows.
See image 1 for example of df where 'pm10_ugm3' and 'pm25_ugm3' are duplicated but other columns are not:
Many thanks
Try the dataframe's duplicated function. This returns an index that you can use to slice/select those rows. Some variation of this will get you close:
dup_rows = df.duplicated(subset=['col1', 'col2'], keep='first')
print(df[dup_rows])
Is there any way to drop duplicate columns, but replacing their values depending upon conditions like
in table below, I would like to remove duplicate/second A and B columns, but want to replace the value of primary A and B (1st and 2nd column) where value is 0 but 1 in duplicate columns.
Ex - In 3rd row, where A, B have value 0 , should replace with 1 with their respective duplicate columns value..
Input Data :
Output Data:
This is an example of a problem I'm working on, my real data have around 200 columns so i'm hoping to find an optimal solution without hardcoding columns names for removal..
Use DataFrame.any per duplicated columns names if only 1,0 values in columns:
df = df.any(axis=1, level=0).astype(int)
Or if need maximal value per duplicated columns names:
df = df.max(axis=1, level=0)
I have a dataframe of two columns Stock and DueDate, where I need to select first row from the repeated consecutive entries based on stock column.
df:
I am expecting output like below,
Expected output:
My Approach
The approach I tried to use is to first list out what all rows repeating based on stock column by creating a new column repeated_yes and then subset the first row only if any rows are repeating more than twice.
I have used the below line of code to create new column "repeated_yes",
ss = df.Stock.ne(df.Stock.shift())
df['repeated_yes'] = ss.groupby(ss.cumsum()).cumcount() + 1
so the new updated dataframe looks like this,
df_new
But I am stuck on subsetting only row number 3 and 8 inorder to attain the result. If there are any other effective approach it would be helpful.
Edited:
Forgot to include the actual full question,
If there are any other rows below the last row in the dataframe df it should not display any output.
Chain another mask created by Series.duplicated with keep=False by & for bitwise AND and filter in boolean indexing:
ss = df.Stock.ne(df.Stock.shift())
ss1 = ss.cumsum().duplicated(keep=False)
df = df[ss & ss1]
I have a Pandas DataFrame called df (378000, 82) and I would like to replace the entire row with NaN based on a specific condition. The condition is for any value in the column df.halon_gas that is >20, I want to replace that entire row with NaN. This is the way I want to filter my data so I don't lose the index values.
Thanks!
First of all get all indexes of values, that are below 20
idx = df[df.halon_gas >= 20].index
Then set the values for all columns and all columns which are below 200 to None
df.set_value(idx, df.columns , None)
This should write None/Nan in the rows with the value below 20
If you're fine with the rows being gone then I suggest you do this:
df.reset_index(level=0, inplace=True)
df = df[df.halon_gas <= 20]
df.set_index("index", inplace=True)
Whats happening here is the following:
The Index gets reset so you have an extra Column with the Index Values pre Removal.
Only the rows where df.halon_gas <= 20 are kept.
The old Index values are set to be the Index for the DataFrame again.
I have a pandas dataframe of about 70000 rows, and 4500 of them are duplicates of an original. The columns are a mix of string columns and number columns. The column I'm interested in is the value column. I'd like to look through the entire dataframe to find rows that are completely identical, count the number of duplicated rows per row (inclusive of the original), and multiply the value in that row by the number of duplicates.
I'm not really sure how to go about this from the start, but I've tried using df[df.duplicated(keep = False)] to obtain a dataframe df1 of duplicated rows (inclusive of original rows). I appended a column of Trues to the end of df1. I tried to use .groupby with a combination of columns to sum up the number of Trues but the result was unable to capture true number of duplicates (i obtained about 3600 unique duplicated rows in this case).
Here's my actual code:
duplicate_bool = df.duplicated(keep = False)
df['duplicate_bool'] = duplicate_bool
df1= df[duplicate_bool]
f = {'duplicate_bool':'sum'}
df2= df1.groupby(['Date', 'Exporter', 'Buyer', \
'Commodity Description', 'Partner Code', \
'Quantity', 'Price per MT'], as_index = False).agg(f)
My idea here was to obtain a separate dataframe df2 with no duplicates, and i could multiply the entry in the value column inside with the number stored in the summed duplicate_bool column. Then I'd simply append df2 to my original dataframe after removing all the duplicates identified by .duplicated.
However, if I use groupby with all columns I get an empty dataframe. If I don't use all the columns, I don't get the true number of duplicates and i wont be able to append it in any way.
I think I'd like a better way to do this since i'm confusing myself.
I think this question is nothing more of figuring out how to get a count of the occurrences of each unique row. If a row occurs only once, this number is one. If it occurs more often, it will be > 1. This count you can then use to multiply, filter, etc.
This nice one-liner (taken from How to count duplicate rows in pandas dataframe?) creates an extra column with the number of occurrences of each row:
df = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'dup_count'}).
To then calculate the true value of each row:
df['total_value'] = df['value'] * df['dup_count']
And to filter we can use the dup_count column to remove all duplicate rows:
dff = df[df['dup_count'] == 1]