I have a dataframe of two columns Stock and DueDate, where I need to select first row from the repeated consecutive entries based on stock column.
df:
I am expecting output like below,
Expected output:
My Approach
The approach I tried to use is to first list out what all rows repeating based on stock column by creating a new column repeated_yes and then subset the first row only if any rows are repeating more than twice.
I have used the below line of code to create new column "repeated_yes",
ss = df.Stock.ne(df.Stock.shift())
df['repeated_yes'] = ss.groupby(ss.cumsum()).cumcount() + 1
so the new updated dataframe looks like this,
df_new
But I am stuck on subsetting only row number 3 and 8 inorder to attain the result. If there are any other effective approach it would be helpful.
Edited:
Forgot to include the actual full question,
If there are any other rows below the last row in the dataframe df it should not display any output.
Chain another mask created by Series.duplicated with keep=False by & for bitwise AND and filter in boolean indexing:
ss = df.Stock.ne(df.Stock.shift())
ss1 = ss.cumsum().duplicated(keep=False)
df = df[ss & ss1]
Related
I have many rows of data and one of the columns is a flag. I have 3 identifiers that need to match between rows.
What I have:
partnumber, datetime1, previousdatetime1, datetime2, previousdatetime2, flag
What I need:
partnumber, datetime1, previousdatetime1, datetime2, previousdatetime2, flag, previous_flag
I need to find flag from the row where partnumber matches, and where the previousdatetime1(current row*) == datetime1(other row)*, and the previousdatetime2(current row) == datetime2(other row).
*To note, the rows are not necessarily in order so the previous row may come later in the dataframe
I'm not quite sure where to start. I got this logic working in PBI using a LookUpValue and basically finding where partnumber = Value(partnumber), datetime1 = Value(datetime1), datetime2 = Value(datetime2). Thanks for the help!
Okay, so assuming you've read this in as a pandas dataframe df1:
(1) Make a copy of the dataframe:
df2=df1.copy()
(2) For sanity, drop some columns in df2
df2.drop(['previousdatetime1','previousdatetime2'],axis=1,inplace=True)
Now you have a df2 that has columns:
['partnumber','datetime1','datetime2','flag']
(3) Merge the two dataframes
newdf=df1.merge(df2,how='left',left_on=['partnumber','previousdatetime1'],right_on=['partnumber','datetime1'],suffixes=('','_previous'))
Now you have a newdf that has columns:
['partnumber','datetime1','previousdatetime1','datetime2','previousdatetime2','flag','partnumber_previous','datetime1_previous','datetime2_previous','flag_previous']
(4) Drop the unnecessary columns
newdf.drop(['partnumber_previous', 'datetime1_previous', 'datetime2_previous'],axis=1,inplace=True)
Now you have a newdf that has columns:
['partnumber','datetime1','previousdatetime1','datetime2','previousdatetime2','flag','flag_previous']
I have a dataframe 'raw' that looks like this -
It has many rows with duplicate values in each column.
I want to make a new dataframe 'new_df' which has unique customer_code corresponding and market_code.
The new_df should look like this -
It sounds like you simply want to create a DataFrame with unique customer_code which also shows market_code. Here's a way to do it:
df = df[['customer_code','market_code']].drop_duplicates('customer_code')
Output:
customer_code market_code
0 Cus001 Mark001
1 Cus003 Mark003
3 Cus004 Mark003
4 Cus005 Mark004
The part reading df[['customer_code','market_code']] gives us a DataFrame containing only the two columns of interest, and the drop_duplicates('customer_code') part eliminates all but the first occurrence of duplicate values in the customer_code column (though you could instead keep the last occurrence of each duplicate by calling it using the keep='last' argument).
I have a dataframe that looks like this, where there is a new row per ID if one of the following columns has a value. I'm trying to combine on the ID, and just consolidate all of the remaining columns. I've tried every groupby/agg combination and can't get the right output. There are no conflicting column values. So for instance if ID "1" has an email value in row 0, the remaining rows will be empty in the column. So I just need it to sum/consolidate, not concatenate or anything.
my current dataframe:
the output i'm looking to achieve:
# fill Nones in string columns with empty string
df[['email', 'status']] = df[['email', 'status']].fillna('')
df = df.groupby('id').agg('max')
If you still want the index as you shown in desired output,
df = df.reset_index(drop=False)
I have two data frames like this: The first has one column and 720 rows (dataframe A), the second has ten columns and 720 rows(dataframe B). The dataframes contain only numerical values.
I am trying to compare them this way: I want to go through each column of dataframe B and compare each cell(row) of that column to the corresponding row in dataframe A .
(Example: For the first column of dataframe B I compare the first row to the first row of dataframe A, then the second row of B to the second row of A etc.)
Basically I want to compare each column of dataframe B to the single column in dataframe A, row by row.
If the the value in dataframe B is smaller or equal than the value in dataframe A, I want to add +1 to another dataframe (or list, depending on how its easier). In the end, I want to drop any column in dataframe B that doesnt have at least one cell to satisfy the condition (basically if the value added to the list or new dataframe is 0).
I tried something like this (written for a single row, I was thinking of creating a for loop using this) but it doesn't seem to do what I want:
DfA_i = pd.DataFrame(DA.iloc[i])
DfB_j = pd.DataFrame(DB.iloc[j])
B = DfB_j.values
DfC['Criteria'] = DfA_i.apply(lambda x: len(np.where(x.values <= B)), axis=1)
dv = dt_dens.values
if dv[1] < 1:
DF = DA.drop(i)
I hope I made my problem clear enough and sorry for any mistakes. Thanks for any help.
Let's try:
dfB.loc[:, dfB.ge(dfA.values).any()]
Explanation: dfA.values returns the numpy array with shape (720,1). Then dfB.ge(dfA.values) check each column from dfB against that single column from dfA; this returns a boolean dataframe of same size with dfB. Finally .any() check along the columns of that boolean dataframe for any True.
how about this:
pd.DataFrame(np.where(A.to_numpy() <= B.to_numpy(),1,np.nan), columns=B.columns, index=A.index).dropna(how='all')
you and replace the np.nan in the np.where condition with whatever values you wish, including keeping the original values of dataframe 'B'
I have a pandas dataframe of about 70000 rows, and 4500 of them are duplicates of an original. The columns are a mix of string columns and number columns. The column I'm interested in is the value column. I'd like to look through the entire dataframe to find rows that are completely identical, count the number of duplicated rows per row (inclusive of the original), and multiply the value in that row by the number of duplicates.
I'm not really sure how to go about this from the start, but I've tried using df[df.duplicated(keep = False)] to obtain a dataframe df1 of duplicated rows (inclusive of original rows). I appended a column of Trues to the end of df1. I tried to use .groupby with a combination of columns to sum up the number of Trues but the result was unable to capture true number of duplicates (i obtained about 3600 unique duplicated rows in this case).
Here's my actual code:
duplicate_bool = df.duplicated(keep = False)
df['duplicate_bool'] = duplicate_bool
df1= df[duplicate_bool]
f = {'duplicate_bool':'sum'}
df2= df1.groupby(['Date', 'Exporter', 'Buyer', \
'Commodity Description', 'Partner Code', \
'Quantity', 'Price per MT'], as_index = False).agg(f)
My idea here was to obtain a separate dataframe df2 with no duplicates, and i could multiply the entry in the value column inside with the number stored in the summed duplicate_bool column. Then I'd simply append df2 to my original dataframe after removing all the duplicates identified by .duplicated.
However, if I use groupby with all columns I get an empty dataframe. If I don't use all the columns, I don't get the true number of duplicates and i wont be able to append it in any way.
I think I'd like a better way to do this since i'm confusing myself.
I think this question is nothing more of figuring out how to get a count of the occurrences of each unique row. If a row occurs only once, this number is one. If it occurs more often, it will be > 1. This count you can then use to multiply, filter, etc.
This nice one-liner (taken from How to count duplicate rows in pandas dataframe?) creates an extra column with the number of occurrences of each row:
df = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'dup_count'}).
To then calculate the true value of each row:
df['total_value'] = df['value'] * df['dup_count']
And to filter we can use the dup_count column to remove all duplicate rows:
dff = df[df['dup_count'] == 1]