I have two data frames like this: The first has one column and 720 rows (dataframe A), the second has ten columns and 720 rows(dataframe B). The dataframes contain only numerical values.
I am trying to compare them this way: I want to go through each column of dataframe B and compare each cell(row) of that column to the corresponding row in dataframe A .
(Example: For the first column of dataframe B I compare the first row to the first row of dataframe A, then the second row of B to the second row of A etc.)
Basically I want to compare each column of dataframe B to the single column in dataframe A, row by row.
If the the value in dataframe B is smaller or equal than the value in dataframe A, I want to add +1 to another dataframe (or list, depending on how its easier). In the end, I want to drop any column in dataframe B that doesnt have at least one cell to satisfy the condition (basically if the value added to the list or new dataframe is 0).
I tried something like this (written for a single row, I was thinking of creating a for loop using this) but it doesn't seem to do what I want:
DfA_i = pd.DataFrame(DA.iloc[i])
DfB_j = pd.DataFrame(DB.iloc[j])
B = DfB_j.values
DfC['Criteria'] = DfA_i.apply(lambda x: len(np.where(x.values <= B)), axis=1)
dv = dt_dens.values
if dv[1] < 1:
DF = DA.drop(i)
I hope I made my problem clear enough and sorry for any mistakes. Thanks for any help.
Let's try:
dfB.loc[:, dfB.ge(dfA.values).any()]
Explanation: dfA.values returns the numpy array with shape (720,1). Then dfB.ge(dfA.values) check each column from dfB against that single column from dfA; this returns a boolean dataframe of same size with dfB. Finally .any() check along the columns of that boolean dataframe for any True.
how about this:
pd.DataFrame(np.where(A.to_numpy() <= B.to_numpy(),1,np.nan), columns=B.columns, index=A.index).dropna(how='all')
you and replace the np.nan in the np.where condition with whatever values you wish, including keeping the original values of dataframe 'B'
Related
Assume we have 1 dataframe: df1(5000, 6)
Let's assume it has the following structure (I will only write the first 3 columns since the others should just be copied)
A B C
'A-0000-ALEX,A-00030-PAUL' 1 '1112-PPAI 12.00: First Name\n4554-ALGF 09:00 Groceries\n'
and I want to create a dataframe that for every time a row has multiple values in a row to create extra rows that separates them as follows
A B C
'A-0000-ALEX' 1 '1112-PPAI 12.00: First Name'
'A-00030-PAUL' 1 '4554-ALGF 09:00 Groceries'
Thus in A column new rows should be created, in column B just copy the same value that already existed. For column C the values should also be splitted in the new rows and for the rest of the columns (D,E,F) they should just be copied in the new rows that will be created
Edit : Just realized that it is not a 1-1 match between the values of A and C. Some rows have only 1 value in A column and some others rows have the same number of values in A column like they have in column C
One solution here is to create a list of the wanted values for all your columns instead of a string by splitting the string:
df['A'] = df['A'].apply(lambda x: x.split(","))
df['C'] = df['C'].apply(lambda x: x.split('\n')[:-1])
Note that the [:-1] is used here because split will create a list of three strings for column C because you have a '\n' at the end of the string as well.
Then you can use the function pd.explode to explode your lists into rows:
df = df.explode(['A','C'])
I have a dataframe with columns that are a string of blanks (null/nan set to 0) with sporadic number values.
I am tying to compare the last two non-zero values in a data frame column.
Something like :
df['Column_c'] = df[column_a'].last_non_zero_value > df[column_a'].second_to_last_non_zero_value
This is what the columns look like in excel
You could drop all the rows with missing data using pd.df.dropna() and then access the last row in the dataframe index and have it return the values as an array which should be easy to find the last two elements in.
Is there any way to drop duplicate columns, but replacing their values depending upon conditions like
in table below, I would like to remove duplicate/second A and B columns, but want to replace the value of primary A and B (1st and 2nd column) where value is 0 but 1 in duplicate columns.
Ex - In 3rd row, where A, B have value 0 , should replace with 1 with their respective duplicate columns value..
Input Data :
Output Data:
This is an example of a problem I'm working on, my real data have around 200 columns so i'm hoping to find an optimal solution without hardcoding columns names for removal..
Use DataFrame.any per duplicated columns names if only 1,0 values in columns:
df = df.any(axis=1, level=0).astype(int)
Or if need maximal value per duplicated columns names:
df = df.max(axis=1, level=0)
I have a dataframe of two columns Stock and DueDate, where I need to select first row from the repeated consecutive entries based on stock column.
df:
I am expecting output like below,
Expected output:
My Approach
The approach I tried to use is to first list out what all rows repeating based on stock column by creating a new column repeated_yes and then subset the first row only if any rows are repeating more than twice.
I have used the below line of code to create new column "repeated_yes",
ss = df.Stock.ne(df.Stock.shift())
df['repeated_yes'] = ss.groupby(ss.cumsum()).cumcount() + 1
so the new updated dataframe looks like this,
df_new
But I am stuck on subsetting only row number 3 and 8 inorder to attain the result. If there are any other effective approach it would be helpful.
Edited:
Forgot to include the actual full question,
If there are any other rows below the last row in the dataframe df it should not display any output.
Chain another mask created by Series.duplicated with keep=False by & for bitwise AND and filter in boolean indexing:
ss = df.Stock.ne(df.Stock.shift())
ss1 = ss.cumsum().duplicated(keep=False)
df = df[ss & ss1]
I have another basic question. So I have a dataframe like so:
cols = a,b,c,d,e which contains integers.
I want column e's value to equal 1 if columns b and c or columns a, b and c = 1.
Although d's column does not matter in this computation, it matters somewhere else so I cannot drop it.
How would I do that on pandas?
Use .loc:
df.loc[df['a']==df['b']==df['c']==1,'e']=1