Check each values on Column a column with another column values - python

Is there any way in Excel or in DAX i can check if all the values of a single column exist or don't on another column.
Example - I have a column called Column 1 where i have some values, like 4,5,2,1. now i want to check how many of those values exists on Column 2 !
As an Output, i expected it can Go Green if the value exists else Red.
I have looked in a lot of place but the only useful result i have found where i can find for a sngle value, not for all the values in a single column.
Do anyone knows any way of doing this work !

Since you mention Python, this is possible programmatically with the Pandas library:
import pandas as pd
# define dataframe, or read in via df = pd.read_excel('file.xlsx')
df = pd.DataFrame({'col1': [4, 5, 2, 1] + [np.nan]*4,
'col2': [6, 8, 3, 4, 1, 6, 3, 4]})
# define highlighting logic
def highlight_cols(x):
res = []
for i in x:
if np.isnan(i):
res.append('')
elif i in set(df['col2']):
res.append('background: green')
else:
res.append('background: red')
return res
# apply highlighting logic to first column only
res = df.style.apply(highlight_cols, subset=pd.IndexSlice[:, ['col1']])
Result:

Create a (optionally hidden) column that will be adjacent to your search column (in my example that will be column C to column B)
=IF(ISERROR(VLOOKUP(B1,$A$1:$A$4, 1, 0)), FALSE, TRUE)
This will determine, if the value is contained within the first data-list (returns true if it is)
And then just use simple conditional formatting
Provides the result as expected:

You can do this easily without adding hidden columns as below. This will updated anytime if you change numbers in column A.
Select column B
Conditional Formatting -> New Rule -> Use a formula to determine which cells to format
insert formula as =OR(B2=$A$2,B2=$A$3,B2=$A$4,B2=$A$5) = TRUE and format cell as your wish (here in Green)
Repeat steps 1 to 2
insert formula as =OR(B2=$A$2,B2=$A$3,B2=$A$4,B2=$A$5) = FASLE and format cells as your wish (here in Red)
Select the column name cell (To remove column heading formatting)
Conditional Formatting -> Clear Rule -> Clear Rules from selected cells

Related

Figuring out if an entire column in a Pandas dataframe is the same value or not

I have a pandas dataframe that works just fine. I am trying to figure out how to tell if a column with a label that I know if correct does not contain all the same values.
The code
below errors out for some reason when I want to see if the column contains -1 in each cell
# column = "TheColumnLabelThatIsCorrect"
# df = "my correct dataframe"
# I get an () takes 1 or 2 arguments but 3 is passed in error
if (not df.loc(column, estimate.eq(-1).all())):
I just learned about .eq() and .all() and hopefully I am using them correctly.
It's a syntax issue - see docs for .loc/indexing. Specifically, you want to be using [] instead of ()
You can do something like
if not df[column].eq(-1).all():
...
If you want to use .loc specifically, you'd do something similar:
if not df.loc[:, column].eq(-1).all():
...
Also, note you don't need to use .eq(), you can just do (df[column] == -1).all()) if you prefer.
You could drop duplicates and if you get only one record it means all records are the same.
import pandas as pd
df = pd.DataFrame({'col': [1, 1, 1, 1]})
len(df['col'].drop_duplicates()) == 1
> True
Question not as clear. Lets try the following though
Contains only -1 in each cell
df['estimate'].eq(-1).all()
Contains -1 in any cell
df['estimate'].eq(-1).any()
Filter out -1 and all columns
df.loc[df['estimate'].eq(-1),:]
df['column'].value_counts() gives you a list of all unique values and their counts in a column. As for checking if all the values are a specific number, you can do that by dropping duplicates and checking the length to be 1.
len(set(df['column'])) == 1

How can I choose rows and columns if the index/header contains certain integer in Pandas dataframe?

I have an input/output data where index and header have numbers that represents different types of industries. I want to create new columns and rows that would represent the sum of columns and rows that belong to certain industry group. To give an example(please refer to the example that I manually made as below), I would want to create new row/column that would have index/header as US_industry_135/CAN_industry_135 which would sum the rows/columns that has industry number 1, 3, or 5. The below example is a small set that I manually created, but I wanted to know if there is a way to put the condition in summation so that I sum rows/columns whose index/header has numbers that belong to specific numbers. I could extract the numbers from header/index and create make a separate row/column, but I was wondering if there is a way to directly check from the index/headers without creating new columns. Thank you in advance for your help!
import pandas as pd
data = {'US1':[3, 2, 1, 4,3,2,1,4,2,3,7,9],'US2':[8,4,9,2,1,3,4,2,5,6,18,11],'US3':[2,4,2,2,3,2,4,2,3,2,7,6],
'US4':[7,4,8,2,2,3,2,4,6,8,17,15],'US5':[2,4,3,2,2,4,1,3,2,4,7,11],
'CAN1':[3, 2, 1, 4,6,2,3,1,4,2,10,5],'CAN2':[8,4,9,2,5,7,3,5,7,1,22,13],'CAN3':[2,4,2,2,4,5,2,3,3,2,8,10],
'CAN4':[7,4,8,2,2,3,1,3,2,4,17,10],'CAN5':[2,4,3,2,6,7,5,4,0,9,11,20],
'US_IND_135':[7,10,6,8,8,8,6,9,7,9,21,26],'CAN_IND_135':[7,10,6,8,16,14,10,8,7,13,29,35]}
df = pd.DataFrame(data, index=['US1','US2','US3','US4','US5','CAN1','CAN2','CAN3','CAN4','CAN5','US_IND_135','CAN_IND_135'])
df
Let's define list of indexes of interest:
idx = [1, 3, 5]
Do the summation using specified columns:
df[['US' + str(i) for i in idx]].sum(axis = 1)
Alternatively, if you want to join summation column to dataframe, you can assign result to the variable:
s1 = df[['US' + str(i) for i in idx]].sum(axis = 1)
s1.name = 'NEW_US_IND_' + ''.join("{0}".format(i) for i in idx)
And add new column:
df.join(s1)

If dataframe.tail(1) is X, do something

I am trying to check if the last cell in a pandas data-frame column contains a 1 or a 2 (these are the only options). If it is a 1, I would like to delete the whole row, if it is a 2 however I would like to keep it.
import pandas as pd
df1 = pd.DataFrame({'number':[1,2,1,2,1], 'name': ['bill','mary','john','sarah','tom']})
df2 = pd.DataFrame({'number':[1,2,1,2,1,2], 'name': ['bill','mary','john','sarah','tom','sam']})
In the above example I would want to delete the last row of df1 (so the final row is 'sarah'), however in df2 I would want to keep it exactly as it is.
So far, I have thought to try the following but I am getting an error
if df1['number'].tail(1) == 1:
df = df.drop(-1)
DataFrame.drop removes rows based on labels (the actual values of the indices). While it is possible to do with df1.drop(df1.index[-1]) this is problematic with a duplicated index. The last row can be selected with iloc, or a single value with .iat
if df1['number'].iat[-1] == 1:
df1 = df1.iloc[:-1, :]
You can check if the value of number in the last row is equal to one:
check = df1['number'].tail(1).values == 1
# Or check entire row with
# check = 1 in df1.tail(1).values
If that condition holds, you can select all rows, except the last one and assign back to df1:
if check:
df1 = df1.iloc[:-1, :]
if df1.tail(1).number == 1:
df1.drop(len(df1)-1, inplace = True)
You can use the same tail function
df.drop(df.tail(n).index,inplace=True) # drop last n rows

Multiplying columns by another column in a dataframe

(Full disclosure that this is related to another question I asked, so bear with me if I should have appended it to what I wrote previously, even though the problem is different.)
I have a dataframe consisting of a column of weights and columns containing binary values of 0 and 1. I'd like to multiply every column within the dataframe by the weights column. However, I seem to be replacing every column within the dataframe with the weight column. I'm sure I'm missing something incredibly stupid/basic here--I'm rather new to pandas and python as a whole. What am I doing wrong?
celebfile = pd.read_csv(celebcsv)
celebframe = pd.DataFrame(celebfile)
behaviorfile = pd.read_csv(behaviorcsv)
behaviorframe = pd.DataFrame(behaviorfile)
celebbehavior = pd.merge(celebframe, behaviorframe, how ='inner', on = 'RespID')
celebbehavior2 = celebbehavior.copy()
def multiplycolumns(column):
for column in celebbehavior:
return celebbehavior[column]*celebbehavior['WEIGHT']
celebbehavior2 = celebbehavior2.apply(lambda column: multiplycolumns(column), axis=0)
print(celebbehavior2.head())
You have return statement in a for loop, which means the for loop is executed only once, to multiply a data frame with a column, you can use mul method with the correct axis parameter:
celebbehavior.mul(celebbehavior['WEIGHT'], axis=0)
read_csv
returns a pd.DataFrame... Not necessary to use pd.DataFrame on top of it.
mul with axis=0
You can use apply but that is awkward. Use mul(axis=0)... This should be all you need.
df = pd.read_csv(celebcsv).merge(pd.read_csv(behaviorcsv), on='RespID')
df = df.mul(df.WEIGHT, 0)
?
You said that it looks like you are just replacing with the weights column? Are you other columns all ones?
you can use the `mul' method to multiply the columns. However, just fyi if you do want to use apply you can bear in mind the following:
The apply function passes each series in the dataframe to the function. This looping is inherent to the apply function. Therefore first thing to say is that your loop within the function is redundant. Also you have a return statement inside it which is causing the behavior you do not want.
If each column is passed as the argument automatically all you need to do is tell the function what to multiply it by. In this case your weights series.
Here is an implementation using apply. Of course the undesirable here is that the weights are also multiplpied by themselves:
df = pd.DataFrame({'1' : [1, 1, 0, 1],
'2' : [0, 0, 1, 0],
'weights' : [0.5, 0.25, 0.1, 0.05]})
def multiply_columns(column, weights):
return column * weights
df.apply(lambda x: multiply_columns(x, df['weights']))

Iterate and input data into a column in a pandas dataframe

I have a pandas dataframe with a column that is a small selection of strings. Let's call the column 'A' and all of the values in it are string_1, string_2, string_3.
Now, I want to add another column and fill it with numeric values that correspond to the strings.
I created a dictionary
d = { 'string_1' : 1, 'string_2' : 2, 'string_3': 3}
I then initialized the new column:
df['B'] = pd.Series(index=df.index)
Now, I want to fill it with the integer values. I can call the values associated with the strings in the dictionary by:
for s in df['A']:
n = d[s]
That works fine, but I've tried using just plain df['B'] = n to fill the new column in the for-loop, but that doesn't work, and I've tried to figure out indexing with pandas.
If I understand you correctly you can just call map:
df['B'] = df['A'].map(d)
This will perform the lookup and fill the values you are looking for.
Rather than fill as an empty column, you can simply populate this with an apply:
df['B'] = df['A'].apply(d.get)

Categories

Resources