How to find different values of different columns from the same row - python

I'm working with a dataframe structured like this one below:
A B C D
group_997 NaN Purple JMHKJALJ_06538
group_998 NaN Pink JMHKJALJ_04556
group_999 NaN Red JMHKJALJ_09211
group_995 NaN Green JMHKJALJ_16378
group_996 NaN Yellow JMHKJALJ_81324
Now I'd like to find the corresponding value of JMHKJALJ_16378 (column D) on column A (that is group_995).
I tried to use:
df[df['A'].str.contains('JMHKJALJ_16378')]
But the output is not correct.
How can I do that?

Related

Comparing 2 dataframes and set the value of the dataframe if not exists [duplicate]

I cannot find a pandas function (which I had seen before) to substitute the NaN's in a dataframe with values from another dataframe (assuming a common index which can be specified). Any help?
If you have two DataFrames of the same shape, then:
df[df.isnull()] = d2
Will do the trick.
Only locations where df.isnull() evaluates to True (highlighted in green) will be eligible for assignment.
In practice, the DataFrames aren't always the same size / shape, and transforming methods (especially .shift()) are useful.
Data coming in is invariably dirty, incomplete, or inconsistent. Par for the course. There's a pretty extensive pandas tutorial and associated cookbook for dealing with these situations.
As I just learned, there is a DataFrame.combine_first() method, which does precisely this, with the additional property that if your updating data frame d2 is bigger than your original df, the additional rows and columns are added, as well.
df = df.combine_first(d2)
This should be as simple as
df.fillna(d2)
A dedicated method for this is DataFrame.update:
Quoted from the documentation:
Modify in place using non-NA values from another DataFrame.
Aligns on indices. There is no return value.
Important to note is that this method will modify your data inplace. So it will overwrite your updated dataframe.
Example:
print(df1)
A B C
aaa NaN 1.0 NaN
bbb NaN NaN 10.0
ccc 3.0 NaN 6.0
ddd NaN NaN NaN
eee NaN NaN NaN
print(df2)
A B C
index
aaa 1.0 1.0 NaN
bbb NaN NaN 10.0
eee NaN 1.0 NaN
# update df1 NaN where there are values in df2
df1.update(df2)
print(df1)
A B C
aaa 1.0 1.0 NaN
bbb NaN NaN 10.0
ccc 3.0 NaN 6.0
ddd NaN NaN NaN
eee NaN 1.0 NaN
Notice the updated NaN values at intersect aaa, A and eee, B
DataFrame.combine_first() answers this question exactly.
However, sometimes you want to fill/replace/overwrite some of the non-missing (non-NaN) values of DataFrame A with values from DataFrame B. That question brought me to this page, and the solution is DataFrame.mask()
A = B.mask(condition, A)
When condition is true, the values from A will be used, otherwise B's values will be used.
For example, you could solve the OP's original question with mask such that when an element from A is non-NaN, use it, otherwise use the corresponding element from B.
But using DataFrame.mask() you could replace the values of A that fail to meet arbitrary criteria (less than zero? more than 100?) with values from B. So mask is more flexible, and overkill for this problem, but I thought it was worthy of mention (I needed it to solve my problem).
It's also important to note that B could be a numpy array instead of a DataFrame. DataFrame.combine_first() requires that B be a DataFrame, but DataFrame.mask() just requires that B's is an NDFrame and its dimensions match A's dimensions.
One important info missing from the other answers is that both combine_first and fillna match on index, so you have to make the indices of match across the DataFrames for these methods to work.
Oftentimes, there's a need to match on some other column(s) to fill in missing values. In that case, you need to use set_index first to make the columns to be matched, the index.
df1 = df1.set_index(cols_to_be_matched).fillna(df2.set_index(cols_to_be_matched)).reset_index()
or
df1 = df1.set_index(cols_to_be_matched).combine_first(df2.set_index(cols_to_be_matched)).reset_index()
Another option is to use merge:
df1 = (df1.merge(df2, on=cols_to_be_matched, how='left', suffixes=('','\x00'))
.sort_index(axis=1).bfill(axis=1)[df.columns])
The idea here is to left-merge and by sorting the columns (we use '\x00' as the suffix for columns from df2 since it's the character with the lowest Unicode value), we make sure the same column values end up next to each other. Then use bfill horizontally to update df1 with values from df2.
Example:
Suppose you had df1:
C1 C2 C3 C4
0 1 a 1.0 0
1 1 b NaN 1
2 2 b NaN 2
3 2 b NaN 3
and df2
C1 C2 C3
0 1 b 2
1 2 b 3
and you want to fill in the missing values in df1 with values in df2 for each pair of C1-C2 value pair. Then
cols_to_be_matched = ['C1', 'C2']
and all of the codes above produce the following output (where the values are indeed filled as required):
C1 C2 C3 C4
0 1 a 1.0 0
1 1 b 2.0 1
2 2 b 3.0 2
3 2 b 3.0 3

how to replace NaN values in one dataframe with values from another [duplicate]

I cannot find a pandas function (which I had seen before) to substitute the NaN's in a dataframe with values from another dataframe (assuming a common index which can be specified). Any help?
If you have two DataFrames of the same shape, then:
df[df.isnull()] = d2
Will do the trick.
Only locations where df.isnull() evaluates to True (highlighted in green) will be eligible for assignment.
In practice, the DataFrames aren't always the same size / shape, and transforming methods (especially .shift()) are useful.
Data coming in is invariably dirty, incomplete, or inconsistent. Par for the course. There's a pretty extensive pandas tutorial and associated cookbook for dealing with these situations.
As I just learned, there is a DataFrame.combine_first() method, which does precisely this, with the additional property that if your updating data frame d2 is bigger than your original df, the additional rows and columns are added, as well.
df = df.combine_first(d2)
This should be as simple as
df.fillna(d2)
A dedicated method for this is DataFrame.update:
Quoted from the documentation:
Modify in place using non-NA values from another DataFrame.
Aligns on indices. There is no return value.
Important to note is that this method will modify your data inplace. So it will overwrite your updated dataframe.
Example:
print(df1)
A B C
aaa NaN 1.0 NaN
bbb NaN NaN 10.0
ccc 3.0 NaN 6.0
ddd NaN NaN NaN
eee NaN NaN NaN
print(df2)
A B C
index
aaa 1.0 1.0 NaN
bbb NaN NaN 10.0
eee NaN 1.0 NaN
# update df1 NaN where there are values in df2
df1.update(df2)
print(df1)
A B C
aaa 1.0 1.0 NaN
bbb NaN NaN 10.0
ccc 3.0 NaN 6.0
ddd NaN NaN NaN
eee NaN 1.0 NaN
Notice the updated NaN values at intersect aaa, A and eee, B
DataFrame.combine_first() answers this question exactly.
However, sometimes you want to fill/replace/overwrite some of the non-missing (non-NaN) values of DataFrame A with values from DataFrame B. That question brought me to this page, and the solution is DataFrame.mask()
A = B.mask(condition, A)
When condition is true, the values from A will be used, otherwise B's values will be used.
For example, you could solve the OP's original question with mask such that when an element from A is non-NaN, use it, otherwise use the corresponding element from B.
But using DataFrame.mask() you could replace the values of A that fail to meet arbitrary criteria (less than zero? more than 100?) with values from B. So mask is more flexible, and overkill for this problem, but I thought it was worthy of mention (I needed it to solve my problem).
It's also important to note that B could be a numpy array instead of a DataFrame. DataFrame.combine_first() requires that B be a DataFrame, but DataFrame.mask() just requires that B's is an NDFrame and its dimensions match A's dimensions.
One important info missing from the other answers is that both combine_first and fillna match on index, so you have to make the indices of match across the DataFrames for these methods to work.
Oftentimes, there's a need to match on some other column(s) to fill in missing values. In that case, you need to use set_index first to make the columns to be matched, the index.
df1 = df1.set_index(cols_to_be_matched).fillna(df2.set_index(cols_to_be_matched)).reset_index()
or
df1 = df1.set_index(cols_to_be_matched).combine_first(df2.set_index(cols_to_be_matched)).reset_index()
Another option is to use merge:
df1 = (df1.merge(df2, on=cols_to_be_matched, how='left', suffixes=('','\x00'))
.sort_index(axis=1).bfill(axis=1)[df.columns])
The idea here is to left-merge and by sorting the columns (we use '\x00' as the suffix for columns from df2 since it's the character with the lowest Unicode value), we make sure the same column values end up next to each other. Then use bfill horizontally to update df1 with values from df2.
Example:
Suppose you had df1:
C1 C2 C3 C4
0 1 a 1.0 0
1 1 b NaN 1
2 2 b NaN 2
3 2 b NaN 3
and df2
C1 C2 C3
0 1 b 2
1 2 b 3
and you want to fill in the missing values in df1 with values in df2 for each pair of C1-C2 value pair. Then
cols_to_be_matched = ['C1', 'C2']
and all of the codes above produce the following output (where the values are indeed filled as required):
C1 C2 C3 C4
0 1 a 1.0 0
1 1 b 2.0 1
2 2 b 3.0 2
3 2 b 3.0 3

Erase columns where duplicated rows exist, in groups. Pandas

I need to show columns which have only duplicated rows inside, in Name groups
I cannot remove/drop columns for one groupo because for other this specific column could be usefull.
So when in specific column will be duplicates i need to make this column empty (replace with np.nan for example)
my df:
Name,B,C,D
Adam,20,dog,cat
Adam,20,cat,elephant
Katie,21,cat,cat
Katie,21,cat,dog
Brody,22,dog,dog
Brody,21,cat,dog
expected output:
#grouping by Name, always two Names are same, not less not more.
Name,B,C,D
Adam,np.nan,dog,cat
Adam,np.nan,cat,elephant
Katie,np.nan,np.nan,cat
Katie,np.nan,np.nan,dog
Brody,22,dog,np.nan
Brody,21,cat,np.nan
I know I should use groupby() function and duplicated()
but I dont know how this approach should looks like
output=df[df.duplicated(keep=False)].groupby('Name')
output=output.replace({True:'np.nan'},regex=True)
Use GroupBy.transform with lambda function and DataFrame.mask for replace:
df = df.set_index('Name')
output=df.mask(df.groupby('Name').transform(lambda x: x.duplicated(keep=False))).reset_index()
print (output)
Name B C D
0 Adam NaN dog cat
1 Adam NaN cat elephant
2 Katie NaN NaN cat
3 Katie NaN NaN dog
4 Brody 22.0 dog NaN
5 Brody 21.0 cat NaN

Drop or replace values within duplicate rows in pandas dataframe

I have a data frame df where some rows are duplicates with respect to a subset of columns:
A B C
1 Blue Green
2 Red Green
3 Red Green
4 Blue Orange
5 Blue Orange
I would like to remove (or replace with a dummy string) values for duplicate rows with respect to B and C, without deleting the whole row, ideally producing:
A B C
1 Blue Green
2 Red Green
3 NaN NaN
4 Blue Orange
5 Nan NaN
As per this thread: Replace duplicate values across columns in Pandas I've tried using pd.Series.duplicated, however I can't get it to work with duplicates in a subset of columns.
I've also played around with:
is_duplicate = df.loc[df.duplicated(subset=['B','C'])]
df = df.where(is_duplicated==True, 999) # 999 intended as a placeholder that I could find-and-replace later on
However this replaces almost every row with 999 in each column - so clearly I'm doing something wrong. I'd appreciate any advice on how to proceed!
df.loc[df.duplicated(subset=['B','C']), ['B','C']] = np.nan seems to work for me.
Edited to include #ALollz and #macaw_9227 correction.
Let me share with you how I used to confront those kind of challenges in the beginning. Obviously, there are quicker ways (a one-liner) but for the sake of the answer, let's do it on a more intuitive level (later, you'll see that you can do it in one line).
So here we go...
df = pd.DataFrame({"B":['Blue','Red','Red','Blue','Blue'],"C":['Green','Green','Green','Orange','Orange']})
which result in
Step 1: identify the duplication:
For this, I'm simply adding another (facilitator) column and asking with True/False if B and C are duplicated.
df['IS_DUPLICATED']= df.duplicated(subset=['B','C'])
Step 2: Identify the indexes of the 'True' IS_DUPLICATED:
dup_index = df[df['IS_DUPLICATED']==True].index
result: Int64Index([2, 4], dtype='int64')
Step 3: mark them as Nan:
df.iloc[dup_index]=np.NaN
Step 4: remove the IS_DUPLICATED column:
df.drop('IS_DUPLICATED',axis=1, inplace=True)
and the desired result:
I will using
df[['B','C']]=df[['B','C']].mask(df.duplicated(['B','C']))
df
Out[141]:
A B C
0 1 Blue Green
1 2 Red Green
2 3 NaN NaN
3 4 Blue Orange
4 5 NaN NaN

Unexpected Result When Filtering PANDAS DF Row

I typically filter a pandas DataFrame using the following syntax:
FDF = DF[DF['Color'] == 'Blue']
I expect to see a result where FDF, which is my filtered DataFrame returns just the rows where the color column is set to blue. Instead, I get something like this. Funny thing is, the program used to worked as expected, but stopped working after I upgraded my operating system and re-installed Python and all of the libraries. Also, it does not do this on all of my DataFrames. Any ideas?
0 Color Shape Data
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 Blue NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
Edit: I think the first 3 responses misunderstood the question. I am showing the result, not the original DF. My original DF looks like this:
Color Shape Data
0 Green square Y
1 Red triangle N
2 Red circle Y
3 Blue circle N
4 Green square N
5 Red triangle N
The result I am expecting is:
Color Shape Data
3 Blue circle N
('Color', 'Shape', 'Data') is NOT your column names, but your first row of data, otherwise there wouldn't an index 0 assigned to this row. Since DF has no column named 'Color', DF['Color'] == 'Blue' doesn't filter out anything, and therefore it would return all records.
If you imported your data from a csv or Excel sheet, I'd suggest you specify using the first row of your file as column names.
You are missing DF.loc so you are getting unneccessary rows.
Make first row as column header:
DF=DF.rename(columns=DF.iloc[0].drop(DF.index[0])
Then use the below to get just the rows where color is blue :
FDF=DF.loc[DF['Color'] =='Blue']
You're trying to filter by a column name which does not exists. First make your first row as column header :
DF.columns = DF.iloc[0]
DF.reindex(DF.index.drop(0))
Now filter using
FDF = DF[DF['Color'] == 'Blue']
Not sure I understand the negative ratings on this question. However, I was able to work around the issue by assigning a new index and renaming the columns.

Categories

Resources