Compare two columns in pandas to make them match - python

So I have two dataframes consisting of 6 columns each containing numbers. I need to compare 1 column from each dataframe to make sure they match and fix any values in that column that don't match. Columns are already sorted and they match in terms of length. So far I can find the differences in the columns:
df1.loc[(df1['col1'] != df2['col2'])]
then I get the index # where df1 doesn't match df2. Then I'll go to that same index # in df2 to find out what value in col2 is causing a mismatch then use this to change the value to the correct one found in df2:
df1.loc[index_number, 'col1'] = new_value
Is there a way I can automatically fix the mismatches without having to manually look up what the correct value should be in df2?

if df2 is the authoritative source, you don't need to check where df1 is equal
df1.loc[:, 'column_name'] = df2['column_name']
But if we must check
c = 'column_name'
df1.loc[df1[c] != df2[c], c] = df2[c]

I think you need compare by eq and then if need add value where dont match use combine_first:
df1 = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,6,5],
'E':[5,3,6],
'F':[1,4,3]})
print (df1)
A B C D E F
0 1 4 7 1 5 1
1 2 5 8 6 3 4
2 3 6 9 5 6 3
df2 = pd.DataFrame({'A':[1,2,1],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 1 6 9 5 6 3
If need compare one column with all DataFrame:
print (df1.eq(df2.A, axis=0))
A B C D E F
0 True False False True False True
1 True False False False False False
2 False False False False False False
print (df1.eq(df1.A, axis=0))
A B C D E F
0 True False False True False True
1 True False False False False False
2 True False False False False True
And if need same column D:
df1.D = df1.loc[df1.D.eq(df2.D), 'D'].combine_first(df2.D)
print (df1)
A B C D E F
0 1 4 7 1.0 5 1
1 2 5 8 3.0 3 4
2 3 6 9 5.0 6 3
But then is easier only assign column D from df2 to D of df1:
df1.D = df2.D
print (df1)
A B C D E F
0 1 4 7 1 5 1
1 2 5 8 3 3 4
2 3 6 9 5 6 3
If indexes are different, is possible use values for convert column to numpy array:
df1.D = df1.D.values
print (df1)
A B C D E F
0 1 4 7 1 5 1
1 2 5 8 6 3 4
2 3 6 9 5 6 3

Related

Drop column with low variance in pandas

I'm trying to drop columns in my pandas dataframe with 0 variance.
I'm sure this has been answered somewhere but I had a lot of trouble finding a thread on it. I found this thread, however when I tried the solution for my dataframe, baseline with the command
baseline_filtered=baseline.loc[:,baseline.std() > 0.0]
I got the error
"Unalignable boolean Series provided as "
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
So, can someone tell me why I'm getting this error or provide an alternative solution?
There are some non numeric columns, so std remove this columns by default:
baseline = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,1,1,1,1,1],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
#no A, F columns
m = baseline.std() > 0.0
print (m)
B True
C True
D False
E True
dtype: bool
So possible solution for add or remove strings columns is use DataFrame.reindex:
baseline_filtered=baseline.loc[:,m.reindex(baseline.columns, axis=1, fill_value=True) ]
print (baseline_filtered)
A B C E F
0 a 4 7 5 a
1 b 5 8 3 a
2 c 4 9 6 a
3 d 5 4 9 b
4 e 5 2 2 b
5 f 4 3 4 b
baseline_filtered=baseline.loc[:,m.reindex(baseline.columns, axis=1, fill_value=False) ]
print (baseline_filtered)
B C E
0 4 7 5
1 5 8 3
2 4 9 6
3 5 4 9
4 5 2 2
5 4 3 4
Another idea is use DataFrame.nunique working with strings and numeric columns:
baseline_filtered=baseline.loc[:,baseline.nunique() > 1]
print (baseline_filtered)
A B C E F
0 a 4 7 5 a
1 b 5 8 3 a
2 c 4 9 6 a
3 d 5 4 9 b
4 e 5 2 2 b
5 f 4 3 4 b

Create mask to identify final two rows in groups in Pandas dataframe

I have a Pandas dataframe that includes a grouping variable. An example can be produced using:
df = pd.DataFrame({'grp':['a','a','b','b','b','c','d','d','d','d'],
'data':[4,5,3,6,7,8,9,8,7,3]})
...which looks like:
grp data
0 a 4
1 a 5
2 b 3
3 b 6
4 b 7
5 c 8
6 d 9
7 d 8
8 d 7
9 d 3
I can retrieve the last two rows of each group using:
dfgrp = df.groupby('grp').tail(2)
However, I would like to produce a mask that identifies the last two rows (or 1 row if only 1 exists), ideally producing an output that looks like:
0 True
1 True
2 False
3 True
4 True
5 True
6 False
7 False
8 True
9 True
I thought this would be relatively straight-forward but I haven't been able to find the solution. Suggestions would be greatly appreciated.
If your index is unique, you could do this by using isin.
import pandas as pd
df = pd.DataFrame({'grp':['a','a','b','b','b','c','d','d','d','d'],
'data':[4,5,3,6,7,8,9,8,7,3]})
df['mask'] = df.index.isin(df.groupby('grp').tail(2).index)
df
grp data mask
0 a 4 True
1 a 5 True
2 b 3 False
3 b 6 True
4 b 7 True
5 c 8 True
6 d 9 False
7 d 8 False
8 d 7 True
9 d 3 True

How do I assign elements to the column of a pandas dataframe based on the properties of groups derived from that dataframe?

Suppose I import pandas and numpy as follows:
import pandas as pd
import numpy as np
and construct the following dataframe:
df = pd.DataFrame({'Alpha'
['A','A','A','B','B','B','B','C','C','C','C','C'],'Beta' : np.NaN})
...which gives me this:
Alpha Beta
0 A NaN
1 A NaN
2 A NaN
3 B NaN
4 B NaN
5 B NaN
6 B NaN
7 C NaN
8 C NaN
9 C NaN
10 C NaN
11 C NaN
How do I use pandas to get the following dataframe?
df_u = pd.DataFrame({'Alpha':['A','A','A','B','B','B','B','C','C','C','C','C'],'Beta' : [1,2,3,1,2,2,3,1,2,2,2,3]})
i.e. this:
Alpha Beta
0 A 1
1 A 2
2 A 3
3 B 1
4 B 2
5 B 2
6 B 3
7 C 1
8 C 2
9 C 2
10 C 2
11 C 3
Generally speaking what I'm trying to achieve can be described by the following logic:
Suppose we group df by Alpha.
For every group, for every row in the group...
if the index of the row equals the minimum index of rows in the group, then assign 1 to Beta for that row,
else if the index of the row equals the maximum index of the rows in the group, then assign 3 to Beta for that row,
else assign 2 to Beta for that row.
Let's use duplicated:
df.loc[~df.duplicated('Alpha', keep='last'), 'Beta'] = 3
df.loc[~df.duplicated('Alpha', keep='first'), 'Beta'] = 1
df['Beta'] = df['Beta'].fillna(2)
print(df)
Output:
Alpha Beta
0 A 1.0
1 A 2.0
2 A 3.0
3 B 1.0
4 B 2.0
5 B 2.0
6 B 3.0
7 C 1.0
8 C 2.0
9 C 2.0
10 C 2.0
11 C 3.0
method 1
Use np.select:
mask1=df['Alpha'].ne(df['Alpha'].shift())
mask3=df['Alpha'].ne(df['Alpha'].shift(-1))
mask2=~(mask1|mask3)
cond=[mask1,mask2,mask3]
values=[1,2,3]
df['Beta']=np.select(cond,values)
print(df)
Alpha Beta
0 A 1
1 A 2
2 A 3
3 B 1
4 B 2
5 B 2
6 B 3
7 C 1
8 C 2
9 C 2
10 C 2
11 C 3
Detail of cond list:
print(mask1)
0 True
1 False
2 False
3 True
4 False
5 False
6 False
7 True
8 False
9 False
10 False
11 False
Name: Alpha, dtype: bool
print(mask2)
0 False
1 True
2 False
3 False
4 True
5 True
6 False
7 False
8 True
9 True
10 True
11 False
Name: Alpha, dtype: bool
print(mask3)
0 False
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 False
9 False
10 False
11 True
Name: Alpha, dtype: bool
method 2
Use groupby:
def assign_value(x):
return pd.Series([1]+[2]*(len(x)-2)+[3])
new_df=df.groupby('Alpha').apply(assign_value).rename('Beta').reset_index('Alpha')
print(new_df)
Alpha Beta
0 A 1
1 A 2
2 A 3
0 B 1
1 B 2
2 B 2
3 B 3
0 C 1
1 C 2
2 C 2
3 C 2
4 C 3
assuming that "Alpha" column is sorted you can do it like this
df["Beta"] = 2
df.loc[~(df["Alpha"] == df["Alpha"].shift()), "Beta"] = 1
df.loc[~(df["Alpha"] == df["Alpha"].shift(-1)), "Beta"] = 3
df

compare multiple specific columns of all rows

I want to compare the particular columns of all the rows, if they are unique extract the value to the new column otherwise 0.
If the example dateframe as follows:
A B C D E F
13348 judte 1 1 1 1
54871 kfzef 1 1 0 1
89983 hdter 4 4 4 4
7543 bgfd 3 4 4 4
The result should be as follows:
A B C D E F Result
13348 judte 1 1 1 1 1
54871 kfzef 1 1 0 1 0
89983 hdter 4 4 4 4 4
7543 bgfd 3 4 4 4 0
I am pleased to hear some suggestions.
Use:
cols = ['C','D','E','F']
df['Result'] = np.where(df[cols].eq(df[cols[0]], axis=0).all(axis=1), df[cols[0]], 0)
print (df)
A B C D E F Result
0 13348 judte 1 1 1 1 1
1 54871 kfzef 1 1 0 1 0
2 89983 hdter 4 4 4 4 4
3 7543 bgfd 3 4 4 4 0
Detail:
First compare all column filtered by list of columns names by eq with first column of cols df[cols[0]]:
print (df[cols].eq(df[cols[0]], axis=0))
C D E F
0 True True True True
1 True True False True
2 True True True True
3 True False False False
Then check if all Trues per row by all:
print (df[cols].eq(df[cols[0]], axis=0).all(axis=1))
0 True
1 False
2 True
3 False
dtype: bool
And last use numpy.where for assign first column values for Trues and 0 for False.
I think you need apply with nunique as:
df['Result'] = df[['C','D','E','F']].apply(lambda x: x[0] if x.nunique()==1 else 0,1)
Or using np.where:
df['Result'] = np.where(df[['C','D','E','F']].nunique(1)==1,df['C'],0)
print(df)
A B C D E F Result
0 13348 judte 1 1 1 1 1
1 54871 kfzef 1 1 0 1 0
2 89983 hdter 4 4 4 4 4
3 7543 bgfd 3 4 4 4 0

Convert Outline format in CSV to Two Columns

I have data in a CSV file of the following format (one column in a dataframe). This is essentially like an outline in a Word document, where the headers I've shown here are letters are the main headers, and the items as numbers are subheaders:
A
1
2
3
B
1
2
C
1
2
3
4
I want to convert this to the following format (two columns in a dataframe):
A 1
A 2
A 3
B 1
B 2
C 1
C 2
C 3
C 4
I'm using pandas read_csv to convert the data into a dataframe, and I'm trying to reformat through for loops, but I'm having difficulty because the data repeats and gets overwritten. For example, A 3 will get overwritten with C 3 (resulting in two instance of C 3 when only one is desired, and losing A 3 altogether) later in the loop. What's the best way to do this?
Apologies for poor formatting, new to the site.
Use:
#if no csv header use names parameter
df = pd.read_csv(file, names=['col'])
df.insert(0, 'a', df['col'].mask(df['col'].str.isnumeric()).ffill())
df = df[df['a'] != df['col']]
print (df)
a col
1 A 1
2 A 2
3 A 3
5 B 1
6 B 2
8 C 1
9 C 2
10 C 3
11 C 4
Details:
Check isnumeric values:
print (df['col'].str.isnumeric())
0 False
1 True
2 True
3 True
4 False
5 True
6 True
7 False
8 True
9 True
10 True
11 True
Name: col, dtype: bool
Replace True by NaNs by mask and forward fill missing values:
print (df['col'].mask(df['col'].str.isnumeric()).ffill())
0 A
1 A
2 A
3 A
4 B
5 B
6 B
7 C
8 C
9 C
10 C
11 C
Name: col, dtype: object
Add new column to first position by DataFrame.insert:
df.insert(0, 'a', df['col'].mask(df['col'].str.isnumeric()).ffill())
print (df)
a col
0 A A
1 A 1
2 A 2
3 A 3
4 B B
5 B 1
6 B 2
7 C C
8 C 1
9 C 2
10 C 3
11 C 4
and last remove rows with same values by boolean indexing.

Categories

Resources