Remove one of duplicate value in two columns of dataframe - python

I am working on google collaboratory and I have two column on panda dataframe which some of the rows has similar value
A B
Syd Syd
Aus Del
Mir Ard
Dol Dol
I wish that the value in column B which has duplicate value with column A to be deleted, like below :
A B
Syd
Aus Del
Mir Ard
Dol
I try to use drop_duplicates() like this one Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C but it will delete the entire column B. Any suggestions smarter ways to solve this problem?
Thanks before!

There is no need to use drop_duplicates, you can simply compare the column A with B, then mask the values in B where they are equal to A
df['B'] = df['B'].mask(df['A'].eq(df['B']))
Alternatively you can also use boolean indexing with loc to mask the duplicated values
df.loc[df['A'].eq(df['B']), 'B'] = np.nan
A B
0 Syd NaN
1 Aus Del
2 Mir Ard
3 Dol NaN

Related

append or join value from one dataframe to every row in another dataframe in Pandas

I'm normally OK on the joining and appending front, but this one has got me stumped.
I've got one dataframe with only one row in it. I have another with multiple rows. I want to append the value from one of the columns of my first dataframe to every row of my second.
df1:
id
Value
1
word
df2:
id
data
1
a
2
b
3
c
Output I'm seeking:
df2
id
data
Value
1
a
word
2
b
word
3
c
word
I figured that this was along the right lines, but it listed out NaN for all rows:
df2 = df2.append(df1[df1['Value'] == 1])
I guess I could just join on the id value and then copy the value to all rows, but I assumed there was a cleaner way to do this.
Thanks in advance for any help you can provide!
Just get the first element in the value column of df1 and assign it to value column of df2
df2['value'] = df1.loc[0, 'value']

Appending a list as a row in a Dataframe [duplicate]

I formulated this question about adding rows WITH index, but it is not yet clear to me how/why this happens when there are no indexes:
columnsList=['A','B','C','D']
df8=pd.DataFrame(columns=columnsList)
L=['value aa','value bb','value cc','value dd']
s = pd.Series(dict(zip(df8.columns, L)))
df8.append(s,ignore_index=True)
df8.append(s,ignore_index=True)
I EXPECT HERE A 2X4 DATAFRAME.
nevertheless no values where added, nor an error occurred.
print(df8.shape)
#>>> (0,4)
Why is the series not being added, and why is not given any error?
If I try to add a row with LOC, an index is added,
df8.loc[df8.index.max() + 1, :] = [4, 5, 6,7]
print(df8)
result:
A B C D
NaN 4 5 6 7
I guess neither LOC, nor iLOC could be used to append rows without index name (i.e. Loc adds the index name NaN, and iLoc can not be used when the index number is higher than the rows of the database)
DataFrame.append is not an in-place operation. From the docs,
DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=None)
Append rows of other to the end of this frame, returning a new object.
Columns not in this frame are added as new columns.
You need to assign the result back.
df8 = df8.append([s] * 2, ignore_index=True)
df8
A B C D
0 value aa value bb value cc value dd
1 value aa value bb value cc value dd
The statement data.append(sub_data) does not work on its own.
But the statement data=data.append(sub_data) will work
Assigning it back solved the issue for me. Good tip not available elsewhere.

categorize string from one column in another column - python

I have dataframe with 3 columns. Column A contains titles on a lot of products, Column B contains all brand names and Column C contains models/series of all products. Column A got +2000 rows, column B got about 50 rows and Column C got about 200 rows. I want to create a new Column D, that categorizes if the Title in Column A includes Brand, Models or is Generic
Example on my dataframe and desired result in Column D
Column A Column B Column C Column D
Running shoes Nike Airmax 2 Generic
Nike airmax 2 Adidas All stars Model/series
Airmax 2 Converse Ultraboost Model/series
Nike Shoes Puma Questar Brand
If a row in column A contains brand and model I want Column D to categorize the row as model/serie. All rows in Column A that cannot get match with Brand or Models/series should be categorized as Generic.
I began trying with this:
df['Column D'] = df.apply(lambda x: x.Column_b in x.Column_a, axis=1)
Here I got an error because column B has a lot less rows than Column A.
Then i wondered if looping even is the right way to do it or if i need to do a regex.
Any help on how to accomplish getting the desired Column D, would be highly appreciated.
Use, Series.str.contains to create a boolean masks m1 where the truthy values in this mask corresponds to the condition where Column A contains the values from Column B in the similar manner create the boolean mask m2, then use np.select to select the values from choices based on the conditions based on m1 and m2:
m1 = df['Column A'].str.contains('|'.join(df['Column B']), case=False)
m2 = df['Column A'].str.contains('|'.join(df['Column C']), case=False)
df['Column D'] = np.select(
[m1 & m2, m1, m2], ['Model/series', 'Brand', 'Model/series'], 'Generic')
# print(df)
Column A Column B Column C Column D
0 Running shoes Nike Airmax 2 Generic
1 Nike airmax 2 Adidas All stars Model/series
2 Airmax 2 Converse Ultraboost Model/series
3 Nike Shoes Puma Questar Brand
Maybe something like:
df['D'] = ['Brand' if x in df['B'].values else 'Model/Series' if x in df['C'].values else 'Generic' for x in df['A']]
I'm not 100% sure if your data can contain both a column B and a column C instance in a column A instance, but if so it's trivial to add another else if inside the list comprehension to catch both

Store nth row elements in a list panda dataframe

I am new to python.Could you help on follow
I have a dataframe as follows.
a,d,f & g are column names. dataframe can be named as df1
a d f g
20 30 20 20
0 1 NaN NaN
I need to put second row of the df1 into a list without NaN's.
Ideally as follows.
x=[0,1]
Select the second row using df.iloc[1] then using .dropna remove the nan values, finally using .tolist method convert the series into python list.
Use:
x = df.iloc[1].dropna().astype(int).tolist()
# x = [0, 1]
Check itertuples()
So you would have something like taht:
for row in df1.itertuples():
row[0] #-> that's your index of row. You can do whatever you want with it, as well as with whole row which is a tuple now.
you can also use iloc and dropna() like that:
row_2 = df1.iloc[1].dropna().to_list()

Is there a way to get the mean value of previous two columns in pandas?

I want to calculate the mean value of previous two rows and fill the NAN's in my dataframe. There are only few rows with missing values in the 2010-19 column.
I tried using bfill and ffill but it only captures the previous or next row/column value and fill NAN.
My example data set has 7 columns as below:
X 1990-2000 2000-2010 2010-19 1990-2000 2000-2010 2010-19
Hyderabad 10 20 NAN 1 3 NAN
The output I want:
X 1990-2000 2000-2010 2010-19 1990-2000 2000-2010 2010-19
Hyderabad 10 20 15 1 3 2
To use fillna row-wise in this way, an easy solution is to provide an pandas series as argument to fillna. This will replace NaN values depending on the index.
Since the column names have duplicates the below code uses the column indices. Assuming a dataframe called df:
col_indices = [3, 6]
for i in col_indices:
means = df.iloc[:, [i-1, i-2]].mean(axis=1)
df.iloc[:, i].fillna(means, inplace=True)
This will fill the NaN values with the mean of the two columns to the left of each column in col_indices.

Categories

Resources