How to store index of duplicated rows in pandas dataframe?

How to store index of duplicated rows in pandas dataframe? - python

My dataset looks like below:
+--------+----------+-----------+--------------------+
| | FST_NAME | LAST_NAME | EMAIL_ADDR |
+--------+----------+-----------+--------------------+
| ROW_ID | | | |
| 1-123 | Will | Smith | will.smith#abc.com |
| 1-124 | Dan | Brown | dan.brown#xyz.com |
| 1-125 | Will | Smith | will.smith#abc.com |
| 1-126 | Dan | Brown | dan.brown#xyz.com |
| 1-127 | Tom | Cruise | tom.cruise#abc.com |
| 1-128 | Will | Smith | will.smith#abc.com |
+--------+----------+-----------+--------------------+
I am trying to count duplicate rows by keeping the first record and store all the duplicated row index in a column.
I tried below. It gives me the count but i am unable to group the duplicated index.
df.groupby(df.columns.tolist(),as_index=False).size()
How can I get the duplicated row index?

Try:
df.reset_index().groupby(df.columns.tolist())["index"].agg(list).reset_index()
To get exactly what you want:
res=df.reset_index().groupby(df.columns.tolist())["index"].agg(list).reset_index().rename(columns={"index": "duplicated"})
res.index=res["duplicated"].str[0].tolist()
res["duplicated"]=res["duplicated"].str[1:]
Outputs (dummy data):
#original df:
a b
a1 x 4
a2 y 3
b6 z 2
c7 x 4
d x 4
x y 3
#transformed one:
a b duplicated
a1 x 4 [c7, d]
a2 y 3 [x]
b6 z 2 []

Not a very efficient way, just that it can be used as a solution
df2 = df.drop_duplicates()
This will result as df2 =
Name1 Name2
0 Will Smith
1 Dan Brown
4 Tom Cruise
Now,
lis = []
for i in df2.iterrows():
lis.append(i[0])
This will make lis = [0, 1, 4]. All the indexes from 0 to len(df) that are not in lis, are the indexes that contain duplicates.

For df like:
FST_NAME L_NAME email
0 w s ws
1 d b db
2 w s ws
3 z z zz
Get grouped index into lists
import pandas as pd
df = pd.DataFrame({'FST_NAME': ['w', 'd', 'w', 'z'], 'L_NAME': ['s', 'b', 's', 'z'], 'email': ['ws', 'db', 'ws', 'zz']})
df = df.groupby(df.columns.tolist()).apply(lambda row: pd.Series({'duplicated': list(row.index)}))
Output:
duplicated
FST_NAME L_NAME email
d b db [1]
w s ws [0, 2]
z z zz [3]

Related

How to concatenate columns' name in new column if value =1 otherwise return 0 (python)

Example data:
| alcoholism | diabites | | handicapped | hypertensive | new col |
| -------- | -------- | | -------- | -------- | ---------------- |
| 1 | 0 | | 1 | 0 | alcoholism, handicapped |
| 0 | 1 | | 0 | 1 | diabites, hypertensive |
| 0 | 1 | | 0 | 0 | diabites |
If any of the above columns has value = 1, then I need the new column to have the names of these columns only,
and if all are zero return no condition.
I had tried to do it with the below code:
problems = ['alcoholism', 'diabetes','hypertension','handicap']
m1 = df[problems].isin([1])
mask = m1 | (m1.loc[~m1.any(axis=1)])
df['sp_name'] = mask.mul(problems).apply(lambda x: [i for i in x if i], axis=1)
But it returns the data with brackets like [handicapped, alcoholism].
The issue is that I can't do value counts as the zero values show as empty [] and will not be plotted.

I still don't understand your ultimate goal, or how this will be useful in plotting, but all you're really missing is using str.join to combine each list into the string you want. That said, the way you've gotten there involves unnecessary steps. First, multiply the DataFrame by its own column names:
df * df.columns
alcoholism diabetes handicapped hypertension
0 alcoholism handicapped
1 diabetes hypertension
2 diabetes
Then you can apply the same as you did:
(df * df.columns).apply(lambda row: [i for i in row if i], axis=1)
0 [alcoholism, handicapped]
1 [diabetes, hypertension]
2 [diabetes]
dtype: object
Then you just need to include a string join in the function you supply to apply. Here's a complete example:
import pandas as pd
df = pd.DataFrame({
'alcoholism': [1, 0, 0],
'diabetes': [0, 1, 1],
'handicapped': [1, 0, 0],
'hypertension': [0, 1, 0],
})
df['new_col'] = (
(df * df.columns)
.apply(lambda row: ', '.join([i for i in row if i]), axis=1)
)
print(df)
alcoholism diabetes handicapped hypertension new_col
0 1 0 1 0 alcoholism, handicapped
1 0 1 0 1 diabetes, hypertension
2 0 1 0 0 diabetes

df['new_col'] = df.iloc[:, :-1].dot(df.add_suffix(",").columns[:-1]).str[:-1]
i already found this solution helpful for me

Python Pandas conditional value elimination

I am trying to drop values in one data frame based on value-based on another data frame. I would appreciate your expertise on this, please.
Data frame 1 – df1:
| A | C |
| -------- | -------------- |
| f | 10 |
| c | 15 |
| b | 20 |
| d | 30 |
| h | 35 |
| e | 40 |
-----------------------------
Data frame 2 – df2:
| A | B |
| -------- | -------------- |
| a | w |
| b | 1 |
| c | w |
| d | 1 |
| e | w |
| f | 0 |
| g | 1 |
| h | 1 |
-----------------------------
I want to modify the df1 and drop(eliminate) values in column A if corresponding values in column B is ‘w’ in df2.
Resulted data frame looks like below.
| A | C |
| -------- | -------------- |
| f | 10 |
| b | 20 |
| d | 30 |
| h | 35 |
-----------------------------

You can first create a list from df2 with the values of A that have associated value of 'w' in B, and then use isin and ~ (which means not in essentially):
a = df2.loc[df2['B'].str.contains(r'w',case=False,na=False),'A'].tolist()
b = df1[~df1['A'].isin(a)]
And get back your desired outcome:
print(b)
A C
0 f 10
2 b 20
3 d 30
4 h 35
I find this link particularly helpful if you want to read more on Python's operators:
https://www.w3schools.com/python/python_operators.asp

What you're looking for is a merge with certain conditions:
# recreating your data
>>> import pandas as pd
>>> df1 = pd.DataFrame.from_dict({'A': list('fcbdhe'), 'B': [10, 15, 20, 30, 35, 40]})
>>> df2 = pd.DataFrame.from_dict({'A': list('abcdefgh'), 'B': list('w1w1w011')})
# merge but we further need to project that to fit the desired output
>>> df1.merge(df2[df2['B'] != 'w'], how='inner', on='A')
A B_x B_y
0 f 10 0
1 b 20 1
2 d 30 1
3 h 35 1
# what you're looking for
>>> df1.merge(df2[df2['B'] != 'w'], how='inner', on='A')[['A', 'B_x']].rename(columns={'B_x': 'C'})
A C
0 f 10
1 b 20
2 d 30
3 h 35

I assume that your first dataframe is df1 and second dataframe is df2.
First of all, list all the values of column A of df2 that value is 'w' in column B of df2.
df2_A = df2[df2['B'] == 'w']['A'].values.tolist()
# >>>output = ['a','c','e']
It will list all the values of column A of df2 that has value 'w' in column B of df2.
Then you can drop the values of df1 whose values lies in the list of above.
for i, val in enumerate(df1['A'].values.tolist()):
if val in df2_A:
df1.drop(i, axis=0, inplace=True)
Full code:
df2_A = df2[df2['B'] == 'w']['A'].values.tolist()
for i, val in enumerate(df1['A'].values.tolist()):
if val in df2_A: #checking if A column of df1 has values similar to above list
df1.drop(i, axis=0, inplace=True)
df1

df1.merge(df2[df2['B']!='w'],how='inner').drop(["B"], axis=1)

How can I insert an excel formula using pandas when I save it as an excel file?

I have the following dataframe with multiple cols and rows,
A | B | C | D | E |....
2 | b | c | NaN | 1 |
3 | c | b | NaN | 0 |
4 | b | b | NaN | 1 |
.
.
.
Is there a way to add excel formulas (for some columns) in the manner stated below through an example using python in an output excel file?
For instance, I want to be able to have the output something like this,
=SUM(A0:A2) | | | | =SUM(E0:E2)
A | B | C | D | E
0 2 | b | c | =IF(B0=C0, "Yes", "No") | 1
1 3 | c | b | =IF(B1=C1, "Yes", "No") | 0
2 4 | b | b | =IF(B2=C2, "Yes", "No") | 1
.
.
.
Final output,
9 | | | | 2
A | B | C | D | E
0 2 | b | c | No | 1
1 3 | c | b | No | 0
2 4 | b | b | Yes | 1
.
.
.
I want to add formulas in the final output excel file so that if there are any changes in the values of columns (in the final output excel file) other columns can also be updated in the excel file in real time, for instance,
15 | | | | 3
A | B | C | D | E
0 2 | b | b | Yes | 1
1 9 | c | b | No | 1
2 4 | b | b | Yes | 1
.
.
.
If I change the values of, for instance, A1 from 3 to 9, then the sum of the column changes to 15; when I change the value of C0 from "c" to "b", the value of its corresponding row value, that is, D0 changes from "No" to "Yes"; Same for col E.
I know you can use xlsxwriter library to write the formulas but I am not able to figure out as to how I can add the formulas in the manner I have stated in the example above.
Any help would be really appreciated, thanks in advance!

You're best doing all of your formulas you wish to keep via xlsxwriter and not pandas.
You would use pandas if you only wanted to export the result, since you want to preserve the formula, do it when you write your spreadsheet.
The code below will write out the dataframe and formula to an xlsx file called test.
import xlsxwriter
import pandas as pd
from numpy import nan
data = [[2, 'b', 'c', nan, 1], [3, 'c', 'b', nan, 0], [4, 'b', 'b', nan, 1]]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D', 'E'])
## Send values to a list so we can iterate over to allow for row:column matching in formula ##
values = df.values.tolist()
## Create Workbook ##
workbook = xlsxwriter.Workbook('test.xlsx')
worksheet = workbook.add_worksheet()
row = 0
col = 0
## Iterate over the data we extracted from the DF, generating our cell formula for 'D' each iteration ##
for idx, line in enumerate(values):
d = f'=IF(B{row + 1}=C{row + 1}, "Yes", "No")'
a, b, c, _, e = line
## Write cells into spreadsheet ##
worksheet.write(row, col, a)
worksheet.write(row, col + 1, b)
worksheet.write(row, col + 2, c)
worksheet.write(row, col + 3, d)
worksheet.write(row, col + 4, e)
row += 1
## Write the total sums to the bottom row of the sheet utilising the row counter to specify our stop point ##
worksheet.write(row, 0, f'=SUM(A1:A{row})')
worksheet.write(row, 4, f'=SUM(E1:E{row})')
workbook.close()

How do I merge categories for crosstab in pandas where some categories are common?

A while ago I asked this question
But that does not cover the case where two merged categories might have a common category
In that case I wanted to merge the categories A and B into AB. What if I have categories A, B, C and I want to merge A,B into AB, and B,C into BC?
Suppose I have the data:
+---+---+
| X | Y |
+---+---+
| A | D |
| B | D |
| B | E |
| B | D |
| A | E |
| C | D |
| C | E |
| B | E |
+---+---+
I want the cross-tab to look like:
+--------+---+---+
| X/Y | D | E |
+--------+---+---+
| A or B | 3 | 3 |
| B or C | 3 | 2 |
| C | 1 | 1 |
+--------+---+---+

I think you can use crosstab by all unique values and then sum values by selecting by categories in index values:
df = pd.crosstab(df.X, df.Y)
df.loc['A or B'] = df.loc[['A','B']].sum()
df.loc['B or C'] = df.loc[['C','B']].sum()
df = df.drop(['A','B'])
print (df)
Y D E
X
C 1 1
A or B 3 3
B or C 3 3
EDIT: If want general solution it is not easy, because is necessary repeat groups with rename like:
df1 = df[df['X'] == 'B'].assign(X = 'B or C')
df2 = df[df['X'] == 'C']
df = pd.concat([df, df1], ignore_index=True)
df['X'] = df['X'].replace({'A':'A or B', 'B': 'A or B', 'C': 'B or C'})
df = pd.concat([df, df2], ignore_index=True)
df = pd.crosstab(df.X, df.Y)
print (df)
Y D E
X
A or B 3 3
B or C 3 3
C 1 1

How to split columns when content isn't aligned

I have a CSV file with survey data. One of the columns contains responses from a multi-select question. The values in that column are separated by ";"
| Q10 |
----------------
| A; B; C |
| A; B; D |
| A; D |
| A; D; E |
| B; C; D; E |
I want to split the column into multiple columns, one for each option:
| A | B | C | D | E |
---------------------
| A | B | C | | |
| A | B | | D | |
| A | | | D | |
| A | | | D | E |
| | B | C | D | E |
Is there anyway to do this in excel or python or some other way?

Here is a simple formula that does what is asked:
=IF(ISNUMBER(SEARCH("; "&B$1&";","; "&$A2&";")),B$1,"")
This assumes there is always a space between the ; and the look up value. If not we can remove the space with substitute:
=IF(ISNUMBER(SEARCH(";"&B$1&";",";"&SUBSTITUTE($A2," ","")&";")),B$1,"")

I know this question has been answered but for those looking for a Python way to solve it, here it is (may be not the most efficient way though):
First split the column value, explode them and get the dummies. Next, group the dummy values together across the given 5 (or N) columns:
df['Q10'] = df['Q10'].str.split('; ')
df = df.explode('Q10')
df = pd.get_dummies(df, columns=['Q10'])
dummy_col_list = df.columns.tolist()
df['New'] = df.index
new_df = df.groupby('New')[dummy_col_list].sum().reset_index()
del new_df['New']
You will get:
Q10_A Q10_B Q10_C Q10_D Q10_E
0 1 1 1 0 0
1 1 1 0 1 0
2 1 0 0 1 0
3 1 0 0 1 1
4 0 1 1 1 1
Now, if you want, you can rename the columns and replacing 1 with the column name:
colName = new_df.columns.tolist()
newColList = []
for i in colName:
newColName = i.split('_', 1)[1]
newColList.append(newColName)
new_df.columns = newColList
for col in list(new_df.columns):
new_df[col] = np.where(new_df[col] == 1, col, '')
Final output:
A B C D E
0 A B C
1 A B D
2 A D
3 A D E
4 B C D E

If you want to do the job in python:
import pandas as pd
import numpy as np
df = pd.read_csv('file.csv')
df['A'] = np.where(df.Q10.str.contains('A'), 'A', '')
df['B'] = np.where(df.Q10.str.contains('B'), 'B', '')
df['C'] = np.where(df.Q10.str.contains('C'), 'C', '')
df['D'] = np.where(df.Q10.str.contains('D'), 'D', '')
df['E'] = np.where(df.Q10.str.contains('E'), 'E', '')
df.drop('Q10', axis=1, inplace=True)
df
Output:
A B C D E
0 A B C
1 A B D
2 A D
3 A D E
4 B C D E
It's not the most efficient way, but it works ;)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to store index of duplicated rows in pandas dataframe? - python

Related

How to concatenate columns' name in new column if value =1 otherwise return 0 (python)

Python Pandas conditional value elimination

How can I insert an excel formula using pandas when I save it as an excel file?

How do I merge categories for crosstab in pandas where some categories are common?

How to split columns when content isn't aligned

Categories

Resources