I am trying to drop values in one data frame based on value-based on another data frame. I would appreciate your expertise on this, please.
Data frame 1 – df1:
| A | C |
| -------- | -------------- |
| f | 10 |
| c | 15 |
| b | 20 |
| d | 30 |
| h | 35 |
| e | 40 |
-----------------------------
Data frame 2 – df2:
| A | B |
| -------- | -------------- |
| a | w |
| b | 1 |
| c | w |
| d | 1 |
| e | w |
| f | 0 |
| g | 1 |
| h | 1 |
-----------------------------
I want to modify the df1 and drop(eliminate) values in column A if corresponding values in column B is ‘w’ in df2.
Resulted data frame looks like below.
| A | C |
| -------- | -------------- |
| f | 10 |
| b | 20 |
| d | 30 |
| h | 35 |
-----------------------------
You can first create a list from df2 with the values of A that have associated value of 'w' in B, and then use isin and ~ (which means not in essentially):
a = df2.loc[df2['B'].str.contains(r'w',case=False,na=False),'A'].tolist()
b = df1[~df1['A'].isin(a)]
And get back your desired outcome:
print(b)
A C
0 f 10
2 b 20
3 d 30
4 h 35
I find this link particularly helpful if you want to read more on Python's operators:
https://www.w3schools.com/python/python_operators.asp
What you're looking for is a merge with certain conditions:
# recreating your data
>>> import pandas as pd
>>> df1 = pd.DataFrame.from_dict({'A': list('fcbdhe'), 'B': [10, 15, 20, 30, 35, 40]})
>>> df2 = pd.DataFrame.from_dict({'A': list('abcdefgh'), 'B': list('w1w1w011')})
# merge but we further need to project that to fit the desired output
>>> df1.merge(df2[df2['B'] != 'w'], how='inner', on='A')
A B_x B_y
0 f 10 0
1 b 20 1
2 d 30 1
3 h 35 1
# what you're looking for
>>> df1.merge(df2[df2['B'] != 'w'], how='inner', on='A')[['A', 'B_x']].rename(columns={'B_x': 'C'})
A C
0 f 10
1 b 20
2 d 30
3 h 35
I assume that your first dataframe is df1 and second dataframe is df2.
First of all, list all the values of column A of df2 that value is 'w' in column B of df2.
df2_A = df2[df2['B'] == 'w']['A'].values.tolist()
# >>>output = ['a','c','e']
It will list all the values of column A of df2 that has value 'w' in column B of df2.
Then you can drop the values of df1 whose values lies in the list of above.
for i, val in enumerate(df1['A'].values.tolist()):
if val in df2_A:
df1.drop(i, axis=0, inplace=True)
Full code:
df2_A = df2[df2['B'] == 'w']['A'].values.tolist()
for i, val in enumerate(df1['A'].values.tolist()):
if val in df2_A: #checking if A column of df1 has values similar to above list
df1.drop(i, axis=0, inplace=True)
df1
df1.merge(df2[df2['B']!='w'],how='inner').drop(["B"], axis=1)
Related
I have the following dataframe with multiple cols and rows,
A | B | C | D | E |....
2 | b | c | NaN | 1 |
3 | c | b | NaN | 0 |
4 | b | b | NaN | 1 |
.
.
.
Is there a way to add excel formulas (for some columns) in the manner stated below through an example using python in an output excel file?
For instance, I want to be able to have the output something like this,
=SUM(A0:A2) | | | | =SUM(E0:E2)
A | B | C | D | E
0 2 | b | c | =IF(B0=C0, "Yes", "No") | 1
1 3 | c | b | =IF(B1=C1, "Yes", "No") | 0
2 4 | b | b | =IF(B2=C2, "Yes", "No") | 1
.
.
.
Final output,
9 | | | | 2
A | B | C | D | E
0 2 | b | c | No | 1
1 3 | c | b | No | 0
2 4 | b | b | Yes | 1
.
.
.
I want to add formulas in the final output excel file so that if there are any changes in the values of columns (in the final output excel file) other columns can also be updated in the excel file in real time, for instance,
15 | | | | 3
A | B | C | D | E
0 2 | b | b | Yes | 1
1 9 | c | b | No | 1
2 4 | b | b | Yes | 1
.
.
.
If I change the values of, for instance, A1 from 3 to 9, then the sum of the column changes to 15; when I change the value of C0 from "c" to "b", the value of its corresponding row value, that is, D0 changes from "No" to "Yes"; Same for col E.
I know you can use xlsxwriter library to write the formulas but I am not able to figure out as to how I can add the formulas in the manner I have stated in the example above.
Any help would be really appreciated, thanks in advance!
You're best doing all of your formulas you wish to keep via xlsxwriter and not pandas.
You would use pandas if you only wanted to export the result, since you want to preserve the formula, do it when you write your spreadsheet.
The code below will write out the dataframe and formula to an xlsx file called test.
import xlsxwriter
import pandas as pd
from numpy import nan
data = [[2, 'b', 'c', nan, 1], [3, 'c', 'b', nan, 0], [4, 'b', 'b', nan, 1]]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D', 'E'])
## Send values to a list so we can iterate over to allow for row:column matching in formula ##
values = df.values.tolist()
## Create Workbook ##
workbook = xlsxwriter.Workbook('test.xlsx')
worksheet = workbook.add_worksheet()
row = 0
col = 0
## Iterate over the data we extracted from the DF, generating our cell formula for 'D' each iteration ##
for idx, line in enumerate(values):
d = f'=IF(B{row + 1}=C{row + 1}, "Yes", "No")'
a, b, c, _, e = line
## Write cells into spreadsheet ##
worksheet.write(row, col, a)
worksheet.write(row, col + 1, b)
worksheet.write(row, col + 2, c)
worksheet.write(row, col + 3, d)
worksheet.write(row, col + 4, e)
row += 1
## Write the total sums to the bottom row of the sheet utilising the row counter to specify our stop point ##
worksheet.write(row, 0, f'=SUM(A1:A{row})')
worksheet.write(row, 4, f'=SUM(E1:E{row})')
workbook.close()
A while ago I asked this question
But that does not cover the case where two merged categories might have a common category
In that case I wanted to merge the categories A and B into AB. What if I have categories A, B, C and I want to merge A,B into AB, and B,C into BC?
Suppose I have the data:
+---+---+
| X | Y |
+---+---+
| A | D |
| B | D |
| B | E |
| B | D |
| A | E |
| C | D |
| C | E |
| B | E |
+---+---+
I want the cross-tab to look like:
+--------+---+---+
| X/Y | D | E |
+--------+---+---+
| A or B | 3 | 3 |
| B or C | 3 | 2 |
| C | 1 | 1 |
+--------+---+---+
I think you can use crosstab by all unique values and then sum values by selecting by categories in index values:
df = pd.crosstab(df.X, df.Y)
df.loc['A or B'] = df.loc[['A','B']].sum()
df.loc['B or C'] = df.loc[['C','B']].sum()
df = df.drop(['A','B'])
print (df)
Y D E
X
C 1 1
A or B 3 3
B or C 3 3
EDIT: If want general solution it is not easy, because is necessary repeat groups with rename like:
df1 = df[df['X'] == 'B'].assign(X = 'B or C')
df2 = df[df['X'] == 'C']
df = pd.concat([df, df1], ignore_index=True)
df['X'] = df['X'].replace({'A':'A or B', 'B': 'A or B', 'C': 'B or C'})
df = pd.concat([df, df2], ignore_index=True)
df = pd.crosstab(df.X, df.Y)
print (df)
Y D E
X
A or B 3 3
B or C 3 3
C 1 1
I have a dataframe that looks like this:
Col1 | Col2 | Col1 | Col3 | Col1 | Col4
a | d | | h | a | p
b | e | b | i | b | l
| l | a | l | | a
l | r | l | a | l | x
a | i | a | w | | i
| c | | i | r | c
d | o | d | e | d | o
Col1 is repeated multiple times in the dataframe. In each Col1, there is missing information. I need to create a new column that has all of the information from each Col1 occurrence.
How can I create a column with the complete information and then delete the previous duplicate columns?
Some information may be missing from multiple columns. This script is also meant to be used in the future when there could be one, three, five, or any number of duplicated Col1 columns.
The desired output looks like this:
Col2 | Col3 | Col4 | Col5
d | h | p | a
e | i | l | b
l | l | a | a
r | a | x | l
i | w | i | a
c | i | c | r
o | e | o | d
I have been looking over this question but it is not clear to me how I could keep the desired Col1 with complete values. I could delete multiple columns of the same name but I need to first create a column with complete information.
First replace empty values in your columns with nan as below:
import numpy as np
df = df.replace(r'^\s*$', np.nan, regex=True)
Then, you could use groupby and then first()
df.groupby(level = 0, axis = 1).first()
May be something like this is what you are looking for.
col_list = list(set(df.columns))
dicts={}
for col in col_list:
val = list(filter(None,set(df.filter(like=col).stack().reset_index()[0].str.strip(' ').tolist())))
dicts[col]= val
max_len=max([len(k) for k in dicts.values()])
pd.DataFrame({k:pd.Series(v[:max_len]) for k,v in dicts.items()})
output
Col3 Col4 Col1 Col2
0 h i d d
1 w l b r
2 i c r i
3 l x l l
4 a p a o
5 e o NaN c
6 NaN a NaN e
If I have two dataframes like:
df1:
| a | b |
0 | 0 | 0 |
1 | 0 | 1 |
2 | 1 | 1 |
df2:
| c | d |
0 | 0 | 1 |
1 | 1 | 1 |
2 | 2 | 1 |
how could I select rows from df2 where df1[df2['c']]['b'] != 0. So in other words, rows from df2 where it's value in column c is the index used to check the value in df1 column b is not equal to 0.
So one other way to look at it. I select all the columns from df2 where column c is a foreign key to df1, and I don't want their value in column b to be equal to 0.
I think this should do the trick. Let me know if you need something else.
df1['index1'] = df1.index
df = pandas.merge(df1, df2, how='left', left_on=['index1'], right_on=['c'])
df = df[df.b != 0]
I’d like to convert a dataframe, df, similar to this one:
PIDM | COURSE | GRADE
1 | MAT1 | B
1 | PHY2 | C
2 | MAT1 | A
2 | MAT2 | B
2 | PHE2 | A
to the following format:
PIDM | MAT1 | PHY2 | MAT2 | PHY 2
1 | B | C | NaN | NaN
2 | A | NaN | B | A
I was assuming I could do something like:
df2 = df.pivot(index='PIDM', columns=‘COURSE’, values = ‘GRADE)
but I receive an error stating that I have duplicate indices. Thank you for your help.
You can use pivot_table with aggregate function join:
df2 = df.pivot_table(index='PIDM', columns='COURSE', values = 'GRADE', aggfunc=', '.join)
print (df2)
COURSE MAT1 MAT2 PHE2 PHY2
PIDM
1 B None None C
2 A B A None