I have a CSV file with survey data. One of the columns contains responses from a multi-select question. The values in that column are separated by ";"
| Q10 |
----------------
| A; B; C |
| A; B; D |
| A; D |
| A; D; E |
| B; C; D; E |
I want to split the column into multiple columns, one for each option:
| A | B | C | D | E |
---------------------
| A | B | C | | |
| A | B | | D | |
| A | | | D | |
| A | | | D | E |
| | B | C | D | E |
Is there anyway to do this in excel or python or some other way?
Here is a simple formula that does what is asked:
=IF(ISNUMBER(SEARCH("; "&B$1&";","; "&$A2&";")),B$1,"")
This assumes there is always a space between the ; and the look up value. If not we can remove the space with substitute:
=IF(ISNUMBER(SEARCH(";"&B$1&";",";"&SUBSTITUTE($A2," ","")&";")),B$1,"")
I know this question has been answered but for those looking for a Python way to solve it, here it is (may be not the most efficient way though):
First split the column value, explode them and get the dummies. Next, group the dummy values together across the given 5 (or N) columns:
df['Q10'] = df['Q10'].str.split('; ')
df = df.explode('Q10')
df = pd.get_dummies(df, columns=['Q10'])
dummy_col_list = df.columns.tolist()
df['New'] = df.index
new_df = df.groupby('New')[dummy_col_list].sum().reset_index()
del new_df['New']
You will get:
Q10_A Q10_B Q10_C Q10_D Q10_E
0 1 1 1 0 0
1 1 1 0 1 0
2 1 0 0 1 0
3 1 0 0 1 1
4 0 1 1 1 1
Now, if you want, you can rename the columns and replacing 1 with the column name:
colName = new_df.columns.tolist()
newColList = []
for i in colName:
newColName = i.split('_', 1)[1]
newColList.append(newColName)
new_df.columns = newColList
for col in list(new_df.columns):
new_df[col] = np.where(new_df[col] == 1, col, '')
Final output:
A B C D E
0 A B C
1 A B D
2 A D
3 A D E
4 B C D E
If you want to do the job in python:
import pandas as pd
import numpy as np
df = pd.read_csv('file.csv')
df['A'] = np.where(df.Q10.str.contains('A'), 'A', '')
df['B'] = np.where(df.Q10.str.contains('B'), 'B', '')
df['C'] = np.where(df.Q10.str.contains('C'), 'C', '')
df['D'] = np.where(df.Q10.str.contains('D'), 'D', '')
df['E'] = np.where(df.Q10.str.contains('E'), 'E', '')
df.drop('Q10', axis=1, inplace=True)
df
Output:
A B C D E
0 A B C
1 A B D
2 A D
3 A D E
4 B C D E
It's not the most efficient way, but it works ;)
Related
I have a dataframe like this
| A | B | C |
|-------|---|---|
| ['1'] | 1 | 1 |
|['1,2']| 2 | |
| ['2'] | 3 | 0 |
|['1,3']| 2 | |
if the value of B is equal to A within the quotes then C is 1. if not present in A it will be 0. Expected output is:
| A | B | C |
|-------|---|---|
| ['1'] | 1 | 1 |
|['1,2']| 2 | 1 |
| ['2'] | 3 | 0 |
|['1,3']| 2 | 0 |
Like this I want to get the dataframe for multiple rows. How do I write in python to get this kind of data frame?
If values in A are strings use:
print (df.A.tolist())
["['1']", "['1,2']", "['2']", "['1,3']"]
df['C'] = [int(str(b) in a.strip("[]'").split(',')) for a, b in zip(df.A, df.B)]
print (df)
A B C
0 ['1'] 1 1
1 ['1,2'] 2 1
2 ['2'] 3 0
3 ['1,3'] 2 0
Or if values are one element lists use:
print (df.A.tolist())
[['1'], ['1,2'], ['2'], ['1,3']]
df['C'] = [int(str(b) in a[0].split(',')) for a, b in zip(df.A, df.B)]
print (df)
A B C
0 [1] 1 1
1 [1,2] 2 1
2 [2] 3 0
3 [1,3] 2 0
My code:
df = pd.read_clipboard()
df
'''
A B
0 ['1'] 1
1 ['1,2'] 2
2 ['2'] 3
3 ['1,3'] 2
'''
(
df.assign(A=df.A.str.replace("'",'').map(eval))
.assign(C=lambda d: d.apply(lambda s: s.B in s.A, axis=1))
.assign(C=lambda d: d.C.astype(int))
)
'''
A B C
0 [1] 1 1
1 [1, 2] 2 1
2 [2] 3 0
3 [1, 3] 2 0
'''
df['C'] = np.where(df['B'].astype(str).isin(df.A), 1,0)
basically you need to transform column b to string since column A is string. then seek for column B inside columnA.
result will be as you are defined.
I am trying to drop values in one data frame based on value-based on another data frame. I would appreciate your expertise on this, please.
Data frame 1 – df1:
| A | C |
| -------- | -------------- |
| f | 10 |
| c | 15 |
| b | 20 |
| d | 30 |
| h | 35 |
| e | 40 |
-----------------------------
Data frame 2 – df2:
| A | B |
| -------- | -------------- |
| a | w |
| b | 1 |
| c | w |
| d | 1 |
| e | w |
| f | 0 |
| g | 1 |
| h | 1 |
-----------------------------
I want to modify the df1 and drop(eliminate) values in column A if corresponding values in column B is ‘w’ in df2.
Resulted data frame looks like below.
| A | C |
| -------- | -------------- |
| f | 10 |
| b | 20 |
| d | 30 |
| h | 35 |
-----------------------------
You can first create a list from df2 with the values of A that have associated value of 'w' in B, and then use isin and ~ (which means not in essentially):
a = df2.loc[df2['B'].str.contains(r'w',case=False,na=False),'A'].tolist()
b = df1[~df1['A'].isin(a)]
And get back your desired outcome:
print(b)
A C
0 f 10
2 b 20
3 d 30
4 h 35
I find this link particularly helpful if you want to read more on Python's operators:
https://www.w3schools.com/python/python_operators.asp
What you're looking for is a merge with certain conditions:
# recreating your data
>>> import pandas as pd
>>> df1 = pd.DataFrame.from_dict({'A': list('fcbdhe'), 'B': [10, 15, 20, 30, 35, 40]})
>>> df2 = pd.DataFrame.from_dict({'A': list('abcdefgh'), 'B': list('w1w1w011')})
# merge but we further need to project that to fit the desired output
>>> df1.merge(df2[df2['B'] != 'w'], how='inner', on='A')
A B_x B_y
0 f 10 0
1 b 20 1
2 d 30 1
3 h 35 1
# what you're looking for
>>> df1.merge(df2[df2['B'] != 'w'], how='inner', on='A')[['A', 'B_x']].rename(columns={'B_x': 'C'})
A C
0 f 10
1 b 20
2 d 30
3 h 35
I assume that your first dataframe is df1 and second dataframe is df2.
First of all, list all the values of column A of df2 that value is 'w' in column B of df2.
df2_A = df2[df2['B'] == 'w']['A'].values.tolist()
# >>>output = ['a','c','e']
It will list all the values of column A of df2 that has value 'w' in column B of df2.
Then you can drop the values of df1 whose values lies in the list of above.
for i, val in enumerate(df1['A'].values.tolist()):
if val in df2_A:
df1.drop(i, axis=0, inplace=True)
Full code:
df2_A = df2[df2['B'] == 'w']['A'].values.tolist()
for i, val in enumerate(df1['A'].values.tolist()):
if val in df2_A: #checking if A column of df1 has values similar to above list
df1.drop(i, axis=0, inplace=True)
df1
df1.merge(df2[df2['B']!='w'],how='inner').drop(["B"], axis=1)
I have the following dataframe with multiple cols and rows,
A | B | C | D | E |....
2 | b | c | NaN | 1 |
3 | c | b | NaN | 0 |
4 | b | b | NaN | 1 |
.
.
.
Is there a way to add excel formulas (for some columns) in the manner stated below through an example using python in an output excel file?
For instance, I want to be able to have the output something like this,
=SUM(A0:A2) | | | | =SUM(E0:E2)
A | B | C | D | E
0 2 | b | c | =IF(B0=C0, "Yes", "No") | 1
1 3 | c | b | =IF(B1=C1, "Yes", "No") | 0
2 4 | b | b | =IF(B2=C2, "Yes", "No") | 1
.
.
.
Final output,
9 | | | | 2
A | B | C | D | E
0 2 | b | c | No | 1
1 3 | c | b | No | 0
2 4 | b | b | Yes | 1
.
.
.
I want to add formulas in the final output excel file so that if there are any changes in the values of columns (in the final output excel file) other columns can also be updated in the excel file in real time, for instance,
15 | | | | 3
A | B | C | D | E
0 2 | b | b | Yes | 1
1 9 | c | b | No | 1
2 4 | b | b | Yes | 1
.
.
.
If I change the values of, for instance, A1 from 3 to 9, then the sum of the column changes to 15; when I change the value of C0 from "c" to "b", the value of its corresponding row value, that is, D0 changes from "No" to "Yes"; Same for col E.
I know you can use xlsxwriter library to write the formulas but I am not able to figure out as to how I can add the formulas in the manner I have stated in the example above.
Any help would be really appreciated, thanks in advance!
You're best doing all of your formulas you wish to keep via xlsxwriter and not pandas.
You would use pandas if you only wanted to export the result, since you want to preserve the formula, do it when you write your spreadsheet.
The code below will write out the dataframe and formula to an xlsx file called test.
import xlsxwriter
import pandas as pd
from numpy import nan
data = [[2, 'b', 'c', nan, 1], [3, 'c', 'b', nan, 0], [4, 'b', 'b', nan, 1]]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D', 'E'])
## Send values to a list so we can iterate over to allow for row:column matching in formula ##
values = df.values.tolist()
## Create Workbook ##
workbook = xlsxwriter.Workbook('test.xlsx')
worksheet = workbook.add_worksheet()
row = 0
col = 0
## Iterate over the data we extracted from the DF, generating our cell formula for 'D' each iteration ##
for idx, line in enumerate(values):
d = f'=IF(B{row + 1}=C{row + 1}, "Yes", "No")'
a, b, c, _, e = line
## Write cells into spreadsheet ##
worksheet.write(row, col, a)
worksheet.write(row, col + 1, b)
worksheet.write(row, col + 2, c)
worksheet.write(row, col + 3, d)
worksheet.write(row, col + 4, e)
row += 1
## Write the total sums to the bottom row of the sheet utilising the row counter to specify our stop point ##
worksheet.write(row, 0, f'=SUM(A1:A{row})')
worksheet.write(row, 4, f'=SUM(E1:E{row})')
workbook.close()
A while ago I asked this question
But that does not cover the case where two merged categories might have a common category
In that case I wanted to merge the categories A and B into AB. What if I have categories A, B, C and I want to merge A,B into AB, and B,C into BC?
Suppose I have the data:
+---+---+
| X | Y |
+---+---+
| A | D |
| B | D |
| B | E |
| B | D |
| A | E |
| C | D |
| C | E |
| B | E |
+---+---+
I want the cross-tab to look like:
+--------+---+---+
| X/Y | D | E |
+--------+---+---+
| A or B | 3 | 3 |
| B or C | 3 | 2 |
| C | 1 | 1 |
+--------+---+---+
I think you can use crosstab by all unique values and then sum values by selecting by categories in index values:
df = pd.crosstab(df.X, df.Y)
df.loc['A or B'] = df.loc[['A','B']].sum()
df.loc['B or C'] = df.loc[['C','B']].sum()
df = df.drop(['A','B'])
print (df)
Y D E
X
C 1 1
A or B 3 3
B or C 3 3
EDIT: If want general solution it is not easy, because is necessary repeat groups with rename like:
df1 = df[df['X'] == 'B'].assign(X = 'B or C')
df2 = df[df['X'] == 'C']
df = pd.concat([df, df1], ignore_index=True)
df['X'] = df['X'].replace({'A':'A or B', 'B': 'A or B', 'C': 'B or C'})
df = pd.concat([df, df2], ignore_index=True)
df = pd.crosstab(df.X, df.Y)
print (df)
Y D E
X
A or B 3 3
B or C 3 3
C 1 1
I would like to groupby and sum dataframe, without modifying the number of indexes but applying the operations to the first occurrence only.
Initial DF:
C1 | Val
a | 1
a | 1
b | 1
c | 1
c | 1
Wanted DF:
C1 | Val
a | 2
a | 0
b | 1
c | 2
c | 0
I tried to apply the following code:
df.groupby(['C1'])['Val'].transform('sum')
which it helps to propagate the aggregated results to the total number or rows. However, it does not seem that transform have arguments which allow to apply the results to first or last occurrence only.
Indeed, what I currently get is:
C1 | Val
a | 2
a | 2
b | 1
c | 2
c | 2
Using pandas.DataFrame.groupby:
s = df.groupby('C1')['Val']
v = s.sum().values
df.loc[:, 'Val'] = 0
df.loc[s.head(1).index, 'Val'] = v
print(df)
Output:
C1 Val
0 a 2
1 a 0
2 b 1
3 c 2
4 c 0