Compare Two column in pandas - python

I have a dataframe like this
| A | B | C |
|-------|---|---|
| ['1'] | 1 | 1 |
|['1,2']| 2 | |
| ['2'] | 3 | 0 |
|['1,3']| 2 | |
if the value of B is equal to A within the quotes then C is 1. if not present in A it will be 0. Expected output is:
| A | B | C |
|-------|---|---|
| ['1'] | 1 | 1 |
|['1,2']| 2 | 1 |
| ['2'] | 3 | 0 |
|['1,3']| 2 | 0 |
Like this I want to get the dataframe for multiple rows. How do I write in python to get this kind of data frame?

If values in A are strings use:
print (df.A.tolist())
["['1']", "['1,2']", "['2']", "['1,3']"]
df['C'] = [int(str(b) in a.strip("[]'").split(',')) for a, b in zip(df.A, df.B)]
print (df)
A B C
0 ['1'] 1 1
1 ['1,2'] 2 1
2 ['2'] 3 0
3 ['1,3'] 2 0
Or if values are one element lists use:
print (df.A.tolist())
[['1'], ['1,2'], ['2'], ['1,3']]
df['C'] = [int(str(b) in a[0].split(',')) for a, b in zip(df.A, df.B)]
print (df)
A B C
0 [1] 1 1
1 [1,2] 2 1
2 [2] 3 0
3 [1,3] 2 0

My code:
df = pd.read_clipboard()
df
'''
A B
0 ['1'] 1
1 ['1,2'] 2
2 ['2'] 3
3 ['1,3'] 2
'''
(
df.assign(A=df.A.str.replace("'",'').map(eval))
.assign(C=lambda d: d.apply(lambda s: s.B in s.A, axis=1))
.assign(C=lambda d: d.C.astype(int))
)
'''
A B C
0 [1] 1 1
1 [1, 2] 2 1
2 [2] 3 0
3 [1, 3] 2 0
'''

df['C'] = np.where(df['B'].astype(str).isin(df.A), 1,0)
basically you need to transform column b to string since column A is string. then seek for column B inside columnA.
result will be as you are defined.

Related

Efficient way of reducing a Dataframe from wide to long based on the logic mentioned

I have a DataFrame with col names 'a', 'b', 'c'
#Input
import pandas as pd
list_of_dicts = [
{'a' : 0, 'b' : 4, 'c' : 3},
{'a' : 1, 'b' : 1, 'c' : 2 },
{'a' : 0, 'b' : 0, 'c' : 0 },
{'a' : 1, 'b' : 0, 'c' : 3 },
{'a' : 2, 'b' : 1, 'c' : 0 }
]
df = pd.DataFrame(list_of_dicts)
#Input DataFrame
-----|------|------|-----|
| a | b | c |
-----|------|------|-----|
0 | 0 | 4 | 3 |
1 | 1 | 1 | 2 |
2 | 0 | 0 | 0 |
3 | 1 | 0 | 3 |
4 | 2 | 1 | 0 |
I want to reduce the wide DataFrame to One column, with the column names
as DataFrame values multiplied by the corresponding row values. The operation must be done Row wise.
#Output
| Values |
-----------------
0 | b |
1 | b |
2 | b |
3 | b |
4 | c |
5 | c |
6 | c |
7 | a |
8 | b |
9 | c |
10 | c |
11 | a |
12 | c |
13 | c |
14 | c |
15 | a |
17 | a |
18 | b |
Explanation:
Row 0 in the Input DataFrame has 4 'b' and 3 'c', so the first seven elements of the output DataFrame are bbbbccc
Row 1 similarly has 1 'a' 1 'b' and 2 'c', so the output will have abcc as the next 4 elements
Row 2 has 0's across, so would be skipped entirely.
The Order of the output is very important
For example, the first row has '4' b and 3 'c', so the output DataFrame must be bbbbccc because Column 'b' comes before column 'c'. The operation must be row-wise from left to right.
I'm trying to find an efficient way in order to accomplish this. The real dataset is too big for me to compute. Please provide the python3 solution.
Stack the data (you could melt as well), and drop rows where the count is zero. Finally use numpy.repeat to build a new array, and build your new dataframe from that.
reshape = df.stack().droplevel(0).loc[lambda x: x != 0]
pd.DataFrame(np.repeat(reshape.index, reshape), columns=['values'])
values
0 b
1 b
2 b
3 b
4 c
5 c
6 c
7 a
8 b
9 c
10 c
11 a
12 c
13 c
14 c
15 a
16 a
17 b
I don't think pandas buys you anything in this process, and especially if you have a large amount of data you don't want to read that all into memory and reprocess it into another large data structure.
import csv
with open('input.csv', 'r') as fh:
reader = csv.DictReader(fh)
for row in reader:
for key in reader.headers:
value = int(row[key])
for i in range(value):
print(key)

How do I merge categories for crosstab in pandas where some categories are common?

A while ago I asked this question
But that does not cover the case where two merged categories might have a common category
In that case I wanted to merge the categories A and B into AB. What if I have categories A, B, C and I want to merge A,B into AB, and B,C into BC?
Suppose I have the data:
+---+---+
| X | Y |
+---+---+
| A | D |
| B | D |
| B | E |
| B | D |
| A | E |
| C | D |
| C | E |
| B | E |
+---+---+
I want the cross-tab to look like:
+--------+---+---+
| X/Y | D | E |
+--------+---+---+
| A or B | 3 | 3 |
| B or C | 3 | 2 |
| C | 1 | 1 |
+--------+---+---+
I think you can use crosstab by all unique values and then sum values by selecting by categories in index values:
df = pd.crosstab(df.X, df.Y)
df.loc['A or B'] = df.loc[['A','B']].sum()
df.loc['B or C'] = df.loc[['C','B']].sum()
df = df.drop(['A','B'])
print (df)
Y D E
X
C 1 1
A or B 3 3
B or C 3 3
EDIT: If want general solution it is not easy, because is necessary repeat groups with rename like:
df1 = df[df['X'] == 'B'].assign(X = 'B or C')
df2 = df[df['X'] == 'C']
df = pd.concat([df, df1], ignore_index=True)
df['X'] = df['X'].replace({'A':'A or B', 'B': 'A or B', 'C': 'B or C'})
df = pd.concat([df, df2], ignore_index=True)
df = pd.crosstab(df.X, df.Y)
print (df)
Y D E
X
A or B 3 3
B or C 3 3
C 1 1

How to get the aggregate result only on the first occurrence, and 0 to the other, in a computation with groupby and transform?

I would like to groupby and sum dataframe, without modifying the number of indexes but applying the operations to the first occurrence only.
Initial DF:
C1 | Val
a | 1
a | 1
b | 1
c | 1
c | 1
Wanted DF:
C1 | Val
a | 2
a | 0
b | 1
c | 2
c | 0
I tried to apply the following code:
df.groupby(['C1'])['Val'].transform('sum')
which it helps to propagate the aggregated results to the total number or rows. However, it does not seem that transform have arguments which allow to apply the results to first or last occurrence only.
Indeed, what I currently get is:
C1 | Val
a | 2
a | 2
b | 1
c | 2
c | 2
Using pandas.DataFrame.groupby:
s = df.groupby('C1')['Val']
v = s.sum().values
df.loc[:, 'Val'] = 0
df.loc[s.head(1).index, 'Val'] = v
print(df)
Output:
C1 Val
0 a 2
1 a 0
2 b 1
3 c 2
4 c 0

How to split columns when content isn't aligned

I have a CSV file with survey data. One of the columns contains responses from a multi-select question. The values in that column are separated by ";"
| Q10 |
----------------
| A; B; C |
| A; B; D |
| A; D |
| A; D; E |
| B; C; D; E |
I want to split the column into multiple columns, one for each option:
| A | B | C | D | E |
---------------------
| A | B | C | | |
| A | B | | D | |
| A | | | D | |
| A | | | D | E |
| | B | C | D | E |
Is there anyway to do this in excel or python or some other way?
Here is a simple formula that does what is asked:
=IF(ISNUMBER(SEARCH("; "&B$1&";","; "&$A2&";")),B$1,"")
This assumes there is always a space between the ; and the look up value. If not we can remove the space with substitute:
=IF(ISNUMBER(SEARCH(";"&B$1&";",";"&SUBSTITUTE($A2," ","")&";")),B$1,"")
I know this question has been answered but for those looking for a Python way to solve it, here it is (may be not the most efficient way though):
First split the column value, explode them and get the dummies. Next, group the dummy values together across the given 5 (or N) columns:
df['Q10'] = df['Q10'].str.split('; ')
df = df.explode('Q10')
df = pd.get_dummies(df, columns=['Q10'])
dummy_col_list = df.columns.tolist()
df['New'] = df.index
new_df = df.groupby('New')[dummy_col_list].sum().reset_index()
del new_df['New']
You will get:
Q10_A Q10_B Q10_C Q10_D Q10_E
0 1 1 1 0 0
1 1 1 0 1 0
2 1 0 0 1 0
3 1 0 0 1 1
4 0 1 1 1 1
Now, if you want, you can rename the columns and replacing 1 with the column name:
colName = new_df.columns.tolist()
newColList = []
for i in colName:
newColName = i.split('_', 1)[1]
newColList.append(newColName)
new_df.columns = newColList
for col in list(new_df.columns):
new_df[col] = np.where(new_df[col] == 1, col, '')
Final output:
A B C D E
0 A B C
1 A B D
2 A D
3 A D E
4 B C D E
If you want to do the job in python:
import pandas as pd
import numpy as np
df = pd.read_csv('file.csv')
df['A'] = np.where(df.Q10.str.contains('A'), 'A', '')
df['B'] = np.where(df.Q10.str.contains('B'), 'B', '')
df['C'] = np.where(df.Q10.str.contains('C'), 'C', '')
df['D'] = np.where(df.Q10.str.contains('D'), 'D', '')
df['E'] = np.where(df.Q10.str.contains('E'), 'E', '')
df.drop('Q10', axis=1, inplace=True)
df
Output:
A B C D E
0 A B C
1 A B D
2 A D
3 A D E
4 B C D E
It's not the most efficient way, but it works ;)

Rolling up data frame along with count of rows in python

I am still in a learning phase in python and wanted to know how do we roll up the data and count the duplicate data rows in a column called count
The data frame structure is as follows
Col1| Value
A | 1
B | 1
A | 1
B | 1
C | 3
C | 3
C | 3
C | 3
My result should be as follows
Col1|Value|Count
A | 1 | 2
B | 1 | 2
C | 3 | 4
>>> df2 = df.groupby(['Col1', 'Value']).size().reset_index()
>>> df2.columns = ['Col1', 'Value', 'Count']
>>> df2
Col1 Value Count
0 A 1 2
1 B 1 2
2 C 3 4
Roman Pekar's fine answer is correct for this case. However, I saw it after trying to write a solution for the general case stated in the text of your question, not just the example with specific column names. So, for the general case, consider:
df.groupby([df[c] for c in df.columns]).size().reset_index().rename(columns={0: 'Count'})
For example:
import pandas as pd
df = pd.DataFrame({'Col1': ['a', 'a', 'a', 'b', 'c'], 'Value': [1, 2, 1, 3, 2]})
>>> df.groupby([df[c] for c in df.columns]).size().reset_index().rename(columns={0: 'Count'})
Col1 Value Count
0 a 1 2
1 a 2 1
2 b 3 1
3 c 2 1
You can also try:
df.groupby('Col1')['Value'].value_counts().reset_index(name='Count')

Categories

Resources