Pandas - Sum of Column A where Column B is in Column C - python

I have the following dataframe. Notice that Column B is a series of lists. This is what is giving me trouble
Dataframe 1:
Column A Column B
0 10 [X]
1 20 [X,Y]
2 15 [X,Y,Z]
3 25 [A]
4 60 [B]
I want to take all of the values in Column C (below), check if they exist in Column B, and then sum their values from Column A.
DataFrame 2: (Desired Output)
Column C Sum of Column A
0 X 45
1 Y 35
2 Z 15
3 Q 0
4 R 0
I know this can be accomplished using a for-loop, but I am looking for the "pandonic method" to solve this.

update
Here is a shorter and faster answer beginning with your second dataframe
df2['C'].apply(lambda x: df.loc[df['B'].apply(lambda y: x in y), 'A'].sum())
original answer
You first can 'normalize' the data in Column B.
df_normal = pd.concat([df.A, df.B.apply(pd.Series)], axis=1)
A 0 1 2
0 10 X NaN NaN
1 20 X Y NaN
2 15 X Y Z
3 25 A NaN NaN
4 60 B NaN NaN
And then stack and groupby to get the lookup table.
df_lookup = df_normal.set_index('A') \
.stack() \
.reset_index(name='group')\
.groupby('group')['A'].sum()
group
A 25
B 60
X 45
Y 35
Z 15
Name: A, dtype: int64
And then join to df2.
df2.join(df_lookup, on='C').fillna(0)
C A
0 X 45.0
1 Y 35.0
2 Z 15.0
3 Q 0.0
4 R 0.0
And in one line
df2.join(
df.set_index('A')['B'] \
.apply(pd.Series) \
.stack() \
.reset_index('A', name='group') \
.groupby('group')['A'] \
.sum(), on='C') \
.fillna(0)
And if you wanted to loop which isn't that bad in this situation
d = {}
for _, row in df.iterrows():
for var in row['B']:
if var in d:
d[var] += row['A']
else:
d[var] = row['A']
df2.join(pd.Series(d, name='Sum of A'), on='C').fillna(0)

Base on your example data:
df1=df.set_index('Column A')['Column B'].\
apply(pd.Series).stack().reset_index().\
groupby([0])['Column A'].sum().to_frame()
df2['Sum of Column A']=df2['Column C'].map(df1['Column A'])
df2.fillna(0)
Out[604]:
Column C Sum of Column A
0 X 45.0
1 Y 35.0
2 Z 15.0
3 Q 0.0
4 R 0.0
Data input :
df = pd.DataFrame({'Column A':[10,20,15,25,60],'Column B':[['X'],['X','Y'],['X','Y','Z'],['A'],['B']]})
df2 = pd.DataFrame({'Column C':['X','Y','Z','Q','R']})

I'd use a list comprehension like
df['result']=np.sum[(df['Column C'] in col['Column B'])*col['Column A'] for col in df]

Related

How to insert a pandas dataframe having a single csv column into MySQL Database

I have a pandas dataframe that I read from google sheet.
I then added the tag column using:
df['tag'] = df.filter(like = 'Subject', axis = 1).apply(lambda x: np.where(x == 'Y', x.name,'')).values.tolist()
df['tag'] = df['tag'].apply(lambda x: [i for i in x if i!= ''])
Resultant sample DataFrame:
Id Name Subject-A Subject-B Total tag
0 1 A Y 100 [Subject-A]
1 2 B Y 98 [Subject-B]
2 3 C Y Y 191 [Subject-A, Subject-B]
3 4 D Y 100 [Subject-B]
4 5 E Y 95 [Subject-B]
Then I export the dataframe to a MySQL Database after converting the tag column into a comma separated string by:
df['tag'] = df['tag'].map(lambda x : ', '.join(str(i) for i in x)).str.replace('Subject-','')
df
Id Name Subject-A Subject-B Total tag
0 1 A Y 100 A
1 2 B Y 98 B
2 3 C Y Y 91 A, B
3 4 D Y 100 B
4 5 E Y 95 B
df.to_sql(name = 'table_name', con = conn, if_exists = 'replace', index = False)
But in the MySQL database the tag columns is:
A,
,B
A,B
,B
,B
My actual data has many such "Subject" columns so the result looks like:
, , , D
A, ,C,
...
...
Could someone please let me know why it's giving expected out in Pandas but when I save the dataframe in cloud SQL, the column looks different. The expected output in MySQL database is same as how the tag column is appearing in Pandas.
Here is alternative solution, seems some data related problem.
First filter Subject columns with remove Subject- and then use DataFrame.dot with columns names with separator, last strip separator from right side:
df1 = df.filter(like = 'Subject').rename(columns=lambda x: x.replace('Subject-',''))
print (df1)
A B
0 Y NaN
1 NaN Y
2 Y Y
3 NaN Y
4 NaN Y
df['tag'] = df1.eq('Y').dot(df1.columns + ', ').str.rstrip(', ')
print (df)
Id Name Subject-A Subject-B Total tag
0 1 A Y NaN 100 A
1 2 B NaN Y 98 B
2 3 C Y Y 191 A, B
3 4 D NaN Y 100 B
4 5 E NaN Y 95 B

Change values of one column based on values of other column pandas dataframe

I have this pandas dataframe:
id A B
1 nan 0
2 nan 1
3 6 0
4 nan 1
5 12 1
6 14 0
I want to change the value of all nan is 'A' based on the value of 'B',
for example if B = 0, A should be random number between [0,1]
if B = 1, A should be random number between [1,3]
How do i do this?
Solution if performance is important - generate random values by length of DataFrame and then assign values by conditions:
Use numpy.random.randint for generate random values and pass to numpy.select with chainded condition with & for bitwise AND, compare is by Series.isna and Series.eq :
a = np.random.randint(0,2, size=len(df)) #generate 0,1
b = np.random.randint(1,4, size=len(df)) #generate 1,2,3
m1 = df.A.isna()
m2 = df.B.eq(0)
m3 = df.B.eq(1)
df['A'] = np.select([m1 & m2, m1 & m3],[a, b], df.A)
print (df)
id A B
0 1 1.0 0
1 2 3.0 1
2 3 6.0 0
3 4 3.0 1
4 5 12.0 1
5 6 14.0 0

Why does pd.DataFrame with pd.isnull fail?

tt = pd.DataFrame({'a':[1,2,None,3],'b':[None,3,4,5]})
bb=pd.DataFrame(pd.isnull(tt).astype(int), index = tt.index, columns=map(lambda x: x + '_'+'NA',tt.columns))
bb
I want create this dataframe with pd.isnull(tt), and the columns name contain the NA, but why does this fail?
Using values
tt = pd.DataFrame({'a':[1,2,None,3],'b':[None,3,4,5]})
bb=pd.DataFrame(data=pd.isnull(tt).astype(int).values, index = tt.index, columns=list(map(lambda x: x + '_'+'NA',tt.columns)))
The reason why :
pandas data carry over the column and index , which pd.isnull(tt).astype(int) already have the columns name as b and a
More information
bb=pd.DataFrame(data=pd.isnull(tt).astype(int), index = tt.index,columns=['a','b', 'a_NA','b_NA'] )
bb
Out[399]:
a b a_NA b_NA
0 0 1 NaN NaN
1 0 0 NaN NaN
2 1 0 NaN NaN
3 0 0 NaN NaN

Python, how to fill nulls in a data frame using a dictionary

I have a dataframe df something like this
A B C
1 'x' 15.0
2 'y' NA
3 'z' 25.0
and a dictionary dc something like
dc = {'x':15,'y':35,'z':25}
I want to fill all nulls in column C of the dataframe using values of column B from the dictionary. So that my dataframe will become
A B C
1 'x' 15
2 'y' 35
3 'z' 25
Could anyone help me how to do that please?
thanks,
Manoj
You can use fillna with map:
dc = {'x':15,'y':35,'z':25}
df['C'] = df.C.fillna(df.B.map(dc))
df
# A B C
#0 1 x 15.0
#1 2 y 35.0
#2 3 z 25.0
df['C'] = np.where(df['C'].isnull(), df['B'].apply(lambda x: dc[x]), df['C'])

create binary columns in a dataframe from condition on its value

I have a dataframe that looks like this one:
df = pd.DataFrame(np.nan, index=[0,1,2,3], columns=['A','B','C'])
df.iloc[0,0] = 'a'
df.iloc[1,0] = 'b'
df.iloc[1,1] = 'c'
df.iloc[2,0] = 'b'
df.iloc[3,0] = 'c'
df.iloc[3,1] = 'b'
df.iloc[3,2] = 'd'
df
out : A B C
0 a NaN NaN
1 b c NaN
2 b NaN NaN
3 c b d
And I would like to add new columns to it which names are the values inside the dataframe (here 'a','b','c',and 'd'). Those columns are binary, and reflect if the values 'a','b','c',and 'd' are in the row.
In one picture, the output I'd like is:
A B C a b c d
0 a NaN NaN 1 0 0 0
1 b c NaN 0 1 1 0
2 b NaN NaN 0 1 0 0
3 c b d 0 1 1 1
To do this I first create the columns filled with zeros:
cols = pd.Series(df.values.ravel()).value_counts().index
for col in cols:
df[col] = 0
(It doesn't create the columns in the right order, but that doesn't matter)
Then I...use a loop over the rows and columns...
for row in df.index:
for col in cols:
if col in df.loc[row].values:
df.ix[row,col] = 1
You'll get why I'm looking for another way to do it, even if my dataframe is relatively small (76k rows), it still takes around 8 minutes, which is far too long.
Any idea?
You're looking for get_dummies. Here I choose to use the .str version:
df.fillna('', inplace=True)
(df.A + '|' + df.B + '|' + df.C).str.get_dummies()
Output:
a b c d
0 1 0 0 0
1 0 1 1 0
2 0 1 0 0
3 0 1 1 1

Categories

Resources