Make two dataframes to one and aggregate the sums - python

I have two dataframes df1 and df2
df1 = pd.DataFrame({'name': ['A', 'B', 'C'],
'value': [100, 300, 150]})
df2 = pd.DataFrame({'name': ['A', 'B', 'D'],
'value': [20, 50, 7]})
I want to merge these two dataframes to a new dataframe df3 so I get the following result:
Then I want to have a forth new dataframe df4 where the rows aggregated to sums like
df4 = pd.DataFrame({'name': ['A', 'B', 'C', 'D'],
'value': [120, 350, 150, 7]})
How to do this?

You can concatenate the DataFrames together then use a groupby and sum:
df3 = pd.concat([df1, df2])
df4 = df3.groupby('name').sum().reset_index()
Result of df4:
name value
0 A 120
1 B 350
2 C 150
3 D 7

Another way is just append
df1.append(df2, ignore_index=True).groupby('name')['value'].sum().to_frame()
value
name
A 120
B 350
C 150
D 7

Related

How to replace selected column in few rows

My dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'c1': [10, 11, 12, 13], 'c2': [100, 110, 120, 130], 'c3': [100, 110, 120, 130], 'c4': ['A', np.nan, np.nan, 'B']})
I need to replace row c2 and c3 from another dataframe using column 'c4'
replacer df:
df_replacer = pd.DataFrame({'c2': [11, 22], 'c3': [99, 299], 'c4': ['A', 'B']})
Below is how I am doing: (Is there a cleaner way to do?)
df = df.merge(df_replacer, on=['c4'], how='left')
df.loc[~df.c4.isna(), 'c2_x'] = df['c2_y']
df.loc[~df.c4.isna(), 'c3_x'] = df['c3_y']
df = df.rename({'c2_x': 'c2', 'c3_x':'c3'}, axis=1)
df = df[['c1', 'c2', 'c3', 'c4']]
I don't see another way to do it without using the merge, maybe you could do something like this :
df = df.merge(df_replacer, on='c4', how='left', suffixes=('', '_replacer'))
df['c2'] = np.where(df['c2_replacer'].notnull(), df['c2_replacer'], df['c2'])
df['c3'] = np.where(df['c3_replacer'].notnull(), df['c3_replacer'], df['c3'])
df = df.drop(['c2_replacer', 'c3_replacer'], axis=1)
# list of columns to update
cols=['c2', 'c3']
# set the index on column to use for matching the two DF
df.set_index('c4', inplace=True)
df_replacer.set_index('c4', inplace=True)
# use update to replace value in DF
df.update(df_replacer[cols] )
# reset the index
df.reset_index()
c4 c1 c2 c3
0 A 10 11.0 99.0
1 NaN 11 22.0 299.0
2 NaN 12 120.0 120.0
3 B 13 22.0 299.0

How to select rows multiple times from a data frame if it appears multiple times in a list, without changing the column order?

Suppose I have a dataframe
data = {'Date': ['22-08-2021', '12-09-2021', '02-10-2021', '22-11-2021'], 'ID': ['A', 'B', 'C', 'O'], 'Item':['Apple','Banana','Carrot', 'Orange'], 'Cost':[10, 12, 15, 13]}
dataframe = pd.DataFrame(data)
dataframe
And a list of indices,
index_list = ['A', 'A', 'B', 'B', 'O', 'C', 'C']
And I want to select rows, based on their indices, multiple times as they appear in this list, so that the above dataframe would become
data2 = {'Date':['22-08-2021', '22-08-2021', '12-09-2021', '12-09-2021', '22-11-2021', '02-10-2021', '02-10-2021'], 'ID': ['A', 'A', 'B', 'B', 'O', 'C', 'C'], 'Item':['Apple', 'Apple', 'Banana', 'Banana', 'Orange', 'Carrot', 'Carrot'], 'Cost':[10, 10, 12, 12, 13, 15, 15]}
dataframe2 = pd.DataFrame(data2)
dataframe2
What's the best way to do this using Pandas?
My approach:
I wrote the following for loop to achieve this, but I think there should be inbuilt pandas functions to do this in a much more elegant and efficient way.
dataframe2 = pd.DataFrame(columns = dataframe.columns)
for i in index_list:
idx = dataframe.index[dataframe['ID'] == i]
dataframe2 = pd.concat([dataframe2, dataframe.loc[idx]])
dataframe2
Any help will be appreciated.
You can use .reset_index() to add the index as a normal column, then set_index() and .loc[] to fetch rows by ID. Then once you know the original indexes of the rows you want, you can use .loc[] again to get them.
>>> orig_indexes = dataframe.reset_index().set_index('ID').loc[index_list, 'index']
>>> dataframe.loc[orig_indexes]
Date ID Item Cost
0 22-08-2021 A Apple 10
0 22-08-2021 A Apple 10
1 12-09-2021 B Banana 12
1 12-09-2021 B Banana 12
3 22-11-2021 O Orange 13
2 02-10-2021 C Carrot 15
2 02-10-2021 C Carrot 15

Add a column to a dataframe by looking up values in another dataframe

Consider these two dataframes:
index = [0, 1, 2, 3]
columns = ['col0', 'col1']
data = [['A', 'D'],
['B', 'E'],
['C', 'F'],
['A', 'D']
]
df1 = pd.DataFrame(data, index, columns)
df2 = pd.DataFrame(data = [10, 20, 30, 40], index = pd.MultiIndex.from_tuples([('A', 'D'), ('B', 'E'), ('C', 'F'), ('X', 'Z')]), columns = ['col2'])
I want to add a column to df1 that tells me the value from looking at df2. The expected result would be like this:
index = [0, 1, 2, 3]
columns = ['col0', 'col1', 'col2']
data = [['A', 'D', 10],
['B', 'E', 20],
['C', 'F', 30],
['A', 'D', 10]
]
df3 = pd.DataFrame(data, index, columns)
What is the best way to achieve this? I am wondering if it should be done with a dictionary and then map or perhaps something simpler. I'm unsure.
Merge normally:
pd.merge(df1, df2, left_on=["col0", "col1"], right_index=True, how="left")
Output:
col0 col1 col2
0 A D 10
1 B E 20
2 C F 30
3 A D 10
try this:
indexes = list(map(tuple, df1.values))
df1["col2"] = df2.loc[indexes].values
Output:
#print(df1)
col0 col1 col2
0 A D 10
1 B E 20
2 C F 30
3 A D 10

Lowercase columns by name using dataframe method

I have a dataframe containing strings and NaNs. I want to str.lower() certain columns by name to_lower = ['b', 'd', 'e']. Ideally I could do it with a method on the whole dataframe, rather than with a method on df[to_lower]. I have
df[to_lower] = df[to_lower].apply(lambda x: x.astype(str).str.lower())
but I would like a way to do it without assigning to the selected columns.
df = pd.DataFrame({'a': ['A', 'a'], 'b': ['B', 'b']})
to_lower = ['a']
df2 = df.copy()
df2[to_lower] = df2[to_lower].apply(lambda x: x.astype(str).str.lower())
You can use assign method and unpack the result as keyword argument:
df = pd.DataFrame({'a': ['A', 'a'], 'b': ['B', 'b'], 'c': ['C', 'c']})
to_lower = ['a', 'b']
df.assign(**df[to_lower].apply(lambda x: x.astype(str).str.lower()))
# a b c
#0 a b C
#1 a b c
You want this:
for column in to_lower:
df[column] = df[column].str.lower()
This is far more efficient assuming you have more rows than columns.

Reshaping pandas data frame into as many columns as there are repeating rows

I have this data frame:
>> df = pd.DataFrame({'Place' : ['A', 'A', 'B', 'B', 'C', 'C'], 'Var' : ['All', 'French', 'All', 'German', 'All', 'Spanish'], 'Values' : [250, 30, 120, 12, 200, 112]})
>> df
Place Values Var
0 A 250 All
1 A 30 French
2 B 120 All
3 B 12 German
4 C 200 All
5 C 112 Spanish
It has a repeating pattern of two rows for every Place. I want to reshape it so it's one row per Place and the Var column becomes two columns, one for "All" and one for the other value.
Like so:
Place All Language Value
A 250 French 30
B 120 German 12
C 200 Spanish 112
A pivot table would make a column for each unique value, and I don't want that.
What's the reshaping method for this?
Because the data appears in alternating pattern, we can conceptualize the transformation in 2 steps.
Step 1:
Go from
a,a,a
b,b,b
To
a,a,a,b,b,b
Step 2: drop redundant columns.
The following solution applies reshape to the values of the DataFrame; the arguments to reshape are (-1, df.shape[1] * 2), which says 'give me a frame that has twice as many columns and as many rows as you can manage.
Then, I hardwired the column indexes for the filter: [0, 1, 4, 5] based on your data layout. Resulting numpy array has 4 columns, so we pass it into a DataFrame constructor along with the correct column names.
It is an unreadable solution that depends on the df layout and produces columns in the wrong order;
import pandas as pd
df = pd.DataFrame({'Place' : ['A', 'A', 'B', 'B', 'C', 'C'], 'Var' : ['All', 'French', 'All', 'German', 'All', 'Spanish'], 'Values' : [250, 30, 120, 12, 200, 112]})
df = pd.DataFrame(df.values.reshape(-1, df.shape[1] * 2)[:,[0,1,4,5]],
columns = ['Place', 'All', 'Value', 'Language'])
A different approach:
df = pd.DataFrame({'Place' : ['A', 'A', 'B', 'B', 'C', 'C'], 'Var' : ['All', 'French', 'All', 'German', 'All', 'Spanish'], 'Values' : [250, 30, 120, 12, 200, 112]})
df1 = df.set_index('Place').pivot(columns='Var')
df1.columns = df1.columns.droplevel()
df1 = df1.set_index('All', append=True).stack().reset_index()
print(df1)
Output:
Place All Var 0
0 A 250.0 French 30.0
1 B 120.0 German 12.0
2 C 200.0 Spanish 112.0

Categories

Resources