Sum of only specific columns in a pandas dataframe - python

Consider I have a dataframe with few columns
and a list ['salary','gross exp']
Now I want to perform sum of the column operation only on the columns from the list on the dataframe and save that to the dataframe
To put this in prespective, the list of columns ['salary','gross exp'] are money related and it makes sense to perform sum on these columns and not touch any of the other columns
P.S: I have several Excel workbooks to work on and each consists of few tens of sheets, so doing it manually is out of options
Also macro code for excel works fine if that's possible
TIA

Working with the following example:
import pandas as pd
list_ = ['sallary', 'gross exp']
d = {'sallary': [1,2,3], 'gross exp': [2,2,2], 'another column': [10,10,10]}
df = pd.DataFrame(d)
df
sallary
gross exp
another column
0
1
2
10
1
2
2
10
2
3
2
10
You can then add a new empty row, and insert the sum of the columns that you have from the list in that same row:
df = df.append(pd.Series(dtype = float), ignore_index=True)
df.loc[df.index[-1],list_] = df[list_].sum()
df
sallary
gross exp
another column
0
1.0
2.0
10.0
1
2.0
2.0
10.0
2
3.0
2.0
10.0
3
6.0
6.0
NaN

Related

Merge dataframes based on substrings

I want to merge/join two large dataframes while the 'id' column the dataframe on the right is assumed to be substrings of the left 'id' column.
For illustration purposes:
import pandas as pd
import numpy as np
df1=pd.DataFrame({'id':['abc','adcfek','acefeasdq'],'numbers':[1,2,np.nan],'add_info':[3123,np.nan,312441]})
df2=pd.DataFrame({'matching':['adc','fek','acefeasdq','abcef','acce','dcf'],'needed_info':[1,2,3,4,5,6],'other_info':[22,33,11,44,55,66]})
This is df1:
id numbers add_info
0 abc 1.0 3123.0
1 adcfek 2.0 NaN
2 acefeasdq NaN 312441.0
And this is df2:
matching needed_info other_info
0 adc 1 22
1 fek 2 33
2 acefeasdq 3 11
3 abcef 4 44
4 acce 5 55
5 dcf 6 66
And this is the desired output:
id numbers add_info needed_info other_info
0 abc 1.0 3123.0 NaN NaN
1 adcfek 2.0 NaN 2.0 33.0
2 adcfek 2.0 NaN 6.0 66.0
3 acefeasdq NaN 312441.0 3.0 11.0
So as described, I only want to merge the additional columns only when the 'matching' column is a substring of the 'id' column. If it is the other way around, e.g. 'abc' is a substring of 'adcef', nothing should happen.
In my data, a lot of the matches between df1 and df2 are actually exact, like the 'acefeasdq' row. But there are cases where 'id's contain multiple 'matching's. For the moment, it is okish to ignore these cases but I'd like to learn how I can tackle this issue. And additionally, is it possible to mark out the rows that are merged based on substrings and the rows that are merged exactly?
You can use pd.merge(how='cross') to create a dataframe containing all combinations of the rows. And then filter the dataframe using a boolean series:
df = pd.merge(df1, df2, how="cross")
include_row = df.apply(lambda row: row.matching in row.id, axis=1)
filtered = df.loc[include_row]
print(filtered)
Docs:
pd.merge
Indexing and selecting data
If your processing can handle CROSS JOIN (problematic with large datasets), then you could cross join and then delete/filter only those you want.
map= cross.apply(lambda x: str(x['matching']) in str(x['id']), axis=1) #create map of booleans
final = cross[map] #get only those where condition was met

How do I create a rank table for a given pandas dataframe with multiple numerical columns?

I would like to create a rank table based on a multi-column pandas dataframe, with several numerical columns.
Let's use the following df as an example:
Name
Sales
Volume
Reviews
A
1000
100
100
B
2000
200
50
C
5400
500
10
I would like to create a new table, ranked_df that ranks the values in each column by descending order while maintaining essentially the same format:
Name
Sales_rank
Volume_rank
Reviews_rank
A
3
3
1
B
2
2
2
C
1
1
3
Now, I can iteratively do this by looping through the columns, i.e.
df = pd.DataFrame{
"Name":['A', 'B', 'C'],
"Sales":[1000, 2000, 5400],
"Volume":[100, 200, 500],
"Reviews":[1000, 2000, 5400]
}
# make a copy of the original df
ranked_df = df.copy()
# define our interested columns
interest_cols = ['Sales', 'Volume', 'Reviews']
for col in interest_cols:
ranked_df[f"{col}_rank"] = df[col].rank()
# drop the cols not needed
...
But my question is this: is there a more elegant - or pythonic way of doing this? Maybe an apply for the dataframe? Or some vectorized operation by throwing it to numpy?
Thank you.
df.set_index('Name').rank().reset_index()
Name Sales Volume Reviews
0 A 1.0 1.0 1.0
1 B 2.0 2.0 2.0
2 C 3.0 3.0 3.0
You could use transform/apply to hit each column
df.set_index('Name').transform(pd.Series.rank, ascending = False)
Sales Volume Reviews
Name
A 3.0 3.0 1.0
B 2.0 2.0 2.0
C 1.0 1.0 3.0

Python adding two dataframes based on index (edited)

(no idea how to introduce a matrix here for readability)
I have two dataframes obtained with Panda and Python.
df1 = pd.DataFrame({'Index': ['0','1','2'], 'number':[3,'dd',1], 'people':[3,'s',3]})
df1 = df1.set_index('Index')
df2 = pd.DataFrame({'Index': ['0','1','2'], 'quantity':[3,2,'hi'], 'persons':[1,5,np.nan]})
I would like to sum the quantities of columns based on Index. Columns do not have the same name and may contain strings. (I have in fact 50 columns on each df). I want to consider nan as 0. The result should look:
df3
Index column 1 column 2
0 6 4
1 nan nan
2 nan nan
I was wondering how could this be done.
Note:
For sure a double while or for would do the trick, just not very elegant...
indices=0
columna=0
while indices<len(df.index)-1:
while columna<numbercolumns-1:
df3.iloc[indices,columna]=df1.iloc[indices,columna] +df2.iloc[indices,columna]
indices += 1
columna += 1
Thank you.
You can try of concatenating both dataframes, then add based on the index group
df1.columns = df.columns
df1.people = pd.to_numeric(df1.people,errors='coerce')
pd.concat([df,df1]).groupby('Index').sum()
Out:
number people
Index
A 8 5.0
B 2 2.0
C 2 5.0
F 3 3.0

Replace values in dataframe from another dataframe with Pandas

I have 3 dataframes: df1, df2, df3. I am trying to fill NaN values of df1 with some values contained in df2. The values selected from df2 are also selected according to the output of a simple function (mul_val) who processes some data stored in df3.
I was able to get such result but I would like to find in a simpler, easier way and more readable code.
Here is what I have so far:
import pandas as pd
import numpy as np
# simple function
def mul_val(a,b):
return a*b
# dataframe 1
data = {'Name':['PINO','PALO','TNCO' ,'TNTO','CUCO' ,'FIGO','ONGF','LABO'],
'Id' :[ 10 , 9 ,np.nan , 14 , 3 ,np.nan, 7 ,np.nan]}
df1 = pd.DataFrame(data)
# dataframe 2
infos = {'Info_a':[10,20,30,40,70,80,90,50,60,80,40,50,20,30,15,11],
'Info_b':[10,30,30,60,10,85,99,50,70,20,30,50,20,40,16,17]}
df2 = pd.DataFrame(infos)
dic = {'Name': {0: 'FIGO', 1: 'TNCO'},
'index': {0: [5, 6], 1: [11, 12, 13]}}
df3 = pd.DataFrame(dic)
#---------------Modify from here in the most efficient way!-----------------
for idx,row in df3.iterrows():
store_val = []
print(row['Name'])
for j in row['index']:
store_val.append([mul_val(df2['Info_a'][j],df2['Info_b'][j]),j])
store_val = np.asarray(store_val)
# - Identify which is the index of minimum value of the first column
indx_min_val = np.argmin(store_val[:,0])
# - Get the value relative number contained in the second column
col_value = row['index'][indx_min_val]
# Identify value to be replaced in df1
value_to_be_replaced = df1['Id'][df1['Name']==row['Name']]
# - Replace such value into the df1 having the same row['Name']
df1['Id'].replace(to_replace=value_to_be_replaced,value=col_value, inplace=True)
By printing store_val at every iteration I get:
FIGO
[[6800 5]
[8910 6]]
TNCO
[[2500 11]
[ 400 12]
[1200 13]]
Let's do a simple example: considering FIGO, I identify 6800 as the minimum number between 6800 and 8910. Therefore I select the number 5 who is placed in df1. Repeating such operation for the remaining rows of df3 (in this case I have only 2 rows but they could be a lot more), the final result should be like this:
In[0]: before In[0]: after
Out[0]: Out[0]:
Id Name Id Name
0 10.0 PINO 0 10.0 PINO
1 9.0 PALO 1 9.0 PALO
2 NaN TNCO -----> 2 12.0 TNCO
3 14.0 TNTO 3 14.0 TNTO
4 3.0 CUCO 4 3.0 CUCO
5 NaN FIGO -----> 5 5.0 FIGO
6 7.0 ONGF 6 7.0 ONGF
7 NaN LABO 7 NaN LABO
Nore: you can also remove the for loops if needed and use different type of formats to store the data (list, arrays...); the important thing is that the final result is still a dataframe.
I can offer two similar options that achieve the same result than your loop in a couple of lines:
1.Using apply and fillna() (fillna is faster than combine_first by a factor of two):
df3['Id'] = df3.apply(lambda row: (df2.Info_a*df2.Info_b).loc[row['index']].argmin(), axis=1)
df1 = df1.set_index('Name').fillna(df3.set_index('Name')).reset_index()
2.Using a function (lambda doesn't support assignment, so you have to apply a func)
def f(row):
df1.ix[df1.Name==row['Name'], 'Id'] = (df2.Info_a*df2.Info_b).loc[row['index']].argmin()
df3.apply(f, axis=1)
or a slight variant not relying on global definitions:
def f(row, df1, df2):
df1.ix[df1.Name==row['Name'], 'Id'] = (df2.Info_a*df2.Info_b).loc[row['index']].argmin()
df3.apply(f, args=(df1,df2,), axis=1)
Note that your solution, even though much more verbose, will take the least amount of time with this small dataset (7.5 ms versus 9.5 ms for both of mine). It makes sense that the speed would be similar, since in both cases it's a matter of looping on the rows of df3

openpyxl Python Iterating Through Large Data List

I have a large excel workbook with 1 sheet with roughly 45,000 rows and 45 columns. I want to iterate through the columns looking for duplicates and unique items and its taking a very long time to go through individual columns. Is there anyway to optimize my code or make this go faster? I either want to print the information or save to txt file. I'm on windows 10 and python 2.7 using openpyxl module:
from openpyxl import load_workbook, worksheet, Workbook
import os
#read work book to get data
wb = load_workbook(filename = 'file.xlsx', use_iterators = True)
ws = wb.get_sheet_by_name(name = 'file')
wb = load_workbook(filename='file.xlsx', read_only=True)
count = 0
seen = set()
uniq = []
for cell in ws.columns[0]:
if cell not in seen:
uniq.append(cell)
seen.add(cell)
print("Unique: "+uniq)
print("Doubles: "+seen)
EDIT: Lets say I have 5 columns A,B,C,D,E and 10 entries, so 10 rows, 5x10. In column A I want to extract all the duplicates and separate them from the unique values.
As VedangMehta mentioned, Pandas will do it very quickly for you.
Run this code:
import pandas as pd
#read in the dataset:
df = pd.read_excel('file.xlsx', sheetname = 'file')
df_dup = df.groupby(axis=1, level=0).apply(lambda x: x.duplicated())
#save duplicated values from first column
df[df_dup].iloc[:,0].to_csv("file_duplicates_col1.csv")
#save unique values from first column
df[~df_dup].iloc[:,0].to_csv("file_unique_col1.csv")
#save duplicated values from all columns:
df[df_dup].to_csv("file_duplicates.csv")
#save unique values from all columns:
df[df_dup].to_csv("file_unique.csv")
For details, see below:
Suppose your dataset looks as follows:
df = pd.DataFrame({'a':[1,3,1,13], 'b':[13,3,5,3]})
df.head()
Out[24]:
a b
0 1 13
1 3 3
2 1 5
3 13 3
You can find which values are duplicated in each column:
df_dup = df.groupby(axis=1, level=0).apply(lambda x: x.duplicated())
the result:
df_dup
Out[26]:
a b
0 False False
1 False False
2 True False
3 False True
you can find the duplicated values by subsetting the df using the boolean dataframe df_dup
df[df_dup]
Out[27]:
a b
0 NaN NaN
1 NaN NaN
2 1.0 NaN
3 NaN 3.0
Again, you can save that using:
#save the above using:
df[df_dup].to_csv("duplicated_values.csv")
to see the duplicated values in the first column, use:
df[df_dup].iloc[:,0]
to get
Out[11]:
0 NaN
1 NaN
2 1.0
3 NaN
Name: a, dtype: float64
For unique calues, use ~ which is Python's not sign. So you're essentially subsetting df by values that are Not duplicates
df[~df_dup]
Out[29]:
a b
0 1.0 13.0
1 3.0 3.0
2 NaN 5.0
3 13.0 NaN
When working with read-only mode don't use the columns property to read a worksheet. This is because data is stored in rows so columns require the parser to continually re-read the file.
This is an example of using openpyxl to convert worksheets into Pandas dataframes. It requires openpyxl 2.4 or higher, which at the time of writing can must be checked out.

Categories

Resources