Difference between rows in pandas - python

I have data in csv which i am reading with pandas. The data is of this format-
name company income saving
A AA 100 10
B AA 200 20
I wish to create a new row with name A, company AA and income and saving being difference of A and B.
Expected output-
name company income saving
A AA -100 -10

I believe need:
print (df)
name company income saving
0 A AA 100 10
1 B AA 200 20
2 C AA 300 40
#for select columns by names
df1 = df[['name','company']].join(df[['income','saving']].diff(-1))
#for select columns by positions
#df1 = df.iloc[:, :2].join(df.iloc[:, 2:].diff(-1))
print (df1)
name company income saving
0 A AA -100.0 -10.0
1 B AA -100.0 -20.0
2 C AA NaN NaN

Related

Match, update and replace values from one dataset to another in Pandas

I have two datasets, one has old data and one has updated data. I'd like to create a new dataset by updating values based on if the area, date and column values match.
Data
df1
area date aa bb cc
japan 10/1/2027 1 0 0
us 1/1/2022 5 5 5
fiji 11/2/2026 1 1 1
df2
area date aa bb cc stat
japan 10/1/2027 0 5 5 yes
fiji 11/2/2026 0 0 10 no
I have two datasets. I wish to replace the values in [aa], [bb], and [cc] columns of df2 with the updated values from df1 if we have the same date and area values. The aa, bb, and cc column are replaced with the updated values.
Desired
area date aa bb cc stat
japan 10/1/2027 1 0 0 yes
fiji 11/2/2026 1 1 1 no
Doing
df['date'] = df.date.apply(lambda x: np.nan if x == ' ' else x)
I am not exactly sure how to set this up, however, I have an idea. Any suggestion is appreciated
You can merge and combine_first:
cols = ['area', 'date']
out = (df2[cols].merge(df1, on=cols, how='left')
.combine_first(df2)[df2.columns]
)
output:
area date aa bb cc stat
0 japan 10/1/2027 1 0 0 yes
1 fiji 11/2/2026 1 1 1 no
Using .merge and making sure date columns in both dfs are set to datetime.
df1["date"] = pd.to_datetime(df1["date"])
df2["date"] = pd.to_datetime(df2["date"])
df3 = pd.merge(left=df1, right=df2, on=["area", "date"], how="right").filter(regex=r".*(?<!_y)$")
df3.columns = df3.columns.str.split("_").str[0]
print(df3)
area date aa bb cc stat
0 japan 2027-10-01 1 0 0 yes
1 fiji 2026-11-02 1 1 1 no
I think this can possibly be simplified to:
output = df1[df1['area'].isin(df2['area']) & df1['date'].isin(df2['date'])]
OUTPUT:
area date aa bb cc stat
japan 10/1/2027 1 0 0 yes
fiji 11/2/2026 1 1 1 no
Even when df1 looks like this:
DF1:
area date aa bb cc
0 japan 10/1/2027 1 0 0
1 us 1/1/2022 5 5 5
2 fiji 11/2/2026 1 1 1
3 fiji 12/5/2025 9 9 9

Pivot select tables in dataframe to make values column headers in Python

I have a dataframe, df, where I would like to transform and pivot select values.
I wish to groupby id and date, sum the 'pwr' values and then count the type values.
df
df values that will be column headers: 'hi', 'hey'
id date type pwr de_id de_date de_type de_pwr base base_pos
aa q1 hey 10 aa q1 hey 5 200 40
aa q1 hi 5 200 40
aa q1 hey 5 200 40
aa q2 hey 2 aa q2 hey 3 200 40
aa q2 hey 2 aa q2 hey 3 200 40
bb q1 0 bb q1 hi 6 500 10
bb q1 0 bb q1 hi 6 500 10
Desired
id date hey hi total sum hey hi totald desum base base_pos
aa q1 2 1 3 20 1 0 1 5 200 40
aa q2 2 0 2 4 2 0 2 6 200 40
bb q1 0 0 0 0 0 2 2 12 500 10
Doing
sum1 = df.groupby(['id','date']).agg({'pwr': 'sum', 'type': 'count', 'de_pwr': 'sum', 'de_type': 'count'})
pd.pivot_table(df, values = '' , columns = 'type')
Any suggestion will be helpful.
So, this is definitely not a 'clean' way to go around it, but since you have 2 separate totals summing along columns, I don't know how much cleaner it could get (and the output seems accurate).
You don't mention what aggregation you use to get base and base_pos values, so I went with mean (might need to change it).
type_col = pd.crosstab(index = [df['id'], df['date']], columns = df['type'])
type_col['total'] = type_col.sum(axis = 1)
pwr_sum = df.groupby(['id','date'])['pwr'].sum()
de_type_col = pd.crosstab(index = [df['id'], df['date']], columns = df['de_type'])
de_type_col['total_de'] = de_type_col.sum(axis = 1)
pwr_de_sum = df.groupby(['id','date'])['de_pwr'].sum()
base_and_pos = df.groupby(['id','date'])[['base','base_pos']].mean()
out = pd.concat([type_col, pwr_sum, de_type_col, pwr_de_sum, base_and_pos], axis = 1).fillna(0).astype('int')
Essentially use crosstab to get value counts and sum them along columns. The index of resulting DataFrame is the same as groupby(['id','date']), so you can then concatenate results of groupby without issue. Repeat the same process for de columns, apply groupby with your choice of aggregation to base and base_pos columns, and concatenate all results along axis = 1. Obviously, you can group some operations together (such as pwr sum, de_pwr sum and base/base_pos aggregation), but you'll need to reorder your columns after that to get the desired order.
Output:
id date hey hi total pwr hey hi total_de de_pwr base base_pos
aa q1 2 1 3 20 1 0 1 5 200 40
aa q2 2 0 2 4 2 0 2 6 200 40
bb q1 0 0 0 0 0 2 2 12 500 10

sum duplicate row with condition using pandas

I have a dataframe who looks like this:
Name rent sale
0 A 180 2
1 B 1 4
2 M 12 1
3 O 10 1
4 A 180 5
5 M 2 19
that i want to make condition that if i have a duplicate row and a duplicate value in column field => Example :
duplicate row A have duplicate value 180 in rent column
I keep only one (without making the sum)
Or make the sum => Example duplicate row A with different values 2 & 5 in Sale column and duplicate row M with different values in rent & sales columns
Expected output:
Name rent sale
0 A 180 7
1 B 1 4
2 M 14 20
3 O 10 1
I tried this code but it's not workin as i want
import pandas as pd
df=pd.DataFrame({'Name':['A','B','M','O','A','M'],
'rent':[180,1,12,10,180,2],
'sale':[2,4,1,1,5,19]})
df2 = df.drop_duplicates().groupby('Name',sort=False,as_index=False).agg(Name=('Name','first'),
rent=('rent', 'sum'),
sale=('sale','sum'))
print(df2)
I got this output
Name rent sale
0 A 360 7
1 B 1 4
2 M 14 20
3 O 10 1
Can try summing only the unique values per group:
def sum_unique(s):
return s.unique().sum()
df2 = df.groupby('Name', sort=False, as_index=False).agg(
Name=('Name', 'first'),
rent=('rent', sum_unique),
sale=('sale', sum_unique)
)
df2:
Name rent sale
0 A 180 7
1 B 1 4
2 M 14 20
3 O 10 1
You can first groupby by Name and rent, and then just by Name:
df2 = df.groupby(['Name', 'rent'], as_index=False).sum().groupby('Name', as_index=False).sum()

change value of one dataframe by value from other dataframe pandas

i have a dataframe df1
id value
1 100
2 100
3 100
4 100
5 100
i have another dataframe df2
id value
2 50
5 30
i want to replace these values for id's in df2 with the values in df1.
final modified df1:
id value
1 100
2 50
3 100
4 100
5 30
i will be running this in a loop. i'e df2, will change time to time (df1, outside loop)
what would be the best way to change the values?
Use combine_first, but first set_index by id in both DataFrames:
Notice: id column in df2 has to be unique.
df = df2.set_index('id').combine_first(df1.set_index('id')).reset_index()
print (df)
id value
0 1 100.0
1 2 50.0
2 3 100.0
3 4 100.0
4 5 30.0
A loc based solution -
i = df1.set_index('id')
j = df2.set_index('id')
i.loc[j.index, 'value'] = j['value']
df2 = i.reset_index()
df2
id value
0 1 100
1 2 50
2 3 100
3 4 100
4 5 30

Merge Pandas Dataframe on a column with structured data

Scenario: Following up from a previous question on how to read an excel file from a serve into a dataframe (How to read an excel file directly from a Server with Python), I am trying to merge the contexts of multiple dataframes (which contain data from excel worksheets).
Issue: Even after searching for similar issues here in SO, I still was not able to solve the problem.
Format of data (each sheet is read into a dataframe):
Sheet 1 (db1)
Name CUSIP Date Price
A XXX 01/01/2001 100
B AAA 02/05/2005 90
C ZZZ 03/07/2006 95
Sheet2 (db2)
Ident CUSIP Value Class
123 XXX 0.5 AA
444 AAA 1.3 AB
555 ZZZ 2,8 AC
Wanted output (fnl):
Name CUSIP Date Price Ident Value Class
A XXX 01/01/2001 100 123 0.5 AA
B AAA 02/05/2005 90 444 1.3 AB
C ZZZ 03/07/2006 95 555 2.8 AC
What I already tried: I am trying to use the merge function to match each dataframe, but I am getting the error on the "how" part.
fnl = db1
fnl = fnl.merge(db2, how='outer', on=['CUSIP'])
fnl = fnl.merge(db3, how='outer', on=['CUSIP'])
fnl = fnl.merge(bte, how='outer', on=['CUSIP'])
I also tried the concatenate, but I just get a list of dataframes, instead of a single output.
wsframes = [db1 ,db2, db3]
fnl = pd.concat(wsframes, axis=1)
Question: What is the proper way to do this operation?
It seems you need:
from functools import reduce
#many dataframes
dfs = [df1,df2]
df = reduce(lambda x, y: x.merge(y, on='CUSIP', how='outer'), dfs)
print (df)
Name CUSIP Date Price Ident Value Class
0 A XXX 01/01/2001 100 123 0.5 AA
1 B AAA 02/05/2005 90 444 1.3 AB
2 C ZZZ 03/07/2006 95 555 2,8 AC
But columns in each dataframe has to be different (no matched columns (CUSIP here)), else get _x and _y suffixes:
dfs = [df1,df1, df2]
df = reduce(lambda x, y: x.merge(y, on='CUSIP', how='outer'), dfs)
print (df)
Name_x CUSIP Date_x Price_x Name_y Date_y Price_y Ident Value \
0 A XXX 01/01/2001 100 A 01/01/2001 100 123 0.5
1 B AAA 02/05/2005 90 B 02/05/2005 90 444 1.3
2 C ZZZ 03/07/2006 95 C 03/07/2006 95 555 2,8
Class
0 AA
1 AB
2 AC

Categories

Resources