I have the below dataset. How can create a new column that shows the difference of money for each person, for each expiry?
The column is yellow is what I want. You can see that it is the difference in money for each expiry point for the person. I highlighted the other rows in colors so it is more clear.
Thanks a lot.
Example
[]
import pandas as pd
import numpy as np
example = pd.DataFrame( data = {'Day': ['2020-08-30', '2020-08-30','2020-08-30','2020-08-30',
'2020-08-29', '2020-08-29','2020-08-29','2020-08-29'],
'Name': ['John', 'Mike', 'John', 'Mike','John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': ['1Y', '1Y', '2Y','2Y','1Y','1Y','2Y','2Y']})
example_0830 = example[ example['Day']=='2020-08-30' ].reset_index()
example_0829 = example[ example['Day']=='2020-08-29' ].reset_index()
example_0830['key'] = example_0830['Name'] + example_0830['Expiry']
example_0829['key'] = example_0829['Name'] + example_0829['Expiry']
example_0829 = pd.DataFrame( example_0829, columns = ['key','Money'])
example_0830 = pd.merge(example_0830, example_0829, on = 'key')
example_0830['Difference'] = example_0830['Money_x'] - example_0830['Money_y']
example_0830 = example_0830.drop(columns=['key', 'Money_y','index'])
Result:
Day Name Money_x Expiry Difference
0 2020-08-30 John 100 1Y 50
1 2020-08-30 Mike 950 1Y 900
2 2020-08-30 John 200 2Y -50
3 2020-08-30 Mike 1000 2Y -200
If the difference is just derived from the previous date, you can just define a date variable in the beginning to find today(t) and previous day (t-1) to filter out original dataframe.
You can solve it with groupby.diff
Take the dataframe
df = pd.DataFrame({
'Day': [30, 30, 30, 30, 29, 29, 28, 28],
'Name': ['John', 'Mike', 'John', 'Mike', 'John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': [1, 1, 2, 2, 1, 1, 2, 2]
})
print(df)
Which looks like
Day Name Money Expiry
0 30 John 100 1
1 30 Mike 950 1
2 30 John 200 2
3 30 Mike 1000 2
4 29 John 50 1
5 29 Mike 50 1
6 28 John 250 2
7 28 Mike 1200 2
And the code
# make sure we have dates in the order we want
df.sort_values('Day', ascending=False)
# groubpy and get the difference from the next row in each group
# diff(1) calculates the difference from the previous row, so -1 will point to the next
df['Difference'] = df.groupby(['Name', 'Expiry']).Money.diff(-1)
Output
Day Name Money Expiry Difference
0 30 John 100 1 50.0
1 30 Mike 950 1 900.0
2 30 John 200 2 -50.0
3 30 Mike 1000 2 -200.0
4 29 John 50 1 NaN
5 29 Mike 50 1 NaN
6 28 John 250 2 NaN
7 28 Mike 1200 2 NaN
Related
Trying to melt the following data such that the result is 4 rows - one for "John-1" containing his before data, one for "John-2" containing his after data, one for "Kelly-1" containing her before data, and one for "Kelly-2" containing her after data. And the columns would be "Name", "Weight", and "Height". Can this be done solely with the melt function?
df = pd.DataFrame({'Name': ['John', 'Kelly'],
'Weight Before': [200, 175],
'Weight After': [195, 165],
'Height Before': [6, 5],
'Height After': [7, 6]})
Use pandas.wide_to_long function as shown below:
pd.wide_to_long(df, ['Weight', 'Height'], 'Name', 'grp', ' ', '\\w+').reset_index()
Name grp Weight Height
0 John Before 200 6
1 Kelly Before 175 5
2 John After 195 7
3 Kelly After 165 6
or you could also use pivot_longer from pyjanitor as follows:
import janitor
df.pivot_longer('Name', names_to = ['.value', 'grp'], names_sep = ' ')
Name grp Weight Height
0 John Before 200 6
1 Kelly Before 175 5
2 John After 195 7
3 Kelly After 165 6
I have a dataframe like as shown below
tdf = pd.DataFrame(
{'Unnamed: 0' : ['Region','Asean','Asean','Asean','Asean','Asean','Asean'],
'Unnamed: 1' : ['Name', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR','STU'],
'2017Q1' : ['target_achieved',2345,5678,7890,1234,6789,5454],
'2017Q1' : ['target_set', 3000,6000,8000,1500,7000,5500],
'2017Q1' : ['score', 86, 55, 90, 65, 90, 87],
'2017Q2' : ['target_achieved',245,578,790,123,689,454],
'2017Q2' : ['target_set', 300,600,800,150,700,500],
'2017Q2' : ['score', 76, 45, 70, 55, 60, 77]})
As you can see that, my column names are duplicated.
Meaning, there are 3 columns (2017Q1 each and 2017Q2 each)
dataframe doesn't allow to have columns with duplicate names.
I tried the below to get my expected output
tdf.columns = tdf.iloc[0]v # but this still ignores the column with duplicate names
update
After reading the excel file, based on jezrael answer, I get the below display
I expect my output to be like as shown below
First create MultiIndex in columns and indices:
df = pd.read_excel(file, header=[0,1], index_col=[0,1])
If not possible, here is alternative from your sample data - converted columns and first row of data to MultiIndex in columns and first columns to MultiIndex in index:
tdf = pd.read_excel(file)
tdf.columns = pd.MultiIndex.from_arrays([tdf.columns, tdf.iloc[0]])
df = (tdf.iloc[1:]
.set_index(tdf.columns[:2].tolist())
.rename_axis(index=['Region','Name'], columns=['Year',None]))
print (df.index)
MultiIndex([('Asean', 'DEF'),
('Asean', 'GHI'),
('Asean', 'JKL'),
('Asean', 'MNO'),
('Asean', 'PQR'),
('Asean', 'STU')],
names=['Region', 'Name'])
print (df.columns)
MultiIndex([('2017Q1', 'target_achieved'),
('2017Q1', 'target_set'),
('2017Q1', 'score'),
('2017Q2', 'target_achieved'),
('2017Q2', 'target_set'),
('2017Q2', 'score')],
names=['Year', None])
And then reshape:
df1 = df.stack(0).reset_index()
print (df1)
Region Name Year score target_achieved target_set
0 Asean DEF 2017Q1 86 2345 3000
1 Asean DEF 2017Q2 76 245 300
2 Asean GHI 2017Q1 55 5678 6000
3 Asean GHI 2017Q2 45 578 600
4 Asean JKL 2017Q1 90 7890 8000
5 Asean JKL 2017Q2 70 790 800
6 Asean MNO 2017Q1 65 1234 1500
7 Asean MNO 2017Q2 55 123 150
8 Asean PQR 2017Q1 90 6789 7000
9 Asean PQR 2017Q2 60 689 700
10 Asean STU 2017Q1 87 5454 5500
11 Asean STU 2017Q2 77 454 500
EDIT: Solution for EDITed question is similar:
df = pd.read_excel(file, header=[0,1], index_col=[0,1])
df1 = df.rename_axis(index=['Region','Name'], columns=['Year',None]).stack(0).reset_index()
Hi i have df_input which needs to be sorted for only column names not by rows(restructuring the dataframe)
df_input.columns
Out[143]: Index(['product_name', 'price', 'make', 'v_d1', 'v_d4', 'v_d2', 'v_d3'], dtype='object')
My required output column names should be sorted after N columns(here after 3 columns)
df_out.columns
Out[144]: Index(['product_name', 'price', 'make', 'v_d1', 'v_d2', 'v_d3', 'v_d4'], dtype='object')
My input dataframe is as follows:
data = {'product_name': ['laptop', 'printer', 'tablet', 'desktop', 'chair'],
'price': [1200, 150, 300, 450, 200],
'make':['Dell','hp','Lenove','iPhone','xyz'],
'v_d1':[2,44,55,2,1],
'v_d4':[66,12,55,7,89],
'v_d2':[54,12,45,77,23],
'v_d3':[88,69,37,15,10]
}
df_input = pd.DataFrame(data)
print (df)
Required output dataframe:
data = {'product_name': ['laptop', 'printer', 'tablet', 'desktop', 'chair'],
'price': [1200, 150, 300, 450, 200],
'make':['Dell','hp','Lenove','iPhone','xyz'],
'v_d1':[2,44,55,2,1],
'v_d2':[54,12,45,77,23],
'v_d3':[88,69,37,15,10],
'v_d4':[66,12,55,7,89]
}
df_out = pd.DataFrame(data)
Thanks in advance
If values of columns names are from 0 to 9 is possible use sorted columns with slicing:
df = df[df.columns[:3].tolist() + sorted(df.columns[3:])]
print (df)
product_name price make v_d1 v_d2 v_d3 v_d4
0 laptop 1200 Dell 2 54 88 66
1 printer 150 hp 44 12 69 12
2 tablet 300 Lenove 55 45 37 55
3 desktop 450 iPhone 2 77 15 7
4 chair 200 xyz 1 23 10 89
More general solution with natural sorting:
from natsort import natsorted
data = {'product_name': ['laptop', 'printer', 'tablet', 'desktop', 'chair'],
'price': [1200, 150, 300, 450, 200],
'make':['Dell','hp','Lenove','iPhone','xyz'],
'v_d1':[2,44,55,2,1],
'v_d4':[66,12,55,7,89],
'v_d10':[54,12,45,77,23],
'v_d20':[88,69,37,15,10]
}
df = pd.DataFrame(data)
df = df[df.columns[:3].tolist() + natsorted(df.columns[3:])]
print (df)
product_name price make v_d1 v_d4 v_d10 v_d20
0 laptop 1200 Dell 2 66 54 88
1 printer 150 hp 44 12 12 69
2 tablet 300 Lenove 55 55 45 37
3 desktop 450 iPhone 2 7 77 15
4 chair 200 xyz 1 89 23 10
I would like to flatten a dataframe that is inside the dataframe. In this example, the column account has a dataframe as value. I would like to flatten this into a single dataframe.
Example: (Updated)
import panda as pd
account1 = pd.DataFrame([{'nr': '123', 'balance': 56}, {'nr': '230', 'balance': 55}])
account2 = pd.DataFrame([{'nr': '456', 'balance': 575}])
account3 = pd.DataFrame([{'nr': '350', 'balance': 59}])
df = pd.DataFrame([{'id': 1, 'age': 23, 'name': 'anna', 'account': account1},
{'id': 2, 'age': 71, 'name': 'mary', 'account': account2},
{'id': 3, 'age': 42, 'name': 'bob', 'account': account3}])
print(df)
gives the dataframe:
id age name account
0 1 23 anna nr balance
0 123 56
1 230 55
1 2 71 mary nr balance
0 456 575
2 3 42 bob nr balance
0 350 59
And I would like to get:
id name age account|nr|0 account|balance|0 account|nr|1 account|balance|1
0 1 anna 23 123 56 230 55
1 2 mary 71 456 575
2 3 bob 59 350 59
How can I flatten a dataframe inside a dataframe to a single dataframe? This type of structure is called Hierarchical DataFrame?
This is the solution that I have found.
list_accounts = []
for index_j, row_j in df.iterrows():
account = row_j["account"]
account = pd.DataFrame(account).stack().to_frame().T
account.columns = ['%s%s' % (a, '|%s' % b if b else '') for a, b in account.columns]
list_accounts.append(account)
df = pd.concat([df, pd.concat(list_accounts).reset_index(drop=True)], axis=1)
df.drop(columns="account", inplace=True)
This question already has answers here:
Insert a row to pandas dataframe
(18 answers)
Closed 4 years ago.
Below is my dataframe
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex':['male','male','female','male']})
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
I want to insert a new row at the first position
name: dean, age: 45, sex: male
age name sex
0 45 dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
What is the best way to do this in pandas?
Probably this is not the most efficient way but:
df.loc[-1] = ['45', 'Dean', 'male'] # adding a row
df.index = df.index + 1 # shifting index
df.sort_index(inplace=True)
Output:
age name sex
0 45 Dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
If it's going to be a frequent operation, then it makes sense (in terms of performance) to gather the data into a list first and then use pd.concat([], ignore_index=True) (similar to #Serenity's solution):
Demo:
data = []
# always inserting new rows at the first position - last row will be always on top
data.insert(0, {'name': 'dean', 'age': 45, 'sex': 'male'})
data.insert(0, {'name': 'joe', 'age': 33, 'sex': 'male'})
#...
pd.concat([pd.DataFrame(data), df], ignore_index=True)
In [56]: pd.concat([pd.DataFrame(data), df], ignore_index=True)
Out[56]:
age name sex
0 33 joe male
1 45 dean male
2 30 jon male
3 25 sam male
4 18 jane female
5 26 bob male
PS I wouldn't call .append(), pd.concat(), .sort_index() too frequently (for each single row) as it's pretty expensive. So the idea is to do it in chunks...
#edyvedy13's solution worked great for me. However it needs to be updated for the deprecation of pandas' sort method - now replaced with sort_index.
df.loc[-1] = ['45', 'Dean', 'male'] # adding a row
df.index = df.index + 1 # shifting index
df = df.sort_index() # sorting by index
Use pandas.concat and reindex new dataframe:
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex':['male','male','female','male']})
# new line
line = pd.DataFrame({'name': 'dean', 'age': 45, 'sex': 'male'}, index=[0])
# concatenate two dataframe
df2 = pd.concat([line,df.ix[:]]).reset_index(drop=True)
print (df2)
Output:
age name sex
0 45 dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex': ['male','male','female','male']})
df1 = pd.DataFrame({'name': ['dean'], 'age': [45], 'sex':['male']})
df1 = df1.append(df)
df1 = df1.reset_index(drop=True)
That works
This will work for me.
>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
... 'age': [30,25,18,26],
... 'sex':['male','male','female','male']}) >>> df
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
>>> df.loc['a']=[45,'dean','male']
>>> df
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
a 45 dean male
>>> newIndex=['a']+[ind for ind in df.index if ind!='a']
>>> df=df.reindex(index=newIndex)
>>> df
age name sex
a 45 dean male
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male