Using melt() in Pandas - python

Trying to melt the following data such that the result is 4 rows - one for "John-1" containing his before data, one for "John-2" containing his after data, one for "Kelly-1" containing her before data, and one for "Kelly-2" containing her after data. And the columns would be "Name", "Weight", and "Height". Can this be done solely with the melt function?
df = pd.DataFrame({'Name': ['John', 'Kelly'],
'Weight Before': [200, 175],
'Weight After': [195, 165],
'Height Before': [6, 5],
'Height After': [7, 6]})

Use pandas.wide_to_long function as shown below:
pd.wide_to_long(df, ['Weight', 'Height'], 'Name', 'grp', ' ', '\\w+').reset_index()
Name grp Weight Height
0 John Before 200 6
1 Kelly Before 175 5
2 John After 195 7
3 Kelly After 165 6
or you could also use pivot_longer from pyjanitor as follows:
import janitor
df.pivot_longer('Name', names_to = ['.value', 'grp'], names_sep = ' ')
Name grp Weight Height
0 John Before 200 6
1 Kelly Before 175 5
2 John After 195 7
3 Kelly After 165 6

Related

How to retain duplicate column names and melt dataframe using pandas?

I have a dataframe like as shown below
tdf = pd.DataFrame(
{'Unnamed: 0' : ['Region','Asean','Asean','Asean','Asean','Asean','Asean'],
'Unnamed: 1' : ['Name', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR','STU'],
'2017Q1' : ['target_achieved',2345,5678,7890,1234,6789,5454],
'2017Q1' : ['target_set', 3000,6000,8000,1500,7000,5500],
'2017Q1' : ['score', 86, 55, 90, 65, 90, 87],
'2017Q2' : ['target_achieved',245,578,790,123,689,454],
'2017Q2' : ['target_set', 300,600,800,150,700,500],
'2017Q2' : ['score', 76, 45, 70, 55, 60, 77]})
As you can see that, my column names are duplicated.
Meaning, there are 3 columns (2017Q1 each and 2017Q2 each)
dataframe doesn't allow to have columns with duplicate names.
I tried the below to get my expected output
tdf.columns = tdf.iloc[0]v # but this still ignores the column with duplicate names
update
After reading the excel file, based on jezrael answer, I get the below display
I expect my output to be like as shown below
First create MultiIndex in columns and indices:
df = pd.read_excel(file, header=[0,1], index_col=[0,1])
If not possible, here is alternative from your sample data - converted columns and first row of data to MultiIndex in columns and first columns to MultiIndex in index:
tdf = pd.read_excel(file)
tdf.columns = pd.MultiIndex.from_arrays([tdf.columns, tdf.iloc[0]])
df = (tdf.iloc[1:]
.set_index(tdf.columns[:2].tolist())
.rename_axis(index=['Region','Name'], columns=['Year',None]))
print (df.index)
MultiIndex([('Asean', 'DEF'),
('Asean', 'GHI'),
('Asean', 'JKL'),
('Asean', 'MNO'),
('Asean', 'PQR'),
('Asean', 'STU')],
names=['Region', 'Name'])
print (df.columns)
MultiIndex([('2017Q1', 'target_achieved'),
('2017Q1', 'target_set'),
('2017Q1', 'score'),
('2017Q2', 'target_achieved'),
('2017Q2', 'target_set'),
('2017Q2', 'score')],
names=['Year', None])
And then reshape:
df1 = df.stack(0).reset_index()
print (df1)
Region Name Year score target_achieved target_set
0 Asean DEF 2017Q1 86 2345 3000
1 Asean DEF 2017Q2 76 245 300
2 Asean GHI 2017Q1 55 5678 6000
3 Asean GHI 2017Q2 45 578 600
4 Asean JKL 2017Q1 90 7890 8000
5 Asean JKL 2017Q2 70 790 800
6 Asean MNO 2017Q1 65 1234 1500
7 Asean MNO 2017Q2 55 123 150
8 Asean PQR 2017Q1 90 6789 7000
9 Asean PQR 2017Q2 60 689 700
10 Asean STU 2017Q1 87 5454 5500
11 Asean STU 2017Q2 77 454 500
EDIT: Solution for EDITed question is similar:
df = pd.read_excel(file, header=[0,1], index_col=[0,1])
df1 = df.rename_axis(index=['Region','Name'], columns=['Year',None]).stack(0).reset_index()

Python : Merge several columns of a dataframe without having duplicates of data

Let's say that I have this dataframe :
Name = ['Lolo', 'Mike', 'Tobias','Luke','Sam']
Age = [19, 34, 13, 45, 52]
Info_1 = ['Tall', 'Large', 'Small', 'Small','']
Info_2 = ['New York', 'Paris', 'Lisbon', '', 'Berlin']
Info_3 = ['Tall', 'Paris', 'Hi', 'Small', 'Thanks']
Data = [123,268,76,909,87]
Sex = ['F', 'M', 'M','M','M']
df = pd.DataFrame({'Name' : Name, 'Age' : Age, 'Info_1' : Info_1, 'Info_2' : Info_2, 'Info_3' : Info_3, 'Data' : Data, 'Sex' : Sex})
print(df)
Name Age Info_1 Info_2 Info_3 Data Sex
0 Lolo 19 Tall New York Tall 123 F
1 Mike 34 Large Paris Paris 268 M
2 Tobias 13 Small Lisbon Hi 76 M
3 Luke 45 Small Small 909 M
4 Sam 52 Berlin Thanks 87 M
I want to merge the data of four columns of this dataframe : Info_1, Info_2, Info_3, Data.
I want to merge them without having duplicates of data for each row. That means for the row "0", I do not want to have "Tall" twice. So at the end I would like to get something like that :
Name Age Info Sex
0 Lolo 19 Tall New York 123 F
1 Mike 34 Large Paris 268 M
2 Tobias 13 Small Lisbon Hi 76 M
3 Luke 45 Small 909 M
4 Sam 52 Berlin Thanks 87 M
I tried this function to merge the data :
di['period'] = df[['Info_1', 'Info_2', 'Info_3' 'Data']].agg('-'.join, axis=1)
However I get an error because it expects a string, How can I merge the data of the column "Data" ? And how can I check that I do not create duplicates
Thank you
Your Data columns seems to be int type. Convert it to strings first:
df['Data'] = df['Data'].astype(str)
df['period'] = (df[['Info_1','Info_2','Info_3','Data']]
.apply(lambda x: ' '.join(x[x!=''].unique()), axis=1)
)
Output:
Name Age Info_1 Info_2 Info_3 Data Sex period
0 Lolo 19 Tall New York Tall 123 F Tall New York 123
1 Mike 34 Large Paris Paris 268 M Large Paris 268
2 Tobias 13 Small Lisbon Hi 76 M Small Lisbon Hi 76
3 Luke 45 Small Small 909 M Small 909
4 Sam 52 Berlin Thanks 87 M Berlin Thanks 87
I think it's probably easiest to first just concatenate all the fields you want with a space in between:
df['Info'] = df.Info_1 + ' ' + df.Info_2 + ' ' + df.Info_3 + ' ' + df.Data.astype(str)
Then you can write a function to remove the duplicate words from a string, something like this:
def remove_dup_words(s):
words = s.split(' ')
unique_words = pd.Series(words).drop_duplicates().tolist()
return ' '.join(unique_words)
and apply that function to the Info field:
df['Info'] = df.Info.apply(remove_dup_words)
all the code together:
import pandas as pd
def remove_dup_words(s):
words = s.split(' ')
unique_words = pd.Series(words).drop_duplicates().tolist()
return ' '.join(unique_words)
Name = ['Lolo', 'Mike', 'Tobias','Luke','Sam']
Age = [19, 34, 13, 45, 52]
Info_1 = ['Tall', 'Large', 'Small', 'Small','']
Info_2 = ['New York', 'Paris', 'Lisbon', '', 'Berlin']
Info_3 = ['Tall', 'Paris', 'Hi', 'Small', 'Thanks']
Data = [123,268,76,909,87]
Sex = ['F', 'M', 'M','M','M']
df = pd.DataFrame({'Name' : Name, 'Age' : Age, 'Info_1' : Info_1, 'Info_2' : Info_2, 'Info_3' : Info_3, 'Data' : Data, 'Sex' : Sex})
df['Info'] = df.Info_1 + ' ' + df.Info_2 + ' ' + df.Info_3 + ' ' + df.Data.astype(str)
df['Info'] = df.Info.apply(remove_dup_words)
print(df)
Name Age Info_1 Info_2 Info_3 Data Sex Info
0 Lolo 19 Tall New York Tall 123 F Tall New York 123
1 Mike 34 Large Paris Paris 268 M Large Paris 268
2 Tobias 13 Small Lisbon Hi 76 M Small Lisbon Hi 76
3 Luke 45 Small Small 909 M Small 909
4 Sam 52 Berlin Thanks 87 M Berlin Thanks 87

Pandas - Create column with difference in values

I have the below dataset. How can create a new column that shows the difference of money for each person, for each expiry?
The column is yellow is what I want. You can see that it is the difference in money for each expiry point for the person. I highlighted the other rows in colors so it is more clear.
Thanks a lot.
Example
[]
import pandas as pd
import numpy as np
example = pd.DataFrame( data = {'Day': ['2020-08-30', '2020-08-30','2020-08-30','2020-08-30',
'2020-08-29', '2020-08-29','2020-08-29','2020-08-29'],
'Name': ['John', 'Mike', 'John', 'Mike','John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': ['1Y', '1Y', '2Y','2Y','1Y','1Y','2Y','2Y']})
example_0830 = example[ example['Day']=='2020-08-30' ].reset_index()
example_0829 = example[ example['Day']=='2020-08-29' ].reset_index()
example_0830['key'] = example_0830['Name'] + example_0830['Expiry']
example_0829['key'] = example_0829['Name'] + example_0829['Expiry']
example_0829 = pd.DataFrame( example_0829, columns = ['key','Money'])
example_0830 = pd.merge(example_0830, example_0829, on = 'key')
example_0830['Difference'] = example_0830['Money_x'] - example_0830['Money_y']
example_0830 = example_0830.drop(columns=['key', 'Money_y','index'])
Result:
Day Name Money_x Expiry Difference
0 2020-08-30 John 100 1Y 50
1 2020-08-30 Mike 950 1Y 900
2 2020-08-30 John 200 2Y -50
3 2020-08-30 Mike 1000 2Y -200
If the difference is just derived from the previous date, you can just define a date variable in the beginning to find today(t) and previous day (t-1) to filter out original dataframe.
You can solve it with groupby.diff
Take the dataframe
df = pd.DataFrame({
'Day': [30, 30, 30, 30, 29, 29, 28, 28],
'Name': ['John', 'Mike', 'John', 'Mike', 'John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': [1, 1, 2, 2, 1, 1, 2, 2]
})
print(df)
Which looks like
Day Name Money Expiry
0 30 John 100 1
1 30 Mike 950 1
2 30 John 200 2
3 30 Mike 1000 2
4 29 John 50 1
5 29 Mike 50 1
6 28 John 250 2
7 28 Mike 1200 2
And the code
# make sure we have dates in the order we want
df.sort_values('Day', ascending=False)
# groubpy and get the difference from the next row in each group
# diff(1) calculates the difference from the previous row, so -1 will point to the next
df['Difference'] = df.groupby(['Name', 'Expiry']).Money.diff(-1)
Output
Day Name Money Expiry Difference
0 30 John 100 1 50.0
1 30 Mike 950 1 900.0
2 30 John 200 2 -50.0
3 30 Mike 1000 2 -200.0
4 29 John 50 1 NaN
5 29 Mike 50 1 NaN
6 28 John 250 2 NaN
7 28 Mike 1200 2 NaN

Apply a function on elements in a Pandas column, grouped on another column

I have a dataset with several columns.
Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.
_id fName lName age
0 ABCD Andrew Schulz
1 ABCD Andreww 23
2 DEFG John boy
3 DEFG Johnn boy 14
4 CDGH Bob TANNA 13
5 ABCD. Peter Parker 45
6 DEFGH Clark Kent 25
So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values. Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
5 ABCD Peter Parker 45
6 DEFG Clark Kent 25
I intend to use pyjarowinkler.
If I had two independent columns (without all the group by stuff) to check, this is how I use it.
df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])]
df = df[df['score'] > 0.87]
Can someone suggest a pythonic and fast way of doing this
UPDATE
So, I have tried using record linkage library for this. And I have ended up at a dataframe containing pair of indexes that are similar called 'matches'. Now I just want to basically combine the data.
# Indexation step
indexer = recordlinkage.Index()
indexer.block(left_on='_id')
candidate_links = indexer.index(df)
# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.string('fName', 'fName', method='jarowinkler', threshold=threshold, label='full_name')
features = compare_cl.compute(candidate_links, df)
# Classification step
matches = features[features.sum(axis=1) >= 1]
print(len(matches))
This is how matches looks:
index1 index2 fName
0 1 1.0
2 3 1.0
I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows
just wanted to clear some doubts regarding your ques. Couldn't clear them in comments due to low reputation.
Like here if i run it for col "fName". I should be able to reduce this
dataframe to based on a score threshold:
So basically your function would return the DataFrame containing the first row in each group (by ID)? This will result in the above listed resultant DataFrame.
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
I hope this code answer your question
r0 =['ABCD','Andrew','Schulz', '' ]
r1 =['ABCD','Andrew', '' , '23' ]
r2 =['DEFG','John' ,'boy' , '' ]
r3 =['DEFG','John' ,'boy' , '14' ]
r4 =['CDGH','Bob' ,'TANNA' , '13' ]
Rx =[r0,r1,r2,r3,r4]
print(Rx)
print()
Dict= dict()
for i in Rx:
if (Dict.__contains__(i[0]) == True):
if (i[2] != ''):
Dict[i[0]][2] = i[2]
if (i[3] != ''):
Dict[i[0]][3] = i[3]
else:
Dict[i[0]]=i
Rx[:] = Dict.values()
print(Rx)
I am lost with the 'score' part of your question, but if what you need is to fill the gaps in data with values from other rows and then drop the duplicates by id, maybe this can help:
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
First make sure that empty values are replaced with nulls. Then use fillna to 'back fill' the data. Then drop duplicates keeping the first occurrence of Id. fillna will fill the values from the next value found in the column, which may correspond to other Id, but since you will discard the duplicated rows, I believe drop_duplicates keeping the first occurrence will do the job. (This assumes that at least one value is provided in every column for every Id)
I've tested with this dataset and code:
data = [
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', 'Schulz'],
['AABBCC', 'Andrew', '', 23],
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', '',],
['DDEEFF', 'Karl', 'boy'],
['DDEEFF', 'Karl', ''],
['DDEEFF', 'Karl', '', 14],
['GGHHHH', 'John', 'TANNA', 13],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', 'Blob'],
['HLHLHL', 'Bob', 'Blob', 15],
['HLHLHL', 'Bob','', 15],
['JLJLJL', 'Nick', 'Best', 20],
['JLJLJL', 'Nick', '']
]
df = pd.DataFrame(data, columns=['Id', 'fName', 'lName', 'Age'])
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
Output:
Id fName lName Age
0 AABBCC Andrew Schulz 23.0
5 DDEEFF Karl boy 14.0
8 GGHHHH John TANNA 13.0
9 HLHLHL Bob Blob 15.0
14 JLJLJL Nick Best 20.0
Hope this helps and apologies if I misunderstood the question.

How to make dictionary keys is one column of Pandas dataframe to the columns?

I have a dataframe with one column containing stringified list containing dictionaries. I was wondering how can I make new columns from these dictionary keys.
I am looking solution using pandas methods like apply stack etc and NOT USING FOR LOOP as far as possible.
Here is the problem:
speakers = ['Einstein','Newton']
views = [1000,2000]
ratings0 = ("[{'id': 7, 'name': 'Funny', 'count': 100}, {'id': 1, 'name': 'Sad', "
"'count': 110}, {'id': 9, 'name': 'Happy', 'count': 120}]")
ratings1 = ("[{'id': 7, 'name': 'Happy', 'count': 200}, {'id': 3, 'name': 'Funny', "
"'count': 210}, {'id': 2, 'name': 'Sad', 'count': 220}]")
ratings = [ratings0, ratings1]
df = pd.DataFrame({'speaker': speakers, 'ratings': ratings,'views':views})
print(df)
speaker ratings views
0 Einstein [{'id': 7, 'name': 'Funny', 'count': 100}, {'i... 1000
1 Newton [{'id': 7, 'name': 'Happy', 'count': 200}, {'i... 2000
My attempt so far,
# new dataframe only for ratings
dfr = df['ratings'].apply(ast.literal_eval)
dfr = dfr.apply(pd.DataFrame)
dfr = dfr.apply(lambda x: x.sort_values(by='name'))
dfr = dfr.apply(pd.DataFrame.stack)
print(dfr)
0 1 2
count id name count id name count id name
0 100 7 Funny 110 1 Sad 120 9 Happy
1 200 7 Happy 210 3 Funny 220 2 Sad
This gives multi-index dataframe. I tried sorting the dictionary, but still it is not sorted and the column name does not have the same values. Also, I am unsure how to move the values of column name to replace column count and remove other unwanted columns.
Final Wanted Solution
speaker views Funny Sad Happy
Einstein 1000 100 110 120
Newton 2000 210 220 200
Update
I am using Pandas 0.20 and the method .explode() is absent in my workplace and I am not permitted to update Pandas.
For pandas >= 0.25.0 you can use ast.literal_eval + explode + pivot
ii = df.set_index('speaker')['ratings'].apply(ast.literal_eval).explode()
u = pd.DataFrame(ii.tolist(), index=ii.index).reset_index()
u.pivot('speaker', 'name', 'count')
name Funny Happy Sad
speaker
Einstein 100 120 110
Newton 210 200 220
For older versions of pandas
a = df['speaker']
b = df['ratings']
ii = [
{**{'speaker': name}, **row}
for name, element in zip(a, b) for row in ast.literal_eval(element)
]
pd.DataFrame(ii).pivot('speaker', 'name', 'count')
You may use sum, index.repeat to construct a new dataframe and join it df[['speaker', 'views']] and assign it to df1. Next, set_index, unstack, and reset_index
df['ratings'] = df['ratings'].apply(ast.literal_eval)
df1 = (pd.DataFrame(df.ratings.sum(), index=df.index.repeat(df.ratings.str.len()))
.drop('id', 1).join(df[['speaker', 'views']]))
df1.set_index(['speaker', 'views', 'name'])['count'].unstack().reset_index()
Out[213]:
name speaker views Funny Happy Sad
0 Einstein 1000 100 120 110
1 Newton 2000 210 200 220
Note: name in the final output is the label of the columns axis. If you don't want to see it, just chain additional rename_axis as follows
df1.set_index(['speaker', 'views', 'name'])['count'].unstack().reset_index() \
.rename_axis([None], axis=1)
Out[214]:
speaker views Funny Happy Sad
0 Einstein 1000 100 120 110
1 Newton 2000 210 200 220
For loops are not always bad. You can give it a try:
dfr = pd.DataFrame(columns=['id','name','count'])
for i in range(len(df)):
x = pd.DataFrame(df['ratings'].apply(ast.literal_eval)[i])
x.index = [i]*len(x)
dfr = dfr.append(x)
dfr = dfr.reset_index()
dfr = (dfr.drop('id',axis=1)
.pivot_table(index=['index'], columns='name',
values='count',aggfunc='sum')
.rename_axis(None, axis=1).reset_index())
df_final = df.join(dfr)
df_final.drop(['index','ratings'],axis=1,inplace=True)
df_final
Gives:
speaker views Funny Happy Sad
0 Einstein 1000 100 120 110
1 Newton 2000 210 200 220

Categories

Resources