Long to wide format using a dictionary - python

I would like to make a long to wide transformation of my dataframe, starting from
match_id player goals home
1 John 1 home
1 Jim 3 home
...
2 John 0 away
2 Jim 2 away
...
ending up with:
match_id player_1 player_2 player_1_goals player_2_goals player_1_home player_2_home ...
1 John Jim 1 3 home home
2 John Jim 0 2 away away
...
Since I'm going to have columns with new names, I though that maybe I should try to build a dictionary for that, where the outer key is match id, for everylike so:
dict = {1: {
'player_1': 'John',
'player_1_goals':1,
'player_1_home': 'home'
'player_2': 'Jim',
'player_2_goals':3,
'player_2_home': 'home'
},
2: {
'player_1': 'John',
'player_1_goals':0,
'player_1_home': 'away',
'player_2': 'Jim',
'player_2_goals':2
'player_2_home': 'away'
},
}
and then:
pd.DataFrame.from_dict(dict).T
In the real case scenario, however, the number of players will vary and I can't hardcode it.
Is this the best way of doing this using diciotnaries? If so, how could I build this dict and populate it from my original pandas dataframe?

It looks like you want to pivot the dataframe. The problem is there is no column in your dataframe that "enumerates" the players for you. If you assign such a column via assign() method, then pivot() becomes easy.
So far, it actually looks incredibly similar this case here. The only difference is you seem to need to format the column names in a specific way where the string "player" needs to prepended to each column name. The set_axis() call below does that.
(df
.assign(
ind=df.groupby('match_id').cumcount().add(1).astype(str)
)
.pivot('match_id', 'ind', ['player', 'goals', 'home'])
.pipe(lambda x: x.set_axis([
'_'.join([c, i]) if c == 'player' else '_'.join(['player', i, c])
for (c, i) in x
], axis=1))
.reset_index()
)

Related

Set operations on cell level

Let's say I have a DataFrame with values in one column being a set:
df = pd.DataFrame([{'song': 'Despacito', 'genres': {'pop','rock'}},
{'song': 'We will rock you', 'genres': {'rock'}},
{'song': 'Fur Eliza', 'genres': {'piano'}}])
print(df)
song genres
0 Despacito {rock, pop}
1 We will rock you {rock}
2 Fur Eliza {piano}
How do I select rows with genres overlapping with expected genres? For instance, I would expect
df[~df['genres'].intersection({'rock', 'metal'})]
to return first two songs:
song genres
0 Despacito {rock, pop}
1 We will rock you {rock}
Obviously, this will fail, because Series does not have intersection() method, but you get the idea.
Is there a way to implement it with pandas DataFrame or DataFrame is not the right structure for my goal?
Use isdisjoint method with Series.map:
df = df[~df['genres'].map({'rock', 'metal'}.isdisjoint)]
print (df)
song genres
0 Despacito {rock, pop}
1 We will rock you {rock}

Use contains to merge data frame

I have two separates files, one from our service providers and the other is internal (HR).
The service providers write the names of our employer in different ways, there are those who write it in firstname lastname format, or first letter of the firstname and the last name or lastname firstname...while the HR file includes separately the first and last name.
DF1
Full Name
0 B.pitt
1 Mr Nickolson Jacl
2 Johnny, Deep
3 Streep Meryl
DF2
First Last
0 Brad Pitt
1 Jack Nicklson
2 Johnny Deep
3 Streep Meryl
My idea is to use str.contains to look for the first letter of the first name and the last name. I've succed to do it with static values using the following code:
df1[['Full Name']][df1['Full Name'].str.contains('B')
& df1['Full Name'].str.contains('pitt')]
Which gives the following result:
Full Name
0 B.pitt
The challenge is comparing the two datasets... Any advise on that please?
Regards
if you are just checking if it exists or no this could be useful:
because it is rare to have 2 exactly the same family name, I recommend to just split your Df1 and compare families, then for ensuring you can differ first names too
you can easily do it with a for:
for i in range('your index'):
if df1_splitted[i].str.contain('family you searching for'):
print("yes")
if you need to compare in other aspects just let me know
I suggest to use next module for parsing names:
pip install nameparser
Then you can process your data frames :
from nameparser import HumanName
import pandas as pd
df1 = pd.DataFrame({'Full Name':['B.pitt','Mr Nickolson Jack','Johnny, Deep','Streep Meryl']})
df2 = pd.DataFrame({'First':['Brad', 'Jack','Johnny', 'Streep'],'Last':['Pitt','Nicklson','Deep','Meryl']})
names1 = [HumanName(name) for name in df1['Full Name']]
names2 = [HumanName(str(row[0]+" "+ str(row[1]))) for i,row in df2.iterrows()]
After that you can try comparing HumanName instances which have parsed fileds. it looks like this:
<HumanName : [
title: ''
first: 'Brad'
middle: ''
last: 'Pitt'
suffix: ''
nickname: '' ]
I have used this approach for processing thousands of names and merging them to same names from other documents and results were good.
More about module can be found at https://nameparser.readthedocs.io/en/latest/
Hey you could use fuzzy string matching with fuzzywuzzy
First create Full Name for df2
df2_ = df2[['First', 'Last']].agg(lambda a: a[0] + ' ' + a[1], axis=1).rename('Full Name').to_frame()
Then merge the two dataframes by index
merged_df = df2_.merge(df1, left_index=True, right_index=True)
Now you can apply fuzz.token_sort_ratio so you get the similarity
merged_df['similarity'] = merged_df[['Full Name_x', 'Full Name_y']].apply(lambda r: fuzz.token_sort_ratio(*r), axis=1)
This results in the following dataframe. You can now filter or sort it by similarity.
Full Name_x Full Name_y similarity
0 Brad Pitt B.pitt 80
1 Jack Nicklson Mr Nickolson Jacl 80
2 Johnny Deep Johnny, Deep 100
3 Streep Meryl Streep Meryl 100

Conditional Filling in Missing Values in a Pandas Data frame using non-conventional means

TLDR; How can I improve my code and make it more pythonic?
Hi,
One of the interesting challenge(s) we were given in a tutorial was the following:
"There are X missing entries in the data frame with an associated code but a 'blank' entry next to the code. This is a random occurance across the data frame. Using your knowledge of pandas, map each missing 'blank' entry to the associated code."
So this looks like the following:
|code| |name|
001 Australia
002 London
...
001 <blank>
My approach I have used is as follows:
Loop through entire dataframe and identify areas with blanks "". Replace all blanks via copying the associated correct code (ordered) to the dataframe.
code_names = [ "",
'Economic management',
'Public sector governance',
'Rule of law',
'Financial and private sector development',
'Trade and integration',
'Social protection and risk management',
'Social dev/gender/inclusion',
'Human development',
'Urban development',
'Rural development',
'Environment and natural resources management'
]
df_copy = df_.copy()
# Looks through each code name, and if it is empty, stores the proper name in its place
for x in range(len(df_copy.mjtheme_namecode)):
for y in range(len(df_copy.mjtheme_namecode[x])):
if(df_copy.mjtheme_namecode[x][y]['name'] == ""):
df_copy.mjtheme_namecode[x][y]['name'] = code_names[int(df_copy.mjtheme_namecode[x][y]['code'])]
limit = 25
counter = 0
for x in range(len(df_copy.mjtheme_namecode)):
for y in range(len(df_copy.mjtheme_namecode[x])):
print(df_copy.mjtheme_namecode[x][y])
counter += 1
if(counter >= limit):
break
While the above approach works - is there a better, more pythonic way of achieving what I'm after? I feel the approach I have used is very clunky due to my skills not being very well developed.
Thank you!
Method 1:
One way to do this would be to replace all your "" blanks with NaN, sort the dataframe by code and name, and use fillna(method='ffill'):
Starting with this:
>>> df
code name
0 1 Australia
1 2 London
2 1
You can apply the following:
new_df = (df.replace({'name':{'':np.nan}})
.sort_values(['code', 'name'])
.fillna(method='ffill')
.sort_index())
>>> new_df
code name
0 1 Australia
1 2 London
2 1 Australia
Method 2:
This is more convoluted, but will work as well:
Using groupby, first, and sqeeze, you can create a pd.Series to map the codes to non-blank names, and use .map to map that series to your code column:
df['name'] = (df['code']
.map(
df.replace({'name':{'':np.nan}})
.sort_values(['code', 'name'])
.groupby('code')
.first()
.squeeze()
))
>>> df
code name
0 1 Australia
1 2 London
2 1 Australia
Explanation: The pd.Series map that this creates looks like this:
code
1 Australia
2 London
And it works because it gets the first instance for every code (via the groupby), sorted in such a manner that the NaNs are last. So as long as each code is associated with a name, this method will work.

Creating new column with pieces of 3 columns

I would like to create a new column in a dataframe containing pieces of 3 different columns.I would like the first 5 letters of the last name, after removing non alphabeticals, if it is that long else just the last name, the first 2 letters of the first name and a code appended to the end.
The code below doesnt work but thats where I am and it isnt close to working
df['namecode'] = df.Last.str.replace('[^a-zA-Z]', '')[:5]+df.First.str.replace('[^a-zA-Z]', '')[:2]+str(jr['code'])
Name lastname code namecode
jeff White 0989 Whiteje0989
Zach Bunt 0798 Buntza0798
ken Black 5764 Blackke5764
Here is one approach.
Use pandas str.slice instead of trying to do string indexing.
For example:
import pandas as pd
df = pd.DataFrame(
{
'First': ['jeff', 'zach', 'ken'],
'Last': ['White^', 'Bun/t', 'Bl?ack'],
'code': ['0989', '0798', '5764']
}
)
print(df['Last'].str.replace('[^a-zA-Z]', '').str.slice(0,5)
+ df['First'].str.slice(0,2) + df['code'])
#0 Whiteje0989
#1 Buntza0798
#2 Blackke5764
#dtype: object

Using if/else in pandas series to create new series based on conditions

I have a pandas df.
Say I have a column "activity" which can be "fun" or "work" and I want to convert it to an integer.
What I do is:
df["activity_id"] = 1*(df["activity"]=="fun") + 2*(df["activity"]=="work")
This works, since I do not know how to put an if/else in there (and if you have 10 activities it can get complicated).
However, say I now have the opposite problem, and I want to convert from an id to a string, I cannot use this trick anymore because I cannot multiply a string with a Boolean. How do I do it? Is there a way to use if/else?
You can create a dictionary with id as the key and the string as the value and then use the map series method to convert the integer to a string.
my_map = {1:'fun', 2:'work'}
df['activity']= df.activity_id.map(my_map)
You could instead convert your activity column to categorical dtype:
df['activity'] = pd.Categorical(df['activity'])
Then you would automatically have access to integer labels for the values via df['activity'].cat.codes.
import pandas as pd
df = pd.DataFrame({'activity':['fun','work','fun']})
df['activity'] = pd.Categorical(df['activity'])
print(df['activity'].cat.codes)
0 0
1 1
2 0
dtype: int8
Meanwhile the string values can still be accessed as before while saving memory:
print(df)
still yields
activity
0 fun
1 work
2 fun
You could also use a dictionary and list comprehension to recalculate values for an entire column. This makes it easy to define the reverse mapping as well:
>>> import pandas as pd
>>> forward_map = {'fun': 1, 'work': 2}
>>> reverse_map = {v: k for k, v in forward_map.iteritems()}
>>> df = pd.DataFrame(
{'activity': ['work', 'work', 'fun', 'fun', 'work'],
'detail': ['reports', 'coding', 'hiking', 'games', 'secret games']})
>>> df
activity detail
0 work reports
1 work coding
2 fun hiking
3 fun games
4 work secret games
>>> df['activity'] = [forward_map[i] for i in df['activity']]
>>> df
activity detail
0 2 reports
1 2 coding
2 1 hiking
3 1 games
4 2 secret games
>>> df['activity'] = [reverse_map[i] for i in df['activity']]
>>> df
activity detail
0 work reports
1 work coding
2 fun hiking
3 fun games
4 work secret games

Categories

Resources