pandas dataframe label columns encoding

pandas dataframe label columns encoding - python

Have a pandas dataframe with string input columns. df looks like:
news label1 label2 label3 label4
COVID Hospitalizations .... health
will pets contract covid.... health pets
High temperature will cause.. health weather
...
Expected output
news health pets weather tech
COVID Hospitalizations .... 1 0 0 0
will pets contract covid.... 1 1 0 0
High temperature will cause.. 1 0 1 0
...
Currently I used sklean
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df['labels'] = df[['label1','label2','label3','label4']].values.tolist()
mlb.fit(df['labels'])
temp = mlb.transform(df['labels'])
ff = pd.DataFrame(temp, columns = list(mlb.classes_))
df_final = pd.concat([df['news'],ff], axis=1)
this works so far.
Just wondering if there is a way to avoid to use sklearn.preprocessing.MultiLabelBinarizer ?

One idea is join values by | and then use Series.str.get_dummies:
#if missing values NaNs
#df = df.fillna('')
df_final = df.set_index('news').agg('|'.join, 1).str.get_dummies().reset_index()
print (df_final)
news health pets weather
0 COVID Hospitalizations .... 1 0 0
1 will pets contract covid.... 1 1 0
2 High temperature will cause.. 1 0 1
Or use get_dummies:
df_final = (pd.get_dummies(df.set_index('news'), prefix='', prefix_sep='')
.groupby(level=0,axis=1)
.max()
.reset_index())
#second column name is empty string, so dfference with solution above
print (df_final)
news health pets weather
0 COVID Hospitalizations .... 1 1 0 0
1 will pets contract covid.... 1 1 1 0
2 High temperature will cause.. 1 1 0 1

Related

Python: Replace multiple old values to new value Pandas

I've been searching around for a while now, but I can't seem to find the answer to this small problem.
I have this code to make a function for replace values:
df = {'Name':['al', 'el', 'naila', 'dori','jlo'],
'living':['Alvando','Georgia GG','Newyork NY','Indiana IN','Florida FL'],
'sample2':['malang','kaltim','ambon','jepara','sragen'],
'output':['KOTA','KAB','WILAYAH','KAB','DAERAH']
}
df = pd.DataFrame(df)
df = df.replace(['KOTA', 'WILAYAH', 'DAERAH'], 0)
df = df.replace('KAB', 1)
But I am actually expecting this output with the simple code that doesn't repeat replace
Name living sample2 output
0 al Alvando malang 0
1 el Georgia GG kaltim 1
2 naila Newyork NY ambon 0
3 dori Indiana IN jepara 1
4 jlo Florida FL sragen 0
I've tried using np.where but it doesn't give the desired result. all results display 0, but the original value is 1
df['output'] = pd.DataFrame({'output':np.where(df == "KAB", 1, 0).reshape(-1, )})

This code should work for you:
df = df.replace(['KOTA', 'WILAYAH', 'DAERAH'], 0).replace('KAB', 1)
Output:
>>> df
Name living sample2 output
0 al Alvando malang 0
1 el Georgia GG kaltim 1
2 naila Newyork NY ambon 0
3 dori Indiana IN jepara 1
4 jlo Florida FL sragen 0

filtering data in pandas where string is in multiple columns

I have a dataframe that looks like this:
team_1 score_1 team_2 score_2
AUS 2 SCO 1
ENG 1 ARG 0
JPN 0 ENG 2
I can retreive all the data from a single team by using:
#list specifiying team of interest
team = ['ENG']
#slice the dataframe to show only the rows where the column 'Team 1' or 'Team 2' value is in the specified string list 'team'
df.loc[df['team_1'].isin(team) | df['team_2'].isin(team)]
team_1 score_1 team_2 score_2
ENG 1 ARG 0
JPN 0 ENG 2
How can I now return only the score for my 'team' such as:
team score
ENG 1
ENG 2
Maybe creating an index to each team so as to filter out?
Maybe encoding the team_1 and team_2 columns to filter out?

new_df_1 = df[df.team_1 =='ENG'][['team_1', 'score_1']]
new_df_1 =new_df_1.rename(columns={"team_1":"team", "score_1":"score"})
# team score
# 0 ENG 1
new_df_2 = df[df.team_2 =='ENG'][['team_2', 'score_2']]
new_df_2 = new_df_2.rename(columns={"team_2":"team", "score_2":"score"})
# team score
# 1 ENG 2
then concat two dataframe:
pd.concat([new_df_1, new_df_2])
the output is :
team score
0 ENG 1
1 ENG 2

Melt the columns, filter for values in team, compute the sum of the scores column, and filter for only teams and score:
team = ["ENG"]
(
df
.melt(cols, value_name="team")
.query("team in #team")
.assign(score=lambda x: x.filter(like="score").sum(axis=1))
.loc[:, ["team", "score"]]
)
team score
1 ENG 1
5 ENG 2

create new column with data in a column

So here is my data in pandas
Movie Tags
0 War film tank;plane
1 Spy film car;plane
i would like to create new column with the tag column with 0 and 1 and add a prefix like 'T_' to the name of the columns.
Like :
Movie Tags T_tank T_plane T_car
0 War film tank;plane 1 1 0
1 Spy film car;plane 0 1 1
I have some ideas on how to do it like line by line with a split(";") and a df.loc[:,'T_plane'] for example.
But i think that may not be the optimal way to do it.
Regards

Using the sklearn library:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
res = df.join(pd.DataFrame(mlb.fit_transform(df['Tags'].str.split(';')),
columns=mlb.classes_).add_prefix('T_'))
print(res)
Movie Tags T_car T_plane T_tank
0 War film tank;plane 0 1 1
1 Spy film car;plane 1 1 0

With .str.get_dummies
df.join(df.Tags.str.get_dummies(';').add_prefix('T_'))
Movie Tags T_car T_plane T_tank
0 War film tank;plane 0 1 1
1 Spy film car;plane 1 1 0

Counting the occurrences of a substring from one column within another column

I have two dataframes I am working with, one which contains a list of players and another that contains play by play data for the players from the other dataframe. Portions of the rows of interest within these two dataframes are shown below.
0 Matt Carpenter
1 Jason Heyward
2 Peter Bourjos
3 Matt Holliday
4 Jhonny Peralta
5 Matt Adams
...
Name: Name, dtype: object
0 Matt Carpenter grounded out to second (Grounder).
1 Jason Heyward doubled to right (Liner).
2 Matt Holliday singled to right (Liner). Jason Heyward scored.
...
Name: Play, dtype: object
What I am trying to do is create a column in the first dataframe that counts the number of occurrences of the string (df['Name'] + ' scored') in the column in the other dataframe. For example, it would search for instances of "Matt Carpenter scored", "Jason Heyward scored", etc. I know you can use str.contains to do this type of thing, but it only seems to work if you put in the explicit string. For example,
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains('Jason Heyward scored')].index)
works fine but if I try
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains(batter_game_logs_df['Name'].astype(str) + ' scored')].index)
it returns the error 'Series' objects are mutable, thus they cannot be hashed. I have looked at various similar questions but cannot find the solution to this problem for the life of me. Any assistance on this would be greatly appreciated, thank you!

I think need findall by regex with join all values of Name, then create indicator columns by MultiLabelBinarizer and add all missing columns by reindex:
s = df1['Name'] + ' scored'
pat = r'\b{}\b'.format('|'.join(s))
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df2['Play'].str.findall(pat)),
columns=mlb.classes_,
index=df2.index).reindex(columns=s, fill_value=0)
print (df)
Name Matt Carpenter scored Jason Heyward scored Peter Bourjos scored \
0 0 0 0
1 0 0 0
2 0 1 0
Name Matt Holliday scored Jhonny Peralta scored Matt Adams scored
0 0 0 0
1 0 0 0
2 0 0 0
Last if necessary join to df1:
df = df2.join(df)
print (df)
Play Matt Carpenter scored \
0 Matt Carpenter grounded out to second (Grounder). 0
1 Jason Heyward doubled to right (Liner). 0
2 Matt Holliday singled to right (Liner). Jason ... 0
Jason Heyward scored Peter Bourjos scored Matt Holliday scored \
0 0 0 0
1 0 0 0
2 1 0 0
Jhonny Peralta scored Matt Adams scored
0 0 0
1 0 0
2 0 0

Python group by 2 columns, output multiple columns

I have a tab-delimited file with movie genre and year in 2 columns:
Comedy 2013
Comedy 2014
Drama 2012
Mystery 2011
Comedy 2013
Comedy 2013
Comedy 2014
Comedy 2013
News 2012
Sport 2012
Sci-Fi 2013
Comedy 2014
Family 2013
Comedy 2013
Drama 2013
Biography 2013
I want to group the genres together by year and print out in the following format (does not have to be in alphabetical order):
Year 2011 2012 2013 2014
Biography 0 0 1 0
Comedy 0 0 5 3
Drama 0 1 1 0
Family 0 0 1 0
Mystery 1 0 0 0
News 0 1 0 0
Sci-Fi 0 0 1 0
Sport 0 1 0 0
How should I approach it? At the moment I'm creating my output through MS Excel, but I would like to do it through Python.

If you don't like to use pandas, you can do it as follows:
from collections import Counter
# load file
with open('tab.txt') as f:
lines = f.read().split('\n')
# replace separating whitespace with exactly one space
lines = [' '.join(l.split()) for l in lines]
# find all years and genres
genres = sorted(set(l.split()[0] for l in lines))
years = sorted(set(l.split()[1] for l in lines))
# count genre-year combinations
C = Counter(lines)
# print table
print "Year".ljust(10),
for y in years:
print y.rjust(6),
print
for g in genres:
print g.ljust(10),
for y in years:
print `C[g + ' ' + y]`.rjust(6),
print
The most interesting function is probably Counter, which counts the number of occurrences of each element. To make sure that the length of the separating whitespace does not influence the counting, I replace it with a single space beforehand.

The easiest way do to this is using the pandas library, which provides lots of way of interacting with tables of data:
df = pd.read_clipboard(names=['genre', 'year'])
df.pivot_table(index='genre', columns='year', aggfunc=len, fill_value=0)
Output:
year 2011 2012 2013 2014
genre
Biography 0 0 1 0
Comedy 0 0 5 3
Drama 0 1 1 0
Family 0 0 1 0
Mystery 1 0 0 0
News 0 1 0 0
Sci-Fi 0 0 1 0
Sport 0 1 0 0
If you're only just starting with Python, you might find trying to learn pandas is a bit too much on top of learning the language, but once you have some Python knowledge, pandas provides very intuitive ways to interact with data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas dataframe label columns encoding - python

Related

Python: Replace multiple old values to new value Pandas

filtering data in pandas where string is in multiple columns

create new column with data in a column

Counting the occurrences of a substring from one column within another column

Python group by 2 columns, output multiple columns

Categories

Resources