create new column with data in a column - python

So here is my data in pandas
Movie Tags
0 War film tank;plane
1 Spy film car;plane
i would like to create new column with the tag column with 0 and 1 and add a prefix like 'T_' to the name of the columns.
Like :
Movie Tags T_tank T_plane T_car
0 War film tank;plane 1 1 0
1 Spy film car;plane 0 1 1
I have some ideas on how to do it like line by line with a split(";") and a df.loc[:,'T_plane'] for example.
But i think that may not be the optimal way to do it.
Regards

Using the sklearn library:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
res = df.join(pd.DataFrame(mlb.fit_transform(df['Tags'].str.split(';')),
columns=mlb.classes_).add_prefix('T_'))
print(res)
Movie Tags T_car T_plane T_tank
0 War film tank;plane 0 1 1
1 Spy film car;plane 1 1 0

With .str.get_dummies
df.join(df.Tags.str.get_dummies(';').add_prefix('T_'))
Movie Tags T_car T_plane T_tank
0 War film tank;plane 0 1 1
1 Spy film car;plane 1 1 0

Related

pandas dataframe label columns encoding

Have a pandas dataframe with string input columns. df looks like:
news label1 label2 label3 label4
COVID Hospitalizations .... health
will pets contract covid.... health pets
High temperature will cause.. health weather
...
Expected output
news health pets weather tech
COVID Hospitalizations .... 1 0 0 0
will pets contract covid.... 1 1 0 0
High temperature will cause.. 1 0 1 0
...
Currently I used sklean
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df['labels'] = df[['label1','label2','label3','label4']].values.tolist()
mlb.fit(df['labels'])
temp = mlb.transform(df['labels'])
ff = pd.DataFrame(temp, columns = list(mlb.classes_))
df_final = pd.concat([df['news'],ff], axis=1)
this works so far.
Just wondering if there is a way to avoid to use sklearn.preprocessing.MultiLabelBinarizer ?
One idea is join values by | and then use Series.str.get_dummies:
#if missing values NaNs
#df = df.fillna('')
df_final = df.set_index('news').agg('|'.join, 1).str.get_dummies().reset_index()
print (df_final)
news health pets weather
0 COVID Hospitalizations .... 1 0 0
1 will pets contract covid.... 1 1 0
2 High temperature will cause.. 1 0 1
Or use get_dummies:
df_final = (pd.get_dummies(df.set_index('news'), prefix='', prefix_sep='')
.groupby(level=0,axis=1)
.max()
.reset_index())
#second column name is empty string, so dfference with solution above
print (df_final)
news health pets weather
0 COVID Hospitalizations .... 1 1 0 0
1 will pets contract covid.... 1 1 1 0
2 High temperature will cause.. 1 1 0 1

how to append the table but with condition and re rank the table with pandas

I have a data frame like this:
for example
user Top Genre
a Horror
b Romance
and I have the contentbased table for for genre :
for example
Genre Rec Rank
Horror Action 1
Horror Comedy 2
Romance Asian 1
Romance Comedy 2
i want to join table so the output will be :
for example
User Rec Rank
a Horror 1
a Action 2
a Comedy 3
b Romance 1
b Asian 2
b Comedy 3
how to process two tables so that the output is like table above with pandas
Use DataFrame.merge with right join and add same DataFrame with DataFrame.assign for new columns, sorting by both columns and last add 1 to Rank:
df11 = df1.rename(columns={'Top Genre':'Genre'})
df = df11.merge(df2, how='right').append(df11.assign(Rec = df11['Genre'], Rank=0))
df = df.sort_values(['user','Rank'], ignore_index=True)
df['Rank'] +=1
print (df)
user Genre Rec Rank
0 a Horror Horror 1
1 a Horror Action 2
2 a Horror Comedy 3
3 b Romance Romance 1
4 b Romance Asian 2
5 b Romance Comedy 3

How to filter first occurrence of Mandarin characters from a column in pandas and put that in another column

I have a dataframe df :
import pandas as pd
df = pd.DataFrame({"ID": [1,2,3,4,5],
"eng_mand" :["後山 4.7·3 reviews Community Center 竹杉園休閒農場",
"Taipei City 台北市Taiwan",
"綠山谷海芋園餐廳 3.8·52 reviews",
"名陽匍休閒農莊minyangpu大賞園",
"Menghuanhu"]})
it looks like:
ID eng_mand
0 1 後山 4.7·3 reviews Community Center 竹杉園休閒農場
1 2 Taipei City 台北市Taiwan
2 3 綠山谷海芋園餐廳 3.8·52 reviews
3 4 名陽匍休閒農莊minyangpu大賞園
4 5 Menghuanhu
I want to filter the first occurrence of the mandarin characters from the column eng_mand and want to put that in another column mandarin_char.My final output must look like:
ID eng_mand mandarin_char
0 1 後山 4.7·3 reviews Community Center 竹杉園休閒農場 後山
1 2 Taipei City 台北市Taiwan 台北市
2 3 綠山谷海芋園餐廳 3.8·52 reviews 綠山谷海芋園餐廳
3 4 名陽匍休閒農莊minyangpu大賞園 名陽匍休閒農莊
4 5 Menghuanhu
How can I do this in python - pandas
Use str.extract all chinese chars and add fillna for replace NaNs to empty strings if necessary:
df['mandarin_char'] = df['eng_mand'].str.extract(r'([\u4e00-\u9fff]+)').fillna('')
print (df)
ID eng_mand mandarin_char
0 1 後山 4.7·3 reviews Community Center 竹杉園休閒農場 後山
1 2 Taipei City 台北市Taiwan 台北市
2 3 綠山谷海芋園餐廳 3.8·52 reviews 綠山谷海芋園餐廳
3 4 名陽匍休閒農莊minyangpu大賞園 名陽匍休閒農莊
4 5 Menghuanhu
Use str.findall and pass the regex for the mandarin range :
In[14]:
df['mandarin_char'] = df['eng_mand'].str.findall('[\u4e00-\u9fff]+').str[0]
df
Out[14]:
ID eng_mand mandarin_char
0 1 後山 4.7·3 reviews Community Center 竹杉園休閒農場 後山
1 2 Taipei City 台北市Taiwan 台北市
2 3 綠山谷海芋園餐廳 3.8·52 reviews 綠山谷海芋園餐廳
3 4 名陽匍休閒農莊minyangpu大賞園 名陽匍休閒農莊
4 5 Menghuanhu NaN
You can call fillna('') on the result to replace NaN if required.

Counting the occurrences of a substring from one column within another column

I have two dataframes I am working with, one which contains a list of players and another that contains play by play data for the players from the other dataframe. Portions of the rows of interest within these two dataframes are shown below.
0 Matt Carpenter
1 Jason Heyward
2 Peter Bourjos
3 Matt Holliday
4 Jhonny Peralta
5 Matt Adams
...
Name: Name, dtype: object
0 Matt Carpenter grounded out to second (Grounder).
1 Jason Heyward doubled to right (Liner).
2 Matt Holliday singled to right (Liner). Jason Heyward scored.
...
Name: Play, dtype: object
What I am trying to do is create a column in the first dataframe that counts the number of occurrences of the string (df['Name'] + ' scored') in the column in the other dataframe. For example, it would search for instances of "Matt Carpenter scored", "Jason Heyward scored", etc. I know you can use str.contains to do this type of thing, but it only seems to work if you put in the explicit string. For example,
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains('Jason Heyward scored')].index)
works fine but if I try
batter_game_logs_df['R vs SP'] = len(play_by_play_SP_df[play_by_play_SP_df['Play'].str.contains(batter_game_logs_df['Name'].astype(str) + ' scored')].index)
it returns the error 'Series' objects are mutable, thus they cannot be hashed. I have looked at various similar questions but cannot find the solution to this problem for the life of me. Any assistance on this would be greatly appreciated, thank you!
I think need findall by regex with join all values of Name, then create indicator columns by MultiLabelBinarizer and add all missing columns by reindex:
s = df1['Name'] + ' scored'
pat = r'\b{}\b'.format('|'.join(s))
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df2['Play'].str.findall(pat)),
columns=mlb.classes_,
index=df2.index).reindex(columns=s, fill_value=0)
print (df)
Name Matt Carpenter scored Jason Heyward scored Peter Bourjos scored \
0 0 0 0
1 0 0 0
2 0 1 0
Name Matt Holliday scored Jhonny Peralta scored Matt Adams scored
0 0 0 0
1 0 0 0
2 0 0 0
Last if necessary join to df1:
df = df2.join(df)
print (df)
Play Matt Carpenter scored \
0 Matt Carpenter grounded out to second (Grounder). 0
1 Jason Heyward doubled to right (Liner). 0
2 Matt Holliday singled to right (Liner). Jason ... 0
Jason Heyward scored Peter Bourjos scored Matt Holliday scored \
0 0 0 0
1 0 0 0
2 1 0 0
Jhonny Peralta scored Matt Adams scored
0 0 0
1 0 0
2 0 0

Python group by 2 columns, output multiple columns

I have a tab-delimited file with movie genre and year in 2 columns:
Comedy 2013
Comedy 2014
Drama 2012
Mystery 2011
Comedy 2013
Comedy 2013
Comedy 2014
Comedy 2013
News 2012
Sport 2012
Sci-Fi 2013
Comedy 2014
Family 2013
Comedy 2013
Drama 2013
Biography 2013
I want to group the genres together by year and print out in the following format (does not have to be in alphabetical order):
Year 2011 2012 2013 2014
Biography 0 0 1 0
Comedy 0 0 5 3
Drama 0 1 1 0
Family 0 0 1 0
Mystery 1 0 0 0
News 0 1 0 0
Sci-Fi 0 0 1 0
Sport 0 1 0 0
How should I approach it? At the moment I'm creating my output through MS Excel, but I would like to do it through Python.
If you don't like to use pandas, you can do it as follows:
from collections import Counter
# load file
with open('tab.txt') as f:
lines = f.read().split('\n')
# replace separating whitespace with exactly one space
lines = [' '.join(l.split()) for l in lines]
# find all years and genres
genres = sorted(set(l.split()[0] for l in lines))
years = sorted(set(l.split()[1] for l in lines))
# count genre-year combinations
C = Counter(lines)
# print table
print "Year".ljust(10),
for y in years:
print y.rjust(6),
print
for g in genres:
print g.ljust(10),
for y in years:
print `C[g + ' ' + y]`.rjust(6),
print
The most interesting function is probably Counter, which counts the number of occurrences of each element. To make sure that the length of the separating whitespace does not influence the counting, I replace it with a single space beforehand.
The easiest way do to this is using the pandas library, which provides lots of way of interacting with tables of data:
df = pd.read_clipboard(names=['genre', 'year'])
df.pivot_table(index='genre', columns='year', aggfunc=len, fill_value=0)
Output:
year 2011 2012 2013 2014
genre
Biography 0 0 1 0
Comedy 0 0 5 3
Drama 0 1 1 0
Family 0 0 1 0
Mystery 1 0 0 0
News 0 1 0 0
Sci-Fi 0 0 1 0
Sport 0 1 0 0
If you're only just starting with Python, you might find trying to learn pandas is a bit too much on top of learning the language, but once you have some Python knowledge, pandas provides very intuitive ways to interact with data.

Categories

Resources