Count no. of tokens in every row of a column in dataframe

Count no. of tokens in every row of a column in dataframe - python

I have a dataframe with data in the of similar format
song lyric tokenized_lyrics
0 Song 1 Look at her face, it's a wonderful face [look , at , her ,face, it's a wonderful, face ]
1 Song 2 Some lyrics of the song taken [Some, lyrics ,of, the, song, taken]
I want to count the no of words in the lyrics per song and an output like
song count
song 1 8
song 2 6
I tried aggregate function but it is not yielding the correct result.
Code I tried :
df.groupby(['song']).agg(
word_count = pd.NamedAgg(column='text' , aggfunc = 'count' )
)
How can I achieve the desired result

I couldnt copy tokenized_lyrics as a list, it came in as a string, so I tokenized the lyrics, with the assumption that the delimiter is a white space:
df['token_count'] = df.lyric.str.replace(',','').str.split().str.len()
df.filter(['song','token_count'])
song token_count
0 Song 1 8
1 Song 2 6
note that you can just apply string len to the tokenized lyrics to get your count, since it is a list, it will count the individual items

Use Series.str.len for count values and if duplicated song values then aggregate sum:
df1 = (df.assign(count = df['tokenized_lyrics'].str.len())
.groupby('song', as_index=False)['count'].sum())

Related

How can I make pairs of the partner_name values in the given dataset?

user_id
partner_name
order_sequence
2
Star Bucks
1
2
KFC
2
2
MCD
3
6
Coffee Store
1
6
MCD
2
9
KFC
1
I am trying to figure out what two restaurant combinations occur the most often. For instance, user with user_id 2 went to star bucks, KFC, and MCD, so I want a two-dimensional array that has [[star bucks, KFC],[KFC, MCD].
However, each time the user_id changes, for instance, in lines 3 and 4, the code should skip adding this combination.
Also, if a user has only one entry in the table, for instance, user with user_id 9, then this user should not be added to the list because he did not visit two or more restaurants.
The final result I am expecting for this table are:
[[Star Bucks, KFC], [KFC,MCD], [Coffee Store, MCD]]
I have written the following lines of code but so far, I am unable to crack it.
Requesting help!
arr1 = []
arr2 = []
for idx,x in enumerate(df['order_sequence']):
if x!=1:
arr1.append(df['partner_name'][idx])
arr1.append(df['partner_name'][idx+1])
arr2.append(arr1)

You could try to use .groupby() and zip():
res = [
pair
for _, sdf in df.groupby("user_id")
for pair in zip(sdf["partner_name"], sdf["partner_name"].iloc[1:])
]
Result for the sample dataframe:
[('Star Bucks', 'KFC'), ('KFC', 'MCD'), ('Coffee Store', 'MCD')]
Or try
res = (
df.groupby("user_id")["partner_name"].agg(list)
.map(lambda l: list(zip(l, l[1:])))
.sum()
)
with the same result.
Might be, that you have to sort the dataframe before:
df = df.sort_values(["user_id", "order_sequence"])

Counting Words from one columns of Dataframe to Another Dataframe column

I am having dataframe idf as below.
feature_name idf_weights
2488 kralendijk 11.221923
3059 night 0
1383 ebebf 0
I have another Dataframe df
message Number of Words in each message
0 night kralendijk ebebf 3
I want to add idf weights from idf for each word in the "df" dataframe in a new column.
The output will look like the below:
message Number of Words in each message Number of words with idf_score>0
0 night kralendijk ebebf 3 1
Here is what I've tried so far, but it's giving the total count of words instead of word having idf_weight>0:
words_weights = dict(idf[['feature_name', 'idf_weights']].values)
df['> zero'] = df['message'].apply(lambda x: count([words_weights.get(word, 11.221923) for word in x.split()]))
Output
message Number of Words in each message Number of words with idf_score>0
0 night kralendijk ebebf 3 3
Thank you.

Try using a list comprehension:
# set up a dictionary for easy feature->weight indexing
d = idf.set_index('feature_name')['idf_weights'].to_dict()
# {'kralendijk': 11.221923, 'night': 0.0, 'ebebf': 0.0}
df['> zero'] = [sum(d.get(w, 0)>0 for w in x.split()) for x in df['message']]
## OR, slighlty faster alternative
# df['> zero'] = [sum(1 for w in x.split() if d.get(w, 0)>0) for x in df['message']]
output:
message Number of Words in each message > zero
0 night kralendijk ebebf 3 1

You can use str.findall: the goal here is to create a list of feature names with a weight greater than 0 to find in each message.
pattern = fr"({'|'.join(idf.loc[idf['idf_weights'] > 0, 'feature_name'])})"
df['Number of words with idf_score>0'] = df['message'].str.findall(pattern).str.len()
print(df)
# Output
message Number of Words in each message Number of words with idf_score>0
0 night kralendijk ebebf 3 1

remove duplicate word from pandas column

I have dataframe with information like below stored in one column
>>> Results.Category[:5]
0 issue delivery wrong master account
1 data wrong master account batch
2 order delivery wrong data account
3 issue delivery wrong master account
4 delivery wrong master account batch
Name: Category, dtype: object
Now I want to keep unique word in Category column
For Example :
In first row word "wrong" is present I want to remove it from all rest of the rows and keep word "wrong" in first row only
In second row word "data" is available then I want to remove it from all rest of the rows and keep word "data" in second row only
I found that if duplicates are available in row we can remove using below , but I need to remove duplicate words from columns, Can anyone please help me here.
AFResults['FinalCategoryN'] = AFResults['FinalCategory'].apply(lambda x: remove_dup(x))

It seems you want something like,
out = []
seen = set()
for c in df['Category']:
words = c.split()
out.append(' '.join([w for w in words if w not in seen]))
seen.update(words)
df['FinalCategoryN'] = out
df
Category FinalCategoryN
0 issue delivery wrong master account issue delivery wrong master account
1 data wrong master account batch data batch
2 order delivery wrong data account order
3 issue delivery wrong master account
4 delivery wrong master account batch
If you don't care about the ordering, you can use set logic:
u = df['Category'].apply(str.split)
v = split.shift().map(lambda x: [] if x != x else x).cumsum().map(set)
(u.map(set) - v).str.join(' ')
0 account delivery issue master wrong
1 batch data
2 order
3
4
Name: Category, dtype: object

In you case you need split it first then remove duplicate by drop_duplicates
df.c.str.split(expand=True).stack().drop_duplicates().\
groupby(level=0).apply(','.join).reindex(df.index)
Out[206]:
0 issue,delivery,wrong,master,account
1 data,batch
2 order
3 NaN
4 NaN
dtype: object

What you what cannot be vectorized, so let us just forget about pandas and use a Python set:
total = set()
result = []
for line in AFResults['FinalCategory']:
line = set(line.split()).difference(total)
total = total.union(line)
result.append(' '.join(line))
You get that list: ['wrong issue master delivery account', 'batch data', 'order', '', '']
You can use it to populate a dataframe column:
AFResults['FinalCategoryN'] = result

Use apply with sorted and set and str.join and list.index:
AFResults['FinalCategoryN'] = AFResults['FinalCategory'].apply(lambda x: ' '.join(sorted(set(x.split()), key=x.index)))

extract special word begin with letter and end with numbers in column

I have a dataframe like this df with the column name title.
title
I have a pen tp001
I have rt0024 apple
I have wtw003 orange
I need to return the new title to the following (begin with letter and end with digit)
title
tp001
rt0024
wtw003
So I use df['new_title'] =df['title'].str.extract(r'^[a-z].*\d$') but it didn't work.The error is ValueError: pattern contains no capture groups
I updated the question,so each word has different length with letters and digits.

You can use:
df['title'] = df['title'].str.extract(r'(\w+\d+)',expand=False)
>>> df
title
0 tp001
1 rt0024
2 wtw003

By using extract
df.title.str.extract(r'([a-z]{2}[0-9]{3})',expand=True)
Out[250]:
0
0 tp001
1 rt002
2 wt003

Split a column value to mutliple columns pandas /python

I am new to Python/Pandas and have a data frame with two columns one a series and another a string.
I am looking to split the contents of a Column(Series) to multiple columns .Appreciate your inputs on this regard .
This is my current dataframe content
Songdetails Density
0 ["'t Hof Van Commerce", "Chance", "SORETGR12AB... 4.445323
1 ["-123min.", "Try", "SOERGVA12A6D4FEC55"] 3.854437
2 ["10_000 Maniacs", "Please Forgive Us (LP Vers... 3.579846
3 ["1200 Micrograms", "ECSTACY", "SOKYOEA12AB018... 5.503980
4 ["13 Cats", "Please Give Me Something", "SOYLO... 2.964401
5 ["16 Bit Lolitas", "Tim Likes Breaks (intermez... 5.564306
6 ["23 Skidoo", "100 Dark", "SOTACCS12AB0185B85"] 5.572990
7 ["2econd Class Citizen", "For This We'll Find ... 3.756746
8 ["2tall", "Demonstration", "SOYYQZR12A8C144F9D"] 5.472524
Desired output is SONG , ARTIST , SONG ID ,DENSITY i.e. split song details into columns.
for e.g. for the sample data
SONG DETAILS DENSITY
8 ["2tall", "Demonstration", "SOYYQZR12A8C144F9D"] 5.472524
SONG ARTIST SONG ID DENSITY
2tall Demonstration SOYYQZR12A8C144F9D 5.472524
Thanks

The following worked for me:
In [275]:
pd.DataFrame(data = list(df['Song details'].values), columns = ['Song', 'Artist', 'Song Id'])
Out[275]:
Song Artist Song Id
0 2tall Demonstration SOYYQZR12A8C144F9D
1 2tall Demonstration SOYYQZR12A8C144F9D
For you please try: pd.DataFrame(data = list(df['Songdetails'].values), columns = ['SONG', 'ARTIST', 'SONG ID'])

Thank you , i had a do an insert of column to the new data frame and was able to achieve what i needed thanks df2 = pd.DataFrame(series.apply(lambda x: pd.Series(x.split(','))))
df2.insert(3,'Density',finaldf['Density'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Count no. of tokens in every row of a column in dataframe - python

Use Series.str.len for count values and if duplicated song values then aggregate sum: df1 = (df.assign(count = df['tokenized_lyrics'].str.len()) .groupby('song', as_index=False)['count'].sum())

Related

How can I make pairs of the partner_name values in the given dataset?

Counting Words from one columns of Dataframe to Another Dataframe column

remove duplicate word from pandas column

extract special word begin with letter and end with numbers in column

Split a column value to mutliple columns pandas /python

Categories

Resources