I have a dataframe with data in the of similar format
song lyric tokenized_lyrics
0 Song 1 Look at her face, it's a wonderful face [look , at , her ,face, it's a wonderful, face ]
1 Song 2 Some lyrics of the song taken [Some, lyrics ,of, the, song, taken]
I want to count the no of words in the lyrics per song and an output like
song count
song 1 8
song 2 6
I tried aggregate function but it is not yielding the correct result.
Code I tried :
df.groupby(['song']).agg(
word_count = pd.NamedAgg(column='text' , aggfunc = 'count' )
)
How can I achieve the desired result
I couldnt copy tokenized_lyrics as a list, it came in as a string, so I tokenized the lyrics, with the assumption that the delimiter is a white space:
df['token_count'] = df.lyric.str.replace(',','').str.split().str.len()
df.filter(['song','token_count'])
song token_count
0 Song 1 8
1 Song 2 6
note that you can just apply string len to the tokenized lyrics to get your count, since it is a list, it will count the individual items
Use Series.str.len for count values and if duplicated song values then aggregate sum:
df1 = (df.assign(count = df['tokenized_lyrics'].str.len())
.groupby('song', as_index=False)['count'].sum())
Related
user_id
partner_name
order_sequence
2
Star Bucks
1
2
KFC
2
2
MCD
3
6
Coffee Store
1
6
MCD
2
9
KFC
1
I am trying to figure out what two restaurant combinations occur the most often. For instance, user with user_id 2 went to star bucks, KFC, and MCD, so I want a two-dimensional array that has [[star bucks, KFC],[KFC, MCD].
However, each time the user_id changes, for instance, in lines 3 and 4, the code should skip adding this combination.
Also, if a user has only one entry in the table, for instance, user with user_id 9, then this user should not be added to the list because he did not visit two or more restaurants.
The final result I am expecting for this table are:
[[Star Bucks, KFC], [KFC,MCD], [Coffee Store, MCD]]
I have written the following lines of code but so far, I am unable to crack it.
Requesting help!
arr1 = []
arr2 = []
for idx,x in enumerate(df['order_sequence']):
if x!=1:
arr1.append(df['partner_name'][idx])
arr1.append(df['partner_name'][idx+1])
arr2.append(arr1)
You could try to use .groupby() and zip():
res = [
pair
for _, sdf in df.groupby("user_id")
for pair in zip(sdf["partner_name"], sdf["partner_name"].iloc[1:])
]
Result for the sample dataframe:
[('Star Bucks', 'KFC'), ('KFC', 'MCD'), ('Coffee Store', 'MCD')]
Or try
res = (
df.groupby("user_id")["partner_name"].agg(list)
.map(lambda l: list(zip(l, l[1:])))
.sum()
)
with the same result.
Might be, that you have to sort the dataframe before:
df = df.sort_values(["user_id", "order_sequence"])
I am having dataframe idf as below.
feature_name idf_weights
2488 kralendijk 11.221923
3059 night 0
1383 ebebf 0
I have another Dataframe df
message Number of Words in each message
0 night kralendijk ebebf 3
I want to add idf weights from idf for each word in the "df" dataframe in a new column.
The output will look like the below:
message Number of Words in each message Number of words with idf_score>0
0 night kralendijk ebebf 3 1
Here is what I've tried so far, but it's giving the total count of words instead of word having idf_weight>0:
words_weights = dict(idf[['feature_name', 'idf_weights']].values)
df['> zero'] = df['message'].apply(lambda x: count([words_weights.get(word, 11.221923) for word in x.split()]))
Output
message Number of Words in each message Number of words with idf_score>0
0 night kralendijk ebebf 3 3
Thank you.
Try using a list comprehension:
# set up a dictionary for easy feature->weight indexing
d = idf.set_index('feature_name')['idf_weights'].to_dict()
# {'kralendijk': 11.221923, 'night': 0.0, 'ebebf': 0.0}
df['> zero'] = [sum(d.get(w, 0)>0 for w in x.split()) for x in df['message']]
## OR, slighlty faster alternative
# df['> zero'] = [sum(1 for w in x.split() if d.get(w, 0)>0) for x in df['message']]
output:
message Number of Words in each message > zero
0 night kralendijk ebebf 3 1
You can use str.findall: the goal here is to create a list of feature names with a weight greater than 0 to find in each message.
pattern = fr"({'|'.join(idf.loc[idf['idf_weights'] > 0, 'feature_name'])})"
df['Number of words with idf_score>0'] = df['message'].str.findall(pattern).str.len()
print(df)
# Output
message Number of Words in each message Number of words with idf_score>0
0 night kralendijk ebebf 3 1
I have dataframe with information like below stored in one column
>>> Results.Category[:5]
0 issue delivery wrong master account
1 data wrong master account batch
2 order delivery wrong data account
3 issue delivery wrong master account
4 delivery wrong master account batch
Name: Category, dtype: object
Now I want to keep unique word in Category column
For Example :
In first row word "wrong" is present I want to remove it from all rest of the rows and keep word "wrong" in first row only
In second row word "data" is available then I want to remove it from all rest of the rows and keep word "data" in second row only
I found that if duplicates are available in row we can remove using below , but I need to remove duplicate words from columns, Can anyone please help me here.
AFResults['FinalCategoryN'] = AFResults['FinalCategory'].apply(lambda x: remove_dup(x))
It seems you want something like,
out = []
seen = set()
for c in df['Category']:
words = c.split()
out.append(' '.join([w for w in words if w not in seen]))
seen.update(words)
df['FinalCategoryN'] = out
df
Category FinalCategoryN
0 issue delivery wrong master account issue delivery wrong master account
1 data wrong master account batch data batch
2 order delivery wrong data account order
3 issue delivery wrong master account
4 delivery wrong master account batch
If you don't care about the ordering, you can use set logic:
u = df['Category'].apply(str.split)
v = split.shift().map(lambda x: [] if x != x else x).cumsum().map(set)
(u.map(set) - v).str.join(' ')
0 account delivery issue master wrong
1 batch data
2 order
3
4
Name: Category, dtype: object
In you case you need split it first then remove duplicate by drop_duplicates
df.c.str.split(expand=True).stack().drop_duplicates().\
groupby(level=0).apply(','.join).reindex(df.index)
Out[206]:
0 issue,delivery,wrong,master,account
1 data,batch
2 order
3 NaN
4 NaN
dtype: object
What you what cannot be vectorized, so let us just forget about pandas and use a Python set:
total = set()
result = []
for line in AFResults['FinalCategory']:
line = set(line.split()).difference(total)
total = total.union(line)
result.append(' '.join(line))
You get that list: ['wrong issue master delivery account', 'batch data', 'order', '', '']
You can use it to populate a dataframe column:
AFResults['FinalCategoryN'] = result
Use apply with sorted and set and str.join and list.index:
AFResults['FinalCategoryN'] = AFResults['FinalCategory'].apply(lambda x: ' '.join(sorted(set(x.split()), key=x.index)))
I have a dataframe like this df with the column name title.
title
I have a pen tp001
I have rt0024 apple
I have wtw003 orange
I need to return the new title to the following (begin with letter and end with digit)
title
tp001
rt0024
wtw003
So I use df['new_title'] =df['title'].str.extract(r'^[a-z].*\d$') but it didn't work.The error is ValueError: pattern contains no capture groups
I updated the question,so each word has different length with letters and digits.
You can use:
df['title'] = df['title'].str.extract(r'(\w+\d+)',expand=False)
>>> df
title
0 tp001
1 rt0024
2 wtw003
By using extract
df.title.str.extract(r'([a-z]{2}[0-9]{3})',expand=True)
Out[250]:
0
0 tp001
1 rt002
2 wt003
I am new to Python/Pandas and have a data frame with two columns one a series and another a string.
I am looking to split the contents of a Column(Series) to multiple columns .Appreciate your inputs on this regard .
This is my current dataframe content
Songdetails Density
0 ["'t Hof Van Commerce", "Chance", "SORETGR12AB... 4.445323
1 ["-123min.", "Try", "SOERGVA12A6D4FEC55"] 3.854437
2 ["10_000 Maniacs", "Please Forgive Us (LP Vers... 3.579846
3 ["1200 Micrograms", "ECSTACY", "SOKYOEA12AB018... 5.503980
4 ["13 Cats", "Please Give Me Something", "SOYLO... 2.964401
5 ["16 Bit Lolitas", "Tim Likes Breaks (intermez... 5.564306
6 ["23 Skidoo", "100 Dark", "SOTACCS12AB0185B85"] 5.572990
7 ["2econd Class Citizen", "For This We'll Find ... 3.756746
8 ["2tall", "Demonstration", "SOYYQZR12A8C144F9D"] 5.472524
Desired output is SONG , ARTIST , SONG ID ,DENSITY i.e. split song details into columns.
for e.g. for the sample data
SONG DETAILS DENSITY
8 ["2tall", "Demonstration", "SOYYQZR12A8C144F9D"] 5.472524
SONG ARTIST SONG ID DENSITY
2tall Demonstration SOYYQZR12A8C144F9D 5.472524
Thanks
The following worked for me:
In [275]:
pd.DataFrame(data = list(df['Song details'].values), columns = ['Song', 'Artist', 'Song Id'])
Out[275]:
Song Artist Song Id
0 2tall Demonstration SOYYQZR12A8C144F9D
1 2tall Demonstration SOYYQZR12A8C144F9D
For you please try: pd.DataFrame(data = list(df['Songdetails'].values), columns = ['SONG', 'ARTIST', 'SONG ID'])
Thank you , i had a do an insert of column to the new data frame and was able to achieve what i needed thanks df2 = pd.DataFrame(series.apply(lambda x: pd.Series(x.split(','))))
df2.insert(3,'Density',finaldf['Density'])