How do you remove standalone alphabets from a column in python? - python

I am currently working with lyric data of several artists.
When I was working with BTS lyric data, I noticed that the data had names of the member who sang that line at the front as you can see in the example below.
Artist Lyric
bts(방탄소년단) jungkook 'cause i i i'm in the stars tonight s...
bts(방탄소년단) 방탄소년단의 fake love 가사 v jungkook 널 위해서라면 난 슬퍼도...
I tried removing their names with the str.replace() method.
However, one of their name seems to be "v" and when I try to remove "v" it removes all v's from the column as demonstrated below
Artist Lyric
bts(방탄소년단) 'cause i i i'm in the stars tonight s...
bts(방탄소년단) 방탄소년단의 fake lo e 가사 널 위해서라면 난 슬퍼도...
It would be really appreciated if there would be any way to remove a "v" that stands alone, but maintain the v's that are actually inside a word such as the v in love.
Thank you in advance!!

If you want to remove all single letters in you data, let's say that you have the following dataframe
artist lyrics
0 sdsddssd fdgdg v sdsssdvsdd
1 sffxxvxv sddsdsdvdfdf
2 vxvxvxv fdagsgvs v v v
3 zcczc xdfsddfds
4 zcczc vxvxdfdvdvd
then
dg['lyrics'].map(lambda x: ' '.join(word for word in x.split() if len(word)>1))
returns:
0 fdgdg sdsssdvsdd
1 sddsdsdvdfdf
2 fdagsgvs
3 xdfsddfds
4 vxvxdfdvdvd
Name: lyrics, dtype: object

Related

How to extract the top N rows from a dataframe with most frequent occurences of a word in a list?

I have a Python dataframe with multiple rows and columns, a sample of which I have shared below -
DocName
Content
Doc1
Hi how you are doing ? Hope you are well. I hear the food is great!
Doc2
The food is great. James loves his food. You not so much right ?
Doc3.
Yeah he is alright.
I also have a list of 100 words as follows -
list = [food, you, ....]
Now, I need to extract the top N rows with most frequent occurences of each word from the list in the "Content" column. For the given sample of data,
"food" occurs twice in Doc2 and once in Doc1.
"you" occurs twice in Doc 1 and once in Doc 2.
Hence, desired output is :
[food:[doc2, doc1], you:[doc1, doc2], .....]
where N = 2 ( top 2 rows having the most frequent occurence of each
word )
I have tried something as follows but unsure how to move further -
list = [food, you, ....]
result = []
for word in list:
result.append(df.Content.apply(lambda row: sum([row.count(word)])))
How can I implement an efficient solution to the above requirement in Python ?
Second attempt (initially I misunderstood your requirements): With df your dataframe you could try something like:
words = ["food", "you"]
n = 2 # Number of top docs
res = (
df
.assign(Content=df["Content"].str.casefold().str.findall(r"\w+"))
.explode("Content")
.loc[lambda df: df["Content"].isin(set(words))]
.groupby("DocName").value_counts().rename("Counts")
.sort_values(ascending=False).reset_index(level=0)
.assign(DocName=lambda df: df["DocName"] + "_" + df["Counts"].astype("str"))
.groupby(level=0).agg({"DocName": list})
.assign(DocName=lambda df: df["DocName"].str[:n])
.to_dict()["DocName"]
)
The first 3 lines in the pipeline extract the relevant words, one per row. For the sample that looks like:
DocName Content
0 Doc1 you
0 Doc1 you
0 Doc1 food
1 Doc2 food
1 Doc2 food
1 Doc2 you
The next 2 lines count the words per doc (.groupby and .value_counts), and sort the result by the counts in descending order (.sort_values), and add the count to the doc-strings. For the sample:
DocName Counts
Content
you Doc1_2 2
food Doc2_2 2
food Doc1_1 1
you Doc2_1 1
Then .groupby the words (index) and put the respective docs in a list via .agg, and restrict the list to the n first items (.str[:n]). For the sample:
DocName
Content
food [Doc2_2, Doc1_1]
you [Doc1_2, Doc2_1]
Finally dumping the result in a dictionary.
Result for the sample dataframe
DocName Content
0 Doc1 Hi how you are doing ? Hope you are well. I hear the food is great!
1 Doc2 The food is great. James loves his food. You not so much right ?
2 Doc3 Yeah he is alright.
is
{'food': ['Doc2_2', 'Doc1_1'], 'you': ['Doc1_2', 'Doc2_1']}
It seems like this problem can be broken down into two sub-problems:
Get the frequency of words per "Content" cell
For each word in the list, extract the top N rows
Luckily, the first sub-problem has many neat approaches, as shown here. TLDR use the Collections library to do a frequency count; or, if you aren't allowed to import libraries, call ".split()" and count in a loop. But again, there are many potential solutions
The second sub-problem is a bit trickier. From our first solution, what we have now is a dictionary of frequency counts, per row. To get to our desired answer, the naive method would be to "query" every dictionary for the word in question.
E.g run
doc1.dict["food"]
doc2.dict["food"]
...
and compare the results in order.
There should be enough to get going, and also opportunity to find more streamlined/elegant solutions. Best of luck!

How to filter rows with non Latin characters

I am stuck in a problem with a dataframe with a column of film names which has a bunch of non-latin names like Japanese or Chinese (and maybe Russian names too) my code is:
df['title'].head(5)
1 I am legend
2 wonder women
3 アライヴ
4 怪獣総進撃
5 dead sea
I just want an output that removes every non-Latin character title, so I want to remove every row that contains character similar to row 3 and 4, so my desired output is:
df['title'].head(5)
1 I am legend
2 wonder women
5 dead sea
6 the rig
7 altitude
Any help with this code?
You can use str.match with the Latin character range to identify non-latin characters, and use the boolean output to slice the data:
df_latin = df[~df['title'].str.match(r'.*[^\x00-\xFF]')]
output:
title
1 I am legend
2 wonder women
5 dead sea
6 the rig
7 altitude
You can encode your title column then decode to latin1. If this double transformation does not match your original data, remove row because it contains some non Latin characters:
df = df[df['title'] == df['title'].str.encode('unicode_escape').str.decode('latin1')]
print(df)
# Output
title
0 I am legend
1 wonder women
3 dead sea
You can use the isascii() method (if you're using Python 3.7+). Example:
"I am legend".isascii() # True
"アライヴ".isascii() # False
Even if you have 1 Non-English letter, the isascii() method will return False.
(Note that for strings like '34?#5' the method will return True, because those are all ASCII characters.)
We can easily makes a function which will return whether it is ascii or not and based on that we can then filter our dataframe.
dict_1 = {'col1':list(range(1,6)), 'col2':['I am legend','wonder women','アライヴ','怪獣総進撃','dead sea']}
def check_ascii(string):
if string.isascii() == True:
return True
else:
return False
df = pd.DataFrame(dict_1)
df['is_eng'] = df['col2'].apply(lambda x: check_ascii(x))
df2 = df[df['is_eng'] == True]
df2
Output

My Regex to remove RT is not working for some reason

The head of my dataframe looks like this
for i in df.index:
txt = df.loc[i]["tweet"]
txt=re.sub(r'#[A-Z0-9a-z_:]+','',txt)#replace username-tags
txt=re.sub(r'^[RT]+','',txt)#replace RT-tags
txt = re.sub('https?://[A-Za-z0-9./]+','',txt)#replace URLs
txt=re.sub("[^a-zA-Z]", " ",txt)#replace hashtags
df.at[i,"tweet"]=txt
However, running this does not remove the 'RT' tags. In addition, I would like to remove the 'b' tag also.
Raw result tweet column:
b Yal suppose you would people waiting for a tub of paint and garden furniture the league is gone and any that thinks anything else is a complete tool of a human who really needs to get down off that cloud lucky to have it back for
b RT watching porn aftern normal people is like no turn it off they don xe x x t love each other
b RT If not now when nn
b Used red wine as a chaser for Captain Morgan xe x x s Fun times
b RT shackattack Hold the front page s Lockdown property project sent me up the walls
Your regular expression is not working, beause this sing ^ means at the beginning of the string. But the two characters you want to remove are not at the beginning.
Change r'^[RT]+' to r'[RT]+' the two letters will be removed. But tbe carefull beacause all other matches will be removed, too.
If you want to remove the letter be as well, try r'^b\s([RT]+)?'.
I suggest you try it yourself on https://regex101.com/

remove words starting with "#" in a column from a dataframe

I have a dataframe called tweetscrypto and I am trying to remove all the words from the column "text" starting with the character "#" and gather the result in a new column "clean_text". The rest of the words should stay exactly the same:
tweetscrypto['clean_text'] = tweetscrypto['text'].apply(filter(lambda x:x[0]!='#', x.split()))
it does not seem to work. Can somebody help?
Thanks in advance
Please str.replace string starting with #
Sample Data
text
0 News via #livemint: #RBI bars banks from links
1 Newsfeed from #oayments_source: How Africa
2 is that bitcoin? not my thing
tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(\#\w+.*?)',"")
Still, can capture # without escaping as noted by #baxx
tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(#\w+.*?)',"")
clean_text
0 News via : bars banks from links
1 Newsfeed from : How Africa
2 is that bitcoin? not my thing
In this case it might be better to define a method rather than using a lambda for mainly readability purposes.
def clean_text(X):
X = X.split()
X_new = [x for x in X if not x.startswith("#")
return ' '.join(X_new)
tweetscrypto['clean_text'] = tweetscrypto['text'].apply(clean_text)

Remove specific characters from a pandas column?

Hello I have a dataframe where I want to remove a specific set of characters 'fwd' from every row that starts with it. The issue I am facing is that the code I am using to execute this is removing anything that starts with the letter 'f'.
my dataframe looks like this:
summary
0 Fwd: Please look at the attached documents and take action
1 NSN for the ones who care
2 News for all team members
3 Fwd: Please take action on the action needed items
4 Fix all the mistakes please
When i used the code:
df['Clean Summary'] = individual_receivers['summary'].map(lambda x: x.lstrip('Fwd:'))
I end up with a dataframe that looks like this:
summary
0 Please look at the attached documents and take action
1 NSN for the ones who care
2 News for all team members
3 Please take action on the action needed items
4 ix all the mistakes please
I don't want the last row to lose the F in 'Fix'.
You should use a regex remembering ^ indicates startswith:
df['Clean Summary'] = df['Summary'].str.replace('^Fwd','')
Here's an example:
df = pd.DataFrame({'msg':['Fwd: o','oe','Fwd: oj'],'B':[1,2,3]})
df['clean_msg'] = df['msg'].str.replace(r'^Fwd: ','')
print(df)
Output:
msg B clean_msg
0 Fwd: o 1 o
1 oe 2 oe
2 Fwd: oj 3 oj
You are not only loosing 'F' but also 'w', 'd', and ':'. This is the way lstrip works - it removes all of the combinations of characters in the passed string.
You should actually use x.replace('Fwd:', '', 1)
1 - ensures that only the first occurrence of the string is removed.

Categories

Resources