How to remove repeating letter in a dataframe?

How to remove repeating letter in a dataframe? - python

I have the following string:
"hello, I'm going to eat to the fullest today hhhhhhhhhhhhhhhhhhhhh"
I have collected many tweets like that and assigned them to a dataframe. How can I clean those rows in dataframe by removing "hhhhhhhhhhhhhhhhhh" and only let the rest of the string in that row?
I'm also using countVectorizer later, so there was a lot of vocabularies that contained 'hhhhhhhhhhhhhhhhhhhhhhh'

Using Regex.
Ex:
import pandas as pd
df = pd.DataFrame({"Col": ["hello, I'm going to eat to the fullest today hhhhhhhhhhhhhhhhhhhhh", "Hello World"]})
#df["Col"] = df["Col"].str.replace(r"\b(.)\1+\b", "")
df["Col"] = df["Col"].str.replace(r"\s+(.)\1+\b", "").str.strip()
print(df)
Output:
Col
0 hello, I'm going to eat to the fullest today
1 Hello World

You may try this:
df["Col"] = df["Col"].str.replace(u"h{4,}", "")
Where you may set the number of characters to match in my case 4.
Col
0 hello, I'm today hh hhhh hhhhhhhhhhhhhhh
1 Hello World
Col
0 hello, I'm today hh
1 Hello World
I used unicode matching, since you mentioned you are in tweets.

Related

Keep in cells only the first two words with pandas

I would like to keep in the column only the two first words of a cell in a dataframe.
For instance:
df = pd.DataFrame(["I'm learning Python", "I don't have money"])
I would like that the results in the column have the following output:
"I'm learning" ; "I don't"
After that, if possible I would like to add '*' between each word. So would be like:
"*I'm* *learning*" ; "*I* *don't*"
Thanks for all the help!

You can use a regex with str.replace:
df[0].str.replace(r'(\S+)\s(\S+).*', r'*\1* *\2*', regex=True)
output:
0 *I'm* *learning*
1 *I* *don't*
Name: 0, dtype: object
As a new column:
df['new'] = df[0].str.replace(r'(\S+)\s(\S+).*', r'*\1* *\2*', regex=True)
output:
0 new
0 I'm learning Python *I'm* *learning*
1 I don't have money *I* *don't*

spacy stemming on pandas df column not working

How to apply stemming on Pandas Dataframe column
am using this function for stemming which is working perfect on string
xx='kenichan dived times ball managed save 50 rest'
def make_to_base(x):
x_list = []
doc = nlp(x)
for token in doc:
lemma=str(token.lemma_)
if lemma=='-PRON-' or lemma=='be':
lemma=token.text
x_list.append(lemma)
print(" ".join(x_list))
make_to_base(xx)
But when i am applying this function on my pandas dataframe column it is not working neither giving any error
x = list(df['text']) #my df column
x = str(x)#converting into string otherwise it is giving error
make_to_base(x)
i've tried different thing but nothing working. like this
df["texts"] = df.text.apply(lambda x: make_to_base(x))
make_to_base(df['text'])
my dataset looks like this:
df['text'].head()
Out[17]:
0 Hope you are having a good week. Just checking in
1 K..give back my thanks.
2 Am also doing in cbe only. But have to pay.
3 complimentary 4 STAR Ibiza Holiday or £10,000 ...
4 okmail: Dear Dave this is your final notice to...
Name: text, dtype: object

You need to actually return the value you got inside the make_to_base method, use
def make_to_base(x):
x_list = []
for token in nlp(x):
lemma=str(token.lemma_)
if lemma=='-PRON-' or lemma=='be':
lemma=token.text
x_list.append(lemma)
return " ".join(x_list)
Then, use
df['texts'] = df['text'].apply(lambda x: make_to_base(x))

find a string from column in pandas dataframe which matches any item from another list of strings

I have a pandas data frame DF
A
["I need PEN"
["something went wrong in LAPTOP"
"I eat MANGO"
"I dont know anything "]
and a Python list matches ["BAT","PEN","LAPTOP","I","SCHOOL",,,,]
need a new column B to be added which matches strings from list
df['B']=df['A'].str.extract("(" + "|".join(matchers) + ")",expand=True)

Use str.findall and then join:
import pandas as pd
import re
df = pd.DataFrame({"A":["I need PEN",
"something went wrong in LAPTOP",
"I eat MANGO",
"I dont know anything about school"]})
matches = ["BAT","PEN","LAPTOP","I","SCHOOL"]
pattern = "|".join(f"\\b{i}\\b" for i in matches)
df["B"] = df['A'].str.findall(pattern,flags=re.IGNORECASE).str.join(",")
print (df)
#
A B
0 I need PEN I,PEN
1 something went wrong in LAPTOP LAPTOP
2 I eat MANGO I
3 I dont know anything about school I,school

Just use df.apply function
def fn_apply(x):
default_list = ["BAT","PEN","LAPTOP","I","SCHOOL"]
b_list = []
for item in default_list:
if item.upper() in x.A.upper().split():
b_list.append(item)
return ",".join(b_list)
df['B'] = df.apply(fn_apply, axis=1)
df
A B
0 I need PEN PEN,I
1 something went wrong in LAPTOP LAPTOP
2 eat MANGO
3 dont know anythingabout school SCHOOL
Let me know if this works for you

with easy pattern
import re
df['B'] = df['A'].str.findall('(' + '|'.join(matches) + ')', flags=re.IGNORECASE).str.join(',')

How to remove words from one dataframe that are not in another

I want to remove all words from df2, which are not in df1.
My df1 looks like this:
id text
1 Hello world how are you people
2 Hello people I am fine people
3 Good Morning people
4 Good Evening
My df2 looks like this:
id text
1 Hello world how are you all
2 Hello everyone I am fine everyone
3 Good Afternoon people
4 Good Night
Expected output of df2:
id text
1 Hello world how are you
2 Hello I am fine
3 Good people
4 Good
Edit: It will be good, If I can also get to print the words that I deleted, and their count (total words I removed)

One way would be to work with sets, and take the intersection of two given lists with corresponding index having split the strings. Then we can use sorted to sort the result according to df1.text and join the items in the list back together:
res = [' '.join(sorted(set(s1.split()) & set(s2.split()), key=s1.split().index))
for s1, s2 in zip(df1.text, df2.text)]
out = pd.DataFrame(res, columns=['Text'])
print(out)
Text
0 Hello world how are you
1 Hello I am fine
2 Good people
3 Good
For a more readable solution:
res = []
for s1, s2 in zip(df1.text, df2.text):
set_s2 = s2.split()
set_int = set(set_s2) & set(s1.split())
s_int = sorted(set_int, key=set_s2.index)
res.append(' '.join(s_int))
out = pd.DataFrame(res, columns=['Text'])

Replace partial string/char in columdata of Panda dataframe

I have a dataframe as follows:
Name Rating
0 ABC Good
1 XYZ Good #
2 GEH Good
3 ABH *
4 FEW Normal
Here I want to replace in the Rating element if it contain # it should replace by Can be improve , if it contain * then Very Poor. I have tried with following but it replace whole string. But I want to replace only the special char if it present.But it solves for another case if only special char is present.
import pandas as pd
df = pd.DataFrame() # Load with data
df['Rating'] = df['Rating'].str.replace('.*#+.*', 'Can be improve')
is returning
Name Rating
0 ABC Good
1 XYZ Can be improve
2 GEH Good
3 ABH Very Poor
4 FEW Normal
Can anybody help me out with this?

import pandas as pd
df = pd.DataFrame({"Rating": ["Good", "Good #", "*"]})
df["Rating"] = df["Rating"].str.replace("#", "Can be improve")
df["Rating"] = df["Rating"].str.replace("*", "Very Poor")
print(df)
Output:
0 Good
1 Good Can be improve
2 Very Poor

You replace the whole string because .* matches any character zero or more times.
If your special values are always at the end of the string you might use:
.str.replace(r'#$', "Can be improve")
.str.replace(r'\*$', "Very Poor")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove repeating letter in a dataframe? - python

You may try this: df["Col"] = df["Col"].str.replace(u"h{4,}", "") Where you may set the number of characters to match in my case 4. Col 0 hello, I'm today hh hhhh hhhhhhhhhhhhhhh 1 Hello World Col 0 hello, I'm today hh 1 Hello World I used unicode matching, since you mentioned you are in tweets.

Related

Keep in cells only the first two words with pandas

spacy stemming on pandas df column not working

find a string from column in pandas dataframe which matches any item from another list of strings

How to remove words from one dataframe that are not in another

Replace partial string/char in columdata of Panda dataframe

Categories

Resources