Keep in cells only the first two words with pandas - python

I would like to keep in the column only the two first words of a cell in a dataframe.
For instance:
df = pd.DataFrame(["I'm learning Python", "I don't have money"])
I would like that the results in the column have the following output:
"I'm learning" ; "I don't"
After that, if possible I would like to add '*' between each word. So would be like:
"*I'm* *learning*" ; "*I* *don't*"
Thanks for all the help!

You can use a regex with str.replace:
df[0].str.replace(r'(\S+)\s(\S+).*', r'*\1* *\2*', regex=True)
output:
0 *I'm* *learning*
1 *I* *don't*
Name: 0, dtype: object
As a new column:
df['new'] = df[0].str.replace(r'(\S+)\s(\S+).*', r'*\1* *\2*', regex=True)
output:
0 new
0 I'm learning Python *I'm* *learning*
1 I don't have money *I* *don't*

Related

How do I change the same string within a column and make it permanent using Pandas

I'm trying to change the Strings "SLL" under the competitions column to "League" but when i tried this:
messi_dataset.replace("SLL", "League",regex = True)
It only changed the first "SLL" to "League" but then other strings that were "SLL" became "UCL. I have no idea why. I also tried changing regex = True to inlace = True but no luck.
https://drive.google.com/file/d/1ldq6o70j-FsjX832GbYq24jzeR0IwlEs/view?usp=sharing
https://drive.google.com/file/d/1OeCSutkfdHdroCmTEG9KqnYypso3bwDm/view?usp=sharing
Suppose you have a dataframe as below:
import pandas as pd
import re
df = pd.DataFrame({'Competitions': ['SLL', 'sll','apple', 'banana', 'aabbSLL', 'ccddSLL']})
# write a regex pattern that replaces 'SLL'
# I assumed case-irrelevant
regex_pat = re.compile(r'SLL', flags=re.IGNORECASE)
df['Competitions'].str.replace(regex_pat, 'league', regex=True)
# Input DataFrame
Competitions
0 SLL
1 sll
2 apple
3 banana
4 aabbSLL
5 ccddSLL
Output:
0 league
1 league
2 apple
3 banana
4 aabbleague
5 ccddleague
Name: Competitions, dtype: object
Hope it clarifies.
base on this Answer test this code:
messi_dataset['competitions'] = messi_dataset['competitions'].replace("SLL", "League")
also, there are many different ways to do this like this one that I test:
messi_dataset.replace({'competitions': 'SLL'}, "League")
for those cases that 'SLL' is a part of another word:
messi_dataset.replace({'competitions': 'SLL'}, "League", regex=True)

spacy stemming on pandas df column not working

How to apply stemming on Pandas Dataframe column
am using this function for stemming which is working perfect on string
xx='kenichan dived times ball managed save 50 rest'
def make_to_base(x):
x_list = []
doc = nlp(x)
for token in doc:
lemma=str(token.lemma_)
if lemma=='-PRON-' or lemma=='be':
lemma=token.text
x_list.append(lemma)
print(" ".join(x_list))
make_to_base(xx)
But when i am applying this function on my pandas dataframe column it is not working neither giving any error
x = list(df['text']) #my df column
x = str(x)#converting into string otherwise it is giving error
make_to_base(x)
i've tried different thing but nothing working. like this
df["texts"] = df.text.apply(lambda x: make_to_base(x))
make_to_base(df['text'])
my dataset looks like this:
df['text'].head()
Out[17]:
0 Hope you are having a good week. Just checking in
1 K..give back my thanks.
2 Am also doing in cbe only. But have to pay.
3 complimentary 4 STAR Ibiza Holiday or £10,000 ...
4 okmail: Dear Dave this is your final notice to...
Name: text, dtype: object
You need to actually return the value you got inside the make_to_base method, use
def make_to_base(x):
x_list = []
for token in nlp(x):
lemma=str(token.lemma_)
if lemma=='-PRON-' or lemma=='be':
lemma=token.text
x_list.append(lemma)
return " ".join(x_list)
Then, use
df['texts'] = df['text'].apply(lambda x: make_to_base(x))

How can I merge and format multiple columns which contains int data type in pandas?

I have a dataframe like this (shown below), and I want to create a new column which combines them together (Note that I have other columns which contain numbers):
Program Season Episode
AAA 1 1
AAA 1 2
...
...
This is the code I tried:
#create a new column
series['series_name'] = series[['Program', 'Season','Episode']].apply(lambda x: ''.join(str(x)), axis=1)
It gave me something like this:
'Program AAA\nSeason 1\nEpisode 1\nName: 0, dtype: object'
My expected output should be something like:
'AAA-Season 1-Episode 1'
Can someone help me, many thanks.
df['series_name'] = df['Program'].str.cat(
['Season ' + df['Season'].astype(str),
'Episode ' + df['Episode'].astype(str)],
sep='-'
)
Here is one way
df=df1.copy()
df[['Season','Episode']]=df[['Season','Episode']].astype(str).radd(['Season ','Episode '],1)
s=df.apply('-'.join,1)
s
Out[79]:
0 AAA-Season 1-Episode 1
1 AAA-Season 1-Episode 2
dtype: object
#df1['series_name']=s

How to remove repeating letter in a dataframe?

I have the following string:
"hello, I'm going to eat to the fullest today hhhhhhhhhhhhhhhhhhhhh"
I have collected many tweets like that and assigned them to a dataframe. How can I clean those rows in dataframe by removing "hhhhhhhhhhhhhhhhhh" and only let the rest of the string in that row?
I'm also using countVectorizer later, so there was a lot of vocabularies that contained 'hhhhhhhhhhhhhhhhhhhhhhh'
Using Regex.
Ex:
import pandas as pd
df = pd.DataFrame({"Col": ["hello, I'm going to eat to the fullest today hhhhhhhhhhhhhhhhhhhhh", "Hello World"]})
#df["Col"] = df["Col"].str.replace(r"\b(.)\1+\b", "")
df["Col"] = df["Col"].str.replace(r"\s+(.)\1+\b", "").str.strip()
print(df)
Output:
Col
0 hello, I'm going to eat to the fullest today
1 Hello World
You may try this:
df["Col"] = df["Col"].str.replace(u"h{4,}", "")
Where you may set the number of characters to match in my case 4.
Col
0 hello, I'm today hh hhhh hhhhhhhhhhhhhhh
1 Hello World
Col
0 hello, I'm today hh
1 Hello World
I used unicode matching, since you mentioned you are in tweets.

Replace partial string/char in columdata of Panda dataframe

I have a dataframe as follows:
Name Rating
0 ABC Good
1 XYZ Good #
2 GEH Good
3 ABH *
4 FEW Normal
Here I want to replace in the Rating element if it contain # it should replace by Can be improve , if it contain * then Very Poor. I have tried with following but it replace whole string. But I want to replace only the special char if it present.But it solves for another case if only special char is present.
import pandas as pd
df = pd.DataFrame() # Load with data
df['Rating'] = df['Rating'].str.replace('.*#+.*', 'Can be improve')
is returning
Name Rating
0 ABC Good
1 XYZ Can be improve
2 GEH Good
3 ABH Very Poor
4 FEW Normal
Can anybody help me out with this?
import pandas as pd
df = pd.DataFrame({"Rating": ["Good", "Good #", "*"]})
df["Rating"] = df["Rating"].str.replace("#", "Can be improve")
df["Rating"] = df["Rating"].str.replace("*", "Very Poor")
print(df)
Output:
0 Good
1 Good Can be improve
2 Very Poor
You replace the whole string because .* matches any character zero or more times.
If your special values are always at the end of the string you might use:
.str.replace(r'#$', "Can be improve")
.str.replace(r'\*$', "Very Poor")

Categories

Resources