How can I remove non characters from a dataframe? python beautiful soup - python

I have a dataframe
df
ID col1
1 The quick brown fox jumped hf_093*&
2 fox run jump *& #7
How can I parse out non-characters in this dataframe?
I tried this but it doesn't work
posts = ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)","
",posts).split())

You could use the inbuilt functions:
import pandas as pd
df = pd.DataFrame({'ID': [1,2], 'col1': ['The quick brown fox jumped hf_093*&', 'fox run jump *& #7']}).set_index('ID')
df['col1'] = df['col1'].str.replace('[^\w\s]+', '')
print(df)
Which yields
col1
ID
1 The quick brown fox jumped hf_093
2 fox run jump 7
This removes everything not [a-zA-Z0-9_] and whitespaces.
If you want finer control, you could use a function
import re
rx = re.compile(r'(?i)\b[a-z]+\b')
def remover(row):
words = " ".join([word
for word in row.split()
if rx.match(word)])
return words
df['col1'] = df['col1'].apply(remover)
print(df)
Which would yield
col1
ID
1 The quick brown fox jumped
2 fox run jump

If what you're looking for is removing the strings that contains special characters:
Regex:
df.applymap(lambda x: re.sub("(?:\w*[^\w ]+\w*)", "", x).strip())
Output:
0
0 The quick brown fox jumped
1 fox run jump
An alternative, non-regex solution for the crazy list comprehension enthusiasts:
unwanted = '!##$%^&*()'
df.applymap(lambda x: ' '.join([i for i in x.split() if not any(c in i for c in unwanted)]))
Output:
0
0 The quick brown fox jumped
1 fox run jump
Removes any strings that has the unwanted special characters in them.

Related

remove String row in pandas data frame when number of words is less than N

I am pre-processing dataset for NLP classification task, i want to drop the sentences with less than 3 words, the code i tried drop the words with less than 3 letters:
import re
text = "The quick brown fox jumps over the lazy dog."
# remove words between 1 and 3
shortword = re.compile(r'\W*\b\w{1,3}\b')
print(shortword.sub('', text))
how to do this in python?
Using Pandas dataframe:
import pandas
text = {"header":["The quick fox","The quick fox brown jumps hight","The quick"]}
df = pandas.DataFrame(text)
df = df[df['header'].str.split().str.len().gt(2)]
print(df)
The above snippet filters the dataframe of 'header' column length greater than 2 words.
For more on pandas dataframe, refer https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
Hope this helps you.
import re
text= "Hi, Yaman Afadar. Welcome to stackoverflow website. You are pre-processing dataset for NLP classification task. you want to drop the sentences with less than 3 words. Here a sample code. Coud you try it please! The quick brown fox jumps over the lazy dog. So, Hello everyone."
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
print (sentences)
output='\n'.join(s for s in sentences if len(s.split())>3 )
print (output)
[Output]:
['Hi, Yaman Afadar', 'Welcome to stackoverflow website', 'You are pre-processing dataset for NLP classification task', 'you want to drop the sentences with less than 3 words', 'Here a sample code', 'Coud you try it please', 'The quick brown fox jumps over the lazy dog', 'So, Hello everyone', '']
Sentences with more than 3 words
Welcome to stackoverflow website
You are pre-processing dataset for NLP classification task
you want to drop the sentences with less than 3 words
Here a sample code
Coud you try it please
The quick brown fox jumps over the lazy dog

How to group words of a string into different strings using pre-defined word groups in python?

I would like to convert a string which contains words like this: The Red Fox The Cat The Dog Is Blue, into 3 strings which would contain The Red Fox for the first one, The Cat for the second and The Dog Is Blue for the last one.
More simply explained, it should do like so:
# String0 = The Red Fox The Cat The Dog Is Blue
# The line above should transform to the lines below
# String1 = The Red Fox
# String2 = The Cat
# String3 = The Dog Is Blue
You must note that the words that form the expressions are meant to change (but still forming known expressions) so I was thinking about making a dictionary which would help to recognize the words and define how they should group together if it is possible.
I hope that I am understandable and that someone will have the answer to my question.
You can use regex:
import re
string = "The Red Fox The Cat The Dog Is Blue"
# create a regex by joining your words using pipe (|)
pattern = "(The(\\s(Red|Fox|Cat|Dog|Is|Blue))+)"
print([x[0] for x in re.findall(pattern, string)]) # ['The Red Fox', 'The Cat', 'The Dog Is Blue']
In the above example, you can dynamically create your pattern from a list of words that you have.
EDIT: Dynamically constructing the pattern:
pattern = f"(The(\\s({'|'.join(list_of_words)}))+)"
This gets you what you need, the basic code:
def separate():
string0 = "The Red Fox The Cat The Dog Is Blue"
sentences = ["The "+sentence.strip() for sentence in string0.lower().split("the") if sentence != ""]
for sentence in sentences:
print(sentence)

How do I pad all punctuation with a whitespace for every row of text in a pandas dataframe?

I have a data frame with df['text'].
A sample value of df['text'] could be:
"The quick red.fox jumped over.the lazy brown, dog."
I want the output to be:
"The quick red . fox jumped over . the lazy brown , dog . "
I've tried using the str.replace() method, but I don't quite understand how to make it do what I'm looking for.
import pandas as pd
# read csv into dataframe
df=pd.read_csv('./data.csv')
#add a space before and after every punctuation
df['text'] = df['text'].str.replace('.',' . ')
df['text'].head()
# write dataframe to csv
df.to_csv('data.csv', index=False)
You have to use the escape operator to literally match a point, using .str.replace
df['Text'].str.replace('\.', ' . ').str.replace(',', ' , ')
0 The quick red . fox jumped over . the lazy brown , dog .
Name: Text, dtype: object
For replace all punctuation use regex from this with \\1 for add spaces before and after values:
df['text'] = df['text'].str.replace(r'([^\w\s]+)', ' \\1 ')
Try with
df['text'] = df['text'].replace({'.':' . ',', ':' , '},regex=True)

How would I extract a substring from a string that contains parentheses using python?

I have the following string:
The quick brown fox, the cat in the (hat) and the dog in the pound. The Cat in THE (hat):
I need help with extracting the following text:
1) the cat in the (hat)
2) The Cat in THE (hat)
I have tried the following:
p1 = """The quick brown fox, the cat in the (hat) and the dog in the pound. The Cat in THE (hat)"""
pattern = r'\b{var}\b'.format(var = p1)
with io.open(os.path.join(directory,file), 'r', encoding='utf-8') as textfile:
for line in textfile:
result = re.findall(pattern, line)
print (result)
Strictly matching that string, you can use this regex. To generalize for the future, the (?i) in the beginning makes it ignore the case and use \ to escape the parentheses.
import re
regex = re.compile('(?i)the cat in the \(hat\)')
string = 'The quick brown fox, the cat in the (hat) and the dog in the pound. The Cat in THE (hat):'
regex.findall(string)
Result:
['the cat in the (hat)', 'The Cat in THE (hat)']

Python Regex findall But Not Including the conditional string

i have this string:
The quick red fox jumped over the lazy brown dog lazy
And i wrote this regex which gives me this:
s = The quick red fox jumped over the lazy brown dog lazy
re.findall(r'[\s\w\S]*?(?=lazy)', ss)
which gives me below output:
['The quick red fox jumped over the ', '', 'azy brown dog ', '']
But i am trying to get the output like this:
['The quick red fox jumped over the ']
Which means the regex should give me everything till it encounters the first lazy instead of last one and i only want to use findall.
Make the pattern non-greedy by adding a ?:
>>> m = re.search(r'[\s\w\S]*?(?=lazy)', s)
# ^
>>> m.group()
'The quick red fox jumped over the '

Categories

Resources