Remove space delimited single characters - python

I have texts that look like this:
the quick brown fox 狐狸 m i c r o s o f t マ イ ク ロ ソ フ ト jumps over the lazy dog 跳過懶狗 best wishes : John Doe
What's a good regex (for python) that can remove the single-characters so that the output looks like this:
the quick brown fox 狐狸 jumps over the lazy dog 跳過懶狗 best wishes John Doe
I've tried some combinations of \s{1}\S{1}\s{1}\S{1}, but they inevitably end up removing more letters than I need.

You can replace the following with empty string:
(?<!\S)\S(?!\S).?
Match a non-space that has no non-spaces on either side of it (i.e. surrounded by spaces), plus the character after that (if any).
The reason why I used negative lookarounds is because it neatly handles the start/end of string case. We match the extra character that follows the \S to remove the space as well.
Regex101 Demo

A non-regex version might look like:
source_string = r"this is a string I created"
modified_string =' '.join([x for x in source_string.split() if len(x)>1])
print(modified_string)

Please try the below code using regex, where I am looking for at-least two occurrences of characters that can remove a single character problem.
s='the quick brown fox 狐狸 m i c r o s o f t マ イ ク ロ ソ フ ト jumps over the lazy dog 跳過懶狗 best wishes : John Doe'
output = re.findall('\w{2,}', s)
output = ' '.join([x for x in output])
print(output)

Related

Delete the first 2 words and the last 2 words in a string in a dataframe using Regex Python

I have a column in a df contains the following strings:
>>> import pandas as pd
>>> df = pd.DataFrame({'Sentence':['The cat is jumping off the bridge', 'The dog jumped over the brown fox, the bus is coming now', 'The bus is coming']})
>>> df
Sentence
0 The cat is jumping off the bridge
1 The dog jumped over the brown fox, the bus is coming now
2 The bus is coming
I would like to use regex to delete the first 2 words and the last 2 words of all the strings. One row can contain multiple strings (row 1). In case the string is less than 4 words, nothing should be returned for that string (row 2). The output should be as below:
>>> df
Sentence String
0 The cat is jumping off the bridge is jumping off
1 The dog jumped over the brown fox, the bus is coming now jumped over the, is
2 The bus is coming
I tried with this code just to see how it works for the first 2 words, but it is not working. Any suggestions would greatly be appreciated.
df['String']= df.Sentence.str.join(line.split()[2:])
You can use a single call to Series.str.replace with
df['Sentence'].str.replace(r'(?<![^,])\s*\w+(?:\W+\w+)?\s*|\s*\w+(?:\W+\w+)?\s*(?![^,])', '')
See the Pandas demo:
>>> pattern = r'(?<![^,])\s*\w+(?:\W+\w+)?\s*|\s*\w+(?:\W+\w+)?\s*(?![^,])'
>>> df['Sentence'].str.replace(pattern, '')
0 is jumping off
1 jumped over the,is
2
Regex details
(?<![^,]) - a comma or start of string must appear immediately to the left of the current location
\s* - 0+ whitespaces
\w+ - one or more word chars
(?:\W+\w+)? - an optional occurrence of one or more non-word chars followed with one or more word chars
\s* - 0+ whitespaces
| - or
\s* - 0+ whitespaces
\w+ - a word (one or more word chars)
(?:\W+\w+)? - an optional occurrence of one or more non-word chars followed with one or more word chars
\s* - 0+ whitespaces
(?![^,]) - end of string, or a location that is immediately followed with a comma.
Try this:
import pandas as pd
df = pd.DataFrame({'Sentence':['The cat is jumping off the bridge', 'The dog jumped over the brown fox, the bus is coming now', 'The bus is coming']})
df['try'] = df['Sentence'].apply(lambda s: ', '.join([' '.join(x.split()[2:-2]) for x in s.split(',')]))
print(df)
Output:
Sentence try
0 The cat is jumping off the bridge is jumping off
1 The dog jumped over the brown fox, the bus is ... jumped over the, is
2 The bus is coming
Without using regex :
string = #your string
string = string.split()
string.pop(0)
string.pop(0)
string.pop(-1)
string.pop(-1)
string = "".join(string)

Regular expression for returning lines of dialogue

-I am a beginner python coder so bear with me!
A line of complete dialog is defined as text that starts on its own line and starts and ends with double quotation marks (i.e. ").
what i have so far is,
def q_4():
pattern = r'^\"\w*\"'
return re.compile(pattern, re.M|re.IGNORECASE)
but for some reason it only returns one instance with one word between the two double quotes. How can i go about grasping full lines?
Try searching on the pattern \"[^"]+\":
inp = """Here is a quote: "the quick brown fox jumps over
the lazy dog" and here is another "blah
blah blah" the end"""
dialogs = re.findall(r'\"([^"]+)\"', inp)
print(dialogs)
This prints:
['the quick brown fox jumps over\nthe lazy dog', 'blah\nblah blah']

Regex for Matching Apostrophe 's' words

I'm trying to create a regex to match a word that has or doesn't have an apostrophe 's' at the end. For the below example, I'd like add a regex to replace the apostrophe with the regex to match either an apostrophe 's' or just an 's'.
Philip K Dick's Electric Dreams
Philip K Dicks Electric Dreams
What I am trying so far is below, but I'm not getting it to match correctly. Any help here is great. Thanks!
Philip K Dick[\'[a-z]|[a-z]] Electric Dreams
Just set the apostrophe as optional in the regex pattern.
Like this: [a-zA-Z]+\'?s,
For example, using your test strings:
import re
s1 = "Philip K Dick's Electric Dreams"
s2 = "Philip K Dicks Electric Dreams"
>>> re.findall("[a-zA-Z]+\'?s", s1)
["Dick's", 'Dreams']
>>> re.findall("[a-zA-Z]+\'?s", s2)
['Dicks', 'Dreams']
You can use the regex (\w+)'s to represent any letters followed by 's. Then you can substitute back that word followed by just s.
>>> s = "Philip K Dick's Electric Dreams"
>>> re.sub(r"(\w+)'s", r'\1s', s)
'Philip K Dicks Electric Dreams'

Regex to match strings in quotes that contain only 3 or less capitalized words

I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?
Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4
Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1

Extracting characters from text file

i have a text file that states:
The quick brown fox jumps over the lazy dog.
I want to extract the characters in even positions starting at zero and create a string from them like string_even =Teqikbonfxjmsoe h aydg
as well as the characters in odd positions like string_odd = h uc rw o up vrtelz o.
i am just learning how to read text files and do not know how to approach this problem
print txt[0::2]
print txt[1::2]

Categories

Resources