If we have the sentence = "George coudn't play football in y. 1998 but plays football at θ. 226",
which by letter I mean any letter from Greek or English vocabulary. Is there any way to have as an output = "George coudn't play football in but plays football in"
I tried this one, which removed only the numbers
re_numb = re.compile(r'\d+')
sent = re_numb.sub('', sent)
Just use a Unicode range as in
\s+[\u03b1-\u03c9]+\.\s+\d+
See a demo on regex101.com and a Unicode table for greek letters.
In Python this could be
import re
pattern = re.compile(r'\s+[\u03b1-\u03c9]+\.\s+\d+')
sentence = "George coudn't play football in γ. 1998 but plays football at θ. 226"
sentence = pattern.sub('', sentence)
print(sentence)
And yields
George coudn't play football in but plays football at
In python, \w matches also Greek letters. So you can use:
\b\w\. +\d+\b
where \b is a word boundary.
If you don't want \w to match also underscores:
\b[^_\W]\. +\d+\b
See demo
The following regex can capture the sentence before .y 1995 and θ. 226
(\D)+(?=\s.\.\s\d+)
Demo
If you want to capture only up to first match add ^ to only match from the start of the string
^(\D)+(?=\s.\.\s\d+)
Demo
EDIT
Code sample
To extract each match
import re
text = "George couldn't play football in y. 1998 but plays football at θ. 226"
for match in re.finditer(r'(\D)+(?=\s.\.\s\d+)', text):
print(match.group(), end='') # print without new line
Output
George couldn't play football in but plays football at
To extract only the first match
import re
text = "George couldn't play football in y. 1998 but plays football at θ. 226"
for match in re.finditer(r'^(\D)+(?=\s.\.\s\d+)', text):
print(match.group(), end='')
Output
George couldn't play football in
Related
I'm trying to analyze an article to determine if a specific substring appears.
If "Bill" appears, then I want to delete the substring's parent sentence from the article, as well as every sentence following the first deleted sentence.
If "Bill" does not appear, no alteration are made to the article.
Sample Text:
stringy = """This is Bill Everest here. A long time ago in, erm, this galaxy, a game called Star Wars Episode I: Racer was a smash hit, leading to dozens of enthusiastic magazine reviews with the byline "now this is podracing!" Unfortunately, the intervening years have been unkind to the Star Wars prequels, Star Fox in the way you can rotate your craft to fit through narrow gaps.
This is Bill, signing off. Thank you for reading. And see you tomorrow!"""
Desired Result When Targeted Substring is "Bill":
stringy = """This is Bill Everest here. A long time ago in, erm, this galaxy, a game called Star Wars Episode I: Racer was a smash hit, leading to dozens of enthusiastic magazine reviews with the byline "now this is podracing!" Unfortunately, the intervening years have been unkind to the Star Wars prequels, but does that hindsight extend to this thoroughly literally-named racing tie-in? Star Fox in the way you can rotate your craft to fit through narrow gaps.
"""
This is the code so far:
if "Bill" not in stringy[-200:]:
print(stringy)
text = stringy.rsplit("Bill")[0]
text = text.split('.')[:-1]
text = '.'.join(text) + '.'
It currently doesn't work when "Bill" appears outside of the last 200 characters, cutting off the text at the very first instance of "Bill" (the opening sentence, "This is Bill Everest here"). How can this code be altered to only select for "Bill"s in the last 200 characters?
Here's another approach that loops through each sentence using a regex. We keep a line count and once we're in the last 200 characters we check for 'Bill' in the line. If found, we exclude from this line onward.
Hope the code is readable enough.
import re
def remove_bill(stringy):
sentences = re.findall(r'([A-Z][^\.!?]*[\.!?]\s*\n*)', stringy)
total = len(stringy)
count = 0
for index, line in enumerate(sentences):
#Check each index of 'Bill' in line
for pos in (m.start() for m in re.finditer('Bill', line)):
if count + pos >= total - 200:
stringy = ''.join(sentences[:index])
return stringy
count += len(line)
return stringy
stringy = remove_bill(stringy)
Here is how you can use re:
import re
stringy = """..."""
target = "Bill"
l = re.findall(r'([A-Z][^\.!?]*[\.!?])',stringy)
for i in range(len(l)-1,0,-1):
if target in l[i] and sum([len(a) for a in l[i:]])-sum([len(a) for a in l[i].split(target)[:-1]]) < 200:
strings = ' '.join(l[:i])
print(stringy)
Here is an example string:
text = "hello, i like to eat beef 'sandwiches' and beef 'jerky' and chicken 'patties' and chicken 'burgers' and also chicken 'fingers' and other chicken 'meat' too."
I am trying to separate the words "patties", "burgers",
fingers", and "meat" from this text. I want to separate the words after chicken but before the closing quotation.
I have gotten stumped on how to even separate a single one. I can split after "chicken ' but then how can i select the text up until the next ' ?
I would like to iterate through a list to save the variables to an array. Thanks for any help you can provide.
You can use regular expressions:
import re
text = "hello, i like to eat beef 'sandwiches' and beef 'jerky' and chicken 'patties' and chicken 'burgers' and also chicken 'fingers' and other chicken 'meat' too."
match = re.findall(r'chicken \'(\S+)\'', text)
print (match)
Outputs:
['patties', 'burgers', 'fingers', 'meat']
This is a good use-case for regex.
import re
print(re.findall(r"chicken '(.*?)'", text))
Here's an explanation of the regex: https://regex101.com/r/8IdseD/1
Here's the python code running: https://repl.it/repls/SquareQuerulousModes
The regex, part by part:
chicken ' - matches that literal text
( - starts a capture group - the part that re.findall will spit out.
. - matches any character...
*? - ...any number of times, but as few possible (this is to ensure we don't capture the final ')
) - end the capture group
' - match a literal '.
So re.findall will give you a list of all the substrings that are captured in the group.
You can use zero-width lookarounds to match the surroundings:
(?<=chicken\s')[^']+(?=')
(?<=chicken\s') is zero-width positive lookbehind that matches chicken '
[^']+ matches the portion upto next single quote i.e. the desired substring
(?=') is zero-width positive lookahead that matches ' after the desired substring
Example:
In [713]: text = "hello, i like to eat beef 'sandwiches' and beef 'jerky' and chicken 'patties' and chicken 'burgers' and also chicken 'fingers' and other chicken 'meat' too."
In [714]: re.findall(r"(?<=chicken\s')[^']+(?=')", text)
Out[714]: ['patties', 'burgers', 'fingers', 'meat']
Select just the portion of the sentence from the first occurrence of "chicken":
chicken_text = text[text.find("chicken"):]
Split that text on spaces:
chicken_words = chicken_text.split(" ")
Scan the list for words that begin and end with a single quote:
for word in chicken_words:
if word[0] == "'" and word[-1] == "'":
print word[1:-1]
This won't work if the single-quoted words themselves contain spaces, but that isn't the case in the sample text you gave.
I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?
Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4
Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1
The goals of the function is to split one single string into multiple lines to make it more readable. The goal is to replace the first space found after at least n characters (since the beginning of the string, or since the last "\n" dropped in the string)
Hp:
you can assume no \n in the string
Example
Marcus plays soccer in the afternoon
f(10) should result in
Marcus plays\nsoccer in\nthe afternoon
The first space in Marcus plays soccer in the afternoonis skipped because Marcus is only 5 chars long. We put then a \n after plays and we start counting again. The space after soccer is therefore skipped, etc.
So far tried
def replace_space_w_newline_every_n_chars(n,s):
return re.sub("(?=.{"+str(n)+",})(\s)", "\\1\n", s, 0, re.DOTALL)
inspired by this
Try replacing
(.{10}.*?)\s
with
$1\n
Check it out here.
Example:
>>> import re
>>> s = 'Marcus plays soccer in the afternoo
>>> re.sub(r'(.{9}.*?)\s', r'\1\n', s)
'Marcus plays\nsoccer in\nthe afternoon'
I am using Python 2.7 and need 2 functions to find the longest and shortest sentence (in terms of word count) in a random paragraph. For example, if I choose to put in this paragraph:
"Pair your seaside escape with the reds and whites of northern California's wine country in Jenner. This small coastal city in Sonoma County sits near the mouth of the Russian River, where, all summer long, harbor seals and barking California sea lions heave themselves onto the sand spit, sunning themselves for hours. You can swim and hike at Fort Ross State Historic Park and learn about early Russian hunters who were drawn to the area's herds of seal for their fur pelts. The fort's vineyard, with vines dating back to 1817, was one of the first places in California where grapes were planted."
The output for this should be 36 and 16 with 36 meaning there are 36 words in the longest sentence and 16 words in the shortest sentence.
def MaxMinWords(paragraph):
numWords = [len(sentence.split()) for sentence in paragraph.split('.')]
return max(numWords), min(numWords)
EDIT : As many have pointed out in the comments, this solution is far from robust. The point of this snippet is to simply serve as a pointer to the OP.
You need a way to split the paragraph into sentences and to count words in a sentence. You could use nltk package for both:
from nltk.tokenize import sent_tokenize, word_tokenize # $ pip install nltk
sentences = sent_tokenize(paragraph)
word_count = lambda sentence: len(word_tokenize(sentence))
print(min(sentences, key=word_count)) # the shortest sentence by word count
print(max(sentences, key=word_count)) # the longest sentence by word count
EDIT: As has been mentioned in the comments below, programmatically determining what constitutes the sentences in a paragraph is quite a complex task. However, given the example you provided, I have elucidated a nice start to perhaps solving your problem below.
First, we want to tokenize the paragraph into sentences. We do this by splitting the text on every occurrence of a . (period). This returns a list of strings, each of which is a sentence.
We then want to break each sentence into its corresponding list of words. Then, using this list of lists, we want the sentence (represented as a list of words) whose length is a maximum and the sentence whose length is a minimum. Consider the following code:
par = "Pair your seaside escape with the reds and whites of northern California's wine country in Jenner. This small coastal city in Sonoma County sits near the mouth of the Russian River, where, all summer long, harbor seals and barking California sea lions heave themselves onto the sand spit, sunning themselves for hours. You can swim and hike at Fort Ross State Historic Park and learn about early Russian hunters who were drawn to the area's herds of seal for their fur pelts. The fort's vineyard, with vines dating back to 1817, was one of the first places in California where grapes were planted."
# split paragraph into sentences
sentences = par.split(". ")
# split each sentence into words
tokenized_sentences = [sentence.split(" ") for sentence in sentences]
# get longest sentence and its length
longest_sen = max(tokenized_sentences, key=len)
longest_sen_len = len(longest_sen)
# get shortest word and its length
shortest_sen = min(tokenized_sentences, key=len)
shortest_sen_len = len(shortest_sen)
print longest_sen_len
print shortest_sen_len