How to split text based on numbers with dots in Python?

How to split text based on numbers with dots in Python? - python

I have following simple text:
2 of 5 deliveries some text some text... 1. 3 of 5 items some text some text... 2. 1 of 5 items found in box some text...
Now I want that on the basis of numbers [0.-9.] the text should be splitted as following: (each row represents on entry in a list).
2 of 5 deliveries some text some text...,
3 of 5 items some text some text...,
1 of 5 items found in box some text...
This is the desired output. However, it does not really work with regex with re.split('([0\.-9\.]+)', text). It always separates by numbers only. What would be the most clever way to convert this with Python?

You can use the following pattern:
>>> re.split(r'\s+\d+\.\s+', text)
['2 of 5 deliveries some text some text...',
'3 of 5 items some text some text...',
'1 of 5 items found in box some text...']
EXPLANATION:
>>> re.split(r'''
\s+ # Matches leading spaces to the separator
\d+ # Matches digit character
\. # Matches '.' character
\s+ # Matches trailing spaces after the separator
''', text, flags=re.VERBOSE)
['2 of 5 deliveries some text some text...',
'3 of 5 items some text some text...',
'1 of 5 items found in box some text...']

try:
import re
text = '2 of 5 deliveries some text some text... 1. 3 of 5 items some text some text... 2. 1 of 5 items found in box some text...'
print(re.split('[0-9]\.', text))
Output:
['2 of 5 deliveries some text some text... ', ' 3 of 5 items some text some text... ', ' 1 of 5 items found in box some text...']

Related

Change whitespace to underscore at specific positions

I have string like this:
strings = ['pic1.jpg siberian cat 24 25', 'pic2.jpg siemese cat 14 32', 'pic3.jpg american bobtail cat 8 13', 'pic4.jpg cat 9 1']
What I want is to replace whitespace between cat breeds to hyphen eliminating whitespace between .jpg and first word in breed, and numbers.
Expected output:
['pic1.jpg siberian_cat 24 25', 'pic2.jpg siemese_cat 14 32', 'pic3.jpg american_bobtail cat 8 13', 'pic4.jpg cat 9 1']
I tried to construct patterns as follows:
[re.sub(r'(?<!jpg\s)([a-z])\s([a-z])\s([a-z])', r'\1_\2_\3', x) for x in strings ]
However, I adds hyphen between .jpg and next word.
The problem is that "cat" is not always put at the end of the word combination.

Here is one approach using re.sub with a callback function:
strings = ['pic1.jpg siberian cat 24 25', 'pic2.jpg siemese cat 14 32', 'pic3.jpg american bobtail cat 8 13', 'pic4.jpg cat 9 1']
output = [re.sub(r'(?<!\S)\w+(?: \w+)* cat\b', lambda x: x.group().replace(' ', '_'), x) for x in strings]
print(output)
This prints:
['pic1.jpg siberian_cat 24 25',
'pic2.jpg siemese_cat 14 32',
'pic3.jpg american_bobtail_cat 8 13',
'pic4.jpg cat 9 1']
Here is an explanation of the regex pattern used:
(?<!\S) assert what precedes first word is either whitespace or start of string
\w+ match a word, which is then followed by
(?: \w+)* a space another word, zero or more times
[ ] match a single space
cat\b followed by 'cat'
In other words, taking the third list element as an example, the regex pattern matches american bobtail cat, then replaces all spaces by underscore in the lambda callback function.

Try this [re.sub(r'jpg\s((\S+\s)+)cat', "jpg " + "_".join(x.split('jpg')[1].split('cat')[0].strip().split()) + "_cat", x) for x in strings ]

In Python, how do I extract multiple blocks of text that begin with same pattern, but no distinct end?

Given a test string:
teststr= 'chapter 1 Here is a block of text from chapter one. chapter 2 Here is another block of text from the second chapter. chapter 3 Here is the third and final block of text.'
I want to create a list of results like this:
result=['chapter 1 Here is a block of text from chapter one.','chapter 2 Here is another block of text from the second chapter.','chapter 3 Here is the third and final block of text.']
Using re.findall('chapter [0-9]',teststr)
I get ['chapter 1', 'chapter 2', 'chapter 3']
That's fine if all I wanted were the chapter numbers, but I want the chapter number plus all the text up to the next chapter number. In the case of the last chapter, I want to get the chapter number and the text all the way to the end.
Trying re.findall('chapter [0-9].*',teststr) yields the greedy result:
['chapter 1 Here is a block of text from chapter one. chapter 2 Here is another block of text from the second chapter. chapter 3 Here is the third and final block of text.']
I'm not great with regular expressions so any help would be appreciated.

In general, an extraction regex looks like
(?s)pattern.*?(?=pattern|$)
Or, if the pattern is at the start of a line,
(?sm)^pattern.*?(?=\npattern|\Z)
Here, you could use
re.findall(r'chapter [0-9].*?(?=chapter [0-9]|\Z)', text)
See this regex demo. Details:
chapter [0-9] - chapter + space and a digit
.*? - any zero or more chars, as few as possible
(?=chapter [0-9]|\Z) - a positive lookahead that matches a location immediately followed with chapter, space, digit, or end of the whole string.
Here, since the text starts with the keyword, you may use
import re
teststr= 'chapter 1 Here is a block of text from chapter one. chapter 2 Here is another block of text from the second chapter. chapter 3 Here is the third and final block of text.'
my_result = [x.strip() for x in re.split(r'(?!^)(?=chapter \d)', teststr)]
print( my_result )
# => ['chapter 1 Here is a block of text from chapter one.', 'chapter 2 Here is another block of text from the second chapter.', 'chapter 3 Here is the third and final block of text.']
See the Python demo. The (?!^)(?=chapter \d) regex means:
(?!^) - find a location that is not at the start of string and
(?=chapter \d) - is immediately followed with chapter, space and any digit.
The pattern is used to split the string at the found locations, and does not consume any chars, hence, the results are stripped from whitespace in a list comprehension.

If you don't have to use a regex, try this:
def split(text):
chapters = []
this_chapter = ""
for i, c in enumerate(text):
if text[i:].startswith("chapter ") and text[i+8].isdigit():
if this_chapter.strip():
chapters.append(this_chapter.strip())
this_chapter = c
else:
this_chapter += c
chapters.append(this_chapter.strip())
return chapters
print(split('chapter 1 Here is a block of text from chapter one. chapter 2 Here is another block of text from the second chapter. chapter 3 Here is the third and final block of text.'))
Output:
['chapter 1 Here is a block of text from chapter one.', 'chapter 2 Here is another block of text from the second chapter.', 'chapter 3 Here is the third and final block of text.']

You're looking for re.split. Assuming up to 99 chapters:
import re
teststr= 'chapter 1 Here is a block of text from chapter one. chapter 2 Here is another block of text from the second chapter. chapter 3 Here is the third and final block of text.'
chapters = [i.strip() for i in re.split('chapter \d{1,2}', teststr)[1:]]
Output:
['Here is a block of text from chapter one.',
'Here is another block of text from the second chapter.',
'Here is the third and final block of text.']

Find date from image/text

I have dates like this and I need regex to find these types of dates
12-23-2019
29 10 2019
1:2:2018
9/04/2019
22.07.2019
here's what I did
first I removed all spaces from the text and here's what it looks like
12-23-2019291020191:02:2018
and this is my regex
re.findall(r'((\d{1,2})([.\/-])(\d{2}|\w{3,9})([.\/-])(\d{4}))',new_text)
it can find 12-23-2019 , 9/04/2019 , 22.07.2019 but cannot find 29 10 2019 and 1:02:2018

You may use
(?<!\d)\d{1,2}([.:/ -])(?:\d{1,2}|\w{3,})\1\d{4}(?!\d)
See the regex demo
Details
(?<!\d) - no digit right before
\d{1,2} - 1 or 2 digits
([.:/ -]) - a dot, colon, slash, space or hyphen (captured in Group 1)
(?:\d{1,2}|\w{3,}) - 1 or 2 digits or 3 or more word chars
\1 - same value as in Group 1
\d{4} - four digits
(?!\d) - no digit allowed right after
Python sample usage:
import re
text = 'Aaaa 12-23-2019, bddd 29 10 2019 <=== 1:2:2018'
pattern = r'(?<!\d)\d{1,2}([.:/ -])(?:\d{1,2}|\w{3,})\1\d{4}(?!\d)'
results = [x.group() for x in re.finditer(pattern, text)]
print(results) # => ['12-23-2019', '29 10 2019', '1:2:2018']

Regex to match strings in quotes that contain only 3 or less capitalized words

I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?

Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4

Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']

Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1

How to extract a substring separate by square braquet and generate substrings

I would like to extract and construct some strings after identifying a substring that matches a pattern contain within square braquets:
e.g: if my text is '2 cups [9 oz] [10 g] flour '
i want to generate 4 strings out of this input:
"2 cups" -> us
"9 oz" -> uk imperial
"10 g" -> metric
"flour" -> ingredient name
As a beginning I have started to identify any square braquet that contains the oz keyword and wrote the following code but the match is not occurring. Any ideas and best practices to accomplish this?
p_oz = re.compile(r'\[(.+) oz\]', re.IGNORECASE) # to match uk metric
text = '2 cups [9 oz] flour'
m = p_oz.match(text)
if m:
found = m.group(1)
print found

You need to use search instead of match.
m = p_oz.search(text)
re.match tries to match the entire input string against the regex. That's not what you want. You want to find a substring that matches your regex, and that's what re.search is for.

I'm just expanding upon BrenBarn's accepted answer. I like a good problem to solve during lunch. Below is my full implementation of your question:
Given the string 2 cups [9 oz] [10 g] flour
import re
text = '2 cups [9 oz] [10 g] flour'
units = {'oz': 'uk imperical',
'cups': 'us',
'g': 'metric'}
# strip out brackets & trim white space
text = text.replace('[', '').replace(']', '').strip()
# replace numbers like 9 to "9
text = re.sub(r'(\d+)', r'"\1', text)
# expand units like `cups` to `cups" -> us`
for unit in units:
text = text.replace(unit, unit + '" -> ' + units[unit] + "~")
# matches the last word in the string
text = re.sub(r'(\w+$)', r'"\1" -> ingredient name', text)
print "raw text: \n" + text + "\n"
print "Array:"
print text.split('~ ')
Will return an array of strings:
raw text:
"2 cups" -> us~ "9 oz" -> uk imperical~ "10 g" -> metric~ "flour" -> ingredient name
Array: [
'"2 cups" -> us',
'"9 oz" -> uk imperical',
'"10 g" -> metric',
'"flour" -> ingredientname'
]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split text based on numbers with dots in Python? - python

Related

Change whitespace to underscore at specific positions

In Python, how do I extract multiple blocks of text that begin with same pattern, but no distinct end?

Find date from image/text

Regex to match strings in quotes that contain only 3 or less capitalized words

How to extract a substring separate by square braquet and generate substrings

Categories

Resources