how to find all matches in a string with regex in python? - python

I have a string and I want to find all 13 digits numbers in it.
I wrote the code like this, but the problem is that I get a list which contains just the first 13 digits number.
Any one knows where's the problem.
my text:
9311105005816 POTTING MIX OSMOCOTE PRO 501 PREMIUM 107899 4711414284189 BBQ ACC CLEANING SMALL GRILL BRUSH^ 9312566048022 SPRAY PAINT FIDDLY BITS 250G GREY PRIMER 9312324001115 GARDEN BASICS 25L ALL PURPOSE POTTING MIX 2 # $3.50 8711167004368 FIRE IGNITION FIRELIGHTR SAMBA 36PK WHITE BRICK SAKF36 6 #
my code:
import re
with open("Receipt.txt") as f:
lines = f.readlines()
index_subtotal = lines[0].find("SubTotal")
index_tax_invoice = lines[0].find("TAX INVOICE 'Kip' ")
len_tax_invoice = len("TAX INVOICE 'Kip' ")
print(lines[0][index_tax_invoice + len_tax_invoice:index_subtotal])
print("*" * 30)
my_pattern = lines[0][index_tax_invoice + len_tax_invoice:index_subtotal]
my_pattern_list = re.findall("^(?:.*?(\d{10,13}).*|.*)$", my_pattern)
print(my_pattern_list)

Related

How do I remove some text from "get_text()" output in BeautifulSoup

I'm making a web scraping program to get the retail trading sentiment from IG Markets.
The output I would like to be displayed in the console is:
"EUR/USD: 57% of clients accounts are short on this market".
The output I get right now is:
"EUR/USD: 57% of client accounts are short on this market The percentage of IG client
accounts with positions in this market that are currently long or short. Calculated
to the nearest 1%."
How do I remove this text:
"The percentage of IG client accounts with positions in this market that are
currently long or short. Calculated to the nearest 1%."
Thank you.
Here's the code:
import bs4, requests
def getIGsentiment(pairUrl):
res = requests.get(pairUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('.price-ticket__sentiment')
return elems[0].get_text(" ", strip = True)
retail_positions = getIGsentiment('https://www.ig.com/us/forex/markets-forex/eur-usd')
print('EUR/USD: ' + retail_positions)
You can use a Regular expression (regex) for that :
>>> import re
>>> print('EUR/USD: ' + re.match('^.*on this market',retail_positions).group())
EUR/USD: 57% of client accounts are short on this market
You express a search pattern (^.*on this market) and re.match() will return a re.Match object and you can retrieve the match with the group() function.
This search pattern consist of 3 parts :
^ match the start of the line
.* mean to match zero or more (*) instance of any character (.)
on this market literally match this string
Regex are widely used and supported, but beware some variants, Python don't seem to support the [[:digit:]] character class...
If your string changes but the capitalization not, you can simply create a for loop to look after the 7th upper character and split the string. In this case, it's the letter 'T'.
Something like this:
phrase = "EUR/USD: 57 % of client accounts are short on this market The percentage of
IG client accounts with positions in this market that are currently long or short.
Calculated to the nearest 1 % ."
upperchars = []
for char in phrase:
if char.isupper():
upperchars.append(char)
final = phrase.split(upperchars[6])[0]
print(final)
The result would be:
EUR/USD: 57 % of client accounts are short on this market

Delete based on presence

I'm trying to analyze an article to determine if a specific substring appears.
If "Bill" appears, then I want to delete the substring's parent sentence from the article, as well as every sentence following the first deleted sentence.
If "Bill" does not appear, no alteration are made to the article.
Sample Text:
stringy = """This is Bill Everest here. A long time ago in, erm, this galaxy, a game called Star Wars Episode I: Racer was a smash hit, leading to dozens of enthusiastic magazine reviews with the byline "now this is podracing!" Unfortunately, the intervening years have been unkind to the Star Wars prequels, Star Fox in the way you can rotate your craft to fit through narrow gaps.
This is Bill, signing off. Thank you for reading. And see you tomorrow!"""
Desired Result When Targeted Substring is "Bill":
stringy = """This is Bill Everest here. A long time ago in, erm, this galaxy, a game called Star Wars Episode I: Racer was a smash hit, leading to dozens of enthusiastic magazine reviews with the byline "now this is podracing!" Unfortunately, the intervening years have been unkind to the Star Wars prequels, but does that hindsight extend to this thoroughly literally-named racing tie-in? Star Fox in the way you can rotate your craft to fit through narrow gaps.
"""
This is the code so far:
if "Bill" not in stringy[-200:]:
print(stringy)
text = stringy.rsplit("Bill")[0]
text = text.split('.')[:-1]
text = '.'.join(text) + '.'
It currently doesn't work when "Bill" appears outside of the last 200 characters, cutting off the text at the very first instance of "Bill" (the opening sentence, "This is Bill Everest here"). How can this code be altered to only select for "Bill"s in the last 200 characters?
Here's another approach that loops through each sentence using a regex. We keep a line count and once we're in the last 200 characters we check for 'Bill' in the line. If found, we exclude from this line onward.
Hope the code is readable enough.
import re
def remove_bill(stringy):
sentences = re.findall(r'([A-Z][^\.!?]*[\.!?]\s*\n*)', stringy)
total = len(stringy)
count = 0
for index, line in enumerate(sentences):
#Check each index of 'Bill' in line
for pos in (m.start() for m in re.finditer('Bill', line)):
if count + pos >= total - 200:
stringy = ''.join(sentences[:index])
return stringy
count += len(line)
return stringy
stringy = remove_bill(stringy)
Here is how you can use re:
import re
stringy = """..."""
target = "Bill"
l = re.findall(r'([A-Z][^\.!?]*[\.!?])',stringy)
for i in range(len(l)-1,0,-1):
if target in l[i] and sum([len(a) for a in l[i:]])-sum([len(a) for a in l[i].split(target)[:-1]]) < 200:
strings = ' '.join(l[:i])
print(stringy)

cutting strings onto new lines

My string is too long to fit in TkInter, therefore I'm trying to split the list every 15 spaces onto a new line.
so far I have counted the spaces and everytime I get to 15 it adds the string '\n', which should put it on a new line, however it just places it in the string.
How can I fix this?`
def stringCutter(movie):
n = 0
strings = []
spaces = 0
curFilms = db.CurrentFilm(movie)
tempOverview = curFilms[5]
for i in tempOverview:
n += 1
if i == ' ':
spaces += 1
if (spaces % 15)== 0:
string = tempOverview[:n]
tempOverview = tempOverview[n:]
strings.append(string)
n = 0
spaces = 0
if n == len(tempOverview):
strings.append(tempOverview)
overview = '\n'.join(strings)
return overview`
curFilms takes lots of movie info and the 5 element is the overview, which is a long string.
I want it to return the overview like this:
After a global war the seaside kingdom known as the Valley Of The Wind remains
one of the last strongholds on Earth untouched by a poisonous jungle and the powerful
insects that guard it. Led by the courageous Princess Nausicaa the people of the Valley
engage in an epic struggle to restore the bond between humanity and Earth.
Instead of that though, it does this:
After a global war the seaside kingdom known as the Valley Of The Wind remains \none of the last strongholds on Earth untouched by a poisonous jungle and the powerful \ninsects that guard it. Led by the courageous Princess Nausicaa the people of the Valley \nengage in an epic struggle to restore the bond between humanity and Earth.

Understanding Pyparsing for street addresses

While searching for ways to build a better address locator for processing a single field address table I came across the Pyparsing module. On the Examples page there is a script called "streetAddressParser" (author unknown) that I've copied in full below. While I've read the documentation and looked at O'Reilly Recursive Decent Parser tutorials I'm still confused about the code for this address parser. I'm aware that this parser would represent just one component of an address locator application, but my Python experience is limited to GIS scripting and I'm struggling to understand certain parts of this code.
First, what is the purpose of defining numbers as "Zero One Two Three...Eleven Twelve Thirteen...Ten Twenty Thirty..."? If we know an address field starts with integers representing the street number why not just extract that as the first token?
Second, why does this script use so many bitwise operators (^, |, ~)? Is this because of performance gains or are they treated differently in the Pyparsing module? Could other operators be used in place of them and produce the same result?
I'm grateful for any guidance offered and I appreciate your patience in reading this.
Thank you!
from pyparsing import *
# define number as a set of words
units = oneOf("Zero One Two Three Four Five Six Seven Eight Nine Ten"
"Eleven Twelve Thirteen Fourteen Fifteen Sixteen Seventeen Eighteen Nineteen",
caseless=True)
tens = oneOf("Ten Twenty Thirty Forty Fourty Fifty Sixty Seventy Eighty Ninety",caseless=True)
hundred = CaselessLiteral("Hundred")
thousand = CaselessLiteral("Thousand")
OPT_DASH = Optional("-")
numberword = ((( units + OPT_DASH + Optional(thousand) + OPT_DASH +
Optional(units + OPT_DASH + hundred) + OPT_DASH +
Optional(tens)) ^ tens )
+ OPT_DASH + Optional(units) )
# number can be any of the forms 123, 21B, 222-A or 23 1/2
housenumber = originalTextFor( numberword | Combine(Word(nums) +
Optional(OPT_DASH + oneOf(list(alphas))+FollowedBy(White()))) +
Optional(OPT_DASH + "1/2")
)
numberSuffix = oneOf("st th nd rd").setName("numberSuffix")
streetnumber = originalTextFor( Word(nums) +
Optional(OPT_DASH + "1/2") +
Optional(numberSuffix) )
# just a basic word of alpha characters, Maple, Main, etc.
name = ~numberSuffix + Word(alphas)
# types of streets - extend as desired
type_ = Combine( MatchFirst(map(Keyword,"Street St Boulevard Blvd Lane Ln Road Rd Avenue Ave "
"Circle Cir Cove Cv Drive Dr Parkway Pkwy Court Ct Square Sq"
"Loop Lp".split())) + Optional(".").suppress())
# street name
nsew = Combine(oneOf("N S E W North South East West NW NE SW SE") + Optional("."))
streetName = (Combine( Optional(nsew) + streetnumber +
Optional("1/2") +
Optional(numberSuffix), joinString=" ", adjacent=False )
^ Combine(~numberSuffix + OneOrMore(~type_ + Combine(Word(alphas) + Optional("."))), joinString=" ", adjacent=False)
^ Combine("Avenue" + Word(alphas), joinString=" ", adjacent=False)).setName("streetName")
# PO Box handling
acronym = lambda s : Regex(r"\.?\s*".join(s)+r"\.?")
poBoxRef = ((acronym("PO") | acronym("APO") | acronym("AFP")) +
Optional(CaselessLiteral("BOX"))) + Word(alphanums)("boxnumber")
# basic street address
streetReference = streetName.setResultsName("name") + Optional(type_).setResultsName("type")
direct = housenumber.setResultsName("number") + streetReference
intersection = ( streetReference.setResultsName("crossStreet") +
( '#' | Keyword("and",caseless=True)) +
streetReference.setResultsName("street") )
streetAddress = ( poBoxRef("street")
^ direct.setResultsName("street")
^ streetReference.setResultsName("street")
^ intersection )
tests = """\
3120 De la Cruz Boulevard
100 South Street
123 Main
221B Baker Street
10 Downing St
1600 Pennsylvania Ave
33 1/2 W 42nd St.
454 N 38 1/2
21A Deer Run Drive
256K Memory Lane
12-1/2 Lincoln
23N W Loop South
23 N W Loop South
25 Main St
2500 14th St
12 Bennet Pkwy
Pearl St
Bennet Rd and Main St
19th St
1500 Deer Creek Lane
186 Avenue A
2081 N Webb Rd
2081 N. Webb Rd
1515 West 22nd Street
2029 Stierlin Court
P.O. Box 33170
The Landmark # One Market, Suite 200
One Market, Suite 200
One Market
One Union Square
One Union Square, Apt 22-C
""".split("\n")
# how to add Apt, Suite, etc.
suiteRef = (
oneOf("Suite Ste Apt Apartment Room Rm #", caseless=True) +
Optional(".") +
Word(alphanums+'-')("suitenumber"))
streetAddress = streetAddress + Optional(Suppress(',') + suiteRef("suite"))
for t in map(str.strip,tests):
if t:
#~ print "1234567890"*3
print t
addr = streetAddress.parseString(t, parseAll=True)
#~ # use this version for testing
#~ addr = streetAddress.parseString(t)
print "Number:", addr.street.number
print "Street:", addr.street.name
print "Type:", addr.street.type
if addr.street.boxnumber:
print "Box:", addr.street.boxnumber
print addr.dump()
print
In some addresses, the primary number is spelt out as a word, as you can see from a few addresses in their tests, near the end of the list. Your statement, "If we know an address field starts with integers representing the street number..." is a big "if". Many, many addresses do not start with a number.
The bitwise operators are probably used to set flags to classify tokens as having certain properties. For the purpose of setting bits/flags, the bitwise operators are very efficient and convenient.
It's refreshing to see a parser that attempts to parse street addresses without using a regular expression... also see this page about some of the challenges of parsing freeform addresses.
However, it's worth noting that this parser looks like it will miss a wide variety of addresses. It doesn't seem to consider some of the special address formats common in Utah, Wisconsin, and rural areas. It also is missing a significant number of secondary designators and street suffixes.

Splitting Paragraphs in Python using Regular Expression containing abbreaviations

Tried using this function on a paragraph consisting of 3 strings and abbreviations.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
def splitParagraphIntoSentences(paragraph):
''' break a paragraph into sentences
and return a list '''
import re
# to split by multile characters
# regular expressions are easiest (and fastest)
sentenceEnders = re.compile('[.!?][\s]{1,2}[A-Z]')
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
p = "While other species (e.g. horse mango, M. foetida) are also grown ,Mangifera indica – the common mango or Indian mango – is the only mango tree. Commonly cultivated in many tropical and subtropical regions, and its fruit is distributed essentially worldwide.In several cultures, its fruit and leaves are ritually used as floral decorations at weddings, public celebrations and religious "
sentences = splitParagraphIntoSentences(p)
for s in sentences:
print s.strip()
The first character of the next beggining sentence is eliminated,
O/p Recieved:
While other Mangifera species (e.g. horse mango, M. foetida) are also grown on a
more localized basis, Mangifera indica ΓÇô the common mango or Indian mango ΓÇô
is the only mango tree
ommonly cultivated in many tropical and subtropical regions, and its fruit is di
stributed essentially worldwide.In several cultures, its fruit and leaves are ri
tually used as floral decorations at weddings, public celebrations and religious.
Thus the string got spliited into only 2 strings and the first character of the next sentence got eliminated.Also some strange charactes can be seen, I guess python wasn`t able to convert the hypen.
Incase I alter the regex to [.!?][\s]{1,2}
While other species (e.g
horse mango, M
foetida) are also grown ,Mangifera indica ΓÇô the common mango or Indian mango Γ
Çô is the only mango tree
Commonly cultivated in many tropical and subtropical regions, and its fruit is d
istributed essentially worldwide.In several cultures, its fruit and leaves are r
itually used as floral decorations at weddings, public celebrations and religiou
s
Thus even the abbreviations get splitted.
The regex you want is:
[.!?][\s]{1,2}(?=[A-Z])
You want a positive lookahead assertion, which means you want to match the pattern if it's followed by a capital letter, but not match the capital letter.
The reason only the first one got matched is you don't have a space after the 2nd period.

Categories

Resources