Grab a number from a text file using regex [duplicate]

Grab a number from a text file using regex [duplicate] - python

This question already has an answer here:
Regex to match the last float in a string
(1 answer)
Closed 3 years ago.
I have a text file that has lines of the format something like:
1 12.345 12345.12345678 56.789 textextext
Using python, I want to be able to grab the number that has the format nn.nnn, but only the one in the penultimate column, i.e. for this row, I would like to grab 56.789 (and not 12.345).
I know I can do something like:
re.findall(r' \d\d\.\d\d\d',<my_line>)[0]
but I'm not sure how to make sure I only grab one of the two numbers with this same format.

You may use a greedy match before matching your number:
>>> s = '1 12.345 12345.12345678 56.789 textextext'
>>> print re.findall(r'.*(\b\d+\.\d+)', s)[0]
56.789
RegEx Demo
RegEx Details:
.* is greedy that matches longest possible match before next match
\b is for word boundary
\d+\.\d+ matches a floating point number

Related

How to match a full string, instead of partial string? [duplicate]

This question already has answers here:
Order of regular expression operator (..|.. ... ..|..)
(1 answer)
Checking whole string with a regex
(5 answers)
Closed 2 years ago.
pattern = (1|2|3|4|5|6|7|8|9|10|11|12)
str = '11'
This only matches '1', not '11'. How to match the full '11'? I changed it to:
pattern = (?:1|2|3|4|5|6|7|8|9|10|11|12)
It is the same.
I am testing here first:
https://regex101.com/

It is matching 1 instead of 11 because you have 1 before 11 in your alternation. If you use re.findall then it will match 1 twice for input string 11.
However to match numbers from 1 to 12 you can avoid alternation and use:
\b[1-9]|1[0-2]?\b
It is safer to use word boundary to avoid matching within word digits.
RegEx Demo

Regex always matches left before right.
On an alternation you'd put the longest first.
However, factoring should take precedense.
(1|2|3|4|5|6|7|8|9|10|11|12)
then it turns into
1[012]?|[2-9]
https://regex101.com/r/qmlKr0/1
I purposely didn't add boundary parts as
everybody has their own preference.

do you mean this solution?
[\d]+

How to find values in specific format including parenthesis using regular expression in python [duplicate]

This question already has answers here:
Regular expression to return text between parenthesis
(11 answers)
Closed 2 years ago.
I have long string S, and I want to find value (numeric) in the following format "Value(**)", where ** is values I want to extract.
For example, S is "abcdef Value(34) Value(56) Value(13)", then I want to extract values 34, 56, 13 from S.
I tried to use regex as follows.
import re
regex = re.compile('\Value(.*'))
re.findall(regex, S)
But the code yields the result I did not expect.
Edit. I edited some mistakes.

You should escape the parentheses, correct the typo of Value (as opposed to Values), use a lazy repeater *? instead of *, add the missing right parenthesis, and capture what's enclosed in the escaped parentheses with a pair of parentheses:
regex = re.compile(r'Value\((.*?)\)')

Only one of your numbers follows the word 'Value', so you can extract anything inside parentheses. You also need to escape the parentheses which are special characters.
regex = re.compile('\(.*?\)')
re.findall(regex, S)
Output:
['(34)', '(56)', '(13)']

I think what you're looking for is a capturing group that can return multiple matches. This string is: (\(\d{2}\))?. \d matches an digit and {2} matches exactly 2 digits. {1,2} will match 1 or 2 digits ect. ? matches 0 to unlimited number of times. Each group can be indexed and combined into a list. This is a simple implementation and will return the numbers within parentheses.
eg. 'asdasd Value(102222), fgdf(20), he(77)' will match 20 and 77 but not 102222.

Regex: How to find substring that does NOT contain a certain word [duplicate]

This question already has answers here:
Regular expressions: Ensuring b doesn't come between a and c
(4 answers)
Closed 3 years ago.
I have this string;
string = "STARTcandyFINISH STARTsugarFINISH STARTpoisonFINISH STARTBlobpoisonFINISH STARTpoisonBlobFINISH"
I would like to match and capture all substrings that appear in between START and FINISH but only if the word "poison" does NOT appear in that substring. How do I exclude this word and capture only the desired substrings?
re.findall(r'START(.*?)FINISH', string)
Desired captured groups:
candy
sugar

Using a tempered dot, we can try:
string = "STARTcandyFINISH STARTsugarFINISH STARTpoisonFINISH STARTBlobpoisonFINISH STARTpoisonBlobFINISH"
matches = re.findall(r'START((?:(?!poison).)*?)FINISH', string)
print(matches)
This prints:
['candy', 'sugar']
For an explanation of how the regex pattern works, we can have a closer look at:
(?:(?!poison).)*?
This uses a tempered dot trick. It will match, one character at a time, so long as what follows is not poison.

Extracting numbers from a string under certain conditions

I have some strings that are stored in a dataframe using pandas and I want to extract all numbers out of them if it exists. The conditions these numbers must meet are quite specific and I'm not really sure if I can use regex to solve my problem. The conditions are:
The number CANNOT be at the start of the string
It CANNOT appear after the word "No. " or after the word "Question "
Also if possible, if the number has an e right after it I would want to keep that as well. However this is less important.
This is what I have so far to find all the numbers, but I do not know how to code the conditions I mentioned above.
testNumbers = re.findall(r'\d+', row['Name'])
For a given string: " Test T860 Article No. 9712250 787"
I would want the regex expression to return
[860, 787]

You may use
(?!^)(?<!\d)(?<!\bNo\.\s)(?<!\bQuestion\s)(\d+)(?!\d)
In Python, declare as a raw string literal:
pattern = r'(?!^)(?<!\d)(?<!\bNo\.\s)(?<!\bQuestion\s)(\d+)(?!\d)'
See the regex demo
Details
(?!^) - not at the start of the string
(?<!\d) - no digit immediately before the current location is allowed
(?<!\bNo\.\s) - no No. and a whitespace immediately before is allowed
(?<!\bQuestion\s) - no Question and a whitespace immediately before is allowed
(\d+) - Group 1: one or more digits
(?!\d) - no digit immediately after the current location is allowed.
In Pandas, you may use it like
df = pd.DataFrame({'text':[" Test T860 Article No. 9712250 787"," Test F199 Article Question 9712250787"]})
df['numbers'] = df['text'].str.findall(r'(?!^)(?<!\d)(?<!\bNo\.\s)(?<!\bQuestion\s)(\d+)(?!\d)').apply(','.join)
Output:
>>> df
text numbers
0 Test T860 Article No. 9712250 787 860,787
1 Test F199 Article Question 9712250787 199

Here we can use an expression with word boundaries and quantifier:
\b[A-Z]+(\d+)\b|\b([0-9]{1,3})\b
Demo
RegEx
If this expression wasn't desired or you wish to modify it, please visit regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:

Match specific pattern with regular expression

I've to make a regex to match exactly this kind of pattern
here an example
JK+6.00,PP*2,ZZ,GROUPO
having a match for every group like
Match 1
JK
+
6.00
Match 2
PP
*
2
Match 3
ZZ
Match 4
GROUPO
So comma separated blocks of
(2 to 12 all capitals letters) [optional (+ or *) and a (positive number 0[.0[0]])
This block successfully parse the pattern
(?P<block>(?P<subject>[A-Z]{2,12})(?:(?P<operation>\*|\+)(?P<value>\d+(?:.?\d{1,2})?))?)
we have the subject group
(?P<subject>[A-Z]{2,12})
The value
(?P<value>\d+(?:.?\d{1,2})?)
All the optional operation section (value within)
(?:(?P<operation>\*|\+)(?P<value>\d+(?:.?\d{1,2})?))?
But the regex must fail if the string doesn't match EXACTLY the pattern
and that's the problem
I tried this but doesn't work
^(?P<block>(?P<subject>[A-Z]{2,12})(?:(?P<operation>\*|\+)(?P<value>\d+(?:.?\d{1,2})?))?)(?:,(?P=block))*$
Any suggestion?
PS. I use Python re

I'd personally go for a 2 step solution, first check that the whole string fits to your pattern, then extract the groups you want.
For the overall check you might want to use ^(?:[A-Z]{2,12}(?:[*+]\d+(?:\.\d{1,2})?)?(?:,|$))*$ as a pattern, which contains basically your pattern, the (?:,|$) to match the delimiters and anchors.
I have also adjusted your pattern a bit, to (?P<block>(?P<subject>[A-Z]{2,12})(?:(?P<operation>[*+])(?P<value>\d+(?:\.\d{1,2})?))?). I have replaced (?:\*|\+) with [+*] in your operation pattern and \. with .? in your value pattern.
A (very basic) python implementation could look like
import re
str='JK+6.00,PP*2,ZZ,GROUPO'
full_pattern=r'^(?:[A-Z]{2,12}(?:[*+]\d+(?:\.\d{1,2})?)?(?:,|$))*$'
extract_pattern=r'(?P<block>(?P<subject>[A-Z]{2,12})(?:(?P<operation>[*+])(?P<value>\d+(?:\.\d{1,2})?))?)'
if re.fullmatch(full_pattern, str):
for match in re.finditer(extract_pattern, str):
print(match.groups())
http://ideone.com/kMl9qu

I'm guessing this is the pattern you were looking for:
(2 different letter)+(time stamp),(2 of the same letter)*(1 number),(2 of the same letter),(a string)
If thats the case, this regex would do the trick:
^(\w{2}\+\d{1,2}\.\d{2}),((\w)\3\*\d),((\w)\5),(\w+)$
Demo: https://regex101.com/r/8B3C6e/2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grab a number from a text file using regex [duplicate] - python

Related

How to match a full string, instead of partial string? [duplicate]

How to find values in specific format including parenthesis using regular expression in python [duplicate]

Regex: How to find substring that does NOT contain a certain word [duplicate]

Extracting numbers from a string under certain conditions

Match specific pattern with regular expression

Categories

Resources