extracting numbers from list of strings with python - python

I have a list of strings that I am trying to parse for data that is meaningful to me. I need an ID number that is contained within the string. Sometimes it might be two or even three of them. Example string might be:
lst1 = [
"(Tower 3rd floor window corner_ : option 3_floor cut out_large : GA - floors : : model lines : id 3999595(tower 4rd floor window corner : option 3_floor: : whatever else is in iit " new floor : id 3999999)",
"(Tower 3rd floor window corner_ : option 3_floor cut out_large : GA - floors : : model lines : id 3998895(tower 4rd floor window corner : option 3_floor: : id 5555456 whatever else is in iit " new floor : id 3998899)"
]
I would like to be able to iterate over that list of strings and extract only those highlighted id values.
Output would be a lst1 = ["3999595; 3999999", "3998895; 5555456; 3998899"] where each id values from the same input string is separated by a colon but list order still matches the input list.

You can use id\s(\d{7}) regular expression.
Iterate over items in a list and join the results of findall() call by ;:
import re
lst1 = [
'(Tower 3rd floor window corner_ : option 3_floor cut out_large : GA - floors : : model lines : id 3999595(tower 4rd floor window corner : option 3_floor: : whatever else is in iit " new floor : id 3999999)',
'(Tower 3rd floor window corner_ : option 3_floor cut out_large : GA - floors : : model lines : id 3998895(tower 4rd floor window corner : option 3_floor: : id 5555456 whatever else is in iit " new floor : id 3998899)'
]
pattern = re.compile(r'id\s(\d{7})')
print ["; ".join(pattern.findall(item)) for item in lst1]
prints:
['3999595; 3999999', '3998895; 5555456; 3998899']

Based on #alecxe solution you can also do it without any imports.
If your id numbers are always after id and have a fixed (7) number of digits I would probably just use .split('id ') to separate it and get the 7 digits from the second block onwards.
You can put them together in the desired format by using '; '.join()
Putting everything together:
pattern = ['; '.join([value[:7] for value in valueList.split('id ')[1:]]) for valueList in lst1]
Which prints out:
['3999595; 3999999', '3998895; 5555456; 3998899']

Related

Selenium innerHTML list, print specific value

First of all, I'm new to working with Python, especially Selenium. So I connected to a page with the webdriver and also already grabbed the InnerHTML I need. Here's my problem, InnerHTML is a "list" and I only want to output one value. It looks something like this:
<html>
<body>
<pre style="example" xpath="1">
"amount": 12{
"value" : 3
},
</pre>
</body>
</html>
^It's just for illustration, because the actual thing is much longer. InnerHTML looks like this:
"amount": 12{
"value" : 3
},
^This is where I am now. I can't specify a line because the page is not static. How do I make python find "value" from a variable in InnerHTML ? Please note that there is a colon after "value"!
Thank you very much in advance!
I suggest using regular expression to find the value. I assume that you only need the number part, so here's the code:
innerHTML = '''
"amount": 12{
"value" : 3
},"value":4
'value': 5
'''
import re
regex = re.compile(r'''("|')value("|')\s*:\s*(?P<number>\d+)''')
startpos = 0
m = None
while 1:
m = regex.search(innerHTML, startpos)
if m is None: break
print(m.group("number"))
startpos = m.start() + 1
# output:
# 3
# 4
# 5
This will print out all the value numbers found, as strings. You can convert them to integers afterwards, for example.
NOTE: My code also accounts for the case value is surrounded by single quotes ' rather than double quotes ". This is for your convenience; if not, you can change the appropriate line above to:
regex = re.compile(r'''"value"\s*:\s*(?P<number>\d+)''')
In that case, the output would not include the value 5.

Regex sub only removes certain expressions

I'm running a program which creates product labels based on csv data. The function which I am struggling with takes a data structure which consists of a number combination(width of a wooden plank) and a string (name of product). Possible combinations I search for are as follows:
5 MAPLE PEPPER-ANTIQUE
3-1/4 MAPLE CUMIN-ANTIQUE
2-1/4+4-1/4 MAPLE TIMBERWOLF
My function needs to take in the data, split the width from the name and return them both as separate variables as follows:
desc = row[1]
if filter.lower() in desc.lower():
size = re.search(r'(\d{1})(\-*)(\d{0,1})(\/*)(\d{0,2})(\+*)(\d{0,1})(\-*)(\d{0,1})(\/*)(\d{0,2})', desc)
if size:
# remove size from description
desc = re.sub(size.group(), '', desc)
size = size.group() # extract match from obj
else:
size = "None"
The function does as intended with the first two samples, however when it comes across the last product, it recognizes the size but does not remove it from description. Screen shot below shows the output after I print (size + \n + desc)
Is there an issue with my re expression or elsewhere?
Thanks
re.sub() expects its first argument to be a regex. It works for the first two because they don't contain any characters that have special meaning in the context, however the third contains +, which is special.
There's not actually any reason to use regex there... regular string replacement should work:
desc = desc.replace(size.group(), '')
Why replace and not simply match what you need?
import re
text = """5 MAPLE PEPPER-ANTIQUE
3-1/4 MAPLE CUMIN-ANTIQUE
2-1/4+4-1/4 MAPLE TIMBERWOLF""".split('\n')
print(text)
for t in text:
pattern = r'(?P<size>[0-9-+/]+) *(?P<species>[^0123456789]*)'
m = re.search(pattern,t)
print(m.group('size'))
print(m.group('species'))
Output:
5
MAPLE PEPPER-ANTIQUE
3-1/4
MAPLE CUMIN-ANTIQUE
2-1/4+4-1/4
MAPLE TIMBERWOLF
Regex:
r'(?P<size>[0-9-+/]+) *(?P<species>[^0123456789]*)'
2 named groups, between them 0-n spaces.
1st group only 0123456789-+/ allowed
2nd group any but 0123456789 allowed

Replace integer with number of spaces

If I have these names:
bob = "Bob 1"
james = "James 2"
longname = "longname 3"
And priting these gives me:
Bob 1
James 2
longname 3
How can I make sure that the numbers would be aligned (without using \t or tabs or anything)? Like this:
Bob 1
James 2
longname3
This is a good use for a format string, which can specify a width for a field to be filled with a character (including spaces). But, you'll have to split() your strings first if they're in the format at the top of the post. For example:
"{: <10}{}".format(*bob.split())
# output: 'Bob 1'
The < means left align, and the space before it is the character that will be used to "fill" the "emtpy" part of that number of characters. Doesn't have to be spaces. 10 is the number of spaces and the : is just to prevent it from thinking that <10 is supposed to be the name of the argument to insert here.
Based on your example, it looks like you want the width to be based on the longest name. In which case you don't want to hardcode 10 like I just did. Instead you want to get the longest length. Here's a better example:
names_and_nums = [x.split() for x in (bob, james, longname)]
longest_length = max(len(name) for (name, num) in names_and_nums)
format_str = "{: <" + str(longest_length) + "}{}"
for name, num in names_and_nums:
print(format_str.format(name, num))
See: Format specification docs

Extract a number of continuous digits from a random string in python [duplicate]

This question already has answers here:
extracting numbers from list of strings with python
(2 answers)
Closed 8 years ago.
I am trying to parse this list of strings that contains ID values as a series of 7 digits but I am not sure how to approach this.
lst1=[
"(Tower 3rd fl floor_WINDOW CORNER : option 2_ floor cut out_small_wood) : GA -
Floors : : Model Lines : id 3925810
(Tower 3rd fl floor_WINDOW CORNER : option 2_ floor cut out_small_wood) : GA - Floors : Floors : Floor : Duke new core floors : id 3925721",
"(Tower 3rd fl floor_WINDOW CORNER : option 3_ floor cut out_large_wood) : GA - Floors : : Model Lines : id 3976019
(Tower 3rd fl floor_WINDOW CORNER : option 3_ floor cut out_large_wood) : GA - Floors : Floors : Floor : Duke new core floors : id 3975995"
]
I really want to pull out just the digit values and combine them into one string separated by a colon ";".
The resulting list would be something like this:
lst1 = ["3925810; 3925721", "3976019; 3975995"]
You can use regular expression, like this
import re
pattern = re.compile(r"\bid\s*?(\d+)")
print ["; ".join(pattern.findall(item)) for item in lst1]
# ['3925810; 3925721', '3976019; 3975995']
Debuggex Demo
If you want to make sure that the numbers you pick will only be of length 7, then you can do it like this
pattern = re.compile(r"\bid\s*?(\d{7})\D*?")
Debuggex Demo
The \b refers to the word boundary. So, it makes sure that id will be a separate word and it is followed by 0 or more whitespace characters. And then we match numeric digits [0-9], \d is just a shorthand notation for the same. {7} matches only seven times and followed by \D, its the inverse of \d.
Assuming your list is meant to be:
lst1=[
"(Tower 3rd fl floor_WINDOW CORNER : option 2_ floor cut out_small_wood) : GA - Floors : : Model Lines : id 3925810",
"(Tower 3rd fl floor_WINDOW CORNER : option 2_ floor cut out_small_wood) : GA - Floors : Floors : Floor : Duke new core floors : id 3925721",
"(Tower 3rd fl floor_WINDOW CORNER : option 3_ floor cut out_large_wood) : GA - Floors : : Model Lines : id 3976019",
"(Tower 3rd fl floor_WINDOW CORNER : option 3_ floor cut out_large_wood) : GA - Floors : Floors : Floor : Duke new core floors : id 3975995"
]
This will do the trick:
lst2 = [i.rsplit(" ",1)[1] for i in lst1]
which gives:
['3925810', '3925721', '3976019', '3975995']

Execute only if string contains a ','?

I'm trying to execute a bunch of code only if the string I'm searching contains a comma.
Here's an example set of rows that I would need to parse (name is a column header for this tab-delimited file and the column (annoyingly) contains the name, degree, and area of practice:
name
Sam da Man J.D.,CEP
Green Eggs Jr. Ed.M.,CEP
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D.
My issue is that some of the rows contain a comma, which is followed by an acronym which represents an "area of practice" for the professional and some do not.
My code relies on the principle that each line contains a comma, and I will now have to modify the code in order to account for lines where there is no comma.
def parse_ieca_gc(s):
########################## HANDLE NAME ELEMENT ###############################
degrees = ['M.A.T.','Ph.D.','MA','J.D.','Ed.M.', 'M.A.', 'M.B.A.', 'Ed.S.', 'M.Div.', 'M.Ed.', 'RN', 'B.S.Ed.', 'M.D.']
degrees_list = []
# separate area of practice from name and degree and bind this to var 'area'
split_area_nmdeg = s['name'].split(',')
area = split_area_nmdeg.pop() # when there is no area of practice and hence no comma, this pops out the name + deg and leaves an empty list, that's why 'print split_area_nmdeg' returns nothing and 'area' returns the name and deg when there's no comma
print 'split area nmdeg'
print area
print split_area_nmdeg
# Split the name and deg by spaces. If there's a deg, it will match with one of elements and will be stored deg list. The deg is removed name_deg list and all that's left is the name.
split_name_deg = re.split('\s',split_area_nmdeg[0])
for word in split_name_deg:
for deg in degrees:
if deg == word:
degrees_list.append(split_name_deg.pop())
name = ' '.join(split_name_deg)
# area of practice
category = area
re.search() and re.match() both do not work, it appears, because they return instances and not a boolean, so what should I use to tell if there's a comma?
The easiest way in python to see if a string contains a character is to use in. For example:
if ',' in s['name']:
if re.match(...) is not None :
instead of looking for boolean use that. Match returns a MatchObject instance on success, and None on failure.
You are already searching for a comma. Just use the results of that search:
split_area_nmdeg = s['name'].split(',')
if len(split_area_nmdeg) > 2:
print "Your old code goes here"
else:
print "Your new code goes here"

Categories

Resources