Hey I'm trying to figure out a regular expression to do the following.
Here is my string
Place,08/09/2010,"15,531","2,909",650
I need to split this string by the comma's. Though due to the comma's used in the numerical data fields the split doesn't work correctly. So I want to remove the comma's in the numbers before running splitting the string.
Thanks.
new_string = re.sub(r'"(\d+),(\d+)"', r'\1.\2', original_string)
This will substitute the , inside the quotes with a . and you can now just use the strings split method.
>>> from StringIO import StringIO
>>> import csv
>>> r = csv.reader(StringIO('Place,08/09/2010,"15,531","2,909",650'))
>>> r.next()
['Place', '08/09/2010', '15,531', '2,909', '650']
Another way of doing it using regex directly:
>>> import re
>>> data = "Place,08/09/2010,\"15,531\",\"2,909\",650"
>>> res = re.findall(r"(\w+),(\d{2}/\d{2}/\d{4}),\"([\d,]+)\",\"([\d,]+)\",(\d+)", data)
>>> res
[('Place', '08/09/2010', '15,531', '2,909', '650')]
You could parse a string of that format using pyparsing:
import pyparsing as pp
import datetime as dt
st='Place,08/09/2010,"15,531","2,909",650'
def line_grammar():
integer=pp.Word(pp.nums).setParseAction(lambda s,l,t: [int(t[0])])
sep=pp.Suppress('/')
date=(integer+sep+integer+sep+integer).setParseAction(
lambda s,l,t: dt.date(t[2],t[1],t[0]))
comma=pp.Suppress(',')
quoted=pp.Regex(r'("|\').*?\1').setParseAction(
lambda s,l,t: [int(e) for e in t[0].strip('\'"').split(',')])
line=pp.Word(pp.alphas)+comma+date+comma+quoted+comma+quoted+comma+integer
return line
line=line_grammar()
print(line.parseString(st))
# ['Place', datetime.date(2010, 9, 8), 15, 531, 2, 909, 650]
The advantage is you parse, convert, and validate in a few lines. Note that the ints are all converted to ints and the date to a datetime structure.
a = """Place,08/09/2010,"15,531","2,909",650""".split(',')
result = []
i=0
while i<len(a):
if not "\"" in a[i]:
result.append(a[i])
else:
string = a[i]
i+=1
while True:
string += ","+a[i]
if "\"" in a[i]:
break
i+=1
result.append(string)
i+=1
print result
Result:
['Place', '08/09/2010', '"15,531"', '"2,909"', '650']
Not a big fan of regular expressions unless you absolutely need them
If you need a regex solution, this should do:
r"(\d+),(?=\d\d\d)"
then replace with:
"\1"
It will replace any comma-delimited numbers anywhere in your string with their number-only equivalent, thus turning this:
Place,08/09/2010,"15,531","548,122,909",650
into this:
Place,08/09/2010,"15531","548122909",650
I'm sure there are a few holes to be found and places you don't want this done, and that's why you should use a parser!
Good luck!
Related
I have a spreadsheet with text values like A067,A002,A104. What is most efficient way to do this? Right now I am doing the following:
str = 'A067'
str = str.replace('A','')
n = int(str)
print n
Depending on your data, the following might be suitable:
import string
print int('A067'.strip(string.ascii_letters))
Python's strip() command takes a list of characters to be removed from the start and end of a string. By passing string.ascii_letters, it removes any preceding and trailing letters from the string.
If the only non-number part of the input will be the first letter, the fastest way will probably be to slice the string:
s = 'A067'
n = int(s[1:])
print n
If you believe that you will find more than one number per string though, the above regex answers will most likely be easier to work with.
You could use regular expressions to find numbers.
import re
s = 'A067'
s = re.findall(r'\d+', s) # This will find all numbers in the string
n = int(s[0]) # This will get the first number. Note: If no numbers will throw exception. A simple check can avoid this
print n
Here's some example output of findall with different strings
>>> a = re.findall(r'\d+', 'A067')
>>> a
['067']
>>> a = re.findall(r'\d+', 'A067 B67')
>>> a
['067', '67']
You can use the replace method of regex from re module.
import re
regex = re.compile("(?P<numbers>.*?\d+")
matcher = regex.search(line)
if matcher:
numbers = int(matcher.groupdict()["numbers"] #this will give you the numbers from the captured group
import string
str = 'A067'
print (int(str.strip(string.ascii_letters)))
I'm reading in a large text file with lots of columns, dollar related and not, and I'm trying to figure out how to strip the dollar fields ONLY of $ and , characters.
so say I have:
a|b|c
$1,000|hi,you|$45.43
$300.03|$MS2|$55,000
where a and c are dollar-fields and b is not.
The output needs to be:
a|b|c
1000|hi,you|45.43
300.03|$MS2|55000
I was thinking that regex would be the way to go, but I can't figure out how to express the replacement:
f=open('sample1_fixed.txt','wb')
for line in open('sample1.txt', 'rb'):
new_line = re.sub(r'(\$\d+([,\.]\d+)?k?)',????, line)
f.write(new_line)
f.close()
Anyone have an idea?
Thanks in advance.
Unless you are really tied to the idea of using a regex, I would suggest doing something simple, straight-forward, and generally easy to read:
def convert_money(inval):
if inval[0] == '$':
test_val = inval[1:].replace(",", "")
try:
_ = float(test_val)
except:
pass
else:
inval = test_val
return inval
def convert_string(s):
return "|".join(map(convert_money, s.split("|")))
a = '$1,000|hi,you|$45.43'
b = '$300.03|$MS2|$55,000'
print convert_string(a)
print convert_string(b)
OUTPUT
1000|hi,you|45.43
300.03|$MS2|55000
A simple approach:
>>> import re
>>> exp = '\$\d+(,|\.)?\d+'
>>> s = '$1,000|hi,you|$45.43'
>>> '|'.join(i.translate(None, '$,') if re.match(exp, i) else i for i in s.split('|'))
'1000|hi,you|45.43'
It sounds like you are addressing the entire line of text at once. I think your first task would be to break up your string by columns into an array or some other variables. Once you've don that, your solution for converting strings of currency into numbers doesn't have to worry about the other fields.
Once you've done that, I think there is probably an easier way to do this task than with regular expressions. You could start with this SO question.
If you really want to use regex though, then this pattern should work for you:
\[$,]\g
Demo on regex101
Replace matches with empty strings. The pattern gets a little more complicated if you have other kinds of currency present.
I Try this regex take if necessary.
\$(\d+)[\,]*([\.]*\d*)
SEE DEMO : http://regex101.com/r/wM0zB6/2
Use the regexx
((?<=\d),(?=\d))|(\$(?=\d))
eg
import re
>>> x="$1,000|hi,you|$45.43"
re.sub( r'((?<=\d),(?=\d))|(\$(?=\d))', r'', x)
'1000|hi,you|45.43'
Try the below regex and then replace the matched strings with \1\2\3
\$(\d+(?:\.\d+)?)(?:(?:,(\d{2}))*(?:,(\d{3})))?
DEMO
Defining a black list and checking if the characters are in it, is an easy way to do this:
blacklist = ("$", ",") # define characters to remove
with open('sample1_fixed.txt','wb') as f:
for line in open('sample1.txt', 'rb'):
clean_line = "".join(c for c in line if c not in blacklist)
f.write(clean_line)
\$(?=(?:[^|]+,)|(?:[^|]+\.))
Try this.Replace with empty string.Use re.M option.See demo.
http://regex101.com/r/gT6kI4/6
I am trying to split a string such as: add(ten)sub(one) into add(ten) sub(one).
I can't figure out how to match the close parentheses. I have used re.sub(r'\\)', '\\) ') and every variation of escaping the parentheses,I can think of. It is hard to tell in this font but I am trying to add a space between these commands so I can split it into a list later.
There's no need to escape ) in the replacement string, ) has a special a special meaning only in the regex pattern so it needs to be escaped there in order to match it in the string, but in normal string it can be used as is.
>>> strs = "add(ten)sub(one)"
>>> re.sub(r'\)(?=\S)',r') ', strs)
'add(ten) sub(one)'
As #StevenRumbalski pointed out in comments the above operation can be simply done using str.replace and str.rstrip:
>>> strs.replace(')',') ').strip()
'add(ten) sub(one)'
d = ')'
my_str = 'add(ten)sub(one)'
result = [t+d for t in my_str.split(d) if len(t) > 0]
result = ['add(ten)','sub(one)']
Create a list of all substrings
import re
a = 'add(ten)sub(one)'
print [ b for b in re.findall('(.+?\(.+?\))', a) ]
Output:
['add(ten)', 'sub(one)']
I am completely new to Python and don't know how to get a sub-string which matches some wildcard condition from a string.
I am trying to get a timestamp from the following string:
sdc4-251504-7f5-f59c349f0e516894fc89d2686a0d57f5-1360922654.97671.data
I want to get only "1360922654.97671" part out of the string.
Please help.
Because you mentioned wildcards you can use re
In [77]: import re
In [78]: s = "sdc4-251504-7f5-f59c349f0e516894fc89d2686a0d57f5-1360922654.97671.data"
In [79]: re.findall("\d+\.\d+", s)
Out[79]: ['1360922654.97671']
If the dots and dashes have their specific function within your string, you can use this:
>>> s = "sdc4-251504-7f5-f59c349f0e516894fc89d2686a0d57f5-1360922654.97671.data"
>>> s.rsplit('.', 1)[0].split('-')[-1]
'1360922654.97671'
Step by step:
>>> s.rsplit('.', 1)
['sdc4-251504-7f5-f59c349f0e516894fc89d2686a0d57f5-1360922654.97671', 'data']
>>> s.rsplit('.', 1)[0]
'sdc4-251504-7f5-f59c349f0e516894fc89d2686a0d57f5-1360922654.97671'
>>> s.rsplit('.', 1)[0].split('-')
['sdc4', '251504', '7f5', 'f59c349f0e516894fc89d2686a0d57f5', '1360922654.97671']
>>> s.rsplit('.', 1)[0].split('-')[-1]
'1360922654.97671'
This will work for any strings in the form:
anything-WHATYOUWANT.stringwithoutdots
>>> s = "sdc4-251504-7f5-f59c349f0e516894fc89d2686a0d57f5-1360922654.97671.data"
>>> s.split('-')[-1][:-5]
'1360922654.97671'
slightly fewer characters, only works where the last part of the string is .data or another 5 character string.
I'm getting started with RegEx and I was wondering if anyone could help me craft a statement to convert coordinates as follows:
145.00694,-37.80421,9 145.00686,-37.80382,9 145.00595,-37.8035,16 145.00586,-37.80301,16
to
145.00694,-37.80421
145.00686,-37.80382
145.00595,-37.8035
145.00586,-37.80301
(Strip off the last comma and value and turn it into a line break.)
I can't figure out how to use wildcards to do something like that. Any help would be greatly appreciated! Thanks.
"Some people, when confronted with a
problem, think 'I know, I'll use
regular expressions.' Now they have
two problems." --Jamie Zawinski
Avoid that problem and use string methods:
s="145.00694,-37.80421,9 145.00686,-37.80382,9 145.00595,-37.8035,16 145.00586,37.80301,16"
lines = s.split(' ') # each line is separated by ' '
for line in lines:
a,b,c=line.split(',') # three parts, separated by ','
print a,b
Regex have their uses, but this is not one of them.
>>> import re
>>> s="145.00694,-37.80421,9 145.00686,-37.80382,9 145.00595,-37.8035,16 145.00586,-37.80301,16"
>>> print re.sub(",\d*\w","\n",s)
145.00694,-37.80421
145.00686,-37.80382
145.00595,-37.8035
145.00586,-37.80301
String methods seem to suffice here, regex are overkill:
>>> s='145.00694,-37.80421,9 145.00686,-37.80382,9 145.00595,-37.8035,16 145.00586,-37.80301,16'
>>> print('\n'.join(line.rpartition(',')[0] for line in s.split()))
145.00694,-37.80421
145.00686,-37.80382
145.00595,-37.8035
145.00586,-37.80301
>>> s = '145.00694,37.80421,9 145.00686,-37.80382,9 145.00595,-37.8035,16 145.00586,-37.80301,16
>>> patt = '(%s,%s),%s' % (('[+-]?\d+\.?\d*', )*3)
>>> m = re.findall(patt, s)
>>> m
['145.00694,37.80421', '145.00686,-37.80382', '145.00595,-37.8035', '145.00586,-37.80301']
>>> print '\n'.join(m)
145.00694,37.80421
145.00686,-37.80382
145.00595,-37.8035
145.00586,-37.80301
but I prefer not use regular expressions in this case
I like SilentGhost solution