Cleaning up commas in numbers w/ regular expressions in Python - python

I have been googling this one fervently, but I can't really narrow it down. I am attempting to interpret a csv file of values, common enough sort of behaviour. But I am being punished by values over a thousand, i.e. in quotations and involving a comma. I have kinda gotten around it by using the csv reader, which creates a list of numbers from the row, but I then have to pick the commas out afterwards.
For purely academic reasons, is there a better way to edit a string with regular expressions? Going from 08/09/2010,"25,132","2,909",650 to 08/09/2010,25132,2909,650.
(If you are into Vim, basically I want to put Python on this:
:1,$s/"\([0-9]*\),\([0-9]*\)"/\1\2/g :D )

Use the csv module for first-stage parsing, and a regex only for seeing if the result can be transformed to a number.
import csv, re
num_re = re.compile('^[0-9]+[0-9,]+$')
for row in csv.reader(open('input_file.csv')):
for el_num in len(row):
if num_re.match(row[el_num]):
row[el_num] = row[el_num].replace(',', '')
...although it would probably be faster not to use the regular expression at all:
for row in ([item.replace(',', '') for item in row]
for row in csv.reader(open('input_file.csv'))):
do_something_with_your(row)

I think what you're looking for is, assuming that commas will only appear in numbers, and that those entries will always be quoted:
import re
def remove_commas(mystring):
return re.sub(r'"(\d+?),(\d+?)"', r'\1\2', mystring)
UPDATE:
Adding cdarke's comments below, the following should work for arbitrary-length numbers:
import re
def remove_commas_and_quotes(mystring):
return re.sub(r'","|",|"', ',', re.sub(r'(?:(\d+?),)',r'\1',mystring))

Python has a regular expressions module, "re":
http://docs.python.org/library/re.html
However, in this case, you might want to consider using the "partition" function:
>>> s = 'some_long_string,"12,345",more_string,"56,6789",and_some_more'
>>> left_part,quote_mark,right_part = s.partition(")
>>> right_part
'12,345",more_string,"56,6789",and_some_more'
>>> number,quote_mark,remainder = right_part.partition(")
'12,345'
string.partition("character") splits a string into 3 parts, stuff to the left of the first occurrence of "character", "character" itself and stuff to the right.

Here's a simple regex for removing commas from numbers of any length:
re.sub(r'(\d+),?([\d+]?)',r'\1\2',mystring)

Related

How to find the indexes of certain character not in quotes in Python?

I ultimately want to split a string by a certain character. I tried Regex, but it started escaping \, so I want to avoid that with another approach (all the attempts at unescaping the string failed). So, I want to get all positions of a character char in a string that is not within quotes, so I can split them up accordingly.
For example, given the phase hello-world:la\test, I want to get back 11 if char is :, as that is the only : in the string, and it is in the 11th index. However, re does split it, but I get ['hello-world,lat\\test'].
EDIT:
#BoarGules made me realize that re didn't actually change anything, but it's just how Python displays slashes.
Here's a function that works:
def split_by_char(string,char=':'):
PATTERN = re.compile(rf'''((?:[^\{char}"']|"[^"]*"|'[^']*')+)''')
return [string[m.span()[0]:m.span()[1]] for m in PATTERN.finditer(string)]
string = 'hello-world:la\test'
char = ':'
print(string.find(char))
Prints
11
char_index = string.find(char)
string[:char_index]
Returns
'hello-world'
string[char_index+1:]
Returns
'la\test'
Solution for the case you're likely encountering (a pseudo-CSV format you're hand-rolling a parser for; if you're not in that situation, it's still a likely situation for people finding this question later):
Just use the csv module.
import csv
import io
test_strings = ['field1:field2:field3', 'field1:"field2:with:embedded:colons":field3']
for s in test_strings:
for row in csv.reader(io.StringIO(s), delimiter=':'):
print(row)
Try it online!
which outputs:
['field1', 'field2', 'field3']
['field1', 'field2:with:embedded:colons', 'field3']
correctly ignoring the colons within the quoted field, requiring no kludgy, hard-to-verify hand-written regexes.

Extracting part of a string with regex

I am trying to extract two characters from the middle of a string. From the strings ['data_EL27.dat', 'data.BV256.dat', 'data_BV257.dat'], I wish to output ['EL', 'BV', 'BV'].
Here's what I have using regex, this seems overly clunky for such a simple task, I feel like I'm missing some nicer way of doing this.
import re
str = 'data_EL27.dat'
m = re.search(r'(?:data[._])(\w{2})', str)
m.group(1) # 'EL'
I realise that in this case I can just take str[5:7] but I don't want to rely on the length of the prefix remaining the same.
Shouldn't it be just
m=re.search("data.(..)",str)
?

in python find index in list if combination of strings exist

I'm writing my first script and trying to learn python.
But I'm stuck and can't get out of this one.
I'm writing a script to change file names.
Lets say I have a string = "this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv"
I want the result to be string = "This Is Test3 E00"
this is what I have so far:
l = list(string)
//Transform the string into list
for i in l:
if "E" in l:
p = l.index("E")
if isinstance((p+1), int () is True:
if isinstance((p+2), int () is True:
delp = p+3
a = p-3
del l[delp:]
new = "".join(l)
new = new.replace("."," ")
print (new)
get in index where "E" and check if after "E" there are 2 integers.
Then delete everything after the second integer.
However this will not work if there is an "E" anyplace else.
at the moment the result I get is:
this is tEst
because it is finding index for the first "E" on the list and deleting everything after index+3
I guess my question is how do I get the index in the list if a combination of strings exists.
but I can't seem to find how.
thanks for everyone answers.
I was going in other direction but it is also not working.
if someone could see why it would be awesome. It is much better to learn by doing then just coping what others write :)
this is what I came up with:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
anyone can tell me why this isn't working. I get an error.
Thank you so much
Have you ever heard of a Regular Expression?
Check out python's re module. Link to the Docs.
Basically, you can define a "regex" that would match "E and then two integers" and give you the index of it.
After that, I'd just use python's "Slice Notation" to choose the piece of the string that you want to keep.
Then, check out the string methods for str.replace to swap the periods for spaces, and str.title to put them in Title Case
An easy way is to use a regex to find up until the E followed by 2 digits criteria, with s as your string:
import re
up_until = re.match('(.*?E\d{2})', s).group(1)
# this.is.tEst3.E00
Then, we replace the . with a space and then title case it:
output = up_until.replace('.', ' ').title()
# This Is Test3 E00
The technique to consider using is Regular Expressions. They allow you to search for a pattern of text in a string, rather than a specific character or substring. Regular Expressions have a bit of a tough learning curve, but are invaluable to learn and you can use them in many languages, not just in Python. Here is the Python resource for how Regular Expressions are implemented:
http://docs.python.org/2/library/re.html
The pattern you are looking to match in your case is an "E" followed by two digits. In Regular Expressions (usually shortened to "regex" or "regexp"), that pattern looks like this:
E\d\d # ('\d' is the specifier for any digit 0-9)
In Python, you create a string of the regex pattern you want to match, and pass that and your file name string into the search() method of the the re module. Regex patterns tend to use a lot of special characters, so it's common in Python to prepend the regex pattern string with 'r', which tells the Python interpreter not to interpret the special characters as escape characters. All of this together looks like this:
import re
filename = 'this.is.tEst3.E00.erfeh.ervwer.vwtrt.rvwrv'
match_object = re.search(r'E\d\d', filename)
if match_object:
# The '0' means we want the first match found
index_of_Exx = match_object.end(0)
truncated_filename = filename[:index_of_Exx]
# Now take care of any more processing
Regular expressions can get very detailed (and complex). In fact, you can probably accomplish your entire task of fully changing the file name using a single regex that's correctly put together. But since I don't know the full details about what sorts of weird file names might come into your program, I can't go any further than this. I will add one more piece of information: if the 'E' could possibly be lower-case, then you want to add a flag as a third argument to your pattern search which indicates case-insensitive matching. That flag is 're.I' and your search() method would look like this:
match_object = re.search(r'E\d\d', filename, re.I)
Read the documentation on Python's 're' module for more information, and you can find many great tutorials online, such as this one:
http://www.zytrax.com/tech/web/regex.htm
And before you know it you'll be a superhero. :-)
The reason why this isn't working:
for i in l:
if i=="E" and isinstance((i+1), int ) is True:
p = l.index(i)
print (p)
...is because 'i' contains a character from the string 'l', not an integer. You compare it with 'E' (which works), but then try to add 1 to it, which errors out.

A more powerful method than Python's find? A regex issue?

I'm looking for a list of strings and their variations within a very large string.
What I want to do is find even the implicit matches between two strings.
For example, if my start string is foo-bar, I want the matching to find Foo-bAr foo Bar, or even foo(bar.... Of course, foo-bar should also return a match.
EDIT: More specifically, I need the following matches.
The string itself, case insenstive.
The string with spaces separating any of the characters
The string with parentheses separating any of the characters.
How do I write an expression to meet these conditions?
I realize this might require some tricky regex. The thing is, I have a large list of strings I need to search for, and I feel regex is just the tool for making this as robust as I need.
Perhaps regex isn't the best solution?
Thanks for your help guys. I'm still learning to think in regex.
>>> def findString(inputStr, targetStr):
... if convertToStringSoup(targetStr).find(convertToStringSoup(inputStr)) != -1:
... return True
... return False
...
>>> def convertToStringSoup(testStr):
... testStr = testStr.lower()
... testStr = testStr.replace(" ", "")
... testStr = testStr.replace("(", "")
... testStr = testStr.replace(")", "")
... return testStr
...
>>>
>>> findString("hello", "hello")
True
>>> findString("hello", "hello1")
True
>>> findString("hello", "hell!o1")
False
>>> findString("hello", "hell( o)1")
True
should work according to your specification. Obviously, could be optimized. You're asking about regex, which I'm thinking hard about, and will hopefully edit this question soon with something good. If this isn't too slow, though, regexps can be miserable, and readable is often better!
I noticed that you're repeatedly looking in the same big haystack. Obviously, you only have to convert that to "string soup" once!
Edit: I've been thinking about regex, and any regex you do would either need to have many clauses or the text would have to be modified pre-regex like I did in this answer. I haven't benchmarked string.find() vs re.find(), but I imagine the former would be faster in this case.
I'm going to assume that your rules are right, and your examples are wrong, mainly since you added the rules later, as a clarification, after a bunch of questions. So:
EDIT: More specifically, I need the following matches.
The string itself, case insenstive.
The string with spaces separating any of the characters
The string with parentheses separating any of the characters.
The simplest way to do this is to just remove spaces and parens, then do a case-insensitive search on the result. You don't even need regex for that. For example:
haystack.replace(' ', '').replace('(', '').upper().find(needle.upper())
Try this regex:
[fF][oO]{2}[- ()][bB][aA][rR]
Test:
>>> import re
>>> pattern = re.compile("[fF][oO]{2}[- ()][bB][aA][rR]")
>>> m = pattern.match("foo-bar")
>>> m.group(0)
'foo-bar'
Using a regex, a case-insensitive search matches upper/lower case invariants, '[]' matches any contained characters and '|' lets you do multiple compares at once. Putting it all together, you can try:
import re
pairs = ['foo-bar', 'jane-doe']
regex = '|'.join(r'%s[ -\)]%s' % tuple(p.split('-')) for p in pairs)
print regex
results = re.findall(regex, your_text_here, re.IGNORECASE)

Split string with caret character in python

I have a huge text file, each line seems like this:
Some sort of general menu^a_sub_menu_title^^pagNumber
Notice that the first "general menu" has white spaces, the second part (a subtitle) each word is separate with "_" character and finally a number (a pag number). I want to split each line in 3 (obvious) parts, because I want to create some sort of directory in python.
I was trying with re module, but as the caret character has a strong meaning in such module, I couldn't figure it out how to do it.
Could someone please help me????
>>> "Some sort of general menu^a_sub_menu_title^^pagNumber".split("^")
['Some sort of general menu', 'a_sub_menu_title', '', 'pagNumber']
If you only want three pieces you can accomplish this through a generator expression:
line = 'Some sort of general menu^a_sub_menu_title^^pagNumber'
pieces = [x for x in line.split('^') if x]
# pieces => ['Some sort of general menu', 'a_sub_menu_title', 'pagNumber']
What you need to do is to "escape" the special characters, like r'\^'. But better than regular expressions in this case would be:
line = "Some sort of general menu^a_sub_menu_title^^pagNumber"
(menu, title, dummy, page) = line.split('^')
That gives you the components in a much more straightforward fashion.
You could just say string.split("^") to divide the string into an array containing each segment. The only caveat is that it will divide consecutive caret characters into an empty string. You could protect against this by either collapsing consecutive carats down into a single one, or detecting empty strings in the resultant array.
For more information see http://docs.python.org/library/stdtypes.html
Does that help?
It's also possible that your file is using a format that's compatible with the csv module, you could also look into that, especially if the format allows quoting, because then line.split would break. If the format doesn't use quoting and it's just delimiters and text, line.split is probably the best.
Also, for the re module, any special characters can be escaped with \, like r'\^'. I'd suggest before jumping to use re to 1) learn how to write regular expressions, 2) first look for a solution to your problem instead of jumping to regular expressions - «Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. »

Categories

Resources