Split string with caret character in python - python

I have a huge text file, each line seems like this:
Some sort of general menu^a_sub_menu_title^^pagNumber
Notice that the first "general menu" has white spaces, the second part (a subtitle) each word is separate with "_" character and finally a number (a pag number). I want to split each line in 3 (obvious) parts, because I want to create some sort of directory in python.
I was trying with re module, but as the caret character has a strong meaning in such module, I couldn't figure it out how to do it.
Could someone please help me????

>>> "Some sort of general menu^a_sub_menu_title^^pagNumber".split("^")
['Some sort of general menu', 'a_sub_menu_title', '', 'pagNumber']

If you only want three pieces you can accomplish this through a generator expression:
line = 'Some sort of general menu^a_sub_menu_title^^pagNumber'
pieces = [x for x in line.split('^') if x]
# pieces => ['Some sort of general menu', 'a_sub_menu_title', 'pagNumber']

What you need to do is to "escape" the special characters, like r'\^'. But better than regular expressions in this case would be:
line = "Some sort of general menu^a_sub_menu_title^^pagNumber"
(menu, title, dummy, page) = line.split('^')
That gives you the components in a much more straightforward fashion.

You could just say string.split("^") to divide the string into an array containing each segment. The only caveat is that it will divide consecutive caret characters into an empty string. You could protect against this by either collapsing consecutive carats down into a single one, or detecting empty strings in the resultant array.
For more information see http://docs.python.org/library/stdtypes.html
Does that help?

It's also possible that your file is using a format that's compatible with the csv module, you could also look into that, especially if the format allows quoting, because then line.split would break. If the format doesn't use quoting and it's just delimiters and text, line.split is probably the best.
Also, for the re module, any special characters can be escaped with \, like r'\^'. I'd suggest before jumping to use re to 1) learn how to write regular expressions, 2) first look for a solution to your problem instead of jumping to regular expressions - «Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. »

Related

Using a single replacement operation replace all leading tabs with spaces

In my text I want to replace all leading tabs with two spaces but leave the non-leading tabs alone.
For example:
a
\tb
\t\tc
\td\te
f\t\tg
("a\n\tb\n\t\tc\n\td\te\nf\t\tg")
should turn into:
a
b
c
d\te
f\t\tg
("a\n b\n c\n d\te\nf\t\tg")
For my case I could do that with multiple replacement operations, repeating as many times as the many maximum nesting level or until nothing changes.
But wouldn't it also be possible to do in a single run?
I tried but didn't manage to come up with something, the best I came up yet was with lookarounds:
re.sub(r'(^|(?<=\t))\t', ' ', a, flags=re.MULTILINE)
Which "only" makes one wrong replacement (second tab between f and g).
Now it might be that it's simply impossible to do in regex in a single run because the already replaced parts can't be matched again (or rather the replacement does not happen right away) and you can't sort-of "count" in regex, in this case I would love to see some more detailed explanations on why (as long as this won't shift too much into [cs.se] territory).
I am working in Python currently but this could apply to pretty much any similar regex implementation.
You may match the tabs at the start of the lines, and use a lambda inside re.sub to replace with the double spaces multiplied by the length of the match:
import re
s = "a\n\tb\n\t\tc\n\td\te\nf\t\tg";
print(re.sub(r"^\t+", lambda m: " "*len(m.group()), s, flags=re.M))
See the Python demo
It is also possible to do this without regex using replace() in a one liner:
>>> s = "a\n\tb\n\t\tc\n\td\te\nf\t\tg"
>>> "\n".join(x.replace("\t"," ",len(x)-len(x.lstrip("\t"))) for x in s.split("\n"))
'a\n b\n c\n d\te\nf\t\tg'
This here is kinda crazy, but it works:
"\n".join([ re.sub(r"^(\t+)"," "*(2*len(re.sub(r"^(\t+).*","\1",x))),x) for x in "a\n\tb\n\t\tc\n\td\te\nf\t\tg".splitlines() ])

Matching characters in two Python strings

I am trying to print the shared characters between 2 sets of strings in Python, I am doing this with the hopes of actually finding how to do this using nothing but python regular expressions (I don't know regex so this might be a good time to learn it).
So if first_word = "peepa" and second_word = "poopa" I want the return value to be: "pa"
since in both variables the characters that are shared are p and a. So far I am following the documentation on how to use the re module, but I can't seem to grasp the basic concepts of this.
Any ideas as to how would I solve this problem?
This sounds like a problem where you want to find the intersection of characters between the two strings. The quickest way would be to do this:
>>> set(first_word).intersection(second_word)
set(['a', 'p'])
I don't think regular expressions are the right fit for this problem.
Use sets. Casting a string to a set returns an iterable with unique letters. Then you can retrieve the intersection of the two sets.
match = set(first_word.lower()) & set(second_word.lower())
Using regular expressions
This problem is tailor made for sets. But, you ask for "how to do this using nothing but python regular expressions."
Here is a start:
>>> import re
>>> re.sub('[^peepa]', '', "poopa")
'ppa'
The above uses regular expressions to remove from "poopa" every letter that was not already in "peepa". (As you see it leaves duplicated letters which sets would not do.)
In more detail, re.sub does substitutions based on regular expressions. [peepa] is a regular expression that means any of the letters peepa. The regular expression [^peepa] means anything that is not in peepa. Anything matching this regular expression is replaced with the empty string "", that is, it is removed. What remains are only the common letters.

Best way to parse string with delimiters (thinking regex)?

I'm currently trying to parse a python string for some specific text inside of it. It should actually be really straightforward.
But more importantly, I want to know if regex is a "tool set" type thing, where you know a certain number of tricks? Some people are very, very proficient with them, and I want to attain that proficiency.
So while I am asking how to match this string, I'd like an explanation of your thought process as you went through as you came to your solution
I basically want text A, text-B, and text_C, delimited only by commas.
The desired output string:
"text A,text-B,text_C"
The original text is as follows:
"(1, u'text A', u'text-B', u'text_C')"
In my limited understand, I understand that the main thing separating each expression is a single-quote, so I would start with that. But ultimately I might have strings such as text-'A and I want to make sure that I don't run into errors because I parse the string incorrectly.
Thanks for your time. Remember: thought process.
Since the string you're dealing with is a repr version of a Python tuple, the most Pythonic way is to use ast.literal_eval which can take that object and safely convert back to a Python object retaining the correct types:
import ast
text = "(1, u'text A', u'text-B', u'text_C')"
tup = ast.literal_eval(text)
Then if you only wish to join each item that's a string together:
joined = ', '.join(el for el in tup if isinstance(el, basestring))
# text A, text-B, text_C
Otherwise just slice the tuple tup[1:] and join the items in that...
In terms of a regex, a quick and dirty, non-robust method, that will break easily and possibly even provide incorrect matches under some circumstances is to use:
import re
string_vals = re.findall("'(.*?)'", text)
This finds anything after a ' up until the very next '... Again, using ast.literal_eval is much nicer here...
Must it be regex? :(
a_str = "(1, u'text A', u'text-B', u'text_C')"
print ",".join(a_str[1:-1].split(",")[1:]).replace('u','').replace("'",'')
Yields:
text A, text-B, text_C
EDIT: well if it must be regex .. don't mind this post, it doesn't work for many cases.

A simple regexp in python

My program is a simple calculator, so I need to parse te expression which the user types, to get the input more user-friendly. I know I can do it with regular expressions, but I'm not familar enough about this.
So I need transform a input like this:
import re
input_user = "23.40*1200*(12.00-0.01)*MM(H2O)/(8.314 *func(2*x+273.15,x))"
re.some_stuff( ,input_user) # ????
in this:
"23.40*1200*(12.00-0.01)*MM('H2O')/(8.314 *func('2*x+273.15',x))"
just adding these simple quotes inside the parentheses. How can I do that?
UPDATE:
To be more clear, I want add simple quotes after every sequence of characters "MM(" and before the ")" which comes after it, and after every sequence of characters "func(" and before the "," which comes after it.
This is the sort of thing where regexes can work, but they can potentially result in major problems unless you consider exactly what your input will be like. For example, can whatever is inside MM(...) contain parentheses of its own? Can the first expression in func( contain a comma? If the answers to both questions is no, then the following could work:
input_user2 = re.sub(r'MM\(([^\)]*)\)', r"MM('\1')", input_user)
output = re.sub(r'func\(([^,]*),', r"func('\1',", input_user)
However, this will not work if the answer to either question is yes, and even without that could cause problems depending upon what sort of inputs you expect to receive. Essentially, the first re.sub here looks for MM( ('MM('), followed by any number (including 0) of characters that aren't a close-parenthesis ('([^)]*)') that are then stored as a group (caused by the extra parentheses), and then a close-parenthesis. It replaces that section with the string in the second argument, where \1 is replaced by the first and only group from the pattern. The second re.sub works similarly, looking for any number of characters that aren't a comma.
If the answer to either question is yes, then regexps aren't appropriate for the parsing, as your language would not be regular. The answer to this question, while discussing a different application, may give more insight into that matter.

Cleaning up commas in numbers w/ regular expressions in Python

I have been googling this one fervently, but I can't really narrow it down. I am attempting to interpret a csv file of values, common enough sort of behaviour. But I am being punished by values over a thousand, i.e. in quotations and involving a comma. I have kinda gotten around it by using the csv reader, which creates a list of numbers from the row, but I then have to pick the commas out afterwards.
For purely academic reasons, is there a better way to edit a string with regular expressions? Going from 08/09/2010,"25,132","2,909",650 to 08/09/2010,25132,2909,650.
(If you are into Vim, basically I want to put Python on this:
:1,$s/"\([0-9]*\),\([0-9]*\)"/\1\2/g :D )
Use the csv module for first-stage parsing, and a regex only for seeing if the result can be transformed to a number.
import csv, re
num_re = re.compile('^[0-9]+[0-9,]+$')
for row in csv.reader(open('input_file.csv')):
for el_num in len(row):
if num_re.match(row[el_num]):
row[el_num] = row[el_num].replace(',', '')
...although it would probably be faster not to use the regular expression at all:
for row in ([item.replace(',', '') for item in row]
for row in csv.reader(open('input_file.csv'))):
do_something_with_your(row)
I think what you're looking for is, assuming that commas will only appear in numbers, and that those entries will always be quoted:
import re
def remove_commas(mystring):
return re.sub(r'"(\d+?),(\d+?)"', r'\1\2', mystring)
UPDATE:
Adding cdarke's comments below, the following should work for arbitrary-length numbers:
import re
def remove_commas_and_quotes(mystring):
return re.sub(r'","|",|"', ',', re.sub(r'(?:(\d+?),)',r'\1',mystring))
Python has a regular expressions module, "re":
http://docs.python.org/library/re.html
However, in this case, you might want to consider using the "partition" function:
>>> s = 'some_long_string,"12,345",more_string,"56,6789",and_some_more'
>>> left_part,quote_mark,right_part = s.partition(")
>>> right_part
'12,345",more_string,"56,6789",and_some_more'
>>> number,quote_mark,remainder = right_part.partition(")
'12,345'
string.partition("character") splits a string into 3 parts, stuff to the left of the first occurrence of "character", "character" itself and stuff to the right.
Here's a simple regex for removing commas from numbers of any length:
re.sub(r'(\d+),?([\d+]?)',r'\1\2',mystring)

Categories

Resources