Python Complicated Splitting of a String - python

Say that I have this string:
"hello":"noun":"a greeting";"hello":"verb":"the;action;of;greeting"
How can I make it so string.split(";") or string.split(":") will ignore any characters in quotation marks?
Thanks,
PM

If you can't get a cleaner input than that, I'd recommend using a regular expression and creating a list of tuples with findall():
>>> import re
>>> mystring = '"hello":"noun":"a greeting";"hello":"verb":"the;action;of;greeting"'
>>> result = re.findall(r'"(.+?)":"(.+?)":"(.+?)"', mystring)
>>> for item in result:
... print(*item)
...
hello noun a greeting
hello verb the;action;of;greeting
You can format the output with str.format():
>>> for item in result:
... print('{} - {}, {}'.format(*(part.replace(';', ' ') for part in item)))
...
hello - noun, a greeting
hello - verb, the action of greeting

Your question doesn't make it 100% clear if all strings are inside quoatation marks. Anyway, this should work. It doesn't remove the quotation marks around the string (you can do this afterwards if you want to).
In [20]: [x for x in re.split(r'(:|;|".*?")', s) if x not in [":",";",""]]
Out[20]:
['',
'"hello"',
'"noun"',
'"a greeting"',
'"hello"',
'"verb"',
'"the;action;of;greeting"']

Related

Python print split take char 5-10 for instance

I want to split string that I have.
Lets say string is hello how are you.
I want to print only the how are (meaning start after hello and finish after are
My code for now just start after the hello, but print all the rest.
Want to avoid the you.
ReadJSONFile=JSONResponseFile.read() # this is the txt file with the line
print ReadJSONFile.split('hellow',1)[1] # this gives me everything after hello
You could use string slicing:
>>> s = "hello how are you"
>>> s[6:13]
'how are'
Combine two str.split calls:
>>> s = 'hello how are you'
>>> s.split('hello', 1)[-1]
' how are you'
>>> s.split('hello', 1)[-1].split('you', 1)[0]
' how are '
>>> s.split('hello', 1)[-1].split('you', 1)[0].strip() # remove surrounding spaces
'how are'
If you have the start and end indices you can extract an slice of the string by using the slice notation:
str = 'Hello how are you"
# you want from index 6 (h) to 12 (e)
print str[6:12+1]
This should help: (Using index and slicing)
>>> start = h.index('hello')+len('hello')
>>> end =h.index('you')
>>> h[start:end].strip()
'how are'

Get string between 2 other strings - Python 2.7.8

So I have a huge string, where some strings occur a lot. I need the text in between.
"I don't need this""This is what I need""I also don't need this."
This happens many times, and I'd like all the strings I need in a list.
There's also a lot of special characters, but no ' so I can use them for strings.
I have tried with the re library, but I can't get it to work.
I tried splitting too
listy = hugestring.split('delim1')
for element in listy:
element = element.split('delim2')
But the second splitting doesn't work.
You could use a regex like this
>>> import re
>>> your_str = "foo This is what I need bar foo This is what I need too bar"
>>> left_delim = "foo "
>>> right_delim = " bar"
>>> pattern = "(?<={})[ \w]*?(?={})".format(left_delim,right_delim)
>>> re.findall(pattern,your_str)
['This is what I need', 'This is what I need too']
This will give you a list of all strings within quotes contained in a string:
import re
in_str = "I don't need this\"This is what I need\"I also don't need this."
out_str = re.findall(r'\"(.+?)\"', in_str)
print out_str
So in the above example, print out_str[0] will give you what you need as there's only the one quote in there.
this is the result of what you say in comment , so whats problem now ?:
>>> n= s.split("I don't need this")
['', "This is what I needI also don't need this."]
>>> [i.split("I also don't need this") for i in n]
[[''], ['This is what I need', '.']]

Difference between re.search() and re.findall()

The following code is very strange:
>>> words = "4324324 blahblah"
>>> print re.findall(r'(\s)\w+', words)
[' ']
>>> print re.search(r'(\s)\w+', words).group()
blahblah
The () operator seems to behave poorly with findall. Why is this? I need it for a csv file.
Edit for clarity: I want to display blahblah using findall.
I discovered that re.findall(r'\s(\w+)', words) does what I want, but have no idea why findall treats groups in this way.
One character off:
>>> print re.search(r'(\s)\w+', words).groups()
(' ',)
>>> print re.search(r'(\s)\w+', words).group(1)
' '
findall returns a list of all groups captured. You're getting a space back because that's what you capture. Stop capturing, and it works fine:
>>> print re.findall(r'\s\w+', words)
[' blahblah']
Use the csv module
If you prefer to keep the capturing groups in your regex, but you still want to find the entire contents of each match instead of the groups, you can use the following:
[m.group() for m in re.finditer(r'(\s)\w+', words)]
For example:
>>> [m.group() for m in re.finditer(r'(\s)\w+', '4324324 blahblah')]
[' blahblah']

splitting merged words in python

I am working with a text where all "\n"s have been deleted (which merges two words into one, like "I like bananasAnd this is a new line.And another one.") What I would like to do now is tell Python to look for combinations of a small letter followed by capital letter/punctuation followed by capital letter and insert a whitespace.
I thought this would be easy with reg. expressions, but it is not - I couldnt find an "insert" function or anything, and the string commands seem not to be helpful either. How do I do this?
Any help would be greatly appreciated, I am despairing over here...
Thanks, patrick
Try the following:
re.sub(r"([a-z\.!?])([A-Z])", r"\1 \2", your_string)
For example:
import re
lines = "I like bananasAnd this is a new line.And another one."
print re.sub(r"([a-z\.!?])([A-Z])", r"\1 \2", lines)
# I like bananas And this is a new line. And another one.
If you want to insert a newline instead of a space, change the replacement to r"\1\n\2".
Using re.sub you should be able to make a pattern that grabs a lowercase and uppercase letter and substitutes them for the same two letters, but with a space in between:
import re
re.sub(r'([a-z][.?]?)([A-Z])', '\\1\n\\2', mystring)
You're looking for the sub function. See http://docs.python.org/library/re.html for documentation.
Hmm, interesting. You can use regular expressions to replace text with the sub() function:
>>> import re
>>> string = 'fooBar'
>>> re.sub(r'([a-z][.!?]*)([A-Z])', r'\1 \2', string)
'foo Bar'
If you really don't have any caps except at the beginning of a sentence, it will probably be easiest to just loop through the string.
>>> import string
>>> s = "a word endsA new sentence"
>>> lastend = 0
>>> sentences = list()
>>> for i in range(0, len(s)):
... if s[i] in string.uppercase:
... sentences.append(s[lastend:i])
... lastend = i
>>> sentences.append(s[lastend:])
>>> print sentences
['a word ends', 'A new sentence']
Here's another approach, which avoids regular expressions and does not use any imported libraries, just built-ins...
s = "I like bananasAnd this is a new line.And another one."
with_whitespace = ''
last_was_upper = True
for c in s:
if c.isupper():
if not last_was_upper:
with_whitespace += ' '
last_was_upper = True
else:
last_was_upper = False
with_whitespace += c
print with_whitespace
Yields:
I like bananas And this is a new line. And another one.

python regex to get all text until a (, and get text inside brackets

I need help with two regex operations.
Get all text until an open bracket.
e.g. 'this is so cool (234)' => 'this is so cool'
Get the text inside the brackets, so the number '234'
Up until the paren: regex = re.compile("(.*?)\s*\(")
Inside the first set of parens: regex = re.compile(".*?\((.*?)\)")
Edit: Single regex version: regex = re.compile("(.*?)\s*\((.*?)\)")
Example output:
>>> import re
>>> r1 = re.compile("(.*?)\s*\(")
>>> r2 = re.compile(".*?\((.*?)\)")
>>> text = "this is so cool (234)"
>>> m1 = r1.match(text)
>>> m1.group(1)
'this is so cool'
>>> m2 = r2.match(text)
>>> m2.group(1)
'234'
>>> r3 = re.compile("(.*?)\s*\((.*?)\)")
>>> m3 = r3.match(text)
>>> m3.group(1)
'this is so cool'
>>> m3.group(2)
'234'
>>>
Note of course that this won't work right with multiple sets of parens, as it's only expecting one parenthesized block of text (as per your example). The language of matching opening/closing parens of arbitrary recurrence is not regular.
Sounds to me like you could just do this:
re.findall('[^()]+', mystring)
Splitting would work, too:
re.split('[()]', mystring)
Either way, the text before the first parenthesis will be the first item in the resulting array, and the text inside the first set of parens will be the second item.
No need for regular expression.
>>> s="this is so cool (234)"
>>> s.split("(")[0]
'this is so cool '
>>> s="this is so cool (234) test (123)"
>>> for i in s.split(")"):
... if "(" in i:
... print i.split("(")[-1]
...
234
123
Here is my own library function version without regex.
def between(left,right,s):
before,_,a = s.partition(left)
a,_,after = a.partition(right)
return before,a,after
s="this is so cool (234)"
print('\n'.join(between('(',')',s)))

Categories

Resources