So I have a huge string, where some strings occur a lot. I need the text in between.
"I don't need this""This is what I need""I also don't need this."
This happens many times, and I'd like all the strings I need in a list.
There's also a lot of special characters, but no ' so I can use them for strings.
I have tried with the re library, but I can't get it to work.
I tried splitting too
listy = hugestring.split('delim1')
for element in listy:
element = element.split('delim2')
But the second splitting doesn't work.
You could use a regex like this
>>> import re
>>> your_str = "foo This is what I need bar foo This is what I need too bar"
>>> left_delim = "foo "
>>> right_delim = " bar"
>>> pattern = "(?<={})[ \w]*?(?={})".format(left_delim,right_delim)
>>> re.findall(pattern,your_str)
['This is what I need', 'This is what I need too']
This will give you a list of all strings within quotes contained in a string:
import re
in_str = "I don't need this\"This is what I need\"I also don't need this."
out_str = re.findall(r'\"(.+?)\"', in_str)
print out_str
So in the above example, print out_str[0] will give you what you need as there's only the one quote in there.
this is the result of what you say in comment , so whats problem now ?:
>>> n= s.split("I don't need this")
['', "This is what I needI also don't need this."]
>>> [i.split("I also don't need this") for i in n]
[[''], ['This is what I need', '.']]
Related
How do I do it Pythonically?
I know how to delete the first word, but now I need to remove three.
Note that words can be delimited by amount of whitecap, not just a single space (although I could enforce a single white space if it must be so).
[Update] I mean any X words; I don't know hat they are.
I am considering looping and repeatedly removing the first word, joining together again, rinsing and repeating.
s = "this is my long sentence"
print ' '.join(s.split(' ')[3:])
This will print
"long sentence"
Which I think is what you need (it will handle the white spaces the way you wanted).
Try:
import re
print re.sub("(\w+)", "", "a sentence is cool", 3)
Prints cool
This can be done by simple way as:
In [7]: str = 'Hello, this is long string'
In [8]: str = str[3:]
In [9]: str
Out[9]: 'lo, this is long string'
In [10]:
Now you can update 3 on line In[8] with your X
You can use the split function to do this. Essentially, it splits the string up into individual (space separated, by default) words. These words are stored in a list and then from that list, you can access the words you want, just like you would with a normal list of other data types. Using the desired words you can then join the list to form a string.
for example:
import string
str='This is a bunch of words'
string_list=string.split(
#The string is now stored in a list that looks like:
#['this', 'is', 'a', 'bunch', 'of', 'words']
new_string_list=string_list[3:]
#the list is now: ['bunch', 'of', 'words']
new_string=string.join(new_string_list)
#you now have the string 'bunch of words'
You can also do this in fewer lines, if desired (not sure if this is pythonic though)
import string as st
str='this is a bunch of words'
new_string=st.join(st.split(str[3:])
print new_string
#output would be 'bunch of words'
You can use split:
>>> x = 3 # number of words to remove from beginning
>>> s = 'word1 word2 word3 word4'
>>> s = " ".join(s.split()) # remove multiple spacing
>>> s = s.split(" ", x)[x] # split and keep elements after index x
>>> s
'word4'
This will handle multiple spaces as well.
Say that I have this string:
"hello":"noun":"a greeting";"hello":"verb":"the;action;of;greeting"
How can I make it so string.split(";") or string.split(":") will ignore any characters in quotation marks?
Thanks,
PM
If you can't get a cleaner input than that, I'd recommend using a regular expression and creating a list of tuples with findall():
>>> import re
>>> mystring = '"hello":"noun":"a greeting";"hello":"verb":"the;action;of;greeting"'
>>> result = re.findall(r'"(.+?)":"(.+?)":"(.+?)"', mystring)
>>> for item in result:
... print(*item)
...
hello noun a greeting
hello verb the;action;of;greeting
You can format the output with str.format():
>>> for item in result:
... print('{} - {}, {}'.format(*(part.replace(';', ' ') for part in item)))
...
hello - noun, a greeting
hello - verb, the action of greeting
Your question doesn't make it 100% clear if all strings are inside quoatation marks. Anyway, this should work. It doesn't remove the quotation marks around the string (you can do this afterwards if you want to).
In [20]: [x for x in re.split(r'(:|;|".*?")', s) if x not in [":",";",""]]
Out[20]:
['',
'"hello"',
'"noun"',
'"a greeting"',
'"hello"',
'"verb"',
'"the;action;of;greeting"']
For instance, we have:
word = 'Some Random Word'
print '"' + word + '"'
Is there a better way to print double quotes around a variable?
Update :
From Python 3.6, you can use f-strings
>>> print(f'"{word}"')
"Some Random Word"
Original Answer :
You can try %-formatting
>>> print('"%s"' % word)
"Some Random Word"
OR str.format
>>> print('"{}"'.format(word))
"Some Random Word"
OR escape the quote character with \
>>> print("\"%s\"" % word)
"Some Random Word"
And, if the double-quotes is not a restriction (i.e. single-quotes would do)
>>> from pprint import pprint, pformat
>>> print(pformat(word))
'Some Random Word'
>>> pprint(word)
'Some Random Word'
OR like others have already said (include it in your declaration)
>>> word = '"Some Random Word"'
>>> print(word)
"Some Random Word"
Use whichever you feel to be better or less confusing.
And, if you need to do it for multiple words, you might as well create a function
def double_quote(word):
return '"%s"' % word
print(double_quote(word), double_quote(word2))
And (if you know what you're doing &) if you're concerned about performance of these, see this comparison.
How about json.dumps:
>>> import json
>>> print(json.dumps("hello world"))
"hello world"
The advantage over other approaches mentioned here is that it escapes quotes inside the string as well (take that str.format!), always uses double quotes and is actually intended for reliable serialization (take that repr()!):
>>> print(json.dumps('hello "world"!'))
"hello \"world\"!"
You can try repr
Code:
word = "This is a random text"
print repr(word)
Output:
'This is a random text'
It seems silly, but works fine to me. It's easy to read.
word = "Some Random Word"
quotes = '"'
print quotes + word + quotes
word = '"Some Random Word"' # <-- did you try this?
Using format method or f-string with repr(), you can write it more elegant.
a = "foo"
print("{!r}".format(a))
b = "bar"
print(f"{b!r}")
Use escape sequence
Example:
int x = 10;
System.out.println("\"" + x + "\"");
O/P
"10"
I am working with a text where all "\n"s have been deleted (which merges two words into one, like "I like bananasAnd this is a new line.And another one.") What I would like to do now is tell Python to look for combinations of a small letter followed by capital letter/punctuation followed by capital letter and insert a whitespace.
I thought this would be easy with reg. expressions, but it is not - I couldnt find an "insert" function or anything, and the string commands seem not to be helpful either. How do I do this?
Any help would be greatly appreciated, I am despairing over here...
Thanks, patrick
Try the following:
re.sub(r"([a-z\.!?])([A-Z])", r"\1 \2", your_string)
For example:
import re
lines = "I like bananasAnd this is a new line.And another one."
print re.sub(r"([a-z\.!?])([A-Z])", r"\1 \2", lines)
# I like bananas And this is a new line. And another one.
If you want to insert a newline instead of a space, change the replacement to r"\1\n\2".
Using re.sub you should be able to make a pattern that grabs a lowercase and uppercase letter and substitutes them for the same two letters, but with a space in between:
import re
re.sub(r'([a-z][.?]?)([A-Z])', '\\1\n\\2', mystring)
You're looking for the sub function. See http://docs.python.org/library/re.html for documentation.
Hmm, interesting. You can use regular expressions to replace text with the sub() function:
>>> import re
>>> string = 'fooBar'
>>> re.sub(r'([a-z][.!?]*)([A-Z])', r'\1 \2', string)
'foo Bar'
If you really don't have any caps except at the beginning of a sentence, it will probably be easiest to just loop through the string.
>>> import string
>>> s = "a word endsA new sentence"
>>> lastend = 0
>>> sentences = list()
>>> for i in range(0, len(s)):
... if s[i] in string.uppercase:
... sentences.append(s[lastend:i])
... lastend = i
>>> sentences.append(s[lastend:])
>>> print sentences
['a word ends', 'A new sentence']
Here's another approach, which avoids regular expressions and does not use any imported libraries, just built-ins...
s = "I like bananasAnd this is a new line.And another one."
with_whitespace = ''
last_was_upper = True
for c in s:
if c.isupper():
if not last_was_upper:
with_whitespace += ' '
last_was_upper = True
else:
last_was_upper = False
with_whitespace += c
print with_whitespace
Yields:
I like bananas And this is a new line. And another one.
I need help with two regex operations.
Get all text until an open bracket.
e.g. 'this is so cool (234)' => 'this is so cool'
Get the text inside the brackets, so the number '234'
Up until the paren: regex = re.compile("(.*?)\s*\(")
Inside the first set of parens: regex = re.compile(".*?\((.*?)\)")
Edit: Single regex version: regex = re.compile("(.*?)\s*\((.*?)\)")
Example output:
>>> import re
>>> r1 = re.compile("(.*?)\s*\(")
>>> r2 = re.compile(".*?\((.*?)\)")
>>> text = "this is so cool (234)"
>>> m1 = r1.match(text)
>>> m1.group(1)
'this is so cool'
>>> m2 = r2.match(text)
>>> m2.group(1)
'234'
>>> r3 = re.compile("(.*?)\s*\((.*?)\)")
>>> m3 = r3.match(text)
>>> m3.group(1)
'this is so cool'
>>> m3.group(2)
'234'
>>>
Note of course that this won't work right with multiple sets of parens, as it's only expecting one parenthesized block of text (as per your example). The language of matching opening/closing parens of arbitrary recurrence is not regular.
Sounds to me like you could just do this:
re.findall('[^()]+', mystring)
Splitting would work, too:
re.split('[()]', mystring)
Either way, the text before the first parenthesis will be the first item in the resulting array, and the text inside the first set of parens will be the second item.
No need for regular expression.
>>> s="this is so cool (234)"
>>> s.split("(")[0]
'this is so cool '
>>> s="this is so cool (234) test (123)"
>>> for i in s.split(")"):
... if "(" in i:
... print i.split("(")[-1]
...
234
123
Here is my own library function version without regex.
def between(left,right,s):
before,_,a = s.partition(left)
a,_,after = a.partition(right)
return before,a,after
s="this is so cool (234)"
print('\n'.join(between('(',')',s)))