python regex - characters between certain characters - python

Edit: I should add, that the string in the test is supposed to contain every char there possible is (i.e. * + $ § € / etc.). So i thought of regexp should help best.
i am using regex to find all characters between certain characters([" and "]. My example goes like this:
test = """["this is a text and its supposed to contain every possible char."],
["another one after a newline."],
["and another one even with
newlines
in it."]"""
The supposed output should be like this:
['this is a text and its supposed to contain every possible char.', 'another one after a newline.', 'and another one even with newlines in it.']
My code including the regex looks like this:
import re
my_list = re.findall(r'(?<=\[").*(?="\])*[^ ,\n]', test)
print (my_list)
And my outcome is the following:
['this is a text and its supposed to contain every possible char."]', 'another one after a newline."]', 'and another one even with']
so there are two problems:
1) its not removing "] at the end of a text as i want it to do with (?="\])
2) its not capturing the third text in brackets, guess because of the newlines. But so far i wasnt able to capture those when i try .*\n it gives me back an empty string.
I am thankful for any help or hints with this issue. Thank you in advance.
Btw iam using python 3.6 on anaconda-spyder and the newest regex (2018).
EDIT 2: One Alteration to the test:
test = """[
"this is a text and its supposed to contain every possible char."
],
[
"another one after a newline."
],
[
"and another one even with
newlines
in it."
]"""
Once again i have trouble to remove the newlines from it, guess the whitespaces could be removed with \s, so an regexp like this could solve it, i thought.
my_list = re.findall(r'(?<=\[\S\s\")[\w\W]*(?=\"\S\s\])', test)
print (my_list)
But that returns only an empty list. How to get the supposed output above from that input?

In case you might also accept not regex solution, you can try
result = []
for l in eval(' '.join(test.split())):
result.extend(l)
print(result)
# ['this is a text and its supposed to contain every possible char.', 'another one after a newline.', 'and another one even with newlines in it.']

You can try this mate.
(?<=\[\")[\w\s.]+(?=\"\])
Demo
What you missed in your regex .* will not match newline.
P.S I am not matching special characters. if you want it can be achieved very easily.
This one matches special characters too
(?<=\[\")[\w\W]+?(?=\"\])
Demo 2

So here's what I came up:
test = """["this is a text and its supposed to contain every possible char."],
["another one after a newline."],
["and another one even with
newlines
in it."]"""
for i in test.replace('\n', '').replace(' ', ' ').split(','):
print(i.lstrip(r' ["').rstrip(r'"]'))
Which results in the following being printed to the screen
this is a text and its supposed to contain every possible char.
another one after a newline.
and another one even with newlines in it.
If you want a list of those -exact- strings, we could modify it to-
newList = []
for i in test.replace('\n', '').replace(' ', ' ').split(','):
newList.append(i.lstrip(r' ["').rstrip(r'"]'))

Related

Python - Regex - match characters between certain characters

I have a textfile and i want to match/findall/parse all characters that are between certain characters ([\n" text to match "\n]). The text itself can differ a lot from each other in respect to the structure and characters they contain (they can contain every possible char there is).
I posted this question before (sorry for the duplicate) but so far the problem couldnt be solved, so now i am trying to be even more precise about the problem.
The text in the file is build up like this:
test ="""
[
"this is a text and its supposed to contain every possible char."
],
[
"like *.;#]§< and many "" more."
],
[
"plus there are even
newlines
in it."
]"""
My desired output should be a list (for example) with each text in between the seperators as an element, like the following:
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even newlines in it.']
I tried to solve it with Regex and two solutions with the according output i came up with:
my_list = re.findall(r'(?<=\[\n {8}\").*(?=\"\n {8}\])', test)
print (my_list)
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.']
well this one was close. Its listing the first two elements as its supposed to but unfortunately not the third one as it has newlines within.
my_list = re.findall(r'(?<=\[\n {8}\")[\s\S]*(?=\"\n {8}\])', test)
print (my_list)
['this is a text and its supposed to contain every possible char."\n ], \n [\n "like *.;#]§< and many "" more."\n ], \n [\n "plus there are even\nnewlines\n \n in it.']
okay this time every element is included but the list has only one element in it and the lookahead doesnt seem to be working as i thought it would.
So whats the right Regex to use to get my desired output?
Why does the second approach not include the lookahead?
Or is there even a cleaner, faster way to get what i want (beautifulsoup or other methods?)?
I am very thankful for any help and hints.
i am using python 3.6.
You should use DOTALL flag for matching newlines
print(re.findall(r'\[\n\s+"(.*?)"\n\s+\]', test, re.DOTALL))
Output
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even\nnewlines\n\nin it.']
You can use the pattern
(?s)\[[^"]*"(.*?)"[^]"]*\]
to capture every element within the "s inside the brackets:
https://regex101.com/r/SguEAU/1
Then, you can use a list comprehension with re.sub to replace whitespace characters (including newlines) in every captured substring with a single normal space:
test ="""
[
"this is a text and its supposed to contain every possible char."
],
[
"like *.;#]§< and many "" more."
],
[
"plus there are even
newlines
in it."
]"""
output = [re.sub('\s+', ' ', m.group(1)) for m in re.finditer(r'(?s)\[[^"]*"(.*?)"[^]"]*\]', test)]
Result:
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even newlines in it.']

Python replace - Treat multiple instances as one

I am trying to replace all the carriage returns with a command for incoming lines. It works fine, except when multiple carriage returns exist. I see no information in python's string.replace() function on how to treat multiple instances of the same item as though they are one. Is this possible?
For instance, this line:
This is\nA sentence\nwith multiple\nbreaklines\n\npython.
Should end up like this:
This is, A sentence, with multiple, breaklines, python.
But it actually turns into this:
This is, A sentence, with multiple, breaklines, , python.
You can use regex.
In [48]: mystr = "This is\nA sentence\nwith multiple\nbreaklines\n\npython."
In [49]: re.sub(r'\n+', ', ', mystr)
Out[49]: 'This is, A sentence, with multiple, breaklines, python.'
The regex pattern matches where there's one or more \n's next to each other and replaces them with a ,.

Removing phrase/sentence that \r\n is located randomly (R / Python)

How can I remove a phrase or sentence that \r\n are located over all different places?
For example, I want to remove a sentence like this:
If you are having trouble viewing this message or would like to share
it on a social network, you can view the message online.
But there are many different variations of this sentence like:
If
you are having trouble viewing this message or would like to share
it on a social network, you can view the message online.
or
If you are having trouble
viewing this message or would like to share
it on a social network, you can view the message online.
I tried to specify every variation in regular expressions, but it is possible when the sentence or phrase is short.
For example, if I want to remove Please contact me immediately
I can specify Please\r\ncontact me immediately Please contact\r\nme immediately Please contact me\r\n immediately Please contact me\r\nimmediately to remove this sentence. But if I want to remove a longer sentence like as my first example, I cannot write every possible variation.
In summary, how can I remove phrases and sentences that have same words but have \r\n in all different places?
Give this a try.
$ import re
$ remove_text = lambda x, y: re.sub('\s?\r?\n?'.join(x.split()), "", y)
$ remove_text("Please contact me immediately", "Hello Please contact\r\nme immediately World")
> 'Hello World'
You can also remove extra spaces later.
$ re.sub('\s+', ' ', remove_text("Please contact me immediately", "Hello Please contact\r\nme immediately World"))
> 'Hello World'
This method has its limitations like if your actual text is Pleasecontact meimmediately, it will be treated as same.
This regex pattern will find all paragraphs ( as opposed to sentences ):
((?:[^\n\r]+[\n\r])+(?:[^\n\r]+[\n\r])(?=[\n\r]))
Try it out # Live Demo
Explanation:
Find ( [ 1 or more non-newline characters ] followed by a [ newline character ] ) on 1 or more lines
(?:[^\n\r]+[\n\r])+
Find an additional line which matches the above pattern
(?:[^\n\r]+[\n\r])
Find an additional [ newline character ]
IE: the blank line in between two groups of text
(?=[\n\r])
The 2nd & 3rd groups combined equate to the final line of the paragraph.

Search a delimited string in a file - Python

I have the following read.json file
{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}
and python script :
import re
shakes = open("read.json", "r")
needed = open("needed.txt", "w")
for text in shakes:
if re.search('JOL":"(.+?).tr', text):
print >> needed, text,
I want it to find what's between two words (JOL":" and .tr) and then print it. But all it does is printing all the text set in "read.json".
You're calling re.search, but you're not doing anything with the returned match, except to check that there is one. Instead, you're just printing out the original text. So of course you get the whole line.
The solution is simple: just store the result of re.search in a variable, so you can use it. For example:
for text in shakes:
match = re.search('JOL":"(.+?).tr', text)
if match:
print >> needed, match.group(1)
In your example, the match is JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr, and the first (and only) group in it is EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD, which is (I think) what you're looking for.
However, a couple of side notes:
First, . is a special pattern in a regex, so you're actually matching anything up to any character followed by tr, not .tr. For that, escape the . with a \. (And, once you start putting backslashes into a regex, use a raw string literal.) So: r'JOL":"(.+?)\.tr'.
Second, this is making a lot of assumptions about the data that probably aren't warranted. What you really want here is not "everything between JOL":" and .tr", it's "the value associated with key 'JOL' in the JSON object". The only problem is that this isn't quite a JSON object, because of that prefixed :. Hopefully you know where you got the data from, and therefore what format it's actually in. For example, if you know it's actually a sequence of colon-prefixed JSON objects, the right way to parse it is:
d = json.loads(text[1:])
if 'JOL' in d:
print >> needed, d['JOL']
Finally, you don't actually have anything named needed in your code; you opened a file named 'needed.txt', but you called the file object love. If your real code has a similar bug, it's possible that you're overwriting some completely different file over and over, and then looking in needed.txt and seeing nothing changed each time…
If you know that your starting and ending matching strings only appear once, you can ignore that it's JSON. If that's OK, then you can split on the starting characters (JOL":"), take the 2nd element of the split array [1], then split again on the ending characters (.tr) and take the 1st element of the split array [0].
>>> text = '{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}'
>>> text.split('JOL":"')[1].split('.tr')[0]
'EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD'

Multiple punctuation stripping

I tried multiple solutions here, and although they strip some code, they dont seem to work on multiple punctuations ex. "[ or ',
This code:
regex = re.compile('[%s]' % re.escape(string.punctuation))
for i in words:
while regex.match(i):
regex.sub('', i)
I got from:
Best way to strip punctuation from a string in Python was good but i still encounter problems with double punctuations.
I added While loop in hope to ittirate over each word to remove multiple punctuations but that does not seem to work it just gets stuck on the first item "[ and does not exit it
Am I just missing some obvious piece that I am just being oblivious too?
I solved the problem by adding a redundancy and double looping my lists, this takes extremely long time (well into the minutes) due to fairly large sets
I use Python 2.7
Your code doesn't work because regex.match needs the beginning of the string or complete string to match.
Also, you did not do anything with the return value of regex.sub(). sub doesn't work in place, but you need to assign its result to something.
regex.search returns a match if the pattern is found anywhere in the string and works as expected:
import re
import string
words = ['a.bc,,', 'cdd,gf.f.d,fe']
regex = re.compile('[%s]' % re.escape(string.punctuation))
for i in words:
while regex.search(i):
i = regex.sub('', i)
print i
Edit: As pointed out below by #senderle, the while clause isn't necessary and can be left out completely.
this will replace everything not alphanumeric ...
re.sub("[^a-zA-Z0-9 ]","",my_text)
>>> re.sub("[^a-zA-Z0-9 ]","","A [Black. Cat' On a Hot , tin roof!")
'A Black Cat On a Hot tin roof'
Here is a simple way:
>>> print str.translate("My&& Dog's {{{%!##%!##$L&&&ove Sal*mon", None,'~`!##$%^&*()_+=-[]\|}{;:/><,.?\"\'')
>>> My Dogs Love Salmon
Using this str.translate function will eliminate the punctuation. I usually use this for eliminating numbers from DNA sequence reads.

Categories

Resources