Python re to replace the subtitles with new formats - python

I have a file with the following format:
===Subtitle 1===
text ....
===Subtitle 2===
text ....
How could I replace ===Subtitle 1=== with Section Subtitle 1 using python?
I tried this:
import re
s = '===Subtitle 1==='
lst = re.findall('===[\S+a-zA-Z0-9]===', s)
print lst
But I cannot print out anything.

You don't need use regex here, just use str.replace() like this:
>>> a = '===Subtitle 1==='
>>> a.replace('=', '')
'Subtitle 1'
>>>
But if you'd like use regex...
>>> import re
>>> re.findall('=+(.+?)=+', a)
['Subtitle 1']
>>> re.findall('=+(.+?)=+', a)[0]
'Subtitle 1'
>>>

Related

how do i extract value inside quotes using regex python?

My text is
my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
I am trying to extract value of posted_data which is 2e54eba66f8f2881c8e78be8342428xd
My code :
extract_posted_data = re.search(r'(\"posted_data\": \")(\w*)', my_text)
print (extract_posted_data)
and it prints None
Thank you
This particular example doesn't seem like it needs regular expressions at all.
>>> my_text
'"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
>>> import json
>>> result = json.loads('{%s}' % my_text)
>>> result
{'posted_data': '2e54eba66f8f2881c8e78be8342428xd', 'isropa': False, 'rx': 'NO', 'readal': 'false'}
>>> result['posted_data']
'2e54eba66f8f2881c8e78be8342428xd'
With BeautifulSoup:
>>> import json
...
... from bs4 import BeautifulSoup
...
... soup = BeautifulSoup('<script type="text/javascript"> "posted_data":"2738273283723hjasda" </script>')
...
... result = json.loads('{%s}' % soup.script.text)
>>> result
{'posted_data': '2738273283723hjasda'}
>>> result['posted_data']
'2738273283723hjasda'
This is because your original code has an additional space. It should be:
extract_posted_data = re.search(r'(\"posted_data\":\")(\w*)', my_text)
And in fact, '\' is unnecessary here. Just:
extract_posted_data = re.search(r'("posted_data":")(\w*)', my_text)
Then:
extract_posted_data.group(2)
is what you want.
>>> my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
>>> extract_posted_data = re.search(r'("posted_data":")(\w*)', my_text)
>>> extract_posted_data.group(2)
'2e54eba66f8f2881c8e78be8342428xd'
You need to change your regex to use lookarounds, as follows:
my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
extract_posted_data = re.search(r'(?<="posted_data":")\w*(?=")', my_text)
print (extract_posted_data[0])
Prints 2e54eba66f8f2881c8e78be8342428xd
Also re.search() returns a Match object, so to get the first match (the only match) you get index 0 of the match:
as others have mentioned json would be a better tool for this data but you can also use this regex (I added a \s* in case in the future there are spaces in between):
regex: "posted_data":\s*"(?P<posted_data>[^"]+)"
import re
my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
m = re.search(r'"posted_data":\s*"(?P<posted_data>[^"]+)"', my_text)
if m:
print(m.group('posted_data'))

how do I separate a string that contains number

So, I have this string:
a='test32'
I want to separate this string so I get the text and the number in two separate variables, in python .
import re
r = re.compile("([a-zA-Z]+)([0-9]+)")
>>> m=r.match('test32')
>>> m.group(1)
'test'
>>> m.group(2)
'32'
>>>

Extracting a substring of a string in Python based on presence of another string

common is always present regardless of string. Using that information, I'd like to grab the substring that comes just before it, in this case, "banana":
string = "apple_orange_banana_common_fruit"
In this case, "fruit":
string = "fruit_common_apple_banana_orange"
How would I go about doing this in Python?
You can use re.search() to extract the substring:
>>> import re
>>> s = 'apple_orange_banana_common_fruit'
>>> re.search(r'([a-zA-Z]+)_common', s).group(1)
'banana'
This will return a list of matches:
import re
string = "apple_orange_banana_common_fruit"
preceding_word = re.findall("[A-Za-z]+(?=_common)", string)
If common only occurs once per string, you might be better off using hwnd's solution.
import re
string = "apple_orange_bananna_common_fruit"
preceding_word = re.search('([a-zAZ]+)(?=_common)', string)
print (preceding_word.group(1))
>>> string = "fruit_common_apple_banana_orange"
>>> parts = string.split('_')
>>> print parts[parts.index('common') - 1]
fruit
>>> string = "apple_orange_banana_common_fruit"
>>> parts = string.split('_')
>>> print parts[parts.index('common') - 1]
banana

get substring from the main string Python

I have a string and I want to extract a substring from that main string
Some sample strings are:
http://domain.com/xxxxx/xxxxxxxx?tags=%7C105651%7C102496&Asidebar=1&dnr=y
http://domain.com/xxxxx/xxxxxx?tags=%7C12784%7C102496&index=28&showFromBeginning=true&
I want to get the tags value.
In this case:
val = %7C105651%7C102496
val = %7C12784%7C102496
Is there any chance to get that?
Edit
tags = re.search('tags=(.+?)&Asidebar', url)
print tags
if tags:
found = tags.group(1)
print (found)
output: None
Note: I've just tried to get something from the first string only
Using urlparse.urlparse and cgi.parse_qs (Python 2.x):
>>> import urlparse
>>> import cgi
>>>
>>> s = 'http://domain.com/xxxxx/xxxxxxxx?tags=%7C105651%7C102496&Asidebar=1&dnr=y'
>>> cgi.parse_qs(urlparse.urlparse(s).query)
{'dnr': ['y'], 'Asidebar': ['1'], 'tags': ['|105651|102496']}
>>> cgi.parse_qs(urlparse.urlparse(s).query)['tags'][0]
'|105651|102496'
In Python 3.x, use urllib.parse.urlparse and urllib.parse.parse_qs:
>>> import urllib.parse
>>>
>>> s = 'http://domain.com/xxxxx/xxxxxxxx?tags=%7C105651%7C102496&Asidebar=1&dnr=y'
>>> urllib.parse.parse_qs(urllib.parse.urlparse(s).query)['tags'][0]
'|105651|102496'
You're almost there. You don't need to write Asidebar in your regex. Because in your second input string, there isn't a substring called Asidebar.
tags = re.search('tags=(.+?)&', url)
if tags:
found = tags.group(1)
print (found)

python regex to get all text until a (, and get text inside brackets

I need help with two regex operations.
Get all text until an open bracket.
e.g. 'this is so cool (234)' => 'this is so cool'
Get the text inside the brackets, so the number '234'
Up until the paren: regex = re.compile("(.*?)\s*\(")
Inside the first set of parens: regex = re.compile(".*?\((.*?)\)")
Edit: Single regex version: regex = re.compile("(.*?)\s*\((.*?)\)")
Example output:
>>> import re
>>> r1 = re.compile("(.*?)\s*\(")
>>> r2 = re.compile(".*?\((.*?)\)")
>>> text = "this is so cool (234)"
>>> m1 = r1.match(text)
>>> m1.group(1)
'this is so cool'
>>> m2 = r2.match(text)
>>> m2.group(1)
'234'
>>> r3 = re.compile("(.*?)\s*\((.*?)\)")
>>> m3 = r3.match(text)
>>> m3.group(1)
'this is so cool'
>>> m3.group(2)
'234'
>>>
Note of course that this won't work right with multiple sets of parens, as it's only expecting one parenthesized block of text (as per your example). The language of matching opening/closing parens of arbitrary recurrence is not regular.
Sounds to me like you could just do this:
re.findall('[^()]+', mystring)
Splitting would work, too:
re.split('[()]', mystring)
Either way, the text before the first parenthesis will be the first item in the resulting array, and the text inside the first set of parens will be the second item.
No need for regular expression.
>>> s="this is so cool (234)"
>>> s.split("(")[0]
'this is so cool '
>>> s="this is so cool (234) test (123)"
>>> for i in s.split(")"):
... if "(" in i:
... print i.split("(")[-1]
...
234
123
Here is my own library function version without regex.
def between(left,right,s):
before,_,a = s.partition(left)
a,_,after = a.partition(right)
return before,a,after
s="this is so cool (234)"
print('\n'.join(between('(',')',s)))

Categories

Resources