How can I split string between group of word in python? - python

How can I split the "Value1" and "Value2 from this string?
my_str = 'Value1Value2'
I try to this but it's not work.
my_str = 'Value1Value2'
for i in my_str:
i = str(i).split('^<a.*>$|</a>')
print(i)

You can use bs4.BeautifulSoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(my_str)
out = [st.string for st in soup.find_all('a')]
Output:
['Value1', 'Value2']

One another way is to use cleaning techniques for extraction, you split on one character and remove out unwanted values.
Here's the code, I used
my_str = 'Value1Value2'
strList = my_str.split('/a>',maxsplit = 2)
for i in strList:
try:
print(i.split('>')[1].replace('<',''))
except IndexError:
pass
This will get you Value1 and Value2

If you want to do regex splitting on html, which again you shouldn’t (see bs4 answer above for way better answer).
import re
my_str = 'Value1Value2'
split_str = re.findall(r'(?<=>)\w*?(?=<\/a>)', my_str)

This works if you want the entire html element for each.
import re
re.sub("(a>)(<a)", "\\1[SEP]\\2", my_str).split("[SEP]")
if you just want the values, do this
re.findall("\>(.[^<]+)<\/a>", my_str)

Related

How to get spesific parts from a text? Python

I have a string like this.
'hsa:578\tup:Q16611\nhsa:578\tup:A0A0S2Z391\nhsa:9373\tup:Q9Y263\nhsa:9344\tup:Q9UL54\nhsa:5894\tup:P04049\nhsa:5894\tup:L7RRS6\nhsa:673\tup:P15056\n'
I want to get only values begin with "up:".
Like this:
up:A0A0S2Z391
up:Q9Y263
up:Q9UL54.
How can i do that with python?
By using re module for regular expressions.
import re
text = ''''hsa:578\tup:Q16611\nhsa:578\tup:A0A0S2Z391\nhsa:9373\tup:Q9Y263\nhsa:9344\tup:Q9UL54\nhsa:5894\tup:P04049\nhsa:5894\tup:L7RRS6\nhsa:673\tup:P15056\n'''
pattern = r'up:.*'
values = re.findall(pattern, text)
print(values)
Output:
['up:Q16611', 'up:A0A0S2Z391', 'up:Q9Y263', 'up:Q9UL54', 'up:P04049', 'up:L7RRS6', 'up:P15056']
You could use the split() method for that.
Here is a link to the documentation:
https://docs.python.org/3/library/stdtypes.html?#str.split
Something like this could work for the string you posted:
s = 'hsa:578\tup:Q16611\nhsa:578\tup:A0A0S2Z391\nhsa:9373\tup:Q9Y263\nhsa:9344\tup:Q9UL54\nhsa:5894\tup:P04049\nhsa:5894\tup:L7RRS6\nhsa:673\tup:P15056\n'
res = []
for i in s.split('up')[1:]:
res.append('up' + i.split()[0])
print(res)
output:
['up:Q16611', 'up:A0A0S2Z391', 'up:Q9Y263', 'up:Q9UL54', 'up:P04049', 'up:L7RRS6', 'up:P15056']

how to use python re findall and regex so that 2 conditions can be run simultaneously?

here my data string :
MYDATA=DATANORMAL
MYDATA=DATA_NOTNORMAL
i use this code, but when i run it it shows empty at DATANORMAL
mydata = re.findall(r'MYDATA=(.*)' r'_.*', mystring)
print mydata
and it just shows : NOTNORMAL
i want both to work, and displays data like this:
DATANORMAL
NOTNORMAL
how do i do it? Thanks.
Try it online!
import re
mystring = """
MYDATA=DATANORMAL
MYDATA=DATA_NOTNORMAL
"""
mydata = re.findall(r'^\s*MYDATA=(?:.+_)?(.+?)\s*$', mystring, re.M)
print(mydata)
In case if you need word before _, not after, then use regex r'^\s*MYDATA=(.+?)(?:_.+)?\s*$' in code above, you may try this second variant here.
Based on what you describe, you might want to use an alternation here:
\bMYDATA=((?:DATA|(?:DATA_))\S+)\b
Script:
inp = "some text MYDATA=DATANORMAL more text MYDATA=DATA_NOTNORMAL"
mydata = re.findall(r'\bMYDATA=((?:DATA|(?:DATA_))\S+)\b', inp)
print(mydata)
This prints:
['DATANORMAL', 'DATA_NOTNORMAL']
I guess you need to add flags=re.M?
import re
mystring = """
MYDATA=DATANORMAL
MYDATA=DATA_NOTNORMAL"""
pattern = re.compile("MYDATA=(?:DATA_)?(\w+)",flags=re.M)
print(pattern.findall(mystring))

Replacing 3 line string

When calling external API I get this kind of response It is 2 lines of 44 characters total 88. Which is perfect.
r.text = "P<RUSBASZNAGDCIEWS<<AZIZAS<<<<<<<<<<<<<<<<<<"
"00000000<ORUS5911239F160828525911531023<<<10"
But some times I get this kind of response and I need to make it the same as in example 1. 2 lines of 44 characters.
All this big く should be replaced with normal < and spaces also removed
r.text = "P<RUSALUZAFEE<<ZUZILLAS<<<<
くくくくくくくくくく、
00000000<ORUS7803118 F210127747803111025<<<64"
expected OUTPUT:
string = "P<RUSALUZAFEE<<ZUZILLAS<<<<<<<<<<<<<<<<<<<<<
00000000<ORUS7803118F210127747803111025<<<64"
Here is best attempt guess you will find it helpful
import re
txt =""" P<RUSALUZAFEE<<ZUZILLAS<<<<
くくくくくくくくくく、
00000000<ORUS7803118 F210127747803111025<<<64"""
txt_1 = re.sub('(く |く)', '<', txt).replace('、','')
txt_2 = re.sub(r'\s+', '', txt_1)
regex = r"(\w<?\w+<+\w+<+)(\w*<?\w+<+\w+)"
result = re.match(regex, txt_2)
print(f'{result.group(1)}\n{result.group(2)}')
Output
P<RUSALUZAFEE<<ZUZILLAS<<<<<<<<<<<<<<
00000000<ORUS7803118F210127747803111025<<<64
import re
pattern = r'\n.*く.*\n'
s = re.compile(pattern)
string = s.sub('\n', r.text)
you can do it with re.sub from the module re like the following
new_txt = re.sub("<", "く", old_txt)
or with str.replace like the following
new_str = OldStr.replace("く", "<")
or use regex and combine it with if else like
if pattern:
re.sub # or str.replace
else:
pass

i want to change the url using python

I'm new to python and I can't figure out a way to do this so I'm asking for someone to help
I have URL like this https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4 and I want to remove the last part go_cc_Jpterxvid_avi_mp4 of URL and also change /f/ with /d/ so I can get the URL to be like this https://abc.xyz/d/b
/b it change regular I have tried use somthing like this didn't work
newurl = oldurl.replace('/f/','/d/').rsplit("/", 1)[0])
Late answer, but you can use re.sub to replace "/f/.+" with "/d/b", i.e.:
old_url = "https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4"
new_url = re.sub("/f/.+", r"/d/b", old_url)
# https://abc.xyz/d/b
Regex Demo and Explanation
You can apply re.sub twice:
import re
s = 'https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4'
new_s = re.sub('(?<=\.\w{3}/)\w', 'd', re.sub('(?<=/)\w+$', '', s))
Output:
'https://abc.xyz/d/b/'
import re
domain_str = 'https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4'
#find all appearances of the first part of the url
matches = re.findall('(https?:\/\/\w*\.\w*\/?)',domain_str)
#add your domain extension to each of the results
d_extension = 'd'
altered_domains = []
for res in matches:
altered_domains.append(res + d_extension)
print(altered_domains)
exmaple input:
'https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4'
and output:
['https://abc.xyz/d']
What you had almost worked. The change is to remove the trailing right paren ) at the end of your assignment to newurl. The following works in both Python 2 and 3:
oldurl = "https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4"
newurl = oldurl.replace('/f/','/d/').rsplit("/", 1)[0]
print(newurl)
But a more idiomatic expression can be obtain thru the re standard lib:
import re
old_url = "https://abc.xyz/f/b/go_cc_Jpterxvid_avi_mp4"
new_url = re.sub("/f/.+", r"/d/b", old_url)
print(new_url)

How to use regex to parse a number from HTML?

I want to write a simple regular expression in Python that extracts a number from HTML. The HTML sample is as follows:
Your number is <b>123</b>
Now, how can I extract "123", i.e. the contents of the first bold text after the string "Your number is"?
import re
m = re.search("Your number is <b>(\d+)</b>",
"xxx Your number is <b>123</b> fdjsk")
if m:
print m.groups()[0]
Given s = "Your number is <b>123</b>" then:
import re
m = re.search(r"\d+", s)
will work and give you
m.group()
'123'
The regular expression looks for 1 or more consecutive digits in your string.
Note that in this specific case we knew that there would be a numeric sequence, otherwise you would have to test the return value of re.search() to make sure that m contained a valid reference, otherwise m.group() would result in a AttributeError: exception.
Of course if you are going to process a lot of HTML you want to take a serious look at BeautifulSoup - it's meant for that and much more. The whole idea with BeautifulSoup is to avoid "manual" parsing using string ops or regular expressions.
import re
x = 'Your number is <b>123</b>'
re.search('(?<=Your number is )<b>(\d+)</b>',x).group(0)
this searches for the number that follows the 'Your number is' string
import re
print re.search(r'(\d+)', 'Your number is <b>123</b>').group(0)
The simplest way is just extract digit(number)
re.search(r"\d+",text)
val="Your number is <b>123</b>"
Option : 1
m=re.search(r'(<.*?>)(\d+)(<.*?>)',val)
m.group(2)
Option : 2
re.sub(r'([\s\S]+)(<.*?>)(\d+)(<.*?>)',r'\3',val)
import re
found = re.search("your number is <b>(\d+)</b>", "something.... Your number is <b>123</b> something...")
if found:
print found.group()[0]
Here (\d+) is the grouping, since there is only one group [0] is used. When there are several groupings [grouping index] should be used.
To extract as python list you can use findall
>>> import re
>>> string = 'Your number is <b>123</b>'
>>> pattern = '\d+'
>>> re.findall(pattern,string)
['123']
>>>
You can use the following example to solve your problem:
import re
search = re.search(r"\d+",text).group(0) #returns the number that is matched in the text
print("Starting Index Of Digit", search.start())
print("Ending Index Of Digit:", search.end())
import re
x = 'Your number is <b>123</b>'
output = re.search('(?<=Your number is )<b>(\d+)</b>',x).group(1)
print(output)

Categories

Resources