python encoding regex issue

python encoding regex issue - python

I am trying to get this line out from a page:
$ 55 326
I have made this regex to get the numbers:
player_info['salary'] = re.compile(r'\$ \d{0,3} \d{1,3}')
When I get the text I use bs4 and the text is of type 'unicode'
for a in soup_ntr.find_all('div', id='playerbox'):
player_box_text = a.get_text()
print(type(player_box_text))
I can't seem to get the result.
I have also tried with a regex like these
player_info['salary'] = re.compile(ur'\$ \d{0,3} \d{1,3}')
player_info['salary'] = re.compile(ur'\$ \d{0,3} \d{1,3}', re.UNICODE)
But I can't find out to get the data.
The page I am reading has this header:
Content-Type: text/html; charset=utf-8
Hope for some help to figure it out.

re.compile doesn't match anything. It just creates a compiled version of the regex.
You want something like this:
matchObj = re.match(r'\$ (\d{0,3}) (\d{1,3})', player_box_text)
player_info['salary'] = matchObj.group(1) + matchObj.group(2)

This is a good site for getting to grips with regex.
http://txt2re.com/
#!/usr/bin/python
# URL that generated this code:
# http://txt2re.com/index-python.php3?s=$%2055%20326&2&1
import re
txt='$ 55 326'
re1='.*?' # Non-greedy match on filler
re2='(\\d+)' # Integer Number 1
re3='.*?' # Non-greedy match on filler
re4='(\\d+)' # Integer Number 2
rg = re.compile(re1+re2+re3+re4,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
int1=m.group(1)
int2=m.group(2)
print "("+int1+")"+"("+int2+")"+"\n"

Related

How to extract some url from html?

I need to extract all image links from a local html file. Unfortunately, I can't install bs4 and cssutils to process html.
html = """<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911"><br>
<div><a style="background-image:url(https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912)"></a></div>"""
I tried to extract data using a regex:
images = []
for line in html.split('\n'):
images.append(re.findall(r'(https://s2.*\?lastmod=\d+)', line))
print(images)
[['https://s2.example.com/path/image0.jpg?lastmod=1625296911'],
['https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912']]
I suppose my regular expression is greedy because I used .*?
How to get the following outcome?
images = ['https://s2.example.com/path/image0.jpg',
'https://s2.example.com/path/image1.jpg',
'https://s2.example.com/path/image2.jpg',
'https://s2.example.com/path/image3.jpg']
If it can help all links are enclosed by src="..." or url(...)
Thanks for your help.

import re
indeces_start = sorted(
[m.start()+5 for m in re.finditer("src=", html)]
+ [m.start()+4 for m in re.finditer("url", html)])
indeces_end = [m.end() for m in re.finditer(".jpg", html)]
image_list = []
for start,end in zip(indeces_start,indeces_end):
image_list.append(html[start:end])
print(image_list)
That's a solution which comes to my mind. It consists of finding the start and end indeces of the image path strings. It obviously has to be adjusted if there are different image types.
Edit: Changed the start criteria, in case there are other URLs in the document

You can use
import re
html = """<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911"><br>
<div><a style="background-image:url(https://s2.example.com/path/image1.jpg?lastmod=1625296911)"</a><a style="background-image:url(https://s2.example.com/path/image2.jpg?lastmod=1625296912)"></a><a style="background-image:url(https://s2.example.com/path/image3.jpg?lastmod=1625296912)"></a></div>"""
images = re.findall(r'https://s2[^\s?]*(?=\?lastmod=\d)', html)
print(images)
See the Python demo. Output:
['https://s2.example.com/path/image0.jpg',
'https://s2.example.com/path/image1.jpg',
'https://s2.example.com/path/image2.jpg',
'https://s2.example.com/path/image3.jpg']
See the regex demo, too. It means
https://s2 - some literal text
[^\s?]* -zero or more chars other than whitespace and ? chars
(?=\?lastmod=\d) - immediately to the right, there must be ?lastmode= and a digit (the text is not added to the match since it is a pattern inside a positive lookahead, a non-consuming pattern).

import re
xx = '<img src="https://s2.example.com/path/image0.jpg?lastmod=1625296911" alt="asdasd"><img a src="https://s2.example.com/path/image0.jpg?lastmod=1625296911">'
r1 = re.findall(r"<img(?=\s|>)[^>]*>",xx)
url = []
for x in r1:
x = re.findall(r"src\s{0,}=\s{0,}['\"][\w\d:/.=]{0,}",x)
if(len(x)== 0): continue
x = re.findall(r"http[s]{0,1}[\w\d:/.=]{0,}",x[0])
if(len(x)== 0): continue
url.append(x[0])
print(url)

Regular Expression, extract the number with a decimal place from API input

I was trying to extract the number with 2 decimal places from my APIs input. These data are shown with text and comma but I only need the number with a decimal. I'm pretty sure this isn't the right way of using regex101. I'm a beginner in coding so I don't have much knowledge about a Regular Expression
1: {"symbol":"BTCUSDT","price":"34592.99000000"}
Attempt to extract: 34592.99000000 using regex101 "\d+........"
2: {"THB_BTC":{"id":1,"last":1102999.13,"lowestAsk":1102999.08,"highestBid":1100610.1,"percentChange":2.94,"baseVolume":202.54340749,"quoteVolume":221380256.57,"isFrozen":0,"high24hr":1108001,"low24hr":1061412.72,"change":31496.06,"prevClose":1102999.13,"prevOpen":1071503.07}}
Attempt to extract: 1102999.13 using regex101 "\d\d....."
These attempts only get me close but not 100% to the target, I believe there is a right way of doing this.
here's my code
import requests
import re
result = requests.get("https://api.binance.com/api/v3/ticker/price?symbol=BTCUSDT")
result1 = requests.get("https://api.bitkub.com/api/market/ticker/?sym=THB_BTC&lmt=10")
result.text
result1.text
api0 = re.compile(r"\d+........").findall(result.text)[0]
api1 = re.compile(r"\d\d.....").findall(result1.text)[0]
print(result.text)
print(result1.text)
If you have any advice please do. I'm highly appreciated in advance

An easier and better way to do this, without regex
import requests
import re
result = requests.get("https://api.binance.com/api/v3/ticker/price?symbol=BTCUSDT").json()
result1 = requests.get("https://api.bitkub.com/api/market/ticker/?sym=THB_BTC&lmt=10").json()
data_1 = format(float(result['price']), '.2f')
data_2 = format(float(result1['THB_BTC']['last']), '.2f')
print(data_1, data_2)
34602.98 1101999.95

You can try something like that. Change your regex to \d+\.\d+
import requests
import re
result = requests.get("https://api.binance.com/api/v3/ticker/price?symbol=BTCUSDT")
result1 = requests.get("https://api.bitkub.com/api/market/ticker/?sym=THB_BTC&lmt=10")
api0 = re.compile(r"\d+\.\d+").findall(result.text)[0]
api1 = re.compile(r"\d+\.\d+").findall(result1.text)[0]
print(result.text)
print(result1.text)
print(api0)
print(api1)

Find values using regex (includes brackets)

it's my first time with regex and I have some issues, which hopefully you will help me find answers. Let's give an example of data:
chartData.push({
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
});
var newDate = new Date();
newDate.setFullYear(
2007,
10,
1 );
Want I want to retrieve is to get the date which is the last bracket and the corresponding description. I have no idea how to do it with one regex, thus I decided to split it into two.
First part:
I retrieve the value after the description:. This was managed with the following code:[\n\r].*description:\s*([^\n\r]*) The output gives me the result with a quote "9710" but I can fairly say that it's alright and no changes are required.
Second part:
Here it gets tricky. I want to retrieve the values in brackets after the text newDate.setFullYear. Unfortunately, what I managed so far, is to only get values inside brackets. For that, I used the following code \(([^)]*)\) The result is that it picks all 3 brackets in the example:
"{
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
}",
"()",
"2007,
10,
1 "
What I am missing is an AND operator for REGEX with would allow me to construct a code allowing retrieval of data in brackets after the specific text.
I could, of course, pick every 3rd result but unfortunately, it doesn't work for the whole dataset.
Does anyone of you know the way how to resolve the second part issue?
Thanks in advance.

You can use the following expression:
res = re.search(r'description: "([^"]+)".*newDate.setFullYear\((.*)\);', text, re.DOTALL)
This will return a regex match object with two groups, that you can fetch using:
res.groups()
The result is then:
('9710', '\n2007,\n10,\n1 ')
You can of course parse these groups in any way you want. For example:
date = res.groups()[1]
[s.strip() for s in date.split(",")]
==>
['2007', '10', '1']

import re
test = r"""
chartData.push({
date: 'newDate',
visits: 9710,
color: "#016b92",
description: "9710"
})
var newDate = new Date()
newDate.setFullYear(
2007,
10,
1);"""
m = re.search(r".*newDate\.setFullYear(\(\n.*\n.*\n.*\));", test, re.DOTALL)
print(m.group(1).rstrip("\n").replace("\n", "").replace(" ", ""))
The result:
(2007,10,1)

The AND part that you are referring to is not really an operator. The pattern matches characters from left to right, so after capturing the values in group 1 you cold match all that comes before you want to capture your values in group 2.
What you could do, is repeat matching all following lines that do not start with newDate.setFullYear(
Then when you do encounter that value, match it and capture in group 2 matching all chars except parenthesis.
\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);
Regex demo | Python demo
Example code
import re
regex = r"\r?\ndescription: \"([^\"]+)\"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);"
test_str = ("chartData.push({\n"
"date: newDate,\n"
"visits: 9710,\n"
"color: \"#016b92\",\n"
"description: \"9710\"\n"
"});\n"
"var newDate = new Date();\n"
"newDate.setFullYear(\n"
"2007,\n"
"10,\n"
"1 );")
print (re.findall(regex, test_str))
Output
[('9710', '\n2007,\n10,\n1 ')]
There is another option to get group 1 and the separate digits in group 2 using the Python regex PyPi module
(?:\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(|\G)\r?\n(\d+),?(?=[^()]*\);)
Regex demo

Python Regex Google App Engine

I'm using python on GAE
I'm trying to get the following from html
<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>
I want to get everything that will have a "V" followed by 7 or more digits and have behind it.
My regex is
response = urllib2.urlopen(url)
html = response.read()
tree = etree.HTML(html)
mls = tree.xpath('/[V]\d{7,10}</FONT>')
self.response.out.write(mls)
It's throwing out an invalid expression. I don't know what part of it is invalid because it works on the online regex tester
How can i do this in the xpath format?

>>> import re
>>> s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> a = re.search(r'(.*)(V[0-9]{7,})',s)
>>> a.group(2)
'V1068078'
EDIT
(.*) is a greedy method. re.search(r'V[0-9]{7,}',s) will do the extraction with out greed.
EDIT as #Kaneg said, you can use findall for all instances. You will get a list with all occurrences of 'V[0-9]{7,}'

How can I do this in the XPath?
You can use starts-with() here.
>>> from lxml import etree
>>> html = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> tree = etree.fromstring(html)
>>> mls = tree.xpath("//TD/FONT[starts-with(text(),'V')]")[0].text
'V1068078'
Or you can use a regular expression
>>> from lxml import etree
>>> html = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> tree = etree.fromstring(html)
>>> mls = tree.xpath("//TD/FONT[re:match(text(), 'V\d{7,}')]",
namespaces={'re': 'http://exslt.org/regular-expressions'})[0].text
'V1068078'

Below example can match multiple cases:
import re
s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V10683333</FONT></TD>,' \
' <TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068333333</FONT></TD>'
m = re.findall(r'V\d{7,}', s)
print m

The following will work:
result = re.search(r'V\d{7,}',s)
print result.group(0) # prints 'V1068078'
It will match any string of numeric digit of length 7 or more that follows the letter V
EDIT
If you want it to find all instances, replace search with findall
s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>V1068078 V1068078 V1068078'
re.search(r'V\d{7,}',s)
['V1068078', 'V1068078', 'V1068078', 'V1068078']

For everyone that keeps posting purely regex solutions, you need to read the question -- the problem is not just formulating a regular expression; it is an issue of isolating the right nodes of the XML/HTML document tree, upon which regex can be employed to subsequently isolate the desired strings.
You didn't show any of your import statements -- are you trying to use ElementTree? In order to use ElementTree you need to have some understanding of the structure of your XML/HTML, from the root down to the target tag (in your case, "TD/FONT"). Next you would use the ElementTree methods, "find" and "findall" to traverse the tree and get to your desired tags/attributes.
As has been noted previously, "ElementTree uses its own path syntax, which is more or less a subset of xpath. If you want an ElementTree compatible library with full xpath support, try lxml." ElementTree does have support for xpath, but not the way you are using it here.
If you indeed do want to use ElementTree, you should provide an example of the html you are trying to parse so everybody has a notion of the structure. In the absence of such an example, a made up example would look like the following:
import xml, urllib2
from xml.etree import ElementTree
url = "http://www.uniprot.org/uniprot/P04637.xml"
response = urllib2.urlopen(url)
html = response.read()
tree = xml.etree.ElementTree.fromstring(html)
# namespace prefix, see https://stackoverflow.com/questions/1249876/alter-namespace-prefixing-with-elementtree-in-python
ns = '{http://uniprot.org/uniprot}'
root = tree.getiterator(ns+'uniprot')[0]
taxa = root.find(ns+'entry').find(ns+'organism').find(ns+'lineage').findall(ns+'taxon')
for taxon in taxa:
print taxon.text
# Output:
Eukaryota
Metazoa
Chordata
Craniata
Vertebrata
Euteleostomi
Mammalia
Eutheria
Euarchontoglires
Primates
Haplorrhini
Catarrhini
Hominidae
Homo

And the one without capturing groups.
>>> import re
>>> str = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> m = re.search(r'(?<=>)V\d{7}', str)
>>> print m.group(0)
V1068078

How to eliminate email formatting in received email?

I am practicing sending emails with Google App Engine with Python. This code checks to see if message.sender is in the database:
class ReceiveEmail(InboundMailHandler):
def receive(self, message):
querySender = User.all()
querySender.filter("userEmail =", message.sender)
senderInDatabase = None
for match in querySender:
senderInDatabase = match.userEmail
This works in the development server because I send the email as "az#example.com" and message.sender="az#example.com"
But I realized that in the production server emails come formatted as "az <az#example.com> and my code fails because now message.sender="az <az#example.com>" but the email in the database is simple "az#example.com".
I searched for how to do this with regex and it is possible but I was wondering if I can do this with Python lists? Or, what do you think is the best way to achieve this result? I need to take just the email address from the message.sender.
App Engine documentation acknowledges the formatting but I could not find a specific way to select the email address only.
Thanks!
EDIT2 (re: Forest answer)
#Forest:
parseaddr() appears to be simple enough:
>>> e = "az <az#example.com>"
>>> parsed = parseaddr(e)
>>> parsed
('az', 'az#example.com')
>>> parsed[1]
'az#example.com'
>>>
But this still does not cover the other type of formatting that you mention: user#example.com (Full Name)
>>> e2 = "<az#example.com> az"
>>> parsed2 = parseaddr(e2)
>>> parsed2
('', 'az#example.com')
>>>
Is there really a formatting where full name comes after the email?
EDIT (re: Adam Bernier answer)
My try about how the regex works (probably not correct):
r # raw string
< # first limit character
( # what is inside () is matched
[ # indicates a set of characters
^ # start of string
> # start with this and go backward?
] # end set of characters
+ # repeat the match
) # end group
> # end limit character

Rather than storing the entire contents of a To: or From: header field as an opaque string, why don't you parse incoming email and store email address separately from full name? See email.utils.parseaddr(). This way you don't have to use complicated, slow pattern matching when you want to look up an address. You can always reassemble the fields using formataddr().

If you want to use regex try something like this:
>>> import re
>>> email_string = "az <az#example.com>"
>>> re.findall(r'<([^>]+)>', email_string)
['az#example.com']
Note that the above regex handles multiple addresses...
>>> email_string2 = "az <az#example.com>, bz <bz#example.com>"
>>> re.findall(r'<([^>]+)>', email_string2)
['az#example.com', 'bz#example.com']
but this simpler regex doesn't:
>>> re.findall(r'<(.*)>', email_string2)
['az#example.com>, bz <bz#example.com'] # matches too much
Using slices—which I think you meant to say instead of "lists"—seems more convoluted, e.g.:
>>> email_string[email_string.find('<')+1:-1]
'az#example.com'
and if multiple:
>>> email_strings = email_string2.split(',')
>>> for s in email_strings:
... s[s.find('<')+1:-1]
...
'az#example.com'
'bz#example.com'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python encoding regex issue - python

re.compile doesn't match anything. It just creates a compiled version of the regex. You want something like this: matchObj = re.match(r'\$ (\d{0,3}) (\d{1,3})', player_box_text) player_info['salary'] = matchObj.group(1) + matchObj.group(2)

Related

How to extract some url from html?

Regular Expression, extract the number with a decimal place from API input

Find values using regex (includes brackets)

Python Regex Google App Engine

How to eliminate email formatting in received email?

Categories

Resources