I was trying to extract the number with 2 decimal places from my APIs input. These data are shown with text and comma but I only need the number with a decimal. I'm pretty sure this isn't the right way of using regex101. I'm a beginner in coding so I don't have much knowledge about a Regular Expression
1: {"symbol":"BTCUSDT","price":"34592.99000000"}
Attempt to extract: 34592.99000000 using regex101 "\d+........"
2: {"THB_BTC":{"id":1,"last":1102999.13,"lowestAsk":1102999.08,"highestBid":1100610.1,"percentChange":2.94,"baseVolume":202.54340749,"quoteVolume":221380256.57,"isFrozen":0,"high24hr":1108001,"low24hr":1061412.72,"change":31496.06,"prevClose":1102999.13,"prevOpen":1071503.07}}
Attempt to extract: 1102999.13 using regex101 "\d\d....."
These attempts only get me close but not 100% to the target, I believe there is a right way of doing this.
here's my code
import requests
import re
result = requests.get("https://api.binance.com/api/v3/ticker/price?symbol=BTCUSDT")
result1 = requests.get("https://api.bitkub.com/api/market/ticker/?sym=THB_BTC&lmt=10")
result.text
result1.text
api0 = re.compile(r"\d+........").findall(result.text)[0]
api1 = re.compile(r"\d\d.....").findall(result1.text)[0]
print(result.text)
print(result1.text)
If you have any advice please do. I'm highly appreciated in advance
An easier and better way to do this, without regex
import requests
import re
result = requests.get("https://api.binance.com/api/v3/ticker/price?symbol=BTCUSDT").json()
result1 = requests.get("https://api.bitkub.com/api/market/ticker/?sym=THB_BTC&lmt=10").json()
data_1 = format(float(result['price']), '.2f')
data_2 = format(float(result1['THB_BTC']['last']), '.2f')
print(data_1, data_2)
34602.98 1101999.95
You can try something like that. Change your regex to \d+\.\d+
import requests
import re
result = requests.get("https://api.binance.com/api/v3/ticker/price?symbol=BTCUSDT")
result1 = requests.get("https://api.bitkub.com/api/market/ticker/?sym=THB_BTC&lmt=10")
api0 = re.compile(r"\d+\.\d+").findall(result.text)[0]
api1 = re.compile(r"\d+\.\d+").findall(result1.text)[0]
print(result.text)
print(result1.text)
print(api0)
print(api1)
Related
I am trying to get the ASIN number for each product on Amazon which is the first ten digits after dp/. I have gotten to the point where I have the digits but still have the junk after it. Any help?
product_lst = [
"https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
"https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
"https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
"https://www.amazon.com/dp/B089RDSML3",
"https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
]
for url in product_lst:
product_lst = url.split("dp/")
for url in product_lst:
del product_lst[::2]
print(product_lst)
Output:
['B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ']
['B08SLYY1WD/?encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref=pd_gw_deals']
['B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae']
['B089RDSML3']
['B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2']
For searches in text the module re (regex) is a good choice:
product_lst = [
"https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
"https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
"https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
"https://www.amazon.com/dp/B089RDSML3",
"https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
]
import re
results = []
for url in product_lst:
m = re.search(r"/dp/([^/?]+)",url)
if m:
results.append(m.groups()[0])
print(results)
Output:
['B07R2CNSTK', 'B08SLYY1WD', 'B079QHML21', 'B089RDSML3', 'B081J8SGH7']
I use r"/dp/([^/?]+)" as pattern wich boils down to a grouped match for anything after /dp/ and then matches all things up to the next / or ?.
You can test regexes online - I use http://regex101.com (for complex ones) - it can even provide python code based on what you insert in its fields (not using that though ;o) )
You can change your own code to
for url in product_lst:
part = url.split("dp/")
if len(part) > 1: # blablubb dp/ more things => 2 or more parts
print(part[1]) # print whats is left after dp/
to avoid overwriting your list product_lst - but you will still need to trim stuff after / and ? with it.
After you split() on the 'dp/', there is absolutely no reason to loop. You know exactly where the data is that you want, so just get it directly:
product_lst = [
"https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
"https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
"https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
"https://www.amazon.com/dp/B089RDSML3",
"https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
]
for url in product_lst:
split_lst = url.split("dp/")
print(split_lst[1][:10]
I assume that the ASIN is always 10 characters. Adjust the splice if there are more characters and it is always fixed. Otherwise you will need to find a different appproach.
You can directly get the ASIN without splitting the data.
product_lst = [
"https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
"https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
"https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
"https://www.amazon.com/dp/B089RDSML3",
"https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
]
ASIN=[]
for url in product_lst:
idx = url.find("/dp/")
ASIN.append(url[idx+4:idx+14])
print(ASIN)
output
['B07R2CNSTK', 'B08SLYY1WD', 'B079QHML21', 'B089RDSML3', 'B081J8SGH7']
I am trying to isolate the domain name for a database full of URLs, but I'm running into some regex problems.
Starting example:
examples = ['www2.chccs.k12.nc.us', 'wwwsco.com', 'www-152.aig.com', 'www.google.com']
Desired goal:
['chccs.k12.nc.us', 'sco.com', 'aig.com', 'google.com']
I've been trying a two stage process where I add in a "." before "www", then replace the "www.", but that doesn't quite lead to the results I'd like.
Any regex wizards out there able to help?
Thanks in advance!
import re
def extract(domain):
return re.sub(r'^www[\d-]*\.?', '', domain)
examples = ['www2.chccs.k12.nc.us', 'wwwsco.com', 'www-152.aig.com', 'www.google.com']
result = [extract(d) for d in examples]
assert result == ['chccs.k12.nc.us', 'sco.com', 'aig.com', 'google.com'], result
I'm using python on GAE
I'm trying to get the following from html
<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>
I want to get everything that will have a "V" followed by 7 or more digits and have behind it.
My regex is
response = urllib2.urlopen(url)
html = response.read()
tree = etree.HTML(html)
mls = tree.xpath('/[V]\d{7,10}</FONT>')
self.response.out.write(mls)
It's throwing out an invalid expression. I don't know what part of it is invalid because it works on the online regex tester
How can i do this in the xpath format?
>>> import re
>>> s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> a = re.search(r'(.*)(V[0-9]{7,})',s)
>>> a.group(2)
'V1068078'
EDIT
(.*) is a greedy method. re.search(r'V[0-9]{7,}',s) will do the extraction with out greed.
EDIT as #Kaneg said, you can use findall for all instances. You will get a list with all occurrences of 'V[0-9]{7,}'
How can I do this in the XPath?
You can use starts-with() here.
>>> from lxml import etree
>>> html = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> tree = etree.fromstring(html)
>>> mls = tree.xpath("//TD/FONT[starts-with(text(),'V')]")[0].text
'V1068078'
Or you can use a regular expression
>>> from lxml import etree
>>> html = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> tree = etree.fromstring(html)
>>> mls = tree.xpath("//TD/FONT[re:match(text(), 'V\d{7,}')]",
namespaces={'re': 'http://exslt.org/regular-expressions'})[0].text
'V1068078'
Below example can match multiple cases:
import re
s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V10683333</FONT></TD>,' \
' <TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068333333</FONT></TD>'
m = re.findall(r'V\d{7,}', s)
print m
The following will work:
result = re.search(r'V\d{7,}',s)
print result.group(0) # prints 'V1068078'
It will match any string of numeric digit of length 7 or more that follows the letter V
EDIT
If you want it to find all instances, replace search with findall
s = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>V1068078 V1068078 V1068078'
re.search(r'V\d{7,}',s)
['V1068078', 'V1068078', 'V1068078', 'V1068078']
For everyone that keeps posting purely regex solutions, you need to read the question -- the problem is not just formulating a regular expression; it is an issue of isolating the right nodes of the XML/HTML document tree, upon which regex can be employed to subsequently isolate the desired strings.
You didn't show any of your import statements -- are you trying to use ElementTree? In order to use ElementTree you need to have some understanding of the structure of your XML/HTML, from the root down to the target tag (in your case, "TD/FONT"). Next you would use the ElementTree methods, "find" and "findall" to traverse the tree and get to your desired tags/attributes.
As has been noted previously, "ElementTree uses its own path syntax, which is more or less a subset of xpath. If you want an ElementTree compatible library with full xpath support, try lxml." ElementTree does have support for xpath, but not the way you are using it here.
If you indeed do want to use ElementTree, you should provide an example of the html you are trying to parse so everybody has a notion of the structure. In the absence of such an example, a made up example would look like the following:
import xml, urllib2
from xml.etree import ElementTree
url = "http://www.uniprot.org/uniprot/P04637.xml"
response = urllib2.urlopen(url)
html = response.read()
tree = xml.etree.ElementTree.fromstring(html)
# namespace prefix, see https://stackoverflow.com/questions/1249876/alter-namespace-prefixing-with-elementtree-in-python
ns = '{http://uniprot.org/uniprot}'
root = tree.getiterator(ns+'uniprot')[0]
taxa = root.find(ns+'entry').find(ns+'organism').find(ns+'lineage').findall(ns+'taxon')
for taxon in taxa:
print taxon.text
# Output:
Eukaryota
Metazoa
Chordata
Craniata
Vertebrata
Euteleostomi
Mammalia
Eutheria
Euarchontoglires
Primates
Haplorrhini
Catarrhini
Hominidae
Homo
And the one without capturing groups.
>>> import re
>>> str = '<TD><FONT FACE="Arial,helvetica" SIZE="-2">V1068078</FONT></TD>'
>>> m = re.search(r'(?<=>)V\d{7}', str)
>>> print m.group(0)
V1068078
With the help of joksnet's programs here I've managed to get plaintext Wikipedia articles that I'm looking for.
The text returned includes Wiki markup for the headings, so for example, the sections of the Albert Einstein article are returned like this:
==Biography==
===Early life and education===
blah blah blah
What I'd really like to do is feed the retrieved text to a function and wrap all the top level sections in bold html tags and the second level sections in italics, like this:
<b>Biography</b>
<i>Early life and education</i>
blah blah blah
But I'm afraid I don't know how to even start, at least not without making the function dangerously naive. Do I need to use regular expressions?
Any suggestions greatly appreciated.
PS Sorry if "parsing" is too strong a word for what I'm trying to do here.
I think the best way here would be to let MediaWiki take care of the parsing. I don't know the library you're using, but basically this is the difference between
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Albert%20Einstein&rvprop=content
which returns the raw wikitext and
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Albert%20Einstein&rvprop=content&rvparse
which returns the parsed HTML.
You can use regex and scraping modules like Scrapy and Beautifulsoup to parse and scrape wiki pages.
Now that you clarified your question I suggest you use the py-wikimarkup module that is hosted on github. The link is https://github.com/dcramer/py-wikimarkup/ . I hope that helps.
I ended up doing this:
def parseWikiTitles(x):
counter = 1
while '===' in x:
if counter == 1:
x = x.replace('===','<i>',1)
counter = 2
else:
x = x.replace('===',r'</i>',1)
counter = 1
counter = 1
while '==' in x:
if counter == 1:
x = x.replace('==','<b>',1)
counter = 2
else:
x = x.replace('==',r'</b>',1)
counter = 1
x = x.replace('<b> ', '<b>', 50)
x = x.replace(r' </b>', r'</b>', 50)
x = x.replace('<i> ', '<i>', 50)
x = x.replace(r' </i>', r'<i>', 50)
return x
I pass the string of text with wiki titles to that function and it returns the same text with the == and === replaced with bold and italics HTML tags. The last thing removes spaces before and after titles, for example == title == gets converted to <b>title</b> instead of <b> title </b>
Has worked without problem so far.
Thanks for the help guys,
Alex
I am trying to get this line out from a page:
$ 55 326
I have made this regex to get the numbers:
player_info['salary'] = re.compile(r'\$ \d{0,3} \d{1,3}')
When I get the text I use bs4 and the text is of type 'unicode'
for a in soup_ntr.find_all('div', id='playerbox'):
player_box_text = a.get_text()
print(type(player_box_text))
I can't seem to get the result.
I have also tried with a regex like these
player_info['salary'] = re.compile(ur'\$ \d{0,3} \d{1,3}')
player_info['salary'] = re.compile(ur'\$ \d{0,3} \d{1,3}', re.UNICODE)
But I can't find out to get the data.
The page I am reading has this header:
Content-Type: text/html; charset=utf-8
Hope for some help to figure it out.
re.compile doesn't match anything. It just creates a compiled version of the regex.
You want something like this:
matchObj = re.match(r'\$ (\d{0,3}) (\d{1,3})', player_box_text)
player_info['salary'] = matchObj.group(1) + matchObj.group(2)
This is a good site for getting to grips with regex.
http://txt2re.com/
#!/usr/bin/python
# URL that generated this code:
# http://txt2re.com/index-python.php3?s=$%2055%20326&2&1
import re
txt='$ 55 326'
re1='.*?' # Non-greedy match on filler
re2='(\\d+)' # Integer Number 1
re3='.*?' # Non-greedy match on filler
re4='(\\d+)' # Integer Number 2
rg = re.compile(re1+re2+re3+re4,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
int1=m.group(1)
int2=m.group(2)
print "("+int1+")"+"("+int2+")"+"\n"