First of all, I'm new to working with Python, especially Selenium. So I connected to a page with the webdriver and also already grabbed the InnerHTML I need. Here's my problem, InnerHTML is a "list" and I only want to output one value. It looks something like this:
<html>
<body>
<pre style="example" xpath="1">
"amount": 12{
"value" : 3
},
</pre>
</body>
</html>
^It's just for illustration, because the actual thing is much longer. InnerHTML looks like this:
"amount": 12{
"value" : 3
},
^This is where I am now. I can't specify a line because the page is not static. How do I make python find "value" from a variable in InnerHTML ? Please note that there is a colon after "value"!
Thank you very much in advance!
I suggest using regular expression to find the value. I assume that you only need the number part, so here's the code:
innerHTML = '''
"amount": 12{
"value" : 3
},"value":4
'value': 5
'''
import re
regex = re.compile(r'''("|')value("|')\s*:\s*(?P<number>\d+)''')
startpos = 0
m = None
while 1:
m = regex.search(innerHTML, startpos)
if m is None: break
print(m.group("number"))
startpos = m.start() + 1
# output:
# 3
# 4
# 5
This will print out all the value numbers found, as strings. You can convert them to integers afterwards, for example.
NOTE: My code also accounts for the case value is surrounded by single quotes ' rather than double quotes ". This is for your convenience; if not, you can change the appropriate line above to:
regex = re.compile(r'''"value"\s*:\s*(?P<number>\d+)''')
In that case, the output would not include the value 5.
Related
I am trying to get the ASIN number for each product on Amazon which is the first ten digits after dp/. I have gotten to the point where I have the digits but still have the junk after it. Any help?
product_lst = [
"https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
"https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
"https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
"https://www.amazon.com/dp/B089RDSML3",
"https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
]
for url in product_lst:
product_lst = url.split("dp/")
for url in product_lst:
del product_lst[::2]
print(product_lst)
Output:
['B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ']
['B08SLYY1WD/?encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref=pd_gw_deals']
['B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae']
['B089RDSML3']
['B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2']
For searches in text the module re (regex) is a good choice:
product_lst = [
"https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
"https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
"https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
"https://www.amazon.com/dp/B089RDSML3",
"https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
]
import re
results = []
for url in product_lst:
m = re.search(r"/dp/([^/?]+)",url)
if m:
results.append(m.groups()[0])
print(results)
Output:
['B07R2CNSTK', 'B08SLYY1WD', 'B079QHML21', 'B089RDSML3', 'B081J8SGH7']
I use r"/dp/([^/?]+)" as pattern wich boils down to a grouped match for anything after /dp/ and then matches all things up to the next / or ?.
You can test regexes online - I use http://regex101.com (for complex ones) - it can even provide python code based on what you insert in its fields (not using that though ;o) )
You can change your own code to
for url in product_lst:
part = url.split("dp/")
if len(part) > 1: # blablubb dp/ more things => 2 or more parts
print(part[1]) # print whats is left after dp/
to avoid overwriting your list product_lst - but you will still need to trim stuff after / and ? with it.
After you split() on the 'dp/', there is absolutely no reason to loop. You know exactly where the data is that you want, so just get it directly:
product_lst = [
"https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
"https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
"https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
"https://www.amazon.com/dp/B089RDSML3",
"https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
]
for url in product_lst:
split_lst = url.split("dp/")
print(split_lst[1][:10]
I assume that the ASIN is always 10 characters. Adjust the splice if there are more characters and it is always fixed. Otherwise you will need to find a different appproach.
You can directly get the ASIN without splitting the data.
product_lst = [
"https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
"https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
"https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
"https://www.amazon.com/dp/B089RDSML3",
"https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
]
ASIN=[]
for url in product_lst:
idx = url.find("/dp/")
ASIN.append(url[idx+4:idx+14])
print(ASIN)
output
['B07R2CNSTK', 'B08SLYY1WD', 'B079QHML21', 'B089RDSML3', 'B081J8SGH7']
So. I'm trying to extract the Document is original from the string below.
c:1:{s:7:"note";s:335:"Document is original-no need to register again";}
Two thoughts:
A little bit of work, get most components of that structure:
string <- 'c:1:{s:7:"note";s:335:"Document is original-no need to register again";}'
strcapture("(.*):(.*):(.*)",
strsplit(regmatches(string, gregexpr('(?<={)[^}]+(?=})', string, perl = TRUE))[[1]], ";")[[1]],
proto = list(s="", len=1L, x=""))
# s len x
# 1 s 7 "note"
# 2 s 335 "Document is original-no need to register again"
A simpler approach, perhaps a little more hard-coded:
regmatches(string, gregexpr('(?<=")([^;"]+)(?=")', string, perl = TRUE))[[1]]
# [1] "note"
# [2] "Document is original-no need to register again"
From here, you need to figure out how to dismiss "note" and then perhaps strsplit(.., "-") to get the substring you want.
it's my first time with regex and I have some issues, which hopefully you will help me find answers. Let's give an example of data:
chartData.push({
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
});
var newDate = new Date();
newDate.setFullYear(
2007,
10,
1 );
Want I want to retrieve is to get the date which is the last bracket and the corresponding description. I have no idea how to do it with one regex, thus I decided to split it into two.
First part:
I retrieve the value after the description:. This was managed with the following code:[\n\r].*description:\s*([^\n\r]*) The output gives me the result with a quote "9710" but I can fairly say that it's alright and no changes are required.
Second part:
Here it gets tricky. I want to retrieve the values in brackets after the text newDate.setFullYear. Unfortunately, what I managed so far, is to only get values inside brackets. For that, I used the following code \(([^)]*)\) The result is that it picks all 3 brackets in the example:
"{
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
}",
"()",
"2007,
10,
1 "
What I am missing is an AND operator for REGEX with would allow me to construct a code allowing retrieval of data in brackets after the specific text.
I could, of course, pick every 3rd result but unfortunately, it doesn't work for the whole dataset.
Does anyone of you know the way how to resolve the second part issue?
Thanks in advance.
You can use the following expression:
res = re.search(r'description: "([^"]+)".*newDate.setFullYear\((.*)\);', text, re.DOTALL)
This will return a regex match object with two groups, that you can fetch using:
res.groups()
The result is then:
('9710', '\n2007,\n10,\n1 ')
You can of course parse these groups in any way you want. For example:
date = res.groups()[1]
[s.strip() for s in date.split(",")]
==>
['2007', '10', '1']
import re
test = r"""
chartData.push({
date: 'newDate',
visits: 9710,
color: "#016b92",
description: "9710"
})
var newDate = new Date()
newDate.setFullYear(
2007,
10,
1);"""
m = re.search(r".*newDate\.setFullYear(\(\n.*\n.*\n.*\));", test, re.DOTALL)
print(m.group(1).rstrip("\n").replace("\n", "").replace(" ", ""))
The result:
(2007,10,1)
The AND part that you are referring to is not really an operator. The pattern matches characters from left to right, so after capturing the values in group 1 you cold match all that comes before you want to capture your values in group 2.
What you could do, is repeat matching all following lines that do not start with newDate.setFullYear(
Then when you do encounter that value, match it and capture in group 2 matching all chars except parenthesis.
\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);
Regex demo | Python demo
Example code
import re
regex = r"\r?\ndescription: \"([^\"]+)\"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);"
test_str = ("chartData.push({\n"
"date: newDate,\n"
"visits: 9710,\n"
"color: \"#016b92\",\n"
"description: \"9710\"\n"
"});\n"
"var newDate = new Date();\n"
"newDate.setFullYear(\n"
"2007,\n"
"10,\n"
"1 );")
print (re.findall(regex, test_str))
Output
[('9710', '\n2007,\n10,\n1 ')]
There is another option to get group 1 and the separate digits in group 2 using the Python regex PyPi module
(?:\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(|\G)\r?\n(\d+),?(?=[^()]*\);)
Regex demo
Ok say I have a string in python:
str="martin added 1 new photo to the <a href=''>martins photos</a> album."
the string contains a lot more css/html in real world use
What is the fastest way to change the 1 ('1 new photo') to say '2 new photos'. of course later the '1' may say '12'.
Note, I don't know what the number is, so doing a replace is not acceptable.
I also need to change 'photo' to 'photos' but I can just do a .replace(...).
Unless there is a neater, easier solution to modify both?
Update
Never mind. From the comments it is evident that the OP's requirement is more complicated than it appears in the question. I don't think it can be solved by my answer.
Original Answer
You can convert the string to a template and store it. Use placeholders for the variables.
template = """%(user)s added %(count)s new %(l_object)s to the
<a href='%(url)s'>%(text)s</a> album."""
options = dict(user = "Martin", count = 1, l_object = 'photo',
url = url, text = "Martin's album")
print template % options
This expects the object of the sentence to be pluralized externally. If you want this logic (or more complex conditions) in your template(s) you should look at a templating engine such as Jinja or Cheetah.
It sounds like this is what you want (although why is another question :^)
import re
def add_photos(s,n):
def helper(m):
num = int(m.group(1)) + n
plural = '' if num == 1 else 's'
return 'added %d new photo%s' % (num,plural)
return re.sub(r'added (\d+) new photo(s?)',helper,s)
s = "martin added 0 new photos to the <a href=''>martins photos</a> album."
s = add_photos(s,1)
print s
s = add_photos(s,5)
print s
s = add_photos(s,7)
print s
Output
martin added 1 new photo to the <a href=''>martins photos</a> album.
martin added 6 new photos to the <a href=''>martins photos</a> album.
martin added 13 new photos to the <a href=''>martins photos</a> album.
since you're not parsing html, just use an regular expression
import re
exp = "{0} added ([0-9]*) new photo".format(name)
number = int(re.findall(exp, strng)[0])
This assumes that you will always pass it a string with the number in it. If not, you'll get an IndexError.
I would store the number and the format string though, in addition to the formatted string. when the number changes, remake the format string and replace your stored copy of it. This will be much mo'bettah' then trying to parse a string to get the count.
In response to your question about the html mattering, I don't think so. You are not trying to extract information that the html is encoding so you are not parsing html with regular expressions. This is just a string as far as that concern goes.
My html looks like:
<td>
<table ..>
<tr>
<th ..>price</th>
<th>$99.99</th>
</tr>
</table>
</td>
So I am in the current table cell, how would I get the 99.99 value?
I have so far:
td[3].findChild('th')
But I need to do:
Find th with text 'price', then get next th tag's string value.
Think about it in "steps"... given that some x is the root of the subtree you're considering,
x.findAll(text='price')
is the list of all items in that subtree containing text 'price'. The parents of those items then of course will be:
[t.parent for t in x.findAll(text='price')]
and if you only want to keep those whose "name" (tag) is 'th', then of course
[t.parent for t in x.findAll(text='price') if t.parent.name=='th']
and you want the "next siblings" of those (but only if they're also 'th's), so
[t.parent.nextSibling for t in x.findAll(text='price')
if t.parent.name=='th' and t.parent.nextSibling and t.parent.nextSibling.name=='th']
Here you see the problem with using a list comprehension: too much repetition, since we can't assign intermediate results to simple names. Let's therefore switch to a good old loop...:
Edit: added tolerance for a string of text between the parent th and the "next sibling" as well as tolerance for the latter being a td instead, per OP's comment.
for t in x.findAll(text='price'):
p = t.parent
if p.name != 'th': continue
ns = p.nextSibling
if ns and not ns.name: ns = ns.nextSibling
if not ns or ns.name not in ('td', 'th'): continue
print ns.string
I've added ns.string, that will give the next sibling's contents if and only if they're just text (no further nested tags) -- of course you can instead analize further at this point, depends on your application's needs!-). Similarly, I imagine you won't be doing just print but something smarter, but I'm giving you the structure.
Talking about the structure, notice that twice I use if...: continue: this reduces nesting compared to the alternative of inverting the if's condition and indenting all the following statements in the loop -- and "flat is better than nested" is one of the koans in the Zen of Python (import this at an interactive prompt to see them all and meditate;-).
With pyparsing, it's easy to reach into the middle of some HTML for a tag pattern like this:
from pyparsing import makeHTMLTags, Combine, Word, nums
th,thEnd = makeHTMLTags("TH")
floatnum = Combine(Word(nums) + "." + Word(nums))
priceEntry = (th + "price" + thEnd +
th + "$" + floatnum("price") + thEnd)
tokens,startloc,endloc = priceEntry.scanString(html).next()
print tokens.price
Pyparsing's makeHTMLTags helper returns a pair of pyparsing expressions, one for the start tag and one for the end tag. The start tag pattern is much more than just adding "<>"s around the given string, but also allows for extra whitespace, variable case, and the presence or absence of tag attributes. For instance, note that even though I specified "TH" as the table head tag, it will also match "th", "Th", "tH" and "TH". Pyparsing's default whitespace skipping behavior will also handle extra spaces, between tag and "$", between "$" and numeric price, etc., without having to sprinkle "zero or more whitespace chars could go here" indicators. Lastly, by assigning the results name "price" (following floatum in the definition of priceEntry), it makes it very simple to access that specific value from the full list of tokens matching the overall priceEntry expression.
(Combine is used for 2 purposes: it disallows whitespace between the components of the number; and returns a single combined token "99.99" instead of the list ["99", ".", "99"].)