Python lxml html builder

Python lxml html builder - python

hey guys i want to make a html in python. I read a xml with python requests. And i counted the elements of an attribute.
count = len(nodeData.xpath("//user[#condition='good']"))
print (count)`
like this.
but now i want to get a table in which the number of the count stays.
nodeRow = html.TR(html.TD(count , style="background-color:#FF0000")
nodeTable.append(nodeRow)
print etree.tostring(nodeTable)
with open("out3.html", "wb") as f:
f.write(etree.tostring(nodeTable))
But that doesn't work. The error is
TypeError: bad argument type: int(2746)

The error-code is pretty clear - you can't put strings into the text-content of an Element. As you have an int, Python balks. Convert it to a string first:
nodeRow = html.TR(html.TD(str(count) , style="background-color:#FF0000")
You should consider using a template library though, it will make doing this much easier, as it takes care of these little obstacles, and allows a more natural writing of longer HTML snippets.

Related

JSON Parsing with python from Rethink database [Python]

Im trying to retrieve data from a database named RethinkDB, they output JSON when called with r.db("Databasename").table("tablename").insert([{ "id or primary key": line}]).run(), when doing so it outputs [{'id': 'ValueInRowOfid\n'}] and I want to parse that to just the value eg. "ValueInRowOfid". Ive tried with JSON in Python, but I always end up with the typeerror: list indices must be integers or slices, not str, and Ive been told that it is because the Database outputs invalid JSON format. My question is how can a JSON format be invalid (I cant see what is invalid with the output) and also what would be the best way to parse it so that the value "ValueInRowOfid" is left in a Operator eg. Value = ("ValueInRowOfid").
This part imports the modules used and connects to RethinkDB:
import json
from rethinkdb import RethinkDB
r = RethinkDB()
r.connect( "localhost", 28015).repl()
This part is getting the output/value and my trial at parsing it:
getvalue = r.db("Databasename").table("tablename").sample(1).run() # gets a single row/value from the table
print(getvalue) # If I print that, it will show as [{'id': 'ValueInRowOfid\n'}]
dumper = json.dumps(getvalue) # I cant use `json.loads(dumper)` as JSON object must be str. Which the output of the database isnt (The output is a list)
parsevalue = json.loads(dumper) # After `json.dumps(getvalue)` I can now load it, but I cant use the loaded JSON.
print(parsevalue["id"]) # When doing this it now says that the list is a str and it needs to be an integers or slices. Quite frustrating for me as it is opposing it self eg. It first wants str and now it cant use str
print(parsevalue{'id'}) # I also tried to shuffle it around as seen here, but still the same result
I know this is janky and is very hard to comprehend this level of stupidity that I might be on. As I dont know if it is the most simple problem or something that just isnt possible (Which it should or else I cant use my data in the database.)
Thank you for reading this through and not jumping straight into the comments and say that I have to read the JSON documentation, because I have and I havent found a single piece that could help me.
I tried reading the documentation and watching tutorials about JSON and JSON parsing. I also looked for others whom have had the same problems as me and couldnt find.

It looks like it's returning a dictionary ({}) inside a list ([]) of one element.
Try:
getvalue = r.db("Databasename").table("tablename").sample(1).run()
print(getvalue[0]['id'])

What is the Python way to express a GREL line that is creating as many tags as needed in an XML document?

I'm using Open Refine to do something that I KNOW Python can do. I'm using it to convert a csv into an XML metadata document. I can figure out most of it, but the one thing that trips me up, is this GREL line:
{{forEach(cells["subjectTopicsLocal"].value.split('; '), v, '<subject authority="local"><topic>'+v.escape("xml")+'</topic></subject>')}}
What this does, is beautiful for me. I've got a "subject" field in my Excel spreadsheet. My volunteers enter keywords, separated with a "; ". I don't know how many keywords they'll come up with, and sometimes there is only one. That GREL line creates a new <subject authority="local"><topic></topic></subject> for each term created, and of course slides it into the field.
I know there has to be a Python expression that can do this. Could someone recommend best practice for this? I'd appreciate it!

Basically you want to use 'split' in Python to convert the string from your subject field into a Python list, and then you can iterate over the list.
So assuming you've read the content of the 'subject' field from a line in your csv/excel document already and assigned it to a string variable 'subj' you could do something like:
subjList = subj.split(";")
for subject in subjList:
#do what you need to do to output 'subject' in an xml element here

This Python expression is the equivalent to your GREL expression:
['<subject authority="local"><topic>'+escape(v)+'</topic></subject>') for v in split(value,'; ')]
It will create an array of XML snippets containing your subjects. It assumes that you've created or imported an appropriate escape function, such as
from xml.sax.saxutils import escape

Scraping data from a http & javaScript site

I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap

I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.

variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.

How do I find text after a given key word?

I am practicing my programming skills (in Python) and I realized that I don't know what to do when I need to find a value that is unknown but introduced by a key word. I am taking the information for this off a website where in the page source it says, '"size":"10","stockKeepingUnitId":"(random number)"'
How can I figure out what that number is.
This is what I have so far --
def stock():
global session
endpoint = '(website)'
reponse = session.get(endpoint)
soup = bs(response.text, "html.parser")
sizes = soup.find('"size":"10","stockKeepingUnitId":')

Off the top of my head there are two ways to do this. Say you have the string mystr = 'some text...content:"67588978"'. The first way is just to search for "content:" in the string and use string slicing to take everything after it:
num = mystr[mystr.index('content:"') + len('content:"'):-1]
Alternatively, as probably a better solution, you could use regular expressions
import re
nums = re.findall(r'.*?content:\"(\d+)\"')
As you haven't provided an example of the dataset you're trying to analyze, there could also be a number of other solutions. If you're trying to parse a JSON or YAML file, there are simple libraries to turn them into python dicts (json is part of the standard library, and PyYaml handles YAML files easily).

How to do parsing in python?

I'm kinda new to Python. And I'm trying to find out how to do parsing in Python?
I've got a task: to do parsing with some piece of unknown for me symbols and put it to DB. I guess I can create DB and tables with help of SQLAlchemy, but I have no idea how to do parsing and what all these symbols below mean?
http://joxi.ru/YmEVXg6Iq3Q426
http://joxi.ru/E2pvG3NFxYgKrY
$$HDRPUBID 112701130020011127162536
H11127011300UNIQUEPONUMBER120011127
D11127011300UNIQUEPONUMBER100001112345678900000001
D21127011300UNIQUEPONUMBER1000011123456789AR000000001
D11127011300UNIQUEPONUMBER200002123456987X000000001
D21127011300UNIQUEPONUMBER200002123456987XIR000000000This item is inactive. 9781605600000
$$EOFPUBID 1127011300200111271625360000005
Thanks in advance those who can give me some advices what to start from and how the parsing is going on?

The best approach is to first figure out where each token begins and ends, and write a regular expression to capture these. The site RegexPal might help you design the regex.

As other suggest take a look to some regex tutorials, and also re module help.
Probably you're looking to something like this:
import re
headerMapping = {'type': (1,5), 'pubid': (6,11), 'batchID': (12,21),
'batchDate': (22,29), 'batchTime': (30,35)}
poaBatchHeaders = re.findall('\$\$HDR\d{30}', text)
parsedBatchHeaders = []
batchHeaderDict = {}
for poaHeader in poaBatchHeaders:
for key in headerMapping:
start = headerMapping[key][0]-1
end = headerMapping[key][1]
batchHeaderDict.update({key: poaHeader[start:end]})
parsedBatchHeaders.append(batchHeaderDict)
Then you have list with dicts, each dict contains data for each attribute. I assume that you have your datafile in text which is string. Each dict is made for one found structure (POA Batch Header in example).
If you want to parse it further, you have to made a function to parse each date in each attribute.
def batchDate(batch):
return (batch[0:2]+'-'+batch[2:4]+'-20'+batch[4:])
for header in parsedBatchHeaders:
header.update({'batchDate': batchDate( header['batchDate'] )})
Remember, that's an example and I don't know documentation of your data! I guess it works like that, but rest is up to you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python lxml html builder - python

Related

JSON Parsing with python from Rethink database [Python]

What is the Python way to express a GREL line that is creating as many tags as needed in an XML document?

Scraping data from a http & javaScript site

How do I find text after a given key word?

How to do parsing in python?

Categories

Resources