Parsing a YAML file in Python, and accessing the data? - python

I am new to YAML and have been searching for ways to parse a YAML file and use/access the data from the parsed YAML.
I have come across explanations on how to parse the YAML file, for example, the PyYAML tutorial, "How can I parse a YAML file in Python", "Convert Python dict to object?", but what I haven't found is a simple example on how to access the data from the parsed YAML file.
Assume I have a YAML file such as:
treeroot:
branch1: branch1 text
branch2: branch2 text
How do I access the text "branch1 text"?
"YAML parsing and Python?" provides a solution, but I had problems accessing the data from a more complex YAML file. And, I'm wondering if there is some standard way of accessing the data from a parsed YAML file, possibly something similar to "tree iteration" or "elementpath" notation or something which would be used when parsing an XML file?

Since PyYAML's yaml.load() function parses YAML documents to native Python data structures, you can just access items by key or index. Using the example from the question you linked:
import yaml
with open('tree.yaml', 'r') as f:
doc = yaml.load(f)
To access branch1 text you would use:
txt = doc["treeroot"]["branch1"]
print txt
"branch1 text"
because, in your YAML document, the value of the branch1 key is under the treeroot key.

Just an FYI on #Aphex's solution -
In case you run into -
"YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated"
you may want to use the Loader=yaml.FullLoader or Loader=yaml.SafeLoader option.
import yaml
with open('cc_config.yml', 'r') as f:
doc = yaml.load(f, Loader=yaml.FullLoader) # also, yaml.SafeLoader
txt = doc["treeroot"]["branch1"]
print (txt)

Related

File not converting to JSON properly from xml

I am using Python to convert data from a xml file to json and putting it in a file. I am using xmltodict to convert to dictionary using 'parse' and then converting into json using 'dumps'. Here is the code below: -
import xmltodict
import json
with open('note.xml') as xml_file:
my_dict=xmltodict.parse(xml_file.read())
xml_file.close()
json_data=json.dumps(my_dict)
with open('note.json', 'w') as f:
json.dump(json_data, f)
Here is a sample xml file that I have used. However, in the output, I am getting something not quite json like, with the added backslash. Looks like gibberish: -
"{\"note\": {\"to\": \"Tove\", \"from\": \"Jani\", \"heading\": \"Reminder\", \"body\": \"Don't forget me this weekend!\"}}"
I am trying to understand why I am not able to get data in proper json form. Is there anything wrong with my code? Please note that I am not a very skilled programmer and have only used Python sporadically.
You need to add below code after json_data=json.dumps(my_dict) to convert string to json object
json_data = json.loads(json_data)

How to JSON dump multiple dictionaries without wrapping them in a list

I'm trying to manipulate a JSON file and dump it back out.
Below is the JSON file - Note how there are two dictionaries wrapped together below...:
{"installed":
{"client_id":"xxx",
"project_id":"quickstart-1557411376659",
"auth_uri":"xxx",
"token_uri":"xxx",
"auth_provider_x509_cert_url":"xxx",
"client_secret":"xxx",
"redirect_uris":["urn:ietf:wg:oauth:2.0:oob","http://localhost"]
}
}
I'm trying to use the following python code to read in the JSON file manipulate it, and write it back out.
with open('google_sheets_credentials.json', 'r+') as file:
google_sheets_auth_dict = json.load(file)
#Manipulate file contents here
with open('google_sheets_credentials.json', 'r+') as file:
json.dump(google_sheets_auth_dict, file)
This code fails after a couple of runs because multiple dictionaries need to be wrapped in a list to be written out as JSON - like so:
The reasoning behind this requirement is explained here
with open('google_sheets_credentials.json', 'r+') as file:
json.dump([google_sheets_auth_dict], file)
The problem is that I can't do that in this case because this JSON is ultimately being fed into google sheets' API where it does not expect the JSON to be wrapped in a list.
How might I read this file in, manipulate, and spit it back out in the format expected by google?
This example case:
import json
example_json = '''{"installed":
{"client_id":"xxx",
"project_id":"quickstart-1557411376659",
"auth_uri":"xxx",
"token_uri":"xxx",
"auth_provider_x509_cert_url":"xxx",
"client_secret":"xxx",
"redirect_uris":["urn:ietf:wg:oauth:2.0:oob","http://localhost"]
}
}'''
google_sheets_auth_dict = json.loads(example_json)
google_sheets_auth_dict['client_id'] = 'yyy'
print(json.dumps(google_sheets_auth_dict))
seems to work fine
I'm guessing that something is going wrong in the #Manipulate file contents here bit, which is not shown. Or, the example JSON does not look like the failure case.

Can a text file and a json file be used interchangeably? And if so how can I use it in python?

Question: I was wondering if JSON and txt files could be used interchangeably in python.
More Details: I found this on the internet and this on stack overflow to find what a JSON file is but it did not say if json and txt could be used interchangeably ie using the same commands. For example, can both use the same code with open('filename')as file: or does JSON require a different code. Also if they can be used in the same general manner is linking and using commands for a JSON file and a txt file the same process?
OS: windows 10
IDE: IDLE 64-bit
Version: Python 3.7
A .txt file can contain JSON data, and using open() in Python can open any file, with any content, and any file extension (granted the user running the code has permissions to do so)
It's not until you try to load a non JSON string or file using json.loads or json.load, respectively, where the problem starts.
In other words, a file contains binary data. The data can be represented as a string, that string could be XHTML, JSON, CSV, YAML, whatever, and you must use the appropriate parser to extract the relevant data from that format (but it's not always the file extensions that determine what to use)
does JSON require a different code
It requires another module
import json
with open(name) as f:
data = json.load(f)
You can read the raw data out of any file the same way; the difference is in reading the structure in the data.

Searching for and manipulating the content of a keyword in a huge file

I have a huge HTML file that I have converted to text file. (The file is Facebook home page's source). Assume the text file has a specific keyword in some places of it. For example: "some_keyword: [bla bla]". How would I print all the different bla blas that are followed by some_keyword?
{id:"1126830890",name:"Hillary Clinton",firstName:"Hillary"}
Imagine there are 50 different names with this format in the page. How would I print all the names followed by "name:", considering the text is very large and crashes when you read() it or try to search through its lines.
Sample File:
shortProfiles:{"100000094503825":{id:"100000094503825",name:"Bla blah",firstName:"Blah",vanity:"blah",thumbSrc:"https://scontent-lax3-1.xx.fbcdn.net/v/t1.0-1/c19.0.64.64/p64x64/10354686_10150004552801856_220367501106153455_n.jpg?oh=3b26bb13129d4f9a482d9c4115b9eeb2&oe=5883062B",uri:"https://www.facebook.com/blah",gender:2,i18nGender:16777216,type:"friend",is_friend:true,mThumbSrcSmall:null,mThumbSrcLarge:null,dir:null,searchTokens:["Bla"],alternateName:"",is_nonfriend_messenger_contact:false},"1347968857":
Based on your comment, since you are the person responsible for writting the data to the file. Write the data in JSON format and read it from file using json.loads() as:
import json
json_file = open('/path/to/your_file')
json_str = json_file.read()
json_data = json.loads(json_str)
for item in json_data:
print item['name']
Explanation:
Lets say data is the variable storing
{id:"1126830890",name:"Hillary Clinton",firstName:"Hillary"}
which will be dynamically changing within your code where you are performing write operation in the file. Instead append it to the list as:
a = []
for item in page_content:
# data = some xy logic on HTML file
a.append(data)
Now write this list to the file using: json.dump()
I just wanted to throw this out there even though I agree with all the comments about just dealing with the html directly or using Facebook's API (probably the safest way), but open file objects in Python can be used as a generator yielding lines without reading the entire file into memory and the re module can be used to extract information from text.
This can be done like so:
import re
regex = re.compile(r"(?:some_keyword:\s\[)(.*?)\]")
with open("filename.txt", "r") as fp:
for line in fp:
for match in regex.findall(line):
print(match)
Of course this only works if the file is in a "line-based" format, but the end effect is that only the line you are on is loaded into memory at any one time.
here is the Python 2 docs for the re module
here is the Python 3 docs for the re module
I cannot find documentation which details the generator capabilities of file objects in Python, it seems to be one of those well-known secrets...Please feel free to edit and remove this paragraph if you know where in the Python docs this is detailed.

How to update/modify an XML file in python?

I have an XML document that I would like to update after it already contains data.
I thought about opening the XML file in "a" (append) mode. The problem is that the new data will be written after the root closing tag.
How can I delete the last line of a file, then start writing data from that point, and then close the root tag?
Of course I could read the whole file and do some string manipulations, but I don't think that's the best idea..
Using ElementTree:
import xml.etree.ElementTree
# Open original file
et = xml.etree.ElementTree.parse('file.xml')
# Append new tag: <a x='1' y='abc'>body text</a>
new_tag = xml.etree.ElementTree.SubElement(et.getroot(), 'a')
new_tag.text = 'body text'
new_tag.attrib['x'] = '1' # must be str; cannot be an int
new_tag.attrib['y'] = 'abc'
# Write back to file
#et.write('file.xml')
et.write('file_new.xml')
note: output written to file_new.xml for you to experiment, writing back to file.xml will replace the old content.
IMPORTANT: the ElementTree library stores attributes in a dict, as such, the order in which these attributes are listed in the xml text will NOT be preserved. Instead, they will be output in alphabetical order.
(also, comments are removed. I'm finding this rather annoying)
ie: the xml input text <b y='xxx' x='2'>some body</b> will be output as <b x='2' y='xxx'>some body</b>(after alphabetising the order parameters are defined)
This means when committing the original, and changed files to a revision control system (such as SVN, CSV, ClearCase, etc), a diff between the 2 files may not look pretty.
Useful Python XML parsers:
Minidom - functional but limited
ElementTree - decent performance, more functionality
lxml - high-performance in most cases, high functionality including real xpath support
Any of those is better than trying to update the XML file as strings of text.
What that means to you:
Open your file with an XML parser of your choice, find the node you're interested in, replace the value, serialize the file back out.
The quick and easy way, which you definitely should not do (see below), is to read the whole file into a list of strings using readlines(). I write this in case the quick and easy solution is what you're looking for.
Just open the file using open(), then call the readlines() method. What you'll get is a list of all the strings in the file. Now, you can easily add strings before the last element (just add to the list one element before the last). Finally, you can write these back to the file using writelines().
An example might help:
my_file = open(filename, "r")
lines_of_file = my_file.readlines()
lines_of_file.insert(-1, "This line is added one before the last line")
my_file.writelines(lines_of_file)
The reason you shouldn't be doing this is because, unless you are doing something very quick n' dirty, you should be using an XML parser. This is a library that allows you to work with XML intelligently, using concepts like DOM, trees, and nodes. This is not only the proper way to work with XML, it is also the standard way, making your code both more portable, and easier for other programmers to understand.
Tim's answer mentioned checking out xml.dom.minidom for this purpose, which I think would be a great idea.
While I agree with Tim and Oben Sonne that you should use an XML library, there are ways to still manipulate it as a simple string object.
I likely would not try to use a single file pointer for what you are describing, and instead read the file into memory, edit it, then write it out.:
inFile = open('file.xml', 'r')
data = inFile.readlines()
inFile.close()
# some manipulation on `data`
outFile = open('file.xml', 'w')
outFile.writelines(data)
outFile.close()
For the modification, you could use tag.text from xml. Here is snippet:
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
for rank in root.iter('rank'):
new_rank = int(rank.text) + 1
rank.text = str(new_rank)
tree.write('output.xml')
The rank in the code is example of tag, which depending on your XML file contents.
What you really want to do is use an XML parser and append the new elements with the API provided.
Then simply overwrite the file.
The easiest to use would probably be a DOM parser like the one below:
http://docs.python.org/library/xml.dom.minidom.html
To make this process more robust, you could consider using the SAX parser (that way you don't have to hold the whole file in memory), read & write till the end of tree and then start appending.
You should read the XML file using specific XML modules. That way you can edit the XML document in memory and rewrite your changed XML document into the file.
Here is a quick start: http://docs.python.org/library/xml.dom.minidom.html
There are a lot of other XML utilities, which one is best depends on the nature of your XML file and in which way you want to edit it.
As Edan Maor explained, the quick and dirty way to do it (for [utc-16] encoded .xml files), which you should not do for the resons Edam Maor explained, can done with the following python 2.7 code in case time constraints do not allow you to learn (propper) XML parses.
Assuming you want to:
Delete the last line in the original xml file.
Add a line
substitute a line
Close the root tag.
It worked in python 2.7 modifying an .xml file named "b.xml" located in folder "a", where "a" was located in the "working folder" of python. It outputs the new modified file as "c.xml" in folder "a", without yielding encoding errors (for me) in further use outside of python 2.7.
pattern = '<Author>'
subst = ' <Author>' + domain + '\\' + user_name + '</Author>'
line_index =0 #set line count to 0 before starting
file = io.open('a/b.xml', 'r', encoding='utf-16')
lines = file.readlines()
outFile = open('a/c.xml', 'w')
for line in lines[0:len(lines)]:
line_index =line_index +1
if line_index == len(lines):
#1. & 2. delete last line and adding another line in its place not writing it
outFile.writelines("Write extra line here" + '\n')
# 4. Close root tag:
outFile.writelines("</phonebook>") # as in:
#http://tizag.com/xmlTutorial/xmldocument.php
else:
#3. Substitue a line if it finds the following substring in a line:
pattern = '<Author>'
subst = ' <Author>' + domain + '\\' + user_name + '</Author>'
if pattern in line:
line = subst
print line
outFile.writelines(line)#just writing/copying all the lines from the original xml except for the last.

Categories

Resources