Cant parse XML tree without bounds element from Pyosmium - python

I downloaded some data from OpenStreetMap, and have been sorting the data so i only have the nodes and ways that i need for my project (highways and the corresponding nodes in the references). To sort the XML file and create a new one, i use the library Pyosmium. Everything works except i cant parse the XML file with xml.etree.ElementTree. When i sort my data into a new file im not moving the bounds that contain the min and max longitude and latitude. If i manually copy in the bounds it parses.
I read through the Pyosium doc's and only found osmium.io.Reader and osmium.io.Header as well as some Geometry Attributes that describe the box (containing what i need), but i found no help in regards to getting it from my file and using my writer to write it to the new one.
So far this is what i have in my main method that just handles the nodes and ways, using SimpleHandlers
wayHandler = XMLhandlers.StreetHandler()
nodeHandler = XMLhandlers.NodeHandler()
wayHandler.apply_file('data/map_2.osm')
nodeHandler.apply_file('data/map_2.osm')
if os.path.exists('data/map_2_TEST.osm'):
os.remove('data/map_2_TEST.osm')
writer = XMLhandlers.wayWriter('data/map_2_TEST.osm')
writer.apply_file('data/map_2.osm')
tree = ET.parse('data/map_2_TEST.osm')
pruces the following error:
xml.etree.ElementTree.ParseError: no element found: line 1, column 0
Pastebin of original XML file: https://pastebin.com/i8uyCneC
Pastebin of sorted XML file that wont parse: https://pastebin.com/WZUcsZg4
EDIT:
The error is not in the parsing itself. If i comment out the part that generates the new XML and only try to parse the new XML file (that was generated beforehand) it works for some reason.
EDIT 2:
The error was i forgot to call close() on my SimpleWriter to flush remaining buffers and close the writer.

The issue happens since the code never stops the writer when done. By calling writer.close() it flushes the remaining buffers and closes the writer.
The following code has the line added, and the tree parses as expected.
wayHandler = XMLhandlers.StreetHandler()
nodeHandler = XMLhandlers.NodeHandler()
wayHandler.apply_file('data/map_2.osm')
nodeHandler.apply_file('data/map_2.osm')
if os.path.exists('data/map_2_TEST.osm'):
os.remove('data/map_2_TEST.osm')
writer = XMLhandlers.wayWriter('data/map_2_TEST.osm')
writer.apply_file('data/map_2.osm')
writer.close()
tree = ET.parse('data/map_2_TEST.osm')

Related

PDF File dedupe issue with same content, but generated at different time periods from a docx

I working on a pdf file dedupe project and analyzed many libraries in python, which read files, then generate hash value of it and then compare it with the next file for duplication - similar to logic below or using python filecomp lib. But the issue I found these logic is like, if a pdf is generated from a source DOCX(Save to PDF) , those outputs are not considered duplicates - even content is exactly the same. Why this happens? Is there any other logic to read the content, then create a unique hash value based on the actual content.
def calculate_hash_val(path, blocks=65536):
file = open(path, 'rb')
hasher = hashlib.md5()
data = file.read()
while len(data) > 0:
hasher.update(data)
data = file.read()
file.close()
return hasher.hexdigest()
One of the things that happens is that you save metadata to the file including the time of creation. It is invisible in the PDF, but that will make the hash different.
Here is an explanation of how to find and strip out that data with at least one tool. I am sure that there are many others.

When I import an array from another file, do I take just the data from it or need to "build" the array with how the original file build it?

Sorry if the question is not well formulated, will reformulated if necessary.
I have a file with an array that I filled with data from an online json db, I imported this array to another file to use its data.
#file1
response = urlopen(url1)
a=[]
data = json.loads(response.read())
for i in range(len(data)):
a.append(data[i]['name'])
i+=1
#file2
from file1 import a
'''do something with "a"'''
Does importing the array means I'm filling the array each time I call it in file2?
If that is the case, what can I do to just keep the data from the array without "building" it each time I call it?
If you saved a to a file, then read a -- you will not need to rebuild a -- you can just open it. For example, here's one way to open a text file and get the text from the file:
# set a variable to be the open file
OpenFile = open(file_path, "r")
# set a variable to be everything read from the file, then you can act on that variable
file_guts = OpenFile.read()
From the Python docs on the Modules section - link - you can read:
When you run a Python module with
python fibo.py <arguments>
the code in the module will be executed, just as if you imported it
This means that importing a module has the same behavior as running it as a regular Python script, unless you use the __name__ as mentioned right after this quotation.
Also, if you think about it, you are opening something, reading from it, and then doing some operations. How can you be sure that the content you are now reading from is the same as the one you had read the first time?

How to in-place parse and edit huge xml file [duplicate]

This question already has an answer here:
How to use xml sax parser to read and write a large xml?
(1 answer)
Closed 3 years ago.
I have huge XML datasets (2-40GB). Some of the data is confidential, so I am trying to edit the dataset to mask all of the confidential information. I have a long list of each value that needs to be masked, so for example if I have ID 'GYT-1064' I need to find and replace every instance of it. These values can be in different fields/levels/subclasses, so in one object it might have 'Order-ID = GYT-1064' whereas another might say 'PO-Name = GYT-1064'. I have looked into iterparse but cannot figure out how to in-place edit the xml file instead of building the entire new tree in memory, because I have to loop through it multiple times to find each instance of each ID.
Ideal functionality:
For each element, if a given string is in element, replace the text and change the line in the XML file.
I have a solution that works if the dataset is small enough to load into memory, but I can't figure out how to correctly leverage iterparse. I've also looked into every answer that talks about lxml iterparse, but since I need to iterate through the entire file multiple times, I need to be able to edit it in place
Simple version that works, but has to load the whole xml into memory (and isn't in-place)
values_to_mask = ['val1', 'GMX-103', 'etc-555'] #imported list of vals to mask
with open(dataset_name, encoding='utf8') as f:
tree = ET.parse(f)
root = tree.getroot()
for old in values_to_mask:
new = mu.generateNew(old, randomnumber) #utility to generate new amt
for elem in root.iter():
try:
elem.text = elem.text.replace(old, new)
except AttributeError:
pass
tree.write(output_name, encoding='utf8')
What I attempted with iterparse:
with open(output_name, mode='rb+') as f:
context = etree.iterparse( f )
for old in values_to_mask:
new = mu.generateNew(old, randomnumber)
mu.fast_iter(context, mu.replace_if_exists, old, new, f)
def replace_if_exists(elem, old, new, xf):
try:
if(old in elem.text):
elem.text = elem.text.replace(old, new)
xf.write(elem)
except AttributeError:
pass
It runs but doesn't replace any text, and I get print(context.root) = 'Null'. Additionally, it doesn't seem like it would correctly write back to the file in place.
Basically how the XML data looks (hierarchical objects with subclasses)
It looks generally like this:
<Master_Data_Object>
<Package>
<PackageNr>1000</PackageNr>
<Quantity>900</Quantity>
<ID>FAKE_CONFIDENTIALGYO421</ID>
<Item_subclass>
<ItemType>C</ItemType>
<MasterPackageID>FAKE_CONFIDENTIALGYO421</MasterPackageID>
<Package>
<Other_Types>
Since Lack of Dataset , I would like to suggest you to
1) use readlines() in loop to read substantial amount of data at a time
2) use a regular expression for identifying confidential information (if Possible) then replace it.
Let me know if it works
You can pretty much use SAX parser for big xml files.
Here is your answer -
Editing big xml files using sax parser

Constant first row of a .csv file?

I have a Python code which is logging some data into a .csv file.
logging_file = 'test.csv'
dt = datetime.datetime.now()
f = open(logging_file, 'a')
f.write('\n "{:%H:%M:%S}",{},{}'.format(dt,x,y,))
The above code is the core part and this produces continuous data in .csv file as
"00:34:09" ,23.05,23.05
"00:36:09" ,24.05,24.05
"00:38:09" ,26.05,26.05
... etc.,
Now I wish to add the following lines in first row of this data. time, data1,data2.I expect output as
time, data1, data2
"00:34:09" ,23.05,23.05
"00:36:09" ,24.05,24.05
"00:38:09" ,26.05,26.05
... etc.,
I tried many ways. Those ways not produced me the result as preferred format.But I am unable to get my expected result.
Please help me to solve the problem.
I would recommend writing a class specifically for creating and managing logs.Have it initialize a file, on creation, with the expected first line (don't forget a \n character!), and keep track of any necessary information about that log(the name of the log it created, where it is, etc). You can then have the class 'write' to the log (append the log, really), you can create new logs as necessary, and, you can have it check for existing logs, and make decisions about either updating what is existing, or scrapping it and starting over.

How to get what is stored in 'data' field of las point using liblas?

I am working with mulipulse lidar data that collects points along a number of lines within the flight path. I am trying to determine the name and number of individual lines within the las file. I am using liblas module in python.
I found this documentation that explains the different fields stored in an las file. It mentions a data field (get_data and set_data) at the very bottom of the page.
The 'point data format' and 'point data record length' in the header set aside space for this 'data' field. My header says I have 28 bytes set aside for the data field, and there are 28 values stored in the data field. The 19th value (at least in two datasets from two different sensors) refers to the line number. I have a single value in single pulse data and 4 in multi-pulse data.
I was wondering if there is a standard for what is stored in this field or if it is proprietary.
Also, as a way to get the name of each scan line, I wrote the following code:
import liblas
from liblas import file as lasfile
# Get parameters
las_file = r"E:\Testing\00101.las"
f = lasfile.File(las_file, mode='r')
line_list = []
counter = 0
for p in f:
line_num = p.data[18]
if line_num not in line_list:
line_list.append(line_num)
counter += 1
print line_list
It results with the following error:
Traceback (most recent call last):
File "D:\Tools\Python_Scripts\point_info.py", line 46, in <module>
line_num = p.data[18]
File "C:\Python27\ArcGIS10.1\lib\site-packages\liblas\point.py", line 560, in get_data
length = self.header.data_record_length
File "C:\Python27\ArcGIS10.1\lib\site-packages\liblas\point.py", line 546, in get_header
return header.Header(handle=core.las.LASPoint_GetHeader(self.handle))
WindowsError: [Error -529697949] Windows Error 0xE06D7363
Does anyone know more about the line numbers stored in the las point/header? Can anyone explain the error? It seems to allocate nearly 2gb of ram before I get the error. I am on win xp, so I'm guessing its a memory error, but I don't understand why accessing this 'data' field hogs memory. Any help is greatly appreciated.
I don't pretend to be an expert in any of this, but I'm intrigued by GIS data so this caught my interest. I installed liblas and its dependencies on my Fedora 19 system and played with the example data files that came with liblas.
Using your code I ran into the same problem of watching all my memory get eaten up. I don't know why that should happen - perhaps unwanted references hanging around preventing the garbage collector from working as we'd hope. This could probably be fixed, but I won't try it.
I did notice some interesting features of the liblas module and decided to try them. I believe you can get the data you seek.
After opening your file, have a look at the XML description from the header.
h = f.get_header()
print(h.get_xml())
It's hard to look at (feel free to play with xml.dom.minidom or lxml.etree), but in my example files it showed the byte layout of the point data (mine had 28 bytes too). In mine, offset 18 was a single short (2 bytes) assigned to Point Source ID. You should be able to retrieve this with p.data[18:19], p.get_data()[18:19], p.point_source_id, or p.get_point_source_id(). Unfortunately the data references chew up memory and p.point_source_id has a bug (bug fix pull request submitted to developers). If we change your code to use the last access method, everything seems to work fine. So, try this in your for loop instead:
for p in f:
line_num = p.get_point_source_id()
if line_num not in line_list:
line_list.append(line_num)
counter += 1
Note that
counter == h.get_count()
If you just want the set of unique Point Source ID values ...
line_set = set(p.get_point_source_id() for p in f)
Hopefully your data value is also available as p.get_point_source_id(). Let me know how it works for you in the comments. Cheers!

Categories

Resources