Finding XML data and Converting it into CSV - python

JUst need some quick help. Let say i have the following xml formatted like so:
<Solution version="1.0">
<DrillHoles total_holes="238">
<description>
<hole hole_id="1">
<hole collar="5720.44, 3070.94, 2642.19" />
<hole toe="5797.82, 3061.01, 2576.29" />
<hole cost="102.12" />
</hole>
........
EDIT: Here is the code i used to create the hole collar..etc.
for row in reader:
if i > 0:
x1,y1,z1,x2,y2,z2,cost = row
if current_group is None or i != current_group.text:
current_group = SubElement(description, 'hole',{'hole_id':"%s"%i})
collar = SubElement (current_group, 'hole',{'collar':', '.join((x1,y1,z1))}),
toe = SubElement (current_group, 'hole',{'toe':', '.join((x2,y2,z2))})
cost = SubElement(current_group, 'hole',{'cost':cost})
i+=1
and so on, how do i obtain the hole collar, hole toe, and hole cost data. Here is my piece of code so far, i think i am really close.
with open(outfile, 'w') as file_:
writer = csv.writer(file_, delimiter="\t")
for a in zip(root.findall("drillholes/hole/hole collar"),
root.findall("drillholes/hole/hole toe"),
root.findall("drillholes/hole/hole cost")):
writer.writerow([x.text for x in a])
Although my program generates an csv file, the csv file is empty which is why i think this piece of code was unable to obtain the data due to some search and find errors. Can anyone help?

You don't specify, but I assume you're using xml.etree.ElementTree. There are a few issues here:
1) XML is case-sensitive. "drillholes" is not the same thing as "DrillHoles".
2) Your path is missing the "description" element found in your XML.
3) To refer to attributes, you don't use a space, but another path element prefixed with an "#", as in "hole/#collar".
Taking those into consideration, the answer should just be this:
for a in zip(root.findall("DrillHoles/description/hole/hole/#collar"),
root.findall("DrillHoles/description/hole/hole/#toe"),
root.findall("DrillHoles/description/hole/hole/#cost")):
writer.writerow([x.text for x in a])
But it's not. Testing this, it seems findall really doesn't like returning attribute nodes, probably because they don't exist as such in etree's API. So you could do this:
for a in zip(root.findall("DrillHoles/description/hole/hole[#collar]"),
root.findall("DrillHoles/description/hole/hole[#toe]"),
root.findall("DrillHoles/description/hole/hole[#cost]")):
writer.writerow([x[0].get('collar'), x[1].get('toe'), x[2].get('cost')])
But if you have to list out the attributes in the statements in the for loop anyway, personally I'd just do away with the zip and do this:
for a in root.findall("DrillHoles/description/hole"):
writer.writerow([a.find("hole[#"+attr+"]").get(attr) for attr in ("collar", "toe", "cost")])

Related

How to get an unknown substring between two known substrings, within a giant string/file

I'm trying to get all the substrings under a "customLabel" tag, for example "Month" inside of ...,"customLabel":"Month"},"schema":"metric...
Unusual issue: this is a 1071552 characters long ndjson file, of a single line ("for line in file:" is pointless since there's only one).
The best I found was that:
How to find a substring of text with a known starting point but unknown ending point in python
but if I use this, the result obviously doesn't stop (at Month) and keeps writing the whole remaining of the file, same as if using partition()[2].
Just know that Month is only an example, customLabel has about 300 variants and they are not listed (I'm actually doing this to list them...)
To give some details here's my script so far:
with open("file.ndjson","rt", encoding='utf-8') as ndjson:
filedata = ndjson.read()
x="customLabel"
count=filedata.count(x)
for i in range (count):
if filedata.find(x)>0:
print("Found "+str(i+1))
So right now it properly tells me how many occurences of customLabel there are, I'd like to get the substring that comes after customLabel":" instead (Month in the example) to put them all in a list, to locate them way more easily and enable the use of replace() for traductions later on.
I'd guess regex are the solution but I'm pretty new to that, so I'll post that question by the time I learn about them...
If you want to search for all (even nested) customLabel values like this:
{"customLabel":"Month" , "otherJson" : {"customLabel" : 23525235}}
you can use RegEx patterns with the re module
import re
label_values = []
regex_pattern = r"\"customLabel\"[ ]?:[ ]?([1-9a-zA-z\"]+)"
with open("file.ndjson", "rt", encoding="utf-8") as ndjson:
for line in ndjson:
values = re.findall(regex_pattern, line)
label_values.extend(values)
print(label_values) # ['"Month"', '23525235']
# If you don't want the items to have quotations
label_values = [i.replace('"', "") for i in label_values]
print(label_values) # ['Month', '23525235']
Note: If you're only talking about ndjson files and not nested searching, then it'd be better to use the json module to parse the lines and then easily get the value of your specific key which is customLabel.
import json
label = "customLabel"
label_values = []
with open("file.ndjson", "rt", encoding="utf-8") as ndjson:
for line in ndjson:
line_json = json.loads(line)
if line_json.get(label) is not None:
label_values.append(line_json.get(label))
print(label_values) # ['Month']

Creating a csv-file from an srt-file ("Friends" subtitles) in python

Currently, I am trying to create a csv file containing the subtitles of NBC's "Friends" and their corresponding starting time. So basically I am trying to turn an srt-file into a csv-file in python.
For those of you that are unfamiliar with srt-files, they look like this:
1
00:00:47,881 --> 00:00:49,757
[CAR HORNS HONKING]
2
00:00:49,966 --> 00:00:52,760
There's nothing to tell.
It's just some guy I work with.
3
00:00:52,969 --> 00:00:55,137
Come on.
You're going out with a guy.
…
Now I have used readlines() to turn it into a list like this:
['\ufeff1\n', '00:00:47,881 --> 00:00:49,757\n', '[CAR HORNS HONKING]\n',
'\n', '2\n', '00:00:49,966 --> 00:00:52,760\n',
"There's nothing to tell.\n", "It's just some guy I work with.\n",
'\n', '3\n', '00:00:52,969 --> 00:00:55,137\n', 'Come on.\n',
"You're going out with a guy.\n", ...]
Is there a way to create a dict or dataframe from this list (or the file it is based on) that contains the starting time (end time is not needed) and the lines that belong to it. I've been struggling because while sometimes just one line corresponds to a starting time, other times there are two (There are two lines at most per starting time in this file. However, a solution that can be used in case even more lines are present would be preferable).
Lines that look like the first one ("[CAR HORNS HONKING]") or others that simply say e. g. "CHANDLER:" and their starting times would ideally not be included but that's not all that important right now.
Any help is very much appreciated!
I think this code cover your problem. The main idea is to use a regular expression to locate the starting time of each legend and extract its value and the corresponding lines. The code is not in the most polished form, but I think the main idea is well expressed. I hope it helps.
import re
with open('sub.srt', 'r') as h:
sub = h.readlines()
re_pattern = r'[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3} -->'
regex = re.compile(re_pattern)
# Get start times
start_times = list(filter(regex.search, sub))
start_times = [time.split(' ')[0] for time in start_times]
# Get lines
lines = [[]]
for sentence in sub:
if re.match(re_pattern, sentence):
lines[-1].pop()
lines.append([])
else:
lines[-1].append(sentence)
lines = lines[1:]
# Merge results
subs = {start_time:line for start_time,line in zip(start_times, lines)}

Python: extract text from docx to txt via parsing word/document.xml

I would like to extract text from docx files into simple txt file.
I know this problem might seem to be easy or trivial (I hope it will be) but I've looked over dozens of forum topics, spent hours trying to solve by myself and found no solution...
I have borrowed the following code from Etienne's blog.
It works perfectly if I need the content with no formatting. But...
Since my documents contain simple tables, I need them to keep their format with simply using tabulators.
So instead of this:
Name
Age
Wage
John
30
2000
This should appear:
Name Age Wage
John 30 2000
In order not to slide into each other I prefer double tabs for longer lines.
I have examined XML structure a little bit and found out that new rows in tables are indicated by tr, and columns by tc.
So I've tried to modify this a thousand ways but with no success...
Though it's not really working, I copy my idea of approaching the solution:
from lxml.html.defs import form_tags
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile
WORD_NAMESPACE='{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
ROW = WORD_NAMESPACE + 'tr'
COL = WORD_NAMESPACE + 'tc'
def get_docx_text(path):
document = zipfile.ZipFile(path)
xml_content = document.read('word/document.xml')
document.close()
tree = XML(xml_content)
paragraphs = []
for item in tree.iter(ROW or COL or PARA):
texts = []
print(item)
if item is ROW:
texts.append('\n')
elif item is COL:
texts.append('\t\t')
elif item is PARA:
for node in item.iter(TEXT):
if node.text:
texts.append(node.text)
if texts:
paragraphs.append(''.join(texts))
return '\n\n'.join(paragraphs)
text_file = open("output.txt", "w")
text_file.write(get_docx_text('input.docx'))
text_file.close()
I'm not very sure about how the syntactics should look like. The output gives nothing, and for a few trial it resulted something but it was even worse than nothing.
I put print(item) just for checking. But instead of every ROW, COL and PARA items it will list me ROWs only. So it seems like in the condition of the for loop the program seems to ingore the or connection of terms. If it cannot find ROW, it won't execute the 2 remaining options but skip instantly to the next item. I tried it with giving a list of the terms, as well.
Inside it the if/elif blocks I think e.g. if item is ROW should examine whether 'item' and 'ROW' are identical (and they actually are).
X or Y or Z evaluates to the first of three values, which is casted to True. Non-empty strings are always True. So, for item in tree.iter(ROW or COL or PARA) evaluates to for item in tree.iter(ROW) — this is why you are getting only row elements inside your loop.
iter() method of ElementTree object can only accept one tag name, so you should perhaps just iterate over the whole tree (won't be a problem if document is not big).
is is not going to work here. It is an identity operator and only returns True if objects compared are identical (i. e. variables compared refer to the same Python object). In your if... elif... you're comparing a constant str (ROW, COL, PARA) and Element object, which is created anew in each iteration, so, obviously, these two are not the same object and each comparison will return False.
Instead you should use something like if item.tag == ROW.
All of the above taken into account, you should rewrite your loop section like this:
for item in tree.iter():
texts = []
print(item)
if item.tag == ROW:
texts.append('\n')
elif item.tag == COL:
texts.append('\t\t')
elif item.tag == PARA:
for node in item.iter(TEXT):
if node.text:
texts.append(node.text)
if texts:
paragraphs.append(''.join(texts))
The answer above won't work like you asked. This should work for documents containing only tables; some additional parsing with findall should help you isolate non-table data and make this work for a document with tables and other text:
TABLE = WORD_NAMESPACE + 'tbl'
for item in tree.iter(): # use this for loop instead
#print(item.tag)
if item.tag == TABLE:
for row in item.iter(ROW):
texts.append('\n')
for col in row.iter(COL):
texts.append('\t')
for ent in col.iter(TEXT):
if ent.text:
texts.append(ent.text)
return ''.join(texts)

Comparing csv files in python to see what is in both

I have 2 csv files that I want to compare one of which is a master file of all the countries and then another one that has only a few countries. This is an attempt I made for some rudimentary testing:
char = {}
with open('all.csv', 'rb') as lookupfile:
for number, line in enumerate(lookupfile):
chars[line.strip()] = number
with open('locations.csv') as textfile:
text = textfile.read()
print text
for char in text:
if char in chars:
print("Country found {0} found in row {1}".format(char, chars[char]))
I am trying to get a final output of the master file of countries with a secondary column indicating if it came up in the other list
Thanks !
Try this:
Write a function to turn the CSV into a Python dictionary containing as keys each of the country you found in the CSV. It can just look like this:
{'US':True, 'UK':True}
Do this for both CSV files.
Now, iterate over the dictionary.keys() for the csv you're comparing against, and just check to see if the other dictionary has the same key.
This will be an extremely fast algorithm because dictionaries give us constant time lookup, and you have a data structure which you can easily use to see which countries you found.
As Eric mentioned in comments, you can also use set membership to handle this. This may actually be the simpler, better way to do this:
set1 = set() # A new empty set
set1.add("country")
if country in set:
#do something
You could use exactly the same logic as the original loop:
with open('locations.csv') as textfile:
for line in textfile:
if char.strip() in chars:
print("Country found {0} found in row {1}".format(char, chars[char]))

Merge two XML files by matching elements by attribute value

I have two XML files that I'm trying to merge. I looked at other previous questions, but I don't feel like I can solve my problem from reading those. What I think makes my situation unique is that I have to find elements by attribute value and then merge to the opposite file.
I have two files. One is an English translation catalog and the second is a Japanese translation catalog. Pleas see below.
In the code below you'll see the XML has three elements which I will be merging children on - MessageCatalogueEntry, MessageCatalogueFormEntry, and MessageCatalogueFormItemEntry. I have hundreds of files and each file has thousands of lines. There may be more elements than the three I just listed, but I know for sure that all the elements have a "key" attribute.
My plan:
Iterate through File 1 and create a list of all the values of the "key" attribute.
In this example, the list would be key_values = [321, 260, 320]
Next, I'll go through the key_value list one by one.
I'll search File 1 for an element with attribute key=321.
Next, grab the child of the element with key=321 from File 1.
Next, In File 2,find the element with key=321 and add the child element I previously grabbed from File 1.
Next I'll continue the same process looping through the key_values list.
Next, I'll write the new xml root to a file being careful to keep the utf8 encoding.
File 1:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE MessageCatalogue []>
<PackageEntry>
<MessageCatalogue designNotes="Undefined" isPrivate="false" lastKey="362" name="AddKMRichSearchEngineAdmin_AutoTranslationCatalogue" nested="false" version="3.12.0">
<MessageCatalogueEntry key="321">
<MessageCatalogueEntry_loc locale="" message="active"/>
</MessageCatalogueEntry>
<MessageCatalogueFormEntry key="260">
<MessageCatalogueFormEntry_loc locale="" shortTitle="Configuration" title="Spider Configuration"/>
</MessageCatalogueFormEntry>
<MessageCatalogueFormItemEntry key="320">
<MessageCatalogueFormItemEntry_loc hintText="" label="Manage Recognised Phrases" locale="" mnemonic="" scriptText=""/>
</MessageCatalogueFormItemEntry>
</MessageCatalogue>
</PackageEntry>
File 2:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE MessageCatalogue[]>
<PackageEntry>
<MessageCatalogue designNotes="Undefined" isPrivate="false" lastKey="362" name="" nested="false" version="3.12.0">
<MessageCatalogueEntry key="321">
<MessageCatalogueEntry_loc locale="ja" message="アクティブ" />
</MessageCatalogueEntry>
<MessageCatalogueFormEntry key="260">
<MessageCatalogueFormEntry_loc locale="ja" shortTitle="設定" title="Spider Configuration/スパイダー設定" />
</MessageCatalogueFormEntry>
<MessageCatalogueFormItemEntry key="320">
<MessageCatalogueFormItemEntry_loc hintText="" label="認識されたフレーズを管理" locale="ja" mnemonic="" scriptText="" />
</MessageCatalogueFormItemEntry>
</MessageCatalogue>
</PackageEntry>
Output:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE MessageCatalogue []>
<PackageEntry>
<MessageCatalogue designNotes="Undefined" isPrivate="false" lastKey="362" name="AddKMRichSearchEngineAdmin_AutoTranslationCatalogue" nested="false" version="3.12.0">
<MessageCatalogueEntry key="321">
<MessageCatalogueEntry_loc locale="" message="active"/>
<MessageCatalogueEntry_loc locale="ja" message="アクティブ" />
</MessageCatalogueEntry>
<MessageCatalogueFormEntry key="260">
<MessageCatalogueFormEntry_loc locale="" shortTitle="Configuration" title="Spider Configuration"/>
<MessageCatalogueFormEntry_loc locale="ja" shortTitle="設定" title="Spider Configuration/スパイダー設定" />
</MessageCatalogueFormEntry>
<MessageCatalogueFormItemEntry key="320">
<MessageCatalogueFormItemEntry_loc hintText="" label="Manage Recognised Phrases" locale="" mnemonic="" scriptText=""/>
<MessageCatalogueFormItemEntry_loc hintText="" label="認識されたフレーズを管理" locale="ja" mnemonic="" scriptText="" />
</MessageCatalogueFormItemEntry>
</MessageCatalogue>
</PackageEntry>
I'm having trouble just even grabbing elements, nevermind grabbing them by key value. For example, I've been playing with the elementtree library and I wrote this code hoping to get just the MessageCatalogueEntry but I'm only getting their children:
from xml.etree import ElementTree as et
tree_japanese = et.parse('C:\\blah\\blah\\blah\\AddKMRichSearchEngineAdmin_AutoTranslationCatalogue_JA.xml')
root_japanese = tree_japanese.getroot()
MC_japanese = root_japanese.findall("MessageCatalogue")
for x in MC_japanese:
messageCatalogueEntry = x.findall("MessageCatalogueEntry")
for m in messageCatalogueEntry:
print et.tostring(m[0], encoding='utf8')
tree_english = et.parse('C:\\blah\\blah\\blah\\AddKMRichSearchEngineAdmin\\AddKMRichSearchEngineAdmin_AutoTranslationCatalogue.xml')
root_english = tree_english.getroot()
MC_english = root_english.findall("MessageCatalogue")
for x in MC_english:
messageCatalogueEntry = x.findall("MessageCatalogueEntry")
for m in messageCatalogueEntry:
print et.tostring(m[0], encoding='utf8')
Any help would be appreciated. I've been at this for a few work days now and I'm not any closer to finishing than I was when I first started!
Actually, you are getting the MessageCatalogEntry's. The problem is in the print statement. An element acts like a list, so m[0] is the first child of the MessageCatalogEntry. In
messageCatalogueEntry = x.findall("MessageCatalogueEntry")
for m in messageCatalogueEntry:
print et.tostring(m[0], encoding='utf8')
change the print to print et.tostring(m, encoding='utf8') to see the right element.
I personally prefer lxml to elementtree. Assuming you want to associate entries by the 'key' attribute, you could use xpath to index one of the docs and then pull them into other doc.
import lxml.etree
tree_english = lxml.etree.parse('english.xml')
tree_japanese = lxml.etree.parse('japanese.xml')
# index the japanese catalog
j_index = {}
for catalog in tree_japanese.xpath('MessageCatalogue/*[#key]'):
j_index[catalog.get('key')] = catalog
# find catalog entries in english and merge the japanese
for catalog in tree_english.xpath('MessageCatalogue/*[#key]'):
j_catalog = j_index.get(catalog.get('key'))
if j_catalog is not None:
print 'found match'
for child in j_catalog:
print 'add one'
catalog.append(child)
print lxml.etree.tostring(tree_english, pretty_print=True, encoding='utf8')

Categories

Resources