I want convert a csv file to xml file with python. I want to group the same id's in the csv file together and convert the csv in to convert xml( see desired ouput ). Its a bit complex than it looks with indentation, looping and grouping the csv to xml. All help is appreciated.
My CSV file:
id,x1,y1,z1,x2,y2,z2,c1,R
a1,1.3,2.1,3.6,4.5,5.1,6.8,B,7.3
b2,1.1,2.1,3.1,4.1,5.1,6.1,G,7.1
c1,2.1,3.1,4.1,5.1,2.1,7.1,G,8.1
a1,2.2,3.2,4.2,5.2,6.2,7.2,S,8.2
b2,4.1,5.1,2.1,7.1,8.1,9.1,S,2.5
b2,3.6,4.5,5.1,6.3,7.4,8.2,G,3.1
c2,6.1,7.1,8.1,9.1,2.1,11.1,S,3.2
c1,1.5,1.5,1.5,1.5,1.5,1.5,A,1.5
my code:
import itertools
import csv
import os
csvFile = r'C:\Users\Desktop\test XML\csvfile.csv'
xmlFile = r'C:\Users\Desktop\test XML\myData.xml'
csvData = csv.reader(open(csvFile))
xmlData = open(xmlFile, 'w')
xmlData.write('<?xml version="1.0" encoding="UTF-8"?>' + "\n" +'<Roughness-Profiles xmlns="http://WKI/Roughness-Profiles/1">' + "\n" )
xmlData.write(' '+'<Roughness-Profile>' + "\n")
rowNum = 0
for row in csvData:
if rowNum == 0:
tags = row
# replace spaces w/ underscores in tag names
for i in range(len(tags)):
tags[i] = tags[i].replace(' ', '_')
else:
xmlData.write(' '+'<surfaces>' +"\n"+' '+'<surface>' + "\n")
for i in range (len(tags)):
xmlData.write(' ' +'<' + tags[i] + '>' \
+ row[i] + '</' + tags[i] + '>' + "\n")
xmlData.write(' '+'</surface>' + "\n" + ' '+'</surfaces>' + "\n" + ' '+'</Roughness-Profile>' + "\n")
rowNum +=1
xmlData.write('</Roughness-Profiles>' + "\n")
xmlData.close()
my xml output:
<?xml version="1.0" encoding="UTF-8"?>
<Roughness-Profiles xmlns="http://WKI/Roughness-Profiles/1">
<Roughness-Profile>
<surfaces>
<surface>
<id>a1</id>
<x1>1.3</x1>
<y1>2.1</y1>
<z1>3.6</z1>
<x2>4.5</x2>
<y2>5.1</y2>
<z2>6.8</z2>
<c1>B</c1>
<R>7.3</R>
</surface>
</surfaces>
</Roughness-Profile>
<surfaces>
<surface>
<id>b2</id>
<x1>1.1</x1>
<y1>2.1</y1>
<z1>3.1</z1>
<x2>4.1</x2>
<y2>5.1</y2>
<z2>6.1</z2>
<c1>G</c1>
<R>7.1</R>
</surface>
</surfaces>
</Roughness-Profile>
<surfaces>
<surface>
<id>c1</id>
<x1>2.1</x1>
<y1>3.1</y1>
<z1>4.1</z1>
<x2>5.1</x2>
<y2>2.1</y2>
<z2>7.1</z2>
<c1>G</c1>
<R>8.1</R>
</surface>
</surfaces>
</Roughness-Profile>
<surfaces>
<surface>
<id>a1</id>
<x1>2.2</x1>
<y1>3.2</y1>
<z1>4.2</z1>
<x2>5.2</x2>
<y2>6.2</y2>
<z2>7.2</z2>
<c1>S</c1>
<R>8.2</R>
</surface>
</surfaces>
</Roughness-Profile>
<surfaces>
<surface>
<id>b2</id>
<x1>4.1</x1>
<y1>5.1</y1>
<z1>2.1</z1>
<x2>7.1</x2>
<y2>8.1</y2>
<z2>9.1</z2>
<c1>S</c1>
<R>2.5</R>
</surface>
</surfaces>
</Roughness-Profile>
<surfaces>
<surface>
<id>b2</id>
<x1>3.6</x1>
<y1>4.5</y1>
<z1>5.1</z1>
<x2>6.3</x2>
<y2>7.4</y2>
<z2>8.2</z2>
<c1>G</c1>
<R>3.1</R>
</surface>
</surfaces>
</Roughness-Profile>
<surfaces>
<surface>
<id>c2</id>
<x1>6.1</x1>
<y1>7.1</y1>
<z1>8.1</z1>
<x2>9.1</x2>
<y2>2.1</y2>
<z2>11.1</z2>
<c1>S</c1>
<R>3.2</R>
</surface>
</surfaces>
</Roughness-Profile>
<surfaces>
<surface>
<id>c1</id>
<x1>1.5</x1>
<y1>1.5</y1>
<z1>1.5</z1>
<x2>1.5</x2>
<y2>1.5</y2>
<z2>1.5</z2>
<c1>A</c1>
<R>1.5</R>
</surface>
</surfaces>
</Roughness-Profile>
</Roughness-Profiles>
Desired output should be:
<?xml version="1.0" encoding="UTF-8"?>
<R-Profiles xmlns="http://WKI/R-Profiles/1">
<R-Profile>
<id>a1</id>
<surfaces>
<surface>
<x1>1.3</x1>
<y1>2.1</y1>
<z1>3.6</z1>
<x2>4.5</x2>
<y2>5.1</y2>
<z2>6.8</z2>
<c1>B</c1>
<R>7.3</R>
</surface>
<surface>
<x1>2.2</x1>
<y1>3.2</y1>
<z1>4.2</z1>
<x2>5.2</x2>
<y2>6.2</y2>
<z2>7.2</z2>
<c1>S</c1>
<R>8.2</R>
</surface>
</surfaces>
</R-Profile>
<R-Profile>
<id>b2</id>
<surfaces>
<surface>
<x1>1.1</x1>
<y1>2.1</y1>
<z1>3.1</z1>
<x2>4.1</x2>
<y2>5.1</y2>
<z2>6.1</z2>
<c1>G</c1>
<R>7.1</R>
</surface>
<surface>
<x1>4.1</x1>
<y1>5.1</y1>
<z1>2.1</z1>
<x2>7.1</x2>
<y2>8.1</y2>
<z2>9.1</z2>
<c1>S</c1>
<R>2.5</R>
</surface>
<surface>
<x1>3.6</x1>
<y1>4.5</y1>
<z1>5.1</z1>
<x2>6.3</x2>
<y2>7.4</y2>
<z2>8.2</z2>
<c1>G</c1>
<R>3.1</R>
</surface>
</surfaces>
</R-Profile>
<R-Profile>
<id>c1</id>
<surfaces>
<surface>
<x1>2.1</x1>
<y1>3.1</y1>
<z1>4.1</z1>
<x2>5.1</x2>
<y2>2.1</y2>
<z2>7.1</z2>
<c1>G</c1>
<R>8.1</R>
</surface>
<surface>
<x1>1.5</x1>
<y1>1.5</y1>
<z1>1.5</z1>
<x2>1.5</x2>
<y2>1.5</y2>
<z2>1.5</z2>
<c1>A</c1>
<R>1.5</R>
</surface>
</surfaces>
</R-Profile>
<R-Profile>
<id>c2</id>
<surfaces>
<surface>
<x1>6.1</x1>
<y1>7.1</y1>
<z1>8.1</z1>
<x2>9.1</x2>
<y2>2.1</y2>
<z2>11.1</z2>
<c1>S</c1>
<R>3.2</R>
</surface>
</surfaces>
</R-Profile>
</R-Profiles>
I would do something very similar to what #Parfait suggested; use csv.DictReader and lxml to create the XML.
However, something is missing from that answer; the surface elements aren't grouped by id.
If I need to group XML during a transformation, the first thing I think of is XSLT.
Once you get the hang of it, grouping is easy with XSLT; especially 2.0 or greater. Unfortunately lxml only supports XSLT 1.0. In 1.0 you need to use Muenchian Grouping.
Here's a full example of creating an intermediate XML and transforming it with XSLT.
CSV Input (test.csv)
id,x1,y1,z1,x2,y2,z2,c1,R
a1,1.3,2.1,3.6,4.5,5.1,6.8,B,7.3
b2,1.1,2.1,3.1,4.1,5.1,6.1,G,7.1
c1,2.1,3.1,4.1,5.1,2.1,7.1,G,8.1
a1,2.2,3.2,4.2,5.2,6.2,7.2,S,8.2
b2,4.1,5.1,2.1,7.1,8.1,9.1,S,2.5
b2,3.6,4.5,5.1,6.3,7.4,8.2,G,3.1
c2,6.1,7.1,8.1,9.1,2.1,11.1,S,3.2
c1,1.5,1.5,1.5,1.5,1.5,1.5,A,1.5
XSLT 1.0 (test.xsl)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:rp="http://WKI/Roughness-Profiles/1">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:key name="surface" match="rp:surface" use="rp:id"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="/*">
<xsl:copy>
<xsl:apply-templates select="#*"/>
<xsl:for-each select="rp:surface[count(.|key('surface',rp:id)[1])=1]">
<xsl:element name="Roughness-Profile" namespace="http://WKI/Roughness-Profiles/1">
<xsl:copy-of select="rp:id"/>
<xsl:element name="surfaces" namespace="http://WKI/Roughness-Profiles/1">
<xsl:apply-templates select="key('surface',rp:id)"/>
</xsl:element>
</xsl:element>
</xsl:for-each>
</xsl:copy>
</xsl:template>
<xsl:template match="rp:id"/>
</xsl:stylesheet>
Python
import csv
import lxml.etree as etree
# INITIALIZING XML FILE WITH ROOT IN PROPER NAMESPACE
nsmap = {None: "http://WKI/Roughness-Profiles/1"}
root = etree.Element('Roughness-Profiles', nsmap=nsmap)
# READING CSV FILE
with open("test.csv") as f:
reader = csv.DictReader(f)
# WRITE INITIAL XML NODES
for row in reader:
surface_elem = etree.SubElement(root, "surface", nsmap=nsmap)
for elem_name, elem_value in row.items():
etree.SubElement(surface_elem, elem_name.strip(), nsmap=nsmap).text = str(elem_value)
# PARSE XSLT AND CREATE TRANSFORMER
xslt_root = etree.parse("test.xsl")
transform = etree.XSLT(xslt_root)
# TRANSFORM
# (Note the weird use of tostring/fromstring. This was used so
# namespaces in the XSLT would work the way they're supposed to.)
final_xml = transform(etree.fromstring(etree.tostring(root)))
# WRITE OUTPUT TO FILE
final_xml.write_output("test.xml")
XML Output (test.xml)
<?xml version="1.0"?>
<Roughness-Profiles xmlns="http://WKI/Roughness-Profiles/1">
<Roughness-Profile>
<id>a1</id>
<surfaces>
<surface>
<x1>1.3</x1>
<y1>2.1</y1>
<z1>3.6</z1>
<x2>4.5</x2>
<y2>5.1</y2>
<z2>6.8</z2>
<c1>B</c1>
<R>7.3</R>
</surface>
<surface>
<x1>2.2</x1>
<y1>3.2</y1>
<z1>4.2</z1>
<x2>5.2</x2>
<y2>6.2</y2>
<z2>7.2</z2>
<c1>S</c1>
<R>8.2</R>
</surface>
</surfaces>
</Roughness-Profile>
<Roughness-Profile>
<id>b2</id>
<surfaces>
<surface>
<x1>1.1</x1>
<y1>2.1</y1>
<z1>3.1</z1>
<x2>4.1</x2>
<y2>5.1</y2>
<z2>6.1</z2>
<c1>G</c1>
<R>7.1</R>
</surface>
<surface>
<x1>4.1</x1>
<y1>5.1</y1>
<z1>2.1</z1>
<x2>7.1</x2>
<y2>8.1</y2>
<z2>9.1</z2>
<c1>S</c1>
<R>2.5</R>
</surface>
<surface>
<x1>3.6</x1>
<y1>4.5</y1>
<z1>5.1</z1>
<x2>6.3</x2>
<y2>7.4</y2>
<z2>8.2</z2>
<c1>G</c1>
<R>3.1</R>
</surface>
</surfaces>
</Roughness-Profile>
<Roughness-Profile>
<id>c1</id>
<surfaces>
<surface>
<x1>2.1</x1>
<y1>3.1</y1>
<z1>4.1</z1>
<x2>5.1</x2>
<y2>2.1</y2>
<z2>7.1</z2>
<c1>G</c1>
<R>8.1</R>
</surface>
<surface>
<x1>1.5</x1>
<y1>1.5</y1>
<z1>1.5</z1>
<x2>1.5</x2>
<y2>1.5</y2>
<z2>1.5</z2>
<c1>A</c1>
<R>1.5</R>
</surface>
</surfaces>
</Roughness-Profile>
<Roughness-Profile>
<id>c2</id>
<surfaces>
<surface>
<x1>6.1</x1>
<y1>7.1</y1>
<z1>8.1</z1>
<x2>9.1</x2>
<y2>2.1</y2>
<z2>11.1</z2>
<c1>S</c1>
<R>3.2</R>
</surface>
</surfaces>
</Roughness-Profile>
</Roughness-Profiles>
First read all rows from CSV and sort them.
Later you can use variable previous_id to open and close Roughness-Profile/surfaces only when id in new row is different then in previous one.
I used StringIO to simulate csv file and sys.stdout to simulate xml file - so everybody can copy code and run it to see how it works
text ='''id,x1,y1,z1,x2,y2,z2,c1,R
a1,1.3,2.1,3.6,4.5,5.1,6.8,B,7.3
b2,1.1,2.1,3.1,4.1,5.1,6.1,G,7.1
c1,2.1,3.1,4.1,5.1,2.1,7.1,G,8.1
a1,2.2,3.2,4.2,5.2,6.2,7.2,S,8.2
b2,4.1,5.1,2.1,7.1,8.1,9.1,S,2.5
b2,3.6,4.5,5.1,6.3,7.4,8.2,G,3.1
c2,6.1,7.1,8.1,9.1,2.1,11.1,S,3.2
c1,1.5,1.5,1.5,1.5,1.5,1.5,A,1.5'''
from io import StringIO
import csv
import sys
#csvFile = r'C:\Users\Desktop\test XML\csvfile.csv'
#xmlFile = r'C:\Users\Desktop\test XML\myData.xml'
#csvData = csv.reader(open(csvFile))
#xmlData = open(xmlFile, 'w')
csvData = csv.reader(StringIO(text))
xmlData = sys.stdout
# read all data to sort them
csvData = list(csvData)
tags = [item.replace(' ', '_') for item in csvData[0]] # headers
csvData = sorted(csvData[1:]) # sort data without headers
xmlData.write('<?xml version="1.0" encoding="UTF-8"?>\n<Roughness-Profiles xmlns="http://WKI/Roughness-Profiles/1">\n')
previous_id = None
for row in csvData:
row_id = row[0]
if row_id != previous_id:
# close previous group - but only if it is not first group
if previous_id is not None:
xmlData.write('</surfaces>\n</Roughness-Profile>\n')
# open new group
xmlData.write('<Roughness-Profile>\n<id>{}</id>\n<surfaces>\n'.format(row_id))
# remeber new group's id
previous_id = row_id
# surface
xmlData.write('<surface>\n')
for value, tag in zip(row[1:], tags[1:]):
xmlData.write('<{}>{}</{}>\n'.format(tag, value, tag))
xmlData.write('</surface>\n')
# close last group
xmlData.write('</surfaces>\n</Roughness-Profile>\n')
xmlData.write('</Roughness-Profiles>\n')
#xmlData.close()
Version without StringIO and sys.stdout
import csv
csvFile = r'C:\Users\Desktop\test XML\csvfile.csv'
xmlFile = r'C:\Users\Desktop\test XML\myData.xml'
csvData = csv.reader(open(csvFile))
xmlData = open(xmlFile, 'w')
# read all data to sort them
csvData = list(csvData)
tags = [item.replace(' ', '_') for item in csvData[0]] # headers
csvData = sorted(csvData[1:]) # sort data without headers
xmlData.write('<?xml version="1.0" encoding="UTF-8"?>\n<Roughness-Profiles xmlns="http://WKI/Roughness-Profiles/1">\n')
previous_id = None
for row in csvData:
row_id = row[0]
if row_id != previous_id:
# close previous group - but only if it is not first group
if previous_id is not None:
xmlData.write('</surfaces>\n</Roughness-Profile>\n')
# open new group
xmlData.write('<Roughness-Profile>\n<id>{}</id>\n<surfaces>\n'.format(row_id))
# remeber new group's id
previous_id = row_id
# surface
xmlData.write('<surface>\n')
for value, tag in zip(row[1:], tags[1:]):
xmlData.write('<{}>{}</{}>\n'.format(tag, value, tag))
xmlData.write('</surface>\n')
# close last group
xmlData.write('</surfaces>\n</Roughness-Profile>\n')
xmlData.write('</Roughness-Profiles>\n')
xmlData.close()
Because XML files are not text files but special text-based documents of structured data adhering to W3C specifications, avoiding building the document by string concatenation.
Instead use appropriate DOM libraries available in virtually all modern programming languages including Python with its built-in xml.etree or more robust, third-party module, lxml. In fact, because your desired output involves grouping nodes by id, consider running XSLT, the special-purpose language designed to transform XML files. The module, lxml can run XSLT 1.0 scripts.
Below uses the DictReader of built-in csv module to build a nested id dictionary (all columns grouped under id keys). Then, XML is built by iterating through content of this dictionary to write data to element nodes.
import csv
from collections import OrderedDict
import lxml.etree as ET
# BUILD NESTED ID DICTIONARY FROM CSV
with open("Input.csv") as f:
reader = csv.DictReader(f)
id_dct = OrderedDict({})
for dct in reader:
if dct["id"] not in id_dct.keys():
id_dct[dct["id"]] = [OrderedDict({k:v for k,v in dct.items() if k!= "id"})]
else:
id_dct[dct["id"]].append(OrderedDict({k:v for k,v in dct.items() if k!= "id"}))
# INITIALIZING XML FILE WITH ROOT AND NAMESPACE
root = ET.Element('R-Profiles', nsmap={None: "http://WKI/Roughness-Profiles/1"})
# WRITING TO XML NODES
for k,v in id_dct.items():
rpNode = ET.SubElement(root, "R-Profile")
ET.SubElement(rpNode, "id").text = str(k)
surfacesNode = ET.SubElement(rpNode, "surfaces")
for dct in v:
surfaceNode = ET.SubElement(surfacesNode, "surface")
for k,v in dct.items():
ET.SubElement(surfaceNode, k).text = str(v)
# OUTPUT XML CONTENT TO FILE
tree_out = ET.tostring(root, pretty_print=True, xml_declaration=True, encoding="UTF-8")
with open('Output.xml','wb') as f:
f.write(tree_out)
Input.csv
id,x1,y1,z1,x2,y2,z2,c1,R
a1,1.3,2.1,3.6,4.5,5.1,6.8,B,7.3
b2,1.1,2.1,3.1,4.1,5.1,6.1,G,7.1
c1,2.1,3.1,4.1,5.1,2.1,7.1,G,8.1
a1,2.2,3.2,4.2,5.2,6.2,7.2,S,8.2
b2,4.1,5.1,2.1,7.1,8.1,9.1,S,2.5
b2,3.6,4.5,5.1,6.3,7.4,8.2,G,3.1
c2,6.1,7.1,8.1,9.1,2.1,11.1,S,3.2
c1,1.5,1.5,1.5,1.5,1.5,1.5,A,1.5
Output.xml
<?xml version='1.0' encoding='UTF-8'?>
<R-Profiles xmlns="http://WKI/Roughness-Profiles/1">
<R-Profile>
<id>a1</id>
<surfaces>
<surface>
<x1>1.3</x1>
<y1>2.1</y1>
<z1>3.6</z1>
<x2>4.5</x2>
<y2>5.1</y2>
<z2>6.8</z2>
<c1>B</c1>
<R>7.3</R>
</surface>
<surface>
<x1>2.2</x1>
<y1>3.2</y1>
<z1>4.2</z1>
<x2>5.2</x2>
<y2>6.2</y2>
<z2>7.2</z2>
<c1>S</c1>
<R>8.2</R>
</surface>
</surfaces>
</R-Profile>
<R-Profile>
<id>b2</id>
<surfaces>
<surface>
<x1>1.1</x1>
<y1>2.1</y1>
<z1>3.1</z1>
<x2>4.1</x2>
<y2>5.1</y2>
<z2>6.1</z2>
<c1>G</c1>
<R>7.1</R>
</surface>
<surface>
<x1>4.1</x1>
<y1>5.1</y1>
<z1>2.1</z1>
<x2>7.1</x2>
<y2>8.1</y2>
<z2>9.1</z2>
<c1>S</c1>
<R>2.5</R>
</surface>
<surface>
<x1>3.6</x1>
<y1>4.5</y1>
<z1>5.1</z1>
<x2>6.3</x2>
<y2>7.4</y2>
<z2>8.2</z2>
<c1>G</c1>
<R>3.1</R>
</surface>
</surfaces>
</R-Profile>
<R-Profile>
<id>c1</id>
<surfaces>
<surface>
<x1>2.1</x1>
<y1>3.1</y1>
<z1>4.1</z1>
<x2>5.1</x2>
<y2>2.1</y2>
<z2>7.1</z2>
<c1>G</c1>
<R>8.1</R>
</surface>
<surface>
<x1>1.5</x1>
<y1>1.5</y1>
<z1>1.5</z1>
<x2>1.5</x2>
<y2>1.5</y2>
<z2>1.5</z2>
<c1>A</c1>
<R>1.5</R>
</surface>
</surfaces>
</R-Profile>
<R-Profile>
<id>c2</id>
<surfaces>
<surface>
<x1>6.1</x1>
<y1>7.1</y1>
<z1>8.1</z1>
<x2>9.1</x2>
<y2>2.1</y2>
<z2>11.1</z2>
<c1>S</c1>
<R>3.2</R>
</surface>
</surfaces>
</R-Profile>
</R-Profiles>
Related
I am trying to access the name of a picture in a word document using python-docx but i also need to know which paragraph and run it is contained in so I can not use inline_shapes.
docx = Document()
section = docx.sections[0]
p = docx.add_paragraph()
run = p.add_run()
img = run.add_picture("pptExporter.png", 100000)
a = run._r.xml
print(a)
b = run._r.drawing.inline.graphic.graphicData.pic.nvPicPr.get('name')
p2 = docx.add_paragraph("hello", style="Caption")
docx.save("test.docx")
When I print the xml I get:
<w:r xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:mv="urn:schemas-microsoft-com:mac:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape">
<w:drawing>
<wp:inline xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<wp:extent cx="100000" cy="75021"/>
<wp:docPr id="1" name="Picture 1"/>
<wp:cNvGraphicFramePr>
<a:graphicFrameLocks noChangeAspect="1"/>
</wp:cNvGraphicFramePr>
<a:graphic>
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic>
<pic:nvPicPr>
<pic:cNvPr id="0" name="pptExporter.png"/>
<pic:cNvPicPr/>
</pic:nvPicPr>
<pic:blipFill>
<a:blip r:embed="rId9"/>
<a:stretch>
<a:fillRect/>
</a:stretch>
</pic:blipFill>
<pic:spPr>
<a:xfrm>
<a:off x="0" y="0"/>
<a:ext cx="100000" cy="75021"/>
</a:xfrm>
<a:prstGeom prst="rect"/>
</pic:spPr>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:inline>
</w:drawing>
</w:r>
but I get the following error:
Traceback (most recent call last):
File "file", line 11, in <module>
b = run._r.drawing.inline.graphic.graphicData.pic.nvPicPr.get('name')
AttributeError: 'CT_R' object has no attribute 'drawing'
I have an XML file like this:
<?xml version="1.0"?>
<PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Class Def" MessageType="Integration Object">
<ListOf_Class_Def>
<ImpExp Type="CLASS_DEF" Name="lp_pkg_cla" Object_Num="1001p">
<ListOfObject_Def>
<Object_Def Ancestor_Num="" Ancestor_Name="">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
<Object_Arrt Orig_Id="6666p" Attr_Name="LP_Portable">
</Object_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Class_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Class Def" MessageType="Integration Object">
<ListOf_Class_Def>
<ImpExp Type="CLASS_DEF" Name="M_pkg_cla" Object_Num="1023i">
<ListOfObject_Def>
<Object_Def Ancestor_Num="" Ancestor_Name="">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
<Object_Arrt Orig_Id="7010p" Attr_Name="O_Portable">
</Object_Arrt>
<Object_Arrt Orig_Id="7012j" Attr_Name="O_wireless">
</Object_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Class_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Prod Def" MessageType="Integration Object">
<ListOf_Prod_Def>
<ImpExp Type="PROD_DEF" Name="Laptop" Object_Num="2008a">
<ListOfObject_Def>
<Object_Def Ancestor_Num="1001p" Ancestor_Name="lp_pkg_cla">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Prod_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Prod Def" MessageType="Integration Object">
<ListOf_Prod_Def>
<ImpExp Type="PROD_DEF" Name="Mouse" Object_Num="2987d">
<ListOfObject_Def>
<Object_Def Ancestor_Num="1023i" Ancestor_Name="M_pkg_cla">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Prod_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Prod Def" MessageType="Integration Object">
<ListOf_Prod_Def>
<ImpExp Type="PROD_DEF" Name="Speaker" Object_Num="5463g">
<ListOfObject_Def>
<Object_Def Ancestor_Num="" Ancestor_Name="">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Prod_Def>
</Message>
</PropertySet>
</PropertySet>
I am hoping to extract the Name, Object_Num, Orig_Id and Attr_Name tags from it using Python and convert them into a .csv format.
The .csv format I'd like to see it in is simply:
ProductId Product AttributeId Attribute
2008a Laptop 6666p LP_Portable
2987d Mouse 7010p O_Portable
2987d Mouse 7012p O_Wireless
5463g Speaker "" ""
Actually there is a relationship like this in xml tags:
All products are in the tags, "ImpExp Type="PROD_DEF".. "
All attributes are in the tags, "ImpExp Type="CLASS_DEF".. "
If a product has attributes, then there is a tag
<Object_Def Ancestor_Num="1023i".. >
The Ancestor_Num is equal to Object_Num in tags,
Type="CLASS_DEF"..
I have tried this:
from lxml import etree
import pandas
import HTMLParser
inFile = "./newm.xml"
outFile = "./new.csv"
ctx1 = etree.iterparse(inFile, tag=("ImpExp", "ListOfObject_Def", "ListOfObject_Arrt",))
hp = HTMLParser.HTMLParser()
csvData = []
csvData1 = []
csvData2 = []
csvData3 = []
csvData4 = []
csvData5 = []
for event, elem in ctx1:
value1 = elem.get("Type")
value2 = elem.get("Name")
value3 = elem.get("Object_Num")
value4 = elem.get("Ancestor_Num")
value5 = elem.get("Orig_Id")
value6 = elem.get("Attr_Name")
if value1 == "PROD_DEF":
csvData.append(value2)
csvData1.append(value3)
for event, elem in ctx1:
if value4 is not None:
csvData2.append(value4)
elem.clear()
df = pandas.DataFrame({'Product':csvData, 'ProductId':csvData1, 'AncestorId':csvData2})
for event, elem in ctx1:
if value1 == "Class Def":
csvData3.append(value3)
csvData4.append(value5)
csvData5.append(value6)
elem.clear()
df1 = pandas.DataFrame({'AncestorId':csvData3, 'AttribId':csvData4, 'AttribName':csvData5})
dff = pandas.merge(df, df1, on="AncestorId")
dff.to_csv(outFile, index = False)
Consider XSLT, the special purpose language designed to transform XML files and can directly convert XML to CSV (i.e., text file) without the pandas dataframe intermediary. Python's third-party module lxml (which you are already using) can run XSLT 1.0 scripts and do so without for loops or if logic. However, due to the complex alignment of product and attributes, some longer XPath searches are used with XSLT.
XSLT (save as .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="no" method="text"/>
<xsl:strip-space elements="*"/>
<xsl:param name="delimiter">,</xsl:param>
<xsl:template match="/PropertySet">
<xsl:text>ProductId,Product,AttributeId,Attribute
</xsl:text>
<xsl:apply-templates select="*"/>
</xsl:template>
<xsl:template match="PropertySet|Message|ListOf_Class_Def|ListOf_Prod_Def|ImpExp">
<xsl:apply-templates select="*"/>
</xsl:template>
<xsl:template match="ListOfObject_Arrt">
<xsl:apply-templates select="Object_Arrt"/>
<xsl:if test="name(*) != 'Object_Arrt' and preceding-sibling::ListOfObject_Def/Object_Def/#Ancestor_Name = ''">
<xsl:value-of select="concat(ancestor::ImpExp/#Name, $delimiter,
ancestor::ImpExp/#Object_Num, $delimiter,
'', $delimiter,
'')"/><xsl:text>
</xsl:text>
</xsl:if>
</xsl:template>
<xsl:template match="Object_Arrt">
<xsl:variable name="attrName" select="ancestor::ImpExp/#Name"/>
<xsl:value-of select="concat(/PropertySet/PropertySet/Message[#IntObjectName='Prod Def']/ListOf_Prod_Def/
ImpExp[ListOfObject_Def/Object_Def/#Ancestor_Name = $attrName]/#Name, $delimiter,
/PropertySet/PropertySet/Message[#IntObjectName='Prod Def']/ListOf_Prod_Def/
ImpExp[ListOfObject_Def/Object_Def/#Ancestor_Name = $attrName]/#Object_Num, $delimiter,
#Orig_Id, $delimiter,
#Attr_Name)"/><xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
Python
import lxml.etree as et
# LOAD XML AND XSL
xml = et.parse('Input.xml')
xsl = et.parse('XSLT_Script.xsl')
# RUN TRANSFORMATION
transform = et.XSLT(xsl)
result = transform(xml)
# OUTPUT TO FILE
with open('Output.csv', 'wb') as f:
f.write(result)
Output
ProductId,Product,AttributeId,Attribute
Laptop,2008a,6666p,LP_Portable
Mouse,2987d,7010p,O_Portable
Mouse,2987d,7012j,O_wireless
Speaker,5463g,,
You would need to preparse all of the CLASS_DEF entries into a dictionary. These can then be looked up when processing the PROD_DEF entries:
import csv
from lxml import etree
inFile = "./newm.xml"
outFile = "./new.csv"
tree = etree.parse(inFile)
class_defs = {}
# First extract all the CLASS_DEF entries into a dictionary
for impexp in tree.iter("ImpExp"):
name = impexp.get('Name')
if impexp.get('Type') == "CLASS_DEF":
for list_of_object_arrt in impexp.findall('ListOfObject_Arrt'):
class_defs[name] = [(obj.get('Orig_Id'), obj.get('Attr_Name')) for obj in list_of_object_arrt]
with open(outFile, 'wb') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['ProductId', 'Product', 'AttributeId', 'Attribute'])
for impexp in tree.iter("ImpExp"):
object_num = impexp.get('Object_Num')
name = impexp.get('Name')
if impexp.get('Type') == "PROD_DEF":
for list_of_object_def in impexp.findall('ListOfObject_Def'):
for obj in list_of_object_def:
ancestor_num = obj.get('Ancestor_Num')
ancestor_name = obj.get('Ancestor_Name')
csv_output.writerow([object_num, name] + list(class_defs.get(ancestor_name, [['', '']])[0]))
This would produce new.csv containing:
ProductId,Product,AttributeId,Attribute
2008a,Laptop,6666p,LP_Portable
2987d,Mouse,7010p,O_Portable
5463g,Speaker,,
If you are using Python 3.x, use:
with open(outFile, 'w', newline='') as f_output:
I am looking for a way to automate the conversion of CSV to XML.
Here is an example of a CSV file, containing a list of movies:
Here is the file in XML format:
<collection shelf="New Arrivals">
<movietitle="Enemy Behind">
<type>War, Thriller</type>
<format>DVD</format>
<year>2003</year>
<rating>PG</rating>
<stars>10</stars>
<description>Talk about a US-Japan war</description>
</movie>
<movietitle="Transformers">
<type>Anime, Science Fiction</type>
<format>DVD</format>
<year>1989</year>
<rating>R</rating>
<stars>8</stars>
<description>A schientific fiction</description>
</movie>
<movietitle="Trigun">
<type>Anime, Action</type>
<format>DVD</format>
<episodes>4</episodes>
<rating>PG</rating>
<stars>10</stars>
<description>Vash the Stampede!</description>
</movie>
<movietitle="Ishtar">
<type>Comedy</type>
<format>VHS</format>
<rating>PG</rating>
<stars>2</stars>
<description>Viewable boredom</description>
</movie>
</collection>
I've tried a few examples where I am able to read the csv and XML format using Python using DOM and SAX but yet am to find a simple example of the conversion. So far I have:
import csv
f = open('movies2.csv')
csv_f = csv.reader(f)
def convert_row(row):
return """<movietitle="%s">
<type>%s</type>
<format>%s</format>
<year>%s</year>
<rating>%s</rating>
<stars>%s</stars>
<description>%s</description>
</movie>""" % (
row.Title, row.Type, row.Format, row.Year, row.Rating, row.Stars, row.Description)
print ('\n'.join(csv_f.apply(convert_row, axis=1)))
But I get the error:
File "moviesxml.py", line 16, in module
print ('\n'.join(csv_f.apply(convert_row, axis=1)))
AttributeError: '_csv.reader' object has no attribute 'apply'
I am pretty new to Python, so any help would be much appreciated!
I am using Python 3.5.2.
Thanks!
Lisa
A possible solution is to first load the csv into Pandas and then convert it row by row into XML, as so:
import pandas as pd
df = pd.read_csv('untitled.txt', sep='|')
With the sample data (assuming separator and so on) loaded as:
Title Type Format Year Rating Stars \
0 Enemy Behind War,Thriller DVD 2003 PG 10
1 Transformers Anime,Science Fiction DVD 1989 R 9
Description
0 Talk about...
1 A Schientific fiction
And then converting to xml with a custom function:
def convert_row(row):
return """<movietitle="%s">
<type>%s</type>
<format>%s</format>
<year>%s</year>
<rating>%s</rating>
<stars>%s</stars>
<description>%s</description>
</movie>""" % (
row.Title, row.Type, row.Format, row.Year, row.Rating, row.Stars, row.Description)
print '\n'.join(df.apply(convert_row, axis=1))
This way you get a string containing the xml:
<movietitle="Enemy Behind">
<type>War,Thriller</type>
<format>DVD</format>
<year>2003</year>
<rating>PG</rating>
<stars>10</stars>
<description>Talk about...</description>
</movie>
<movietitle="Transformers">
<type>Anime,Science Fiction</type>
<format>DVD</format>
<year>1989</year>
<rating>R</rating>
<stars>9</stars>
<description>A Schientific fiction</description>
</movie>
that you can dump in to a file or whatever.
Inspired by this great answer.
Edit: Using the loading method you posted (or a version that actually loads the data to a variable):
import csv
f = open('movies2.csv')
csv_f = csv.reader(f)
data = []
for row in csv_f:
data.append(row)
f.close()
print data[1:]
We get:
[['Enemy Behind', 'War', 'Thriller', 'DVD', '2003', 'PG', '10', 'Talk about...'], ['Transformers', 'Anime', 'Science Fiction', 'DVD', '1989', 'R', '9', 'A Schientific fiction']]
And we can convert to XML with minor modifications:
def convert_row(row):
return """<movietitle="%s">
<type>%s</type>
<format>%s</format>
<year>%s</year>
<rating>%s</rating>
<stars>%s</stars>
<description>%s</description>
</movie>""" % (row[0], row[1], row[2], row[3], row[4], row[5], row[6])
print '\n'.join([convert_row(row) for row in data[1:]])
Getting identical results:
<movietitle="Enemy Behind">
<type>War</type>
<format>Thriller</format>
<year>DVD</year>
<rating>2003</rating>
<stars>PG</stars>
<description>10</description>
</movie>
<movietitle="Transformers">
<type>Anime</type>
<format>Science Fiction</format>
<year>DVD</year>
<rating>1989</rating>
<stars>R</stars>
<description>9</description>
</movie>
I tried to generalize robertoia's function convert_row for any header instead of writing it by hand.
import csv
import pandas as pd
f = open('movies2.csv')
csv_f = csv.reader(f)
data = []
for row in csv_f:
data.append(row)
f.close()
df = pd.read_csv('movies2.csv')
header= list(df.columns)
def convert_row(row):
str_row = """<%s>%s</%s> \n"""*(len(header)-1)
str_row = """<%s>%s""" +"\n"+ str_row + """</%s>"""
var_values = [list_of_elments[k] for k in range(1,len(header)) for list_of_elments in [header,row,header]]
var_values = [header[0],row[0]]+var_values+[header[0]]
var_values =tuple(var_values)
return str_row % var_values
text ="""<collection shelf="New Arrivals">"""+"\n"+'\n'.join([convert_row(row) for row in data[1:]])+"\n" +"</collection >"
print(text)
with open('output.xml', 'w') as myfile:
myfile.write(text)
Of course with pandas now, it is simpler to just use
to_xml() :
df= pd.read_csv('movies2.csv')
with open('outputf.xml', 'w') as myfile:
myfile.write(df.to_xml())
I found an easier way to insert variables into a string or block of text:
'''Twas brillig and the slithy {what}
Did gyre and gimble in the {where}
All {how} were the borogoves
And the {who} outgrabe.'''.format(what='toves',
where='wabe',
how='mimsy',
who='momeraths')
Alternatively:
'''Twas brillig and the slithy {0}
Did gyre and gimble in the {1}
All {2} were the borogoves
And the {3} outgrabe.'''.format('toves',
'wabe',
'mimsy',
'momeraths')
(substitute name of incoming data variable for 'toves', 'wabe', 'mimsy', and 'momeraths')
<GeocodeResponse>
<status>OK</status>
<result>
<type>locality</type>
<type>political</type>
<formatted_address>Chengam, Tamil Nadu 606701, India</formatted_address>
<address_component>
<long_name>Chengam</long_name>
<short_name>Chengam</short_name>
<type>locality</type>
<type>political</type>
</address_component>
<address_component>
<long_name>Tiruvannamalai</long_name>
<short_name>Tiruvannamalai</short_name>
<type>administrative_area_level_2</type>
<type>political</type>
</address_component>
<address_component>
<long_name>Tamil Nadu</long_name>
<short_name>TN</short_name>
<type>administrative_area_level_1</type>
<type>political</type>
</address_component>
<address_component>
<long_name>India</long_name>
<short_name>IN</short_name>
<type>country</type>
<type>political</type>
</address_component>
<address_component>
<long_name>606701</long_name>
<short_name>606701</short_name>
<type>postal_code</type>
</address_component>
<geometry>
<location>
<lat>12.3067864</lat>
<lng>78.7957856</lng>
</location>
<location_type>APPROXIMATE</location_type>
<viewport>
<southwest>
<lat>12.2982423</lat>
<lng>78.7832165</lng>
</southwest>
<northeast>
<lat>12.3213030</lat>
<lng>78.8035583</lng>
</northeast>
</viewport>
<bounds>
<southwest>
<lat>12.2982423</lat>
<lng>78.7832165</lng>
</southwest>
<northeast>
<lat>12.3213030</lat>
<lng>78.8035583</lng>
</northeast>
</bounds>
</geometry>
<place_id>ChIJu8JCb3jxrDsRAOfhACQczWo</place_id>
</result>
</GeocodeResponse>
I am new to xml thing and i don't know how to handle it with python xml.etree ? Basic stuffs i read from https://docs.python.org/2/library/xml.etree.elementtree.html#parsing-xml is useful,but still struggling to printout the latitude and longitude values under geometry-->location.i have tried something like this
with open('data.xml', 'w') as f:
f.write(xmlURL.text)
tree = ET.parse('data.xml')
root = tree.getroot()
lat = root.find(".//geometry/location")
print(lat.text)
You almost got it. Change root.find(".//geometry/location") to root.find(".//geometry/location/lat"):
lat = root.find(".//geometry/location/lat")
print(lat.text)
>> 12.3067864
Same goes for lng of course:
lng = root.find(".//geometry/location/lng")
print(lng.text)
>> 78.7957856
I am trying to get a Python code working that finds peaks in data using the Lomb-Scargle method.
http://www.astropython.org/snippets/fast-lomb-scargle-algorithm32/
Using this method as below,
import lomb
x = np.arange(10)
y = np.sin(x)
fx,fy, nout, jmax, prob = lomb.fasper(x,y, 6., 6.)
print jmax
works fine, without problems. It prints 8. However on another piece of data (data dump below),
df = pd.read_csv('extinct.csv',header=None)
Y = pd.rolling_mean(df[0],window=5)
fx,fy, nout, jmax, prob = lomb.fasper(np.array(Y.index),np.array(Y),6.,6.)
print jmax
displays only 0. I tried passing different ofac,hifac values, none gives me sensible values.
Main function
"""
from numpy import *
from numpy.fft import *
def __spread__(y, yy, n, x, m):
"""
Given an array yy(0:n-1), extirpolate (spread) a value y into
m actual array elements that best approximate the "fictional"
(i.e., possible noninteger) array element number x. The weights
used are coefficients of the Lagrange interpolating polynomial
Arguments:
y :
yy :
n :
x :
m :
Returns:
"""
nfac=[0,1,1,2,6,24,120,720,5040,40320,362880]
if m > 10. :
print 'factorial table too small in spread'
return
ix=long(x)
if x == float(ix):
yy[ix]=yy[ix]+y
else:
ilo = long(x-0.5*float(m)+1.0)
ilo = min( max( ilo , 1 ), n-m+1 )
ihi = ilo+m-1
nden = nfac[m]
fac=x-ilo
for j in range(ilo+1,ihi+1): fac = fac*(x-j)
yy[ihi] = yy[ihi] + y*fac/(nden*(x-ihi))
for j in range(ihi-1,ilo-1,-1):
nden=(nden/(j+1-ilo))*(j-ihi)
yy[j] = yy[j] + y*fac/(nden*(x-j))
def fasper(x,y,ofac,hifac, MACC=4):
""" function fasper
Given abscissas x (which need not be equally spaced) and ordinates
y, and given a desired oversampling factor ofac (a typical value
being 4 or larger). this routine creates an array wk1 with a
sequence of nout increasing frequencies (not angular frequencies)
up to hifac times the "average" Nyquist frequency, and creates
an array wk2 with the values of the Lomb normalized periodogram at
those frequencies. The arrays x and y are not altered. This
routine also returns jmax such that wk2(jmax) is the maximum
element in wk2, and prob, an estimate of the significance of that
maximum against the hypothesis of random noise. A small value of prob
indicates that a significant periodic signal is present.
Reference:
Press, W. H. & Rybicki, G. B. 1989
ApJ vol. 338, p. 277-280.
Fast algorithm for spectral analysis of unevenly sampled data
(1989ApJ...338..277P)
Arguments:
X : Abscissas array, (e.g. an array of times).
Y : Ordinates array, (e.g. corresponding counts).
Ofac : Oversampling factor.
Hifac : Hifac * "average" Nyquist frequency = highest frequency
for which values of the Lomb normalized periodogram will
be calculated.
Returns:
Wk1 : An array of Lomb periodogram frequencies.
Wk2 : An array of corresponding values of the Lomb periodogram.
Nout : Wk1 & Wk2 dimensions (number of calculated frequencies)
Jmax : The array index corresponding to the MAX( Wk2 ).
Prob : False Alarm Probability of the largest Periodogram value
MACC : Number of interpolation points per 1/4 cycle
of highest frequency
History:
02/23/2009, v1.0, MF
Translation of IDL code (orig. Numerical recipies)
"""
#Check dimensions of input arrays
n = long(len(x))
if n != len(y):
print 'Incompatible arrays.'
return
nout = 0.5*ofac*hifac*n
nfreqt = long(ofac*hifac*n*MACC) #Size the FFT as next power
nfreq = 64L # of 2 above nfreqt.
while nfreq < nfreqt:
nfreq = 2*nfreq
ndim = long(2*nfreq)
#Compute the mean, variance
ave = y.mean()
##sample variance because the divisor is N-1
var = ((y-y.mean())**2).sum()/(len(y)-1)
# and range of the data.
xmin = x.min()
xmax = x.max()
xdif = xmax-xmin
#extirpolate the data into the workspaces
wk1 = zeros(ndim, dtype='complex')
wk2 = zeros(ndim, dtype='complex')
fac = ndim/(xdif*ofac)
fndim = ndim
ck = ((x-xmin)*fac) % fndim
ckk = (2.0*ck) % fndim
for j in range(0L, n):
__spread__(y[j]-ave,wk1,ndim,ck[j],MACC)
__spread__(1.0,wk2,ndim,ckk[j],MACC)
#Take the Fast Fourier Transforms
wk1 = ifft( wk1 )*len(wk1)
wk2 = ifft( wk2 )*len(wk1)
wk1 = wk1[1:nout+1]
wk2 = wk2[1:nout+1]
rwk1 = wk1.real
iwk1 = wk1.imag
rwk2 = wk2.real
iwk2 = wk2.imag
df = 1.0/(xdif*ofac)
#Compute the Lomb value for each frequency
hypo2 = 2.0 * abs( wk2 )
hc2wt = rwk2/hypo2
hs2wt = iwk2/hypo2
cwt = sqrt(0.5+hc2wt)
swt = sign(hs2wt)*(sqrt(0.5-hc2wt))
den = 0.5*n+hc2wt*rwk2+hs2wt*iwk2
cterm = (cwt*rwk1+swt*iwk1)**2./den
sterm = (cwt*iwk1-swt*rwk1)**2./(n-den)
wk1 = df*(arange(nout, dtype='float')+1.)
wk2 = (cterm+sterm)/(2.0*var)
pmax = wk2.max()
jmax = wk2.argmax()
#Significance estimation
#expy = exp(-wk2)
#effm = 2.0*(nout)/ofac
#sig = effm*expy
#ind = (sig > 0.01).nonzero()
#sig[ind] = 1.0-(1.0-expy[ind])**effm
#Estimate significance of largest peak value
expy = exp(-pmax)
effm = 2.0*(nout)/ofac
prob = effm*expy
if prob > 0.01:
prob = 1.0-(1.0-expy)**effm
return wk1,wk2,nout,jmax,prob
def getSignificance(wk1, wk2, nout, ofac):
""" returns the peak false alarm probabilities
Hence the lower is the probability and the more significant is the peak
"""
expy = exp(-wk2)
effm = 2.0*(nout)/ofac
sig = effm*expy
ind = (sig > 0.01).nonzero()
sig[ind] = 1.0-(1.0-expy[ind])**effm
return sig
Data,
13.5945121951
13.5945121951
12.6615853659
12.6615853659
12.6615853659
4.10975609756
4.10975609756
4.10975609756
7.99695121951
7.99695121951
16.237804878
16.237804878
16.237804878
16.0823170732
16.237804878
16.237804878
8.92987804878
8.92987804878
10.6402439024
10.6402439024
28.0548780488
28.0548780488
28.0548780488
27.8993902439
27.8993902439
41.5823170732
41.5823170732
41.5823170732
41.5823170732
41.5823170732
41.5823170732
18.7256097561
15.9268292683
15.9268292683
15.9268292683
15.9268292683
15.9268292683
15.9268292683
14.0609756098
14.0609756098
14.0609756098
14.0609756098
14.0609756098
23.8567073171
23.8567073171
23.8567073171
23.8567073171
25.4115853659
25.4115853659
28.0548780488
40.0274390244
40.0274390244
40.0274390244
40.0274390244
40.0274390244
40.0274390244
20.5914634146
20.5914634146
20.4359756098
19.6585365854
18.2591463415
19.3475609756
18.2591463415
10.3292682927
27.743902439
27.743902439
27.743902439
27.743902439
27.743902439
27.743902439
22.3018292683
22.3018292683
21.368902439
21.368902439
21.368902439
21.5243902439
20.4359756098
20.4359756098
20.4359756098
20.4359756098
20.4359756098
20.4359756098
20.4359756098
11.8841463415
11.8841463415
1.0
11.1067073171
10.1737804878
14.5274390244
14.5274390244
14.5274390244
14.5274390244
14.5274390244
14.5274390244
11.7286585366
11.7286585366
12.6615853659
11.7286585366
8.15243902439
1.0
7.84146341463
6.90853658537
12.6615853659
12.6615853659
12.6615853659
12.6615853659
12.6615853659
12.6615853659
12.6615853659
12.6615853659
12.6615853659
13.1280487805
12.9725609756
12.9725609756
12.9725609756
10.3292682927
10.3292682927
10.3292682927
10.3292682927
9.55182926829
10.4847560976
29.9207317073
29.9207317073
29.9207317073
29.9207317073
30.0762195122
30.0762195122
26.1890243902
7.99695121951
25.256097561
7.99695121951
7.99695121951
7.99695121951
6.59756097561
6.59756097561
6.59756097561
6.59756097561
7.53048780488
7.53048780488
7.53048780488
7.53048780488
7.53048780488
7.53048780488
7.53048780488
7.53048780488
10.0182926829
10.0182926829
10.0182926829
10.0182926829
10.0182926829
10.0182926829
10.4847560976
15.9268292683
15.9268292683
15.9268292683
15.9268292683
15.9268292683
16.8597560976
15.9268292683
15.9268292683
16.8597560976
16.7042682927
16.7042682927
16.7042682927
9.08536585366
8.46341463415
8.46341463415
8.46341463415
8.46341463415
6.90853658537
7.84146341463
6.90853658537
4.26524390244
12.3506097561
12.3506097561
12.3506097561
12.3506097561
12.3506097561
12.3506097561
12.3506097561
12.3506097561
12.3506097561
12.3506097561
12.3506097561
14.2164634146
14.2164634146
14.2164634146
14.0609756098
14.0609756098
14.0609756098
14.0609756098
16.8597560976
16.8597560976
16.7042682927
16.7042682927
16.7042682927
16.7042682927
17.9481707317
17.9481707317
19.6585365854
19.6585365854
19.6585365854
19.6585365854
10.7957317073
10.7957317073
10.7957317073
10.7957317073
10.7957317073
12.1951219512
12.1951219512
22.9237804878
22.9237804878
22.9237804878
22.9237804878
22.9237804878
22.9237804878
22.9237804878
7.84146341463
7.84146341463
7.84146341463
7.84146341463
8.7743902439
8.7743902439
7.84146341463
8.61890243902
8.61890243902
8.61890243902
8.61890243902
18.2591463415
18.2591463415
18.2591463415
18.2591463415
18.2591463415
18.2591463415
18.2591463415
18.2591463415
18.2591463415
9.39634146341
9.39634146341
9.24085365854
9.24085365854
9.24085365854
9.24085365854
9.08536585366
9.08536585366
9.08536585366
9.08536585366
9.55182926829
9.55182926829
9.55182926829
9.55182926829
9.55182926829
16.5487804878
16.5487804878
16.5487804878
16.5487804878
16.5487804878
16.5487804878
16.5487804878
16.5487804878
16.5487804878
16.5487804878
16.5487804878
16.5487804878
16.5487804878
16.5487804878
1.0
16.0823170732
16.0823170732
16.0823170732
16.0823170732
16.0823170732
16.0823170732
16.0823170732
16.0823170732
16.0823170732
17.1707317073
17.0152439024
21.9908536585
21.9908536585
21.9908536585
21.9908536585
21.9908536585
21.9908536585
21.9908536585
7.84146341463
8.7743902439
7.84146341463
6.75304878049
5.9756097561
5.9756097561
5.9756097561
5.9756097561
5.9756097561
5.9756097561
3.95426829268
7.06402439024
7.06402439024
7.06402439024
11.262195122
11.262195122
11.262195122
11.262195122
11.262195122
11.262195122
9.08536585366
9.86280487805
7.99695121951
7.99695121951
14.2164634146
14.0609756098
14.0609756098
14.0609756098
14.0609756098
14.0609756098
2.24390243902
2.08841463415
3.02134146341
3.02134146341
2.08841463415
4.73170731707
4.73170731707
4.73170731707
4.73170731707
6.44207317073
6.44207317073
6.44207317073
6.44207317073
6.44207317073
6.44207317073
6.44207317073
6.44207317073
6.44207317073
6.44207317073
6.59756097561
6.59756097561
6.59756097561
6.75304878049
1.0
6.28658536585
6.28658536585
7.21951219512
6.28658536585
10.6402439024
10.6402439024
10.6402439024
10.6402439024
10.6402439024
10.6402439024
10.6402439024
14.3719512195
14.3719512195
15.6158536585
15.6158536585
15.6158536585
35.6737804878
35.6737804878
35.6737804878
35.6737804878
35.6737804878
35.6737804878
35.6737804878
35.6737804878
35.6737804878
35.6737804878
35.6737804878
28.6768292683
28.6768292683
28.6768292683
28.6768292683
28.6768292683
51.8445121951
51.8445121951
51.8445121951
51.8445121951
51.8445121951
52.0
52.0
4.42073170732
4.42073170732
5.9756097561
5.9756097561
5.9756097561
5.9756097561
5.9756097561
5.9756097561
4.10975609756
3.95426829268
3.64329268293
3.64329268293
4.73170731707
4.73170731707
6.28658536585
6.28658536585
6.28658536585
6.28658536585
6.28658536585
6.28658536585
6.28658536585
5.9756097561
5.82012195122
5.82012195122
5.82012195122
5.82012195122
5.82012195122
12.1951219512
12.1951219512
12.1951219512
12.1951219512
12.1951219512
12.1951219512
12.1951219512
12.1951219512
1.0
11.7286585366
11.7286585366
11.7286585366
11.7286585366
11.7286585366
11.7286585366
11.1067073171
11.1067073171
11.1067073171
11.1067073171
11.1067073171
11.1067073171
11.1067073171
11.1067073171
10.0182926829
10.0182926829
16.7042682927
16.7042682927
16.7042682927
16.7042682927
16.7042682927
16.7042682927
29.1432926829
29.1432926829
29.1432926829
29.1432926829
29.1432926829
29.1432926829
29.1432926829
29.1432926829
29.1432926829
1.15548780488
2.71036585366
2.71036585366
2.71036585366
2.71036585366
2.71036585366
2.71036585366
2.71036585366
3.17682926829
4.10975609756
4.10975609756
5.9756097561
5.9756097561
5.9756097561
6.90853658537
5.9756097561
10.1737804878
10.1737804878
10.1737804878
8.61890243902
8.46341463415
8.46341463415
9.39634146341
8.46341463415
8.46341463415
5.35365853659
5.35365853659
5.35365853659
5.35365853659
5.35365853659
5.35365853659
3.33231707317
4.42073170732
3.33231707317
6.59756097561
6.44207317073
5.82012195122
6.75304878049
5.82012195122
5.82012195122
5.82012195122
4.73170731707
5.66463414634
5.66463414634
4.73170731707
4.73170731707
5.66463414634
5.66463414634
5.50914634146
2.71036585366
5.50914634146
2.71036585366
2.71036585366
5.50914634146
5.50914634146
5.50914634146
6.28658536585
6.28658536585
5.9756097561
5.9756097561
7.06402439024
5.9756097561
7.53048780488
8.46341463415
8.46341463415
13.2835365854
13.2835365854
13.2835365854
13.2835365854
2.55487804878
2.55487804878
2.55487804878
2.55487804878
4.10975609756
3.17682926829
3.17682926829
4.26524390244
3.64329268293
3.64329268293
3.64329268293
3.33231707317
3.33231707317
3.33231707317
2.24390243902
3.33231707317
2.24390243902
2.24390243902
3.64329268293
3.64329268293
3.64329268293
3.64329268293
3.64329268293
3.64329268293
7.53048780488
7.53048780488
7.53048780488
7.53048780488
7.53048780488
7.53048780488
7.53048780488
7.53048780488
7.53048780488
6.28658536585
6.28658536585
7.21951219512
6.28658536585
6.28658536585
6.28658536585
6.28658536585
6.28658536585
6.28658536585
3.7987804878
4.73170731707
3.7987804878
3.7987804878
3.7987804878
3.7987804878
3.7987804878
3.7987804878
4.26524390244
4.26524390244
5.19817073171
5.19817073171
6.28658536585
6.28658536585
6.28658536585
6.28658536585
6.28658536585
6.28658536585
6.28658536585
6.28658536585
7.53048780488
7.53048780488
7.53048780488
7.53048780488
7.53048780488
7.53048780488
3.7987804878
3.7987804878
3.95426829268
3.02134146341
3.02134146341
3.02134146341
1.0
1.93292682927
2.55487804878
2.55487804878
5.9756097561
5.9756097561
5.9756097561
5.9756097561
5.9756097561
5.9756097561
5.9756097561
5.9756097561
5.9756097561
5.9756097561
5.9756097561
6.28658536585
6.28658536585
6.28658536585
6.28658536585
6.28658536585
6.28658536585
16.0823170732
16.0823170732
31.3201219512
31.3201219512
31.3201219512
31.3201219512
31.3201219512
31.3201219512
31.3201219512
31.3201219512
3.64329268293
3.64329268293
4.26524390244
4.26524390244
3.7987804878
4.73170731707
3.7987804878
3.7987804878
2.55487804878
3.48780487805
2.55487804878
2.55487804878
3.17682926829
3.17682926829
3.17682926829
3.17682926829
3.17682926829
3.17682926829
3.17682926829
3.17682926829
3.17682926829
3.17682926829
3.17682926829
3.17682926829
3.17682926829
3.17682926829
3.17682926829
3.17682926829
3.17682926829
3.17682926829
3.17682926829
3.17682926829
3.33231707317
12.3506097561
12.3506097561
12.3506097561
12.3506097561
12.3506097561
12.3506097561
4.73170731707
4.73170731707
4.73170731707
4.73170731707
4.73170731707
4.73170731707
4.73170731707
4.73170731707
2.86585365854
2.86585365854
1.46646341463
1.46646341463
1.46646341463
1.46646341463
1.46646341463
1.46646341463
1.62195121951
1.62195121951
1.62195121951
1.77743902439
1.77743902439
4.42073170732
4.42073170732
4.42073170732
4.42073170732
4.42073170732
4.42073170732
4.42073170732
3.95426829268
3.95426829268
2.71036585366
2.71036585366
2.71036585366
2.71036585366
2.71036585366
1.77743902439
2.86585365854
3.02134146341
2.86585365854
2.86585365854
3.17682926829
3.17682926829
Plot
Any help would be appreciated,
After some digging, it looks like AstroML method is the best.
import numpy as np
from matplotlib import pyplot as plt
from astroML.time_series import lomb_scargle, search_frequencies
import pandas as pd
df = pd.read_csv('extinct.csv',header=None)
Y = df[0]
dy = 0.5 + 0.5 * np.random.random(len(df))
omega = np.linspace(10, 100, 1000)
sig = np.array([0.1, 0.01, 0.001])
PS, z = lomb_scargle(df.index, Y, dy, omega, generalized=True, significance=sig)
plt.plot(omega,PS)
plt.hold(True)
xlim = (omega[0], omega[-1])
for zi, pi in zip(z, sig):
plt.plot(xlim, (zi, zi), ':k', lw=1)
plt.text(xlim[-1] - 0.001, zi - 0.02, "$%.1g$" % pi, ha='right', va='top')
plt.hold(True)
plt.show()
which gives
Significance levels are shown on the graph as well. I used to generalized LS, and used no smoothing.