Unable to write data from StreamSets Jython Evaluator - python

I am trying to read data from directory and trying to parse that data and finally trying to write it to another directory.
for this i am using Jython Evaluator. Here is my code:
import sys
sys.path.append('/usr/lib/python2.7/site-packages')
import feedparser
for record in records:
myfeed = feedparser.parse(str(record))
for item in myfeed['items']:
title = item.title
link = item.link
output.write(record)
I am able to write data to output, but my requirement is write title and link which are parsed from input record.
Here is my code snippet:
any suggestions please.
Thanks in advance.

You need to write the values to the record, see below where we are adding the record value and assigning title and link respectively.
import sys
sys.path.append('/usr/lib/python2.7/site-packages')
import feedparser
for record in records:
myfeed = feedparser.parse(str(record))
for item in myfeed['items']:
record.value["title"] = item.title
record.value["link"] = item.link
output.write(record)

Related

Web Scraping with Python - Satellites

I'm trying to get data on satellite positions several times a day from https://www.n2yo.com/. The satellite I'm focused on is MUOS 5. My problem is I'm not able to get to any of the data changing in the table.
from bs4 import BeautifulSoup
import requests
import csv
info = soup.find('div', class_='container-main')
info = info.find('div', id='trackinginfo')
info = info.find('div', id='paneldata')
info = info.find('table', id='tabledata')
info = info.find('tr')
print(info)
I expect to see the information shown and in the second column, 41622, But I don't know how to only access the second td
<tr>
<td>NORAD ID:</td><td><div id="noradid"></div></td>
</tr>
Any help/direction would be appreciated.
This is probably not a complete answer, but I believe it will get you started.
The page you linked to is dynamically loaded with javascript, so beautifulsoup can't handle it. The data itself is located at another url (see below - can be located through the Developer tab in your browser) and, since it's in json format, it can be loaded into python.
The json contains historical information, and the most recent item is located at the end of the json string. Once you have that, you can extract the relevant data from it.
As you'll see below, I managed to connect some of the dynamic data to some of the types, but I'm not really familiar with the terminology, so you will probably have to do some extra work to complete it. But, as I said, it will at least get you started:
import requests
import json
req = requests.get('https://www.n2yo.com/sat/instant-tracking.php?s=41622&hlat=40.71427&hlng=-74.00597&d=300&r=547647090737.1928&tz=GMT-04:00&O=n2yocom&rnd_str=8fde3fd56c515d8fb110d5145c7df86b&callback=')
data = json.loads(req.text)
heads = ['LATITUDE','LONGITUDE', 'AZIMUTH','ELEVATION','??','DECLINATION','ALTITUDE [km]','???','NORAD ID','ABC','xxx'] #as I said, not sure exactly what's what...
target = list(data[0].values())[-1][-1] #this is the most recent data
dats = [item for item in list(target.values())[0].split('|')]
for v,d in zip(dats,heads):
print(d,':',v)
Output:
LATITUDE : -6.18764033
LONGITUDE : -102.54010579
AZIMUTH : 216.12
ELEVATION : 28.65
?? : 242.47221474
DECLINATION : -12.92722520
ALTITUDE [km] : 35625.07
??? : 0.19091700104938
NORAD ID : 41622
ABC : 1598058725
xxx : 0
Hopefully this helps.
If I understood your question correctly. You are looking for data on this page: https://www.n2yo.com/satellite/?s=41622
For the reason that in my opinion the HTML is not "made for BeautifulSoup", I would recommend regular expressions
... </ul></p><br/><B>NORAD ID</B>: 41622 <a ...
m = re.search(r'\<B\>NORAD ID\<\/B\>: (.*?) \<a', read_data)
print(m.group(1))

parse XML with Python, key value namespaces

I have a XML file downloaded from Wordpress that is structured like this:
<wp:postmeta>
<wp:meta_key><![CDATA[country]]></wp:meta_key>
<wp:meta_value><![CDATA[Germany]]></wp:meta_value>
</wp:postmeta>
my goals is to look through the XML file for all the country keys and print the value. I'm completely new to the XML library so I'm looking where to take it from here.
# load libraries
# importing os to handle directory functions
import os
# import XML handlers
from xml.etree import ElementTree
# importing json to handle structured data saving
import json
# dictonary with namespaces
ns = {'wp:meta_key', 'wp:meta_value'}
tree = ElementTree.parse('/var/www/python/file.xml')
root = tree.getroot()
# item
for item in root.findall('wp:post_meta', ns):
print '- ', item.text
print "Finished running"
this throws me a error about using wp as a namespace but I'm not sure where to go from here the documentation is unclear to me. Any help is appreciated.
Downvoters please let me know how I can improve my question.
I don't know XML, but I can treat it as a string like this.
from simplified_scrapy import SimplifiedDoc, req, utils
xml = '''
<wp:postmeta>
<wp:meta_key><![CDATA[country]]></wp:meta_key>
<wp:meta_value><![CDATA[Germany]]></wp:meta_value>
</wp:postmeta>
'''
doc = SimplifiedDoc(xml)
kvs = doc.select('wp:postmeta').selects('wp:meta_key|wp:meta_value').html
print (kvs)
Result:
['<![CDATA[country]]>', '<![CDATA[Germany]]>']

Can't create an xpath capable of meeting certain condition

I've created a script which is able to extract the links ending with .html extention available under class tableFile from a webpage. The script can do it's job. However, my intention at this point is to get only those .html links which have EX- in its type field. I'm looking for any pure xpath solution (by not using .getparent() or something).
Link to that site
Script I've tried with so far:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(#summary,"Document")]//td[#scope="row"]/a/#href'):
if ".htm" in item:
print(item)
When I try to get the links meeting above condition with the below approach, I get an error:
for item in root.xpath('//table[contains(#summary,"Document")]//td[#scope="row"]/a/#href'):
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
print(item)
Error I get:
if ".htm" in item and "EX" in item.xpath("..//following-sibling::td/text"):
AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'xpath'
This is how the files look like:
If you need pure XPath solution, you can use below:
import requests
from lxml.html import fromstring
res = requests.get("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
root = fromstring(res.text)
for item in root.xpath('//table[contains(#summary,"Document")]//tr[td[starts-with(., "EX-")]]/td/a[contains(#href, ".htm")]/#href'):
print(item)
/Archives/edgar/data/1085596/000146970918000185/ex31_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex31_2apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_1apg.htm
/Archives/edgar/data/1085596/000146970918000185/ex32_2apg.htm
It looks like you want:
//td[following-sibling::td[starts-with(text(), "EX")]]/a[contains(#href, ".htm")]
There's a lot of different ways to do this with xpath. Css is probalby much simpler.
Here is a way using dataframes and pandas
import pandas as pd
tables = pd.read_html("https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/0001469709-18-000185-index.htm")
base = "https://www.sec.gov/Archives/edgar/data/1085596/000146970918000185/"
results = [base + row[1][2] for row in tables[0].iterrows() if row[1][2].endswith(('.htm', '.txt')) and str(row[1][3]).startswith('EX')]
print(results)

Use Minidom to parse XML But just crashes applet

Having some issues with Minidom for parsing an XML file on a remote server.
This is the code I am trying to parse:
<mod n="1">
<body>
Random Body information will be here
</body>
<b>1997-01-27</b>
<d>1460321480</d>
<l>United Kingdom</l>
<s>M</s>
<t>About Denisstoff</t>
</mod>
I'm trying to return the <d> values with Minidom. This is the code I am trying to use to find the value:
expired = True
f = urlreq.urlopen("http://st.chatango.com/profileimg/"+args[:1]+"/"+args[1:2]+"/"+args+"/mod1.xml")
data = f.read().decode("utf-8")
dom = minidom.parseString(data)
itemlist = dom.getElementsByTagName('d')
print(itemlist)
It returns the value is there, but I followed a way to read the data I found here (Below) and it just crashed my python app. This is the code I tried to fix with:
for s in itemlist:
if s.hasAttribute('d'):
print(s.attributes['d'].value)
This is the crash:
AttributeError: 'NodeList' object has no attribute 'value'
I also tried ElementTree but that didn't return any data at all. I have tested the URL and it's correct for the data I want, but I just can't get it to read the data in the tags. Any and all help is appreciated.
if you want to print values from this xml you should use this:
for s in itemlist:
if hasattr(s.childNodes[0], "data"):
print(s.childNodes[0].data)
I hope it help :D

I want to be able to chose which nodes to be printed when importing an XML document to Python

I am using python to display information from a XML file hosted on a website. The code I am using is bellow:
#IMPORTS
from xml.dom import minidom
import urllib
#IMPORTING XML FILE
xmldocurl = 'http://gamebattles.majorleaguegaming.com/ps4/call-of-duty-ghosts/team/TeamCrYpToNGamingEU/stats.xml'
settings = urllib.urlopen(xmldocurl).read()
final = minidom.parseString(settings)
date = final.getElementsByTagName('date')
for node in date:
test = node.getAttribute('timestamp')
print test
This returns the following:
1411853400
1411850700
1411847100
1411843500
1411839000
1411837200
1411831800
1411828200
1411822800
1411820100
I only want it to return the timestamp for the first node under the heading recent matches. This code at the moment returns everything called timestamp but I only want a specific one.
How can I choose this.
Thanks
You need to get the recentMatches object and look at the date of the first match. One way to do that is:
#IMPORTS
from xml.dom import minidom
import urllib
#IMPORTING XML FILE
xmldocurl = 'http://gamebattles.majorleaguegaming.com/ps4/call-of-duty-ghosts/team/TeamCrYpToNGamingEU/stats.xml'
settings = urllib.urlopen(xmldocurl).read()
final = minidom.parseString(settings)
recentMatches = final.getElementsByTagName('recentMatches')[0]
for node in recentMatches.childNodes:
if node.nodeName == "match":
nodes = node.getElementsByTagName('url')
print nodes[0].childNodes[0].data
nodes = node.getElementsByTagName('date')
print nodes[0].getAttribute('timestamp')
break
This will iterate over the matches and get you the first date timestamp.

Categories

Resources