Apache Pig and User Defined Functions

Apache Pig and User Defined Functions - python

I am trying to read in a log file using Apache Pig. After reading in the file I want to use my own User Defined Functions in Python. What I'm trying to do is somthing like the following code, but it results in ERROR 1066:Unable to open iterator for alias B, which I have been unable to find a solution for via google.
register 'userdef.py' using jython as parser;
A = LOAD 'test_data' using PigStorage() as (row);
B = FOREACH A GENERATE parser.split(A.row);
DUMP B;
However, if I replace A.row with an empty string '' the function call is completed and no error ocurrs (but the data is not passed nor processed either).
What is the proper way to pass the row of data to the UDF in string format?

You do not need to specify A.row, row alone or $0 should work.
$0 is the first column, $1 the second one.
Be careful, PigStorage will automatically split your data if it finds any delimiter, so row may be only first element of each row.
Antony.

Related

Evaluate Xquery in pyspark on RDD elements

We are trying to read large number of XML's and run Xquery on them in pyspark for example books xml. We are using spark-xml-utils library.
We want to feed the directory containing xmls to pyspark.
Run Xquery on all of them to get our results.
reference answer: Calling scala code in pyspark for XSLT transformations
The definition of xquery processor where xquery is the string of xquery:
proc = sc._jvm.com.elsevier.spark_xml_utils.xquery.XQueryProcessor.getInstance(xquery)
We are reading the files in a directory using:
sc.wholeTextFiles("xmls/test_files")
This gives us an RDD containing all the files as a list of tuples:
[ (Filename1,FileContentAsAString), (Filename2,File2ContentAsAString) ]
The xquery evaluates and gives us results if we run on the string (FileContentAsAString)
whole_files = sc.wholeTextFiles("xmls/test_files").collect()
proc.evaluate(whole_files[1][1])
# Prints proper xquery result for that file
Problem:
If we try to run proc.evaluate() on the RDD using lambda function, it is failing.
test_file = sc.wholeTextFiles("xmls/test_files")
test_file.map(lambda x: proc.evaluate(x[1])).collect()
# Should give us a list of xquery results
Error:
PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects
These functions work somehow but not the evaluate above:
Print the content xquery is applied on
test_file.map(lambda x: x[1]).collect()
# Outputs the content. if x[0], gives us the list of filenames
Return the len of characters in the contents
test_file.map(lambda x: len(x[1])).collect()
# Output: [15274, 13689, 13696]
Books example for reference:
books_xquery = """for $x in /bookstore/book
where $x/price>30
return $x/title/data()"""
proc_books = sc._jvm.com.elsevier.spark_xml_utils.xquery.XQueryProcessor.getInstance(books_xquery)
books_xml = sc.wholeTextFiles("xmls/books.xml")
books_xml.map(lambda x: proc_books.evaluate(x[1])).collect()
# Error
# I can share the stacktrace if you guys want

Unfortunately it is not possible to call a Java/Scala library directly within a map call from Python code. This answer gives a good explanation why there is no easy way to do this. In short the reason is that the Py4J gateway (which is necessary to "translate" the Python calls into the JVM world) only lives on the driver node while the map calls that you are trying to execute are running on the executor nodes.
One way around that problem would be to wrap the XQuery function in a Scala UDF (explained here), but it still would be necessary to write a few lines of Scala code.
EDIT: If you are able to switch from XQuery to XPath, a probably easier option is to change the (XPath) library. ElementTree is an XML libary written in Python and also XPath.
The code
xmls = spark.sparkContext.wholeTextFiles("xmls/test_files")
import xml.etree.ElementTree as ET
xpathquery = "...your query..."
xmls.flatMap(lambda x: ET.fromstring(x[1]).findall(xpathquery)) \
.map(lambda x: x.text) \
.foreach(print)
would print all results of running the xpathquery against all documents loaded from the directory xmls/test_files.
At first a flatMap is used as the findall call returns a list of all matching elements within each document. By using flatMap this list is flattened (the result might contain more than one element per file). In the second map call the elements are mapped to their text in order to get a readable output.

How to store and get strings with concatenated variables from database without loosing variable value in python

I store in database a string with concatenated variables but when I fetch it, it behaves like string, and the variable values are not reflected.
Stored in database field I have:
"""\
Please visit the following link to grant or revoke your consent:
"""+os.environ.get("PROTOCOL")+"""://"""+os.environ.get("DOMAIN")+"""/consent?id="""+consentHash+""""""
I need to be able to fetch it in python and store it in a variable but have the concatenated variable values reflected:
someVariable = database['field']
But like this the concatenated variable values are not processed and the whole thing behaves like one string.
When I print(someVariable) I am expecting
Please visit the following link to grant or revoke your consent:
https://somedomain/consent?id=123
But instead I get the original stored string as in database field:
"""\
Please visit the following link to grant or revoke your consent:
"""+os.environ.get("PROTOCOL")+"""://"""+os.environ.get("DOMAIN")+"""/consent?id="""+consentHash+""""""

You can call eval on your string to have it, uh, evaluate the string as an expression.
Using eval is considered dangerous, because it can be used to do pretty much anything you could write code for, without knowing just what that code will be ahead of time. This is more of an issue when using it on strings provided from an outside source.

What is the Python way to express a GREL line that is creating as many tags as needed in an XML document?

I'm using Open Refine to do something that I KNOW Python can do. I'm using it to convert a csv into an XML metadata document. I can figure out most of it, but the one thing that trips me up, is this GREL line:
{{forEach(cells["subjectTopicsLocal"].value.split('; '), v, '<subject authority="local"><topic>'+v.escape("xml")+'</topic></subject>')}}
What this does, is beautiful for me. I've got a "subject" field in my Excel spreadsheet. My volunteers enter keywords, separated with a "; ". I don't know how many keywords they'll come up with, and sometimes there is only one. That GREL line creates a new <subject authority="local"><topic></topic></subject> for each term created, and of course slides it into the field.
I know there has to be a Python expression that can do this. Could someone recommend best practice for this? I'd appreciate it!

Basically you want to use 'split' in Python to convert the string from your subject field into a Python list, and then you can iterate over the list.
So assuming you've read the content of the 'subject' field from a line in your csv/excel document already and assigned it to a string variable 'subj' you could do something like:
subjList = subj.split(";")
for subject in subjList:
#do what you need to do to output 'subject' in an xml element here

This Python expression is the equivalent to your GREL expression:
['<subject authority="local"><topic>'+escape(v)+'</topic></subject>') for v in split(value,'; ')]
It will create an array of XML snippets containing your subjects. It assumes that you've created or imported an appropriate escape function, such as
from xml.sax.saxutils import escape

Working with Parameters containing Escaped Characters in Python Config file

I have a config file that I'm reading using the following code:
import configparser as cp
config = cp.ConfigParser()
config.read('MTXXX.ini')
MT=identify_MT(msgtext)
schema_file = config.get(MT,'kbfile')
fold_text = config.get(MT,'fold')
The relevant section of the config file looks like this:
[536]
kbfile=MT536.kb
fold=:16S:TRANSDET\n
Later I try to find text contained in a dictionary that matches the 'fold' parameter, I've found that if I find that text using the following function:
def test (find_text)
return {k for k, v in dictionary.items() if find_text in v}
I get different results if I call that function in one of two ways:
test(fold_text)
Fails to find the data I want, but:
test(':16S:TRANSDET\n')
returns the results I know are there.
And, if I print the content of the dictionary, I can see that it is, as expected, shown as
:16S:TRANSDET\n
So, it matches when I enter the search text directly, but doesn't find a match when I load the same text in from a config file.
I'm guessing that there's some magic being applied here when reading/handling the \n character pattern in from the config file, but don't know how to get it to work the way I want it to.
I want to be able to parameterise using escape characters but it seems I'm blocked from doing this due to some internal mechanism.
Is there some switch I can apply to the config reader, or some extra parsing I can do to get the behavior I want? Or perhaps there's an alternate solution. I do find the configparser module convenient to use, but perhaps this is a limitation that requires an alternative, or even self-built module to lift data out of a parameter file.

Extracting data from MS Word

I am looking for a way to extract / scrape data from Word files into a database. Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia.
I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed.
Which is the best way to do this:
VBA macro from inside Word to create CSV and then upload to the DB?
VBA macro in Word with connection to DB (how does one connect to MySQL from VBA?)
Python script via win32com then upload to DB?
The last one is attractive to me as the web-interface is being built with Django, but I've never used win32com or tried scripting Word from python.
EDIT: I've started extracting the text with VBA because it makes it a little easier to deal with the Word Object Model. I am having a problem though - all the text is in Tables, and when I pull the strings out of the CELLS I want, I get a strange little box character at the end of each string. My code looks like:
sFile = "D:\temp\output.txt"
fnum = FreeFile
Open sFile For Output As #fnum
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text
Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text
If Target = "" Then
ExportText = ""
Else
ExportText = Descr & Chr(44) & Assign & Chr(44) & _
Target & Chr(13) & Chr(10)
Print #fnum, ExportText
End If
Next n
Close #fnum
What's up with the little control character box? Is some kind of character code coming across from Word?

Word has a little marker thingy that it puts at the end of every cell of text in a table.
It is used just like an end-of-paragraph marker in paragraphs: to store the formatting for the entire paragraph.
Just use the Left() function to strip it out, i.e.
Left(Target, Len(Target)-1))
By the way, instead of
num_rows = Application.ActiveDocument.Tables(2).Rows.Count
For n = 1 To num_rows
Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
Try this:
For Each row in Application.ActiveDocument.Tables(2).Rows
Descr = row.Cells(2).Range.Text

Well, I've never scripted Word, but it's pretty easy to do simple stuff with win32com. Something like:
from win32com.client import Dispatch
word = Dispatch('Word.Application')
doc = word.Open('d:\\stuff\\myfile.doc')
doc.SaveAs(FileName='d:\\stuff\\text\\myfile.txt', FileFormat=?) # not sure what to use for ?
This is untested, but I think something like that will just open the file and save it as plain text (provided you can find the right fileformat) – you could then read the text into python and manipulate it from there. There is probably a way to grab the contents of the file directly, too, but I don't know it off hand; documentation can be hard to find, but if you've got VBA docs or experience, you should be able to carry them across.
Have a look at this post from a while ago: http://mail.python.org/pipermail/python-list/2002-October/168785.html Scroll down to COMTools.py; there's some good examples there.
You can also run makepy.py (part of the pythonwin distribution) to generate python "signatures" for the COM functions available, and then look through it as a kind of documentation.

You could use OpenOffice. It can open word files, and also can run python macros.

I'd say look at the related questions on the right -->
The top one seems to have some good ideas for going the python route.

how about saving the file as xml. then using python or something else and pull the data out of word and into the database.

It is possible to programmatically save a Word document as HTML and to import the table(s) contained into Access. This requires very little effort.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Apache Pig and User Defined Functions - python

You do not need to specify A.row, row alone or $0 should work. $0 is the first column, $1 the second one. Be careful, PigStorage will automatically split your data if it finds any delimiter, so row may be only first element of each row. Antony.

Related

Evaluate Xquery in pyspark on RDD elements

How to store and get strings with concatenated variables from database without loosing variable value in python

What is the Python way to express a GREL line that is creating as many tags as needed in an XML document?

Working with Parameters containing Escaped Characters in Python Config file

Extracting data from MS Word

Categories

Resources