UDF (User Defined Function) python gives different answer in pig - python

I want to write a UDF python for pig, to read lines from the file called like
#'prefix.csv'
spol.
LLC
Oy
OOD
and match the names and if finds any matches, then replaces it with white space. here is my python code
def list_files2(name, f):
fin = open(f, 'r')
for line in fin:
final = name
extra = 'nothing'
if (name != name.replace(line.strip(), ' ')):
extra = line.strip()
final = name.replace(line.strip(), ' ').strip()
return final, extra,'insdie if'
return final, extra, 'inside for'
Running this code in python,
>print list_files2('LLC nakisa', 'prefix.csv' )
>print list_files2('AG company', 'prefix.csv' )
returns
('nakisa', 'LLC', 'insdie if')
('AG company', 'nothing', 'inside for')
which is exactly what I need. But when I register this code as a UDF in apache pig for this sample list:
nakisa company LLC
three Oy
AG Lans
Test OOD
pig returns wrong answer on the third line:
((nakisa company,LLC,insdie if))
((three,Oy,insdie if))
((A G L a n s,,insdie if))
((Test,OOD,insdie if))
The question is why UDF enters the if loop for the third entry which does not have any match in the prefix.csv file.

I don't know pig but the way you are checking for a match is strange and might be the cause of your problem.
If you want to check whether a string is a substring of another, python provides
the find method on strings:
if name.find(line.strip()) != -1:
# find will return the first index of the substring or -1 if it was not found
# ... do some stuff
additionally, your code might leave the file handle open. A way better approach to handle file operations is by using the with statement. This assures that in any case (except of interpreter crashes) the file handle will get closed.
with open(filename, "r") as file_:
# Everything within this block can use the opened file.
Last but not least, python provides a module called csv with a reader and a writer, that handle the parsing of the csv file format.
Thus, you could try the following code and check if it returns the correct thing:
import csv
def list_files2(name, filename):
with open(filename, 'rb') as file_:
final = name
extra = "nothing"
for prefix in csv.reader(file_):
if name.find(prefix) != -1:
extra = prefix
final = name.replace(prefix, " ")
return final, extra, "inside if"
return final, extra, "inside for"
Because your file is named prefix.csv I assume you want to do prefix substitution. In this case, you could use startswith instead of find for the check and replace the line final = name.replace(prefix, " ") with final = " " + name[name.find(prefix):]. This assures that only a prefix will be substituted with the space.
I hope, this helps

Related

modified textfile python script

I am totally new in python world. Here I am looking for some suggestion about my problem. I have three text file one is original text file, one is text file for updating original text file and write in a new text file without modifying the original text file. So file1.txt looks like
$ego_vel=x
$ped_vel=2
$mu=3
$ego_start_s=4
$ped_start_x=5
file2.txt like
$ego_vel=5
$ped_vel=5
$mu=6
$to_decel=5
outputfile.txt should be like
$ego_vel=5
$ped_vel=5
$mu=6
$ego_start_s=4
$ped_start_x=5
$to_decel=5
the code I tried till now is given below:
import sys
import os
def update_testrun(filename1: str, filename2: str, filename3: str):
testrun_path = os.path.join(sys.argv[1] + "\\" + filename1)
list_of_testrun = []
with open(testrun_path, "r") as reader1:
for line in reader1.readlines():
list_of_testrun.append(line)
# print(list_of_testrun)
design_path = os.path.join(sys.argv[3] + "\\" + filename2)
list_of_design = []
with open(design_path, "r") as reader2:
for line in reader1.readlines():
list_of_design .append(line)
print(list_of_design)
for i, x in enumerate(list_of_testrun):
for test in list_of_design:
if x[:9] == test[:9]:
list_of_testrun[i] = test
# list_of_updated_testrun=list_of_testrun
break
updated_testrun_path = os.path.join(sys.argv[5] + "\\" + filename3)
def main():
update_testrun(sys.argv[2], sys.argv[4], sys.argv[6])
if __name__ == "__main__":
main()
with this code I am able to get output like this
$ego_vel=5
$ped_vel=5
$mu=3
$ego_start_s=4
$ped_start_x=5
$to_decel=5
all the value I get correctly except $mu value.
Will any one provide me where I am getting wrong and is it possible to share a python script for my task?
Looks like your problem comes from the if statement:
if x[:9] == test[:9]:
Here you're comparing the first 8 characters of each string. For all other cases this is fine as you're not comparing past the '=' character, but for $mu this means you're evaluating:
if '$mu=3' == '$mu=6'
This obviously evaluates to false so the mu value is not updated.
You could shorten to if x[:4] == test[:4]: for a quick fix but maybe you would consider another method, such as using the .split() string function. This lets you split a string around a specific character which in your case could be '='. For example:
if x.split('=')[0] == test.split('=')[0]:
Would evaluate as:
if '$mu' == '$mu':
Which is True, and would work for the other statements too. Regardless of string length before the '=' sign.

String Cutting with multiple lines

so i'm new to python besides some experience with tKintner (some GUI experiments).
I read an .mbox file and copy the plain/text in a string. This text contains a registering form. So a Stefan, living in Maple Street, London working for the Company "MultiVendor XXVideos" has registered with an email for a subscription.
Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
I would like to take this data and put in a .csv row with column
"Name", "Adress", "Company",...
Now i tried to cut and slice everything. For debugging i use "print"(IDE = KATE/KDE + terminal... :-D ).
Problem is, that the data contains multiple lines after keywords but i only get the first line.
How would you improve my code?
import mailbox
import csv
import email
from time import sleep
import string
fieldnames = ["ID","Subject","Name", "Adress", "Company"]
searchKeys = [ 'Name_OF_Person','Adress_HOME','Company_NAME']
mbox_file = "REG.mbox"
export_file_name = "test.csv"
if __name__ == "__main__":
with open(export_file_name,"w") as csvfile:
writer = csv.DictWriter(csvfile, dialect='excel',fieldnames=fieldnames)
writer.writeheader()
for message in mailbox.mbox(mbox_file):
if message.is_multipart():
content = '\n'.join(part.get_payload() for part in message.get_payload())
content = content.split('<')[0] # only want text/plain.. Ill split #right before HTML starts
#print content
else:
content = message.get_payload()
idea = message['message-id']
sub = message['subject']
fr = message['from']
date = message['date']
writer.writerow ('ID':idea,......) # CSV writing will work fine
for line in content.splitlines():
line = line.strip()
for pose in searchKeys:
if pose in line:
tmp = line.split(pose)
pmt = tmp[1].split(":")[1]
if next in line !=:
print pose +"\t"+pmt
sleep(1)
csvfile.closed
OUTPUT:
OFFICIAL_POSTAL_ADDRESS =20
Here, the lines are missing..
from file:
OFFICIAL_POSTAL_ADDRESS: =20
London, testarossa street 41
EDIT2:
#Yaniv
Thank you, iam still trying to understand every step, but just wanted to give a comment. I like the idea to work with the list/matrix/vector "key_value_pairs"
The amount of keywords in the emails is ~20 words. Additionally, my values are sometimes line broken by "=".
I was thinking something like:
Search text for Keyword A,
if true:
search text from Keyword A until keyword B
if true:
copy text after A until B
Name_OF_=
Person: Stefan
Adress_
=HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
Maybe the HTML from EMAIL.mbox is easier to process?
<tr><td bgcolor=3D"#eeeeee"><font face=3D"Verdana" size=3D"1">
<strong>NAM=
E_REGISTERING_PERSON</strong></font></td><td bgcolor=3D"#eeeeee"><font
fac=e=3D"Verdana" size=3D"1">Stefan </font></td></tr>
But the "=" are still there
should i replace ["="," = "] with "" ?
I would go for a "routine" parsing loop over the input lines, and maintain a current_key and current_value variables, as a value for a certain key in your data might be "annoying", and spread across multiple lines.
I've demonstrated such parsing approach in the code below, with some assumptions regarding your problem. For example, if an input line starts with a whitespace, I assumed it must be the case of such "annoying" value (spread across multiple lines). Such lines would be concatenated into a single value, using some configurable string (the parameter join_lines_using_this). Another assumption is that you might want to strip whitespaces from both keys and values.
Feel free to adapt the code to fit your assumptions on the input, and raise Exceptions whenever they don't hold!
# Note the usage of .strip() in some places, to strip away whitespaces. I assumed you might want that.
def parse_funky_text(text, join_lines_using_this=" "):
key_value_pairs = []
current_key, current_value = None, ""
for line in text.splitlines():
line_split = line.split(':')
if line.startswith(" ") or len(line_split) == 1:
if current_key is None:
raise ValueError("Failed to parse this line, not sure which key it belongs to: %s" % line)
current_value += join_lines_using_this + line.strip()
else:
if current_key is not None:
key_value_pairs.append((current_key, current_value))
current_key, current_value = None, ""
current_key = line_split[0].strip()
# We've just found a new key, so here you might want to perform additional checks,
# e.g. if current_key not in sharedKeys: raise ValueError("Encountered a weird key?! %s in line: %s" % (current_key, line))
current_value = ':'.join(line_split[1:]).strip()
# Don't forget the last parsed key, value
if current_key is not None:
key_value_pairs.append((current_key, current_value))
return key_value_pairs
Example usage:
text = """Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos"""
parse_funky_text(text)
Will output:
[('Name_OF_Person', 'Stefan'), ('Adress_HOME', 'London, Maple Street 45'), ('Company_NAME', 'MultiVendor XXVideos')]
You indicate in the comments that your input strings from the content should be relatively consistent. If that is the case, and you want to be able to split that string across multiple lines, the easiest thing to do would be to replace \n with spaces and then just parse the single string.
I've intentionally constrained my answer to using just string methods rather than inventing a huge function to do this. Reason: 1) Your process is already complex enough, and 2) your question really boils down to how to process the string data across multiple lines. If that is the case, and the pattern is consistent, this will get this one off job done
content = content.replace('\n', ' ')
Then you can split on each of the boundries in your consistently structured headers.
content = content.split("Name_OF_Person:")[1] #take second element of the list
person = content.split("Adress_HOME:")[0] # take content before "Adress Home"
content = content.split("Adress_HOME:")[1] #take second element of the list
address = content.split("Company_NAME:")[0] # take content before
company = content.split("Adress_HOME:")[1] #take second element of the list (the remainder) which is company
Normally, I would suggest regex. (https://docs.python.org/3.4/library/re.html). Long term, if you need to do this sort of thing again, regex is going to pay dividends on time spend munging data. To make a regex function "cut" across multiple lines, you would use the re.MULTILINE option. So it might endup looking something like re.search('Name_OF_Person:(.*)Adress_HOME:', html_reg_form, re.MULTILINE)

Creating a dictionary using python and a .txt file

I have downloaded the following dictionary from Project Gutenberg http://www.gutenberg.org/cache/epub/29765/pg29765.txt (it is 25 MB so if you're on a slow connection avoid clicking the link)
In the file the keywords I am looking for are in uppercases for instance HALLUCINATION, then in the dictionary there are some lines dedicated to the pronunciation which are obsolete for me.
What I want to extract is the definition, indicated by "Defn" and then print the lines. I have came up with this rather ugly 'solution'
def lookup(search):
find = search.upper() # transforms our search parameter all upper letters
output = [] # empty dummy list
infile = open('webster.txt', 'r') # opening the webster file for reading
for line in infile:
for part in line.split():
if (find == part):
for line in infile:
if (line.find("Defn:") == 0): # ugly I know, but my only guess so far
output.append(line[6:])
print output # uncertain about how to proceed
break
Now this of course only prints the first line that comes right after "Defn:". I am new when it comes to manipulate .txt files in Python and therefore clueless about how to proceed. I did read in the line in a tuple and noticed that there are special new line characters.
So I want to somehow tell Python to keep on reading until it runs out of new line characters I suppose, but also that doesn't count for the last line which has to be read.
Could someone please enhance me with useful functions I might could use to solve this problem (with a minimal example would be appreciated).
Example of desired output:
lookup("hallucination")
out: To wander; to go astray; to err; to blunder; -- used of mental
processes. [R.] Byron.
lookup("hallucination")
out: The perception of objects which have no reality, or of \r\n
sensations which have no corresponding external cause, arising from \r\n
disorder or the nervous system, as in delirium tremens; delusion.\r\n
Hallucinations are always evidence of cerebral derangement and are\r\n
common phenomena of insanity. W. A. Hammond.
from text:
HALLUCINATE
Hal*lu"ci*nate, v. i. Etym: [L. hallucinatus, alucinatus, p. p. of
hallucinari, alucinari, to wander in mind, talk idly, dream.]
Defn: To wander; to go astray; to err; to blunder; -- used of mental
processes. [R.] Byron.
HALLUCINATION
Hal*lu`ci*na"tion, n. Etym: [L. hallucinatio cf. F. hallucination.]
1. The act of hallucinating; a wandering of the mind; error; mistake;
a blunder.
This must have been the hallucination of the transcriber. Addison.
2. (Med.)
Defn: The perception of objects which have no reality, or of
sensations which have no corresponding external cause, arising from
disorder or the nervous system, as in delirium tremens; delusion.
Hallucinations are always evidence of cerebral derangement and are
common phenomena of insanity. W. A. Hammond.
HALLUCINATOR
Hal*lu"ci*na`tor, n. Etym: [L.]
Here is a function that returns the first definition:
def lookup(word):
word_upper = word.upper()
found_word = False
found_def = False
defn = ''
with open('dict.txt', 'r') as file:
for line in file:
l = line.strip()
if not found_word and l == word_upper:
found_word = True
elif found_word and not found_def and l.startswith("Defn:"):
found_def = True
defn = l[6:]
elif found_def and l != '':
defn += ' ' + l
elif found_def and l == '':
return defn
return False
print lookup('hallucination')
Explanation: There are four different cases we have to consider.
We haven't found the word yet. We have to compare the current line to the word we are looking for in uppercases. If they are equal, we found the word.
We have found the word, but haven't found the start of the definition. We therefore have to look for a line that starts with Defn:. If we found it, we add the line to the definition (excluding the six characters for Defn:.
We have already found the start of the definition. In that case, we just add the line to the definition.
We have already found the start of definition and the current line is empty. The definition is complete and we return the definition.
If we found nothing, we return False.
Note: There are certain entries, e.g. CRANE, that have multiple definitions. The above code is not able to handle that. It will just return the first definition. However, it is far from easy to code a perfect solution considering the format of the file.
You can split into paragraphs and use the index of the search word and find the first Defn paragraph after:
def find_def(f,word):
import re
with open(f) as f:
lines = f.read()
try:
start = lines.index("{}\r\n".format(word)) # find where our search word is
except ValueError:
return "Cannot find search term"
paras = re.split("\s+\r\n",lines[start:],10) # split into paragraphs using maxsplit = 10 as there are no grouping of paras longer in the definitions
for para in paras:
if para.startswith("Defn:"): # if para startswith Defn: we have what we need
return para # return the para
print(find_def("in.txt","HALLUCINATION"))
Using the whole file returns:
In [5]: print find_def("gutt.txt","VACCINATOR")
Defn: One who, or that which, vaccinates.
In [6]: print find_def("gutt.txt","HALLUCINATION")
Defn: The perception of objects which have no reality, or of
sensations which have no corresponding external cause, arising from
disorder or the nervous system, as in delirium tremens; delusion.
Hallucinations are always evidence of cerebral derangement and are
common phenomena of insanity. W. A. Hammond.
A slightly shorter version:
def find_def(f,word):
import re
with open(f) as f:
lines = f.read()
try:
start = lines.index("{}\r\n".format(word))
except ValueError:
return "Cannot find search term"
defn = lines[start:].index("Defn:")
return re.split("\s+\r\n",lines[start+defn:],1)[0]
From here I learned an easy way to deal with memory mapped files and use them as if they were strings. Then you can use something like this to get the first definition for a term.
def lookup(search):
term = search.upper()
f = open('webster.txt')
s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
index = s.find('\r\n\r\n' + term + '\r\n')
if index == -1:
return None
definition = s.find('Defn:', index) + len('Defn:') + 1
endline = s.find('\r\n\r\n', definition)
return s[definition:endline]
print lookup('hallucination')
print lookup('hallucinate')
Assumptions:
There is at least one definition per term
If there are more than one, only the first is returned

Creating a table which has sentences from a paragraph each on a row with Python

I have an abstract which I've split to sentences in Python. I want to write to 2 tables. One which has the following columns: abstract id (which is the file number that I extracted from my document), sentence id (automatically generated) and each sentence of this abstract on a row.
I would want a table that looks like this
abstractID SentenceID Sentence
a9001755 0000001 Myxococcus xanthus development is regulated by(1st sentence)
a9001755 0000002 The C signal appears to be the polypeptide product (2nd sentence)
and another table NSFClasses having abstractID and nsfOrg.
How to write sentences (each on a row) to table and assign sentenceId as shown above?
This is my code:
import glob;
import re;
import json
org = "NSF Org";
fileNo = "File";
AbstractString = "Abstract";
abstractFlag = False;
abstractContent = []
path = 'awardsFile/awd_1990_00/*.txt';
files = glob.glob(path);
for name in files:
fileA = open(name,'r');
for line in fileA:
if line.find(fileNo)!= -1:
file = line[14:]
if line.find(org) != -1:
nsfOrg = line[14:].split()
print file
print nsfOrg
fileA = open(name,'r')
content = fileA.read().split(':')
abstract = content[len(content)-1]
abstract = abstract.replace('\n','')
abstract = abstract.split();
abstract = ' '.join(abstract)
sentences = abstract.split('.')
print sentences
key = str(len(sentences))
print "Sentences--- "
As others have pointed out, it's very difficult to follow your code. I think this code will do what you want, based on your expected output and what we can see. I could be way off, though, since we can't see the file you are working with. I'm especially troubled by one part of your code that I can't see enough to refactor, but feels obviously wrong. It's marked below.
import glob
for filename in glob.glob('awardsFile/awd_1990_00/*.txt'):
fh = open(filename, 'r')
abstract = fh.read().split(':')[-1]
fh.seek(0) # reset file pointer
# See comments below
for line in fh:
if line.find('File') != -1:
absID = line[14:]
print absID
if line.find('NSF Org') != -1:
print line[14:].split()
# End see comments
fh.close()
concat_abstract = ''.join(abstract.replace('\n', '').split())
for s_id, sentence in enumerate(concat_abstract.split('.')):
# Adjust numeric width arguments to prettify table
print absID.ljust(15),
print '{:06d}'.format(s_id).ljust(15),
print sentence
In that section marked, you are searching for the last occurrence of the strings 'File' and 'NSF Org' in the file (whether you mean to or not because the loop will keep overwriting your variables as long as they occur), then doing something with the 15th character onward of that line. Without seeing the file, it is impossible to say how to do it, but I can tell you there is a better way. It probably involves searching through the whole file as one string (or at least the first part of it if this is in its header) rather than looping over it.
Also, notice how I condensed your code. You store a lot of things in variables that you aren't using at all, and collecting a lot of cruft that spreads the state around. To understand what line N does, I have to keep glancing ahead at line N+5 and back over lines N-34 to N-17 to inspect variables. This creates a lot of action at a distance, which for reasons cited is best to avoid. In the smaller version, you can see how I substituted in string literals in places where they are only used once and called print statements immediately instead of storing the results for later. The results are usually more concise and easily understood.

Python; reading file and finding desired text

Need to create a function with two params, a filename to open and a pattern.
The pattern will be a search string.
Eg. the function will open sentence.txt that has something like "The quick brown fox" (can possibly be more than one line)
The pattern will be "brown fox"
So if found, as this will be, it should return a line number and index of the character the found string starts on. Else, return -1.
Catch is I've never programmed in python before so I don't know the syntax.
Previously coded in C, C#, Java, VB, etc..
EDIT:
.....Id
.....Name
#
my intent was for you to write HW3 code as iteration or
nested iterations that explicitly index the character
string as an array; i.e, the Python index() also known as
string.index() function is not allowed for this homework.
#
filename = raw_input('Enter filename: ')
pattern = raw_input('Enter pattern: ')
def findPattern(fname, pat):
Reading in one whole chunk
filetext = open(fname).read()
if pat in filetext:
print("Found it -- chunk")
else:
print("Nothing -- chunk")
Reading in line by line
for search in open(fname):
if pat in search:
print("Found it -- line")
else:
print("Nothing -- line")
findPattern(filename, pattern)
you can simulate simple "grep" with the "in" operator
def grep(filename, pattern):
for n,line in enumerate(open(filename)):
if pattern in line:
print line, n
To get index, you can use str.index() or str.find()
Here's a very simple grep. You could hack it out to use regular expressions pretty trivially. globbing wouldn't be much more difficult with glob. Also, the code you want is in there spread between grep and main so that might be of more interest than a custom grep ;)
def grep(filename, needle):
with open(filename) as f_in:
matches = ((i, line.find(needle), line) for i, line in enumerate(f_in))
return [match for match in matches if match[0] != -1]
def main(filename, needle):
matches = grep(filename, needle)
if matches:
print "{0} found on {1} lines in {2}".format(needle, len(matches), filename)
for line in matches:
print "{0}:{1}:{2}".format(*line)
return 1
else:
return -1
if __name__=='__main__':
import sys
filename = sys.argv[1]
needle = sys.argv[2]
return sys.exit(main(filename, needle))
Note that I haven't tested this code so there might be slight bugs. If it compiles, it should run fine though.
Also, you should tell your teacher that signalling failure with return codes is a terrible way to do things. If the caller of the function that you're going to write needs to know if no matches were found, it can just check for an empty list.

Categories

Resources