python: list index out of range in reducer - python

I'm writing the reduce part of my mapreduce program and I am getting a 'list index out of range' in the line SplitLine = [1]. Why is this? I was fairly sure this was correct.
import sys
cKey = ""
cList = []
lines = sys.stdin.readlines()
for line in lines:
line = line.rstrip()
splitLine = line.split("\t")
key = splitLine[0]
value = splitLine[1]
....
Any thoughts? Thank you!

You are trying to access splitLine[1] when there is no [1] entry. Most likely, you have either blank lines or lines that have no \t in it.
A possible solution would be to ignore entries that have less than 2 columns:
import sys
cKey = ""
cList = []
lines = sys.stdin.readlines()
for line in lines:
line = line.rstrip()
splitLine = line.split("\t")
if len(splitLine) > 1:
key = splitLine[0]
value = splitLine[1]

You should do 2 things:
Filter out blank lines at the outset if not re.match(r'^\s*$', line):
For non blank lines add a default value for border cases with no tabs (blank space " " in this case) line+"\t "
Sample code:
import sys
cKey = ""
cList = []
lines = sys.stdin.readlines()
for line in lines:
# line is empty (has only the following: \t\n\r and whitespace)
if not re.match(r'^\s*$', line):
# add extra delimiter '\t' and default value ' ' to be safe
line = line+"\t "
splitLine = line.split("\t")
key = splitLine[0]
# strip any blank spaces at end
value = splitLine[1].rstrip()

Related

Python: how to skip the repeated lines in an input file?

I am reading from a file and I want to read each line alone, since each 3rd line in the output has to be a combination of the previous 2 lines. This is a small example:
Input:
<www.example.com/apple> <Anything>
<www.example.com/banana> <Anything>
Output:
<www.example.com/apple> <Anything>
<www.example.com/banana> <Anything>
<Apple> <Banana>
If any of the lines is repeated or if it is an empty line, then I do not want to process it, I want to get only 2 different lines each time.
This is a part of my real input:
<http://catalog.data.gov/bread> <http://dbpedia.org>
<http://catalog.data.gov/bread> <http://dbpedia.org>
<http://catalog.data.gov/bread> <http://dbpedia.org>
<http://catalog.data.gov/bread> <http://dbpedia.org>
<http://catalog.data.gov/roll> <http://dbpedia.org>
<http://catalog.data.gov/roll> <http://dbpedia.org>
In this case I want the output to be like this:
<http://catalog.data.gov/bread> <http://dbpedia.org>
<http://catalog.data.gov/roll> <http://dbpedia.org>
<bread> <roll>
This is my code:
file = open('rdfs.txt')
for id, line in enumerate(file):
if id % 2 == 0:
if line.isspace():
continue
line1 = line.split()
sub_line1, rel_line1 = line1[0], line1[1]
sub_line1 = sub_line1.lstrip("<").rstrip(">")
print(sub_line1)
else:
if line.isspace():
continue
line2 = line.split()
sub_line2, rel_line2 = line2[0], line2[1]
sub_line2 = sub_line2.lstrip("<").rstrip(">")
print(sub_line2)
It is working perfectly, but I am getting all the lines, how to add if the second line is equal to the line before then skip all the lines until you find a new different line.
The output I am getting now:
http://catalog.data.gov/bread
http://catalog.data.gov/bread
http://catalog.data.gov/roll
http://catalog.data.gov/roll
Thanks!!
You can declare a set() and named it line_seen that will hold all seen lines, and check on every new line if it in the lines_seen or not and add it to your check:
Your code should looks like:
file = open('rdfs.txt')
lines_seen = set() # holds lines already seen
for id, line in enumerate(file):
if line not in lines_seen: # not a duplicate
lines_seen.add(line)
if id % 2 == 0:
if line.isspace():
continue
line1 = line.split()
sub_line1, rel_line1 = line1[0], line1[1]
sub_line1 = sub_line1.lstrip("<").rstrip(">")
print(sub_line1)
else:
if line.isspace():
continue
line2 = line.split()
sub_line2, rel_line2 = line2[0], line2[1]
sub_line2 = sub_line2.lstrip("<").rstrip(">")
print(sub_line2)

Adding words to list

I wrote a function that is supposed to add the words from a .txt to a list but it is supposed to ignore empty lines, how ever my function outputs ['',] at an empty line.
def words(filename):
word = []
file = open(filename)
for line in file:
word.append(line.strip())
return word
How can i fix this thanks
what about a simple if test?
def words(filename):
word = []
file = open(filename)
for line in file:
if line.strip() != ' ':
word.append(line.strip())
return word
EDIT: I forgot the .strip() after line
Besides, you could also use if line.strip():
Last, if you want to get a list of words but have several words per line, you need to split them. Assuming your separator is ' ':
def words(filename):
word = []
file = open(filename)
for line in file:
if line.strip() != ' ':
word.extend(line.strip().split())
return word
You can fix this like that:
def words(filename):
word = []
file = open(filename)
for line in file:
if not line.strip():
word.append(line)
return word
Your problem is that you're adding line.strip(), but what happens if line is actually an empty string? Look:
In [1]: line = ''
In [2]: line.strip()
Out[2]: ''
''.strip() returns an empty string.
You need to test for an empty line and skip the append in that case.
def words(filename):
word = []
file = open(filename)
for line in file:
line=line.strip()
if len(line):
word.append(line)
return word

Replace string in line without adding new line?

I want to replace string in a line which contain patternB, something like this:
from:
some lines
line contain patternA
some lines
line contain patternB
more lines
to:
some lines
line contain patternA
some lines
line contain patternB xx oo
more lines
I have code like this:
inputfile = open("d:\myfile.abc", "r")
outputfile = open("d:\myfile_renew.abc", "w")
obj = "yaya"
dummy = ""
item = []
for line in inputfile:
dummy += line
if line.find("patternA") != -1:
for line in inputfile:
dummy += line
if line.find("patternB") != -1:
item = line.split()
dummy += item[0] + " xx " + item[-1] + "\n"
break
outputfile.write(dummy)
It do not replace the line contain "patternB" as expected, but add an new line below it like :
some lines
line contain patternA
some lines
line contain patternB
line contain patternB xx oo
more lines
What can I do with my code?
Of course it is, since you append line to dummy in the beginning of the for loop and then the modified version again in the "if" statement. Also why check for Pattern A if you treat is as you treat everything else?
inputfile = open("d:\myfile.abc", "r")
outputfile = open("d:\myfile_renew.abc", "w")
obj = "yaya"
dummy = ""
item = []
for line in inputfile:
if line.find("patternB") != -1:
item = line.split()
dummy += item[0] + " xx " + item[-1] + "\n"
else:
dummy += line
outputfile.write(dummy)
The simplest will be:
1. Read all File into string
2. Call string.replace
3. Dump string to file
If you want to keep line by line iterator
(for a big file)
for line in inputfile:
if line.find("patternB") != -1:
dummy = line.replace('patternB', 'patternB xx oo')
outputfile.write(dummy)
else:
outputfile.write(line)
This is slower than other responses, but enables big file processing.
This should work
import os
def replace():
f1 = open("d:\myfile.abc","r")
f2 = open("d:\myfile_renew.abc","w")
ow = raw_input("Enter word you wish to replace:")
nw = raw_input("Enter new word:")
for line in f1:
templ = line.split()
for i in templ:
if i==ow:
f2.write(nw)
else:
f2.write(i)
f2.write('\n')
f1.close()
f2.close()
os.remove("d:\myfile.abc")
os.rename("d:\myfile_renew.abc","d:\myfile.abc")
replace()
You can use str.replace:
s = '''some lines
line contain patternA
some lines
line contain patternB
more lines'''
print(s.replace('patternB', 'patternB xx oo'))

read a specific string from a file in python?

I want to read the above file foo.txt and read only UDE from the first line and store it in a variable then Unspecified from the second line and store it in a variable and so on.
should I use read or readlines ? should I use regex for this ??
My below program is reading the entire line. how to read the specific word in the line ?
fo = open("foo.txt", "r+")
line = fo.readline()
left, right = line.split(':')
result = right.strip()
File_Info_Domain = result
print File_Info_Domain
line = fo.readline()
left, right = line.split(':')
result = right.strip()
File_Info_Intention = result
print File_Info_Intention
line = fo.readline()
left, right = line.split(':')
result = right.strip()
File_Info_NLU_Result = result
print File_Info_NLU_Result
fo.close()
You can use readline() (without s in name) to read line on-by-one, and then you can use split(':') to get value from line.
fo = open("foo.txt", "r+")
# read first line
line = fo.readline()
# split line only on first ':'
elements = line.split(':', 1)
if len(elements) < 2:
print("there is no ':' or there is no value after ':' ")
else:
# remove spaces and "\n"
result = elements[1].strip()
print(result)
#
# time for second line
#
# read second line
line = fo.readline()
# split line only on first ':'
elements = line.split(':', 1)
if len(elements) < 2:
print("there is no ':' or there is no value after ':' ")
else:
# remove spaces and "\n"
result = elements[1].strip()
print(result)
# close
fo.close()
While you can use #furas response or regex, I would recommend you to use a config file to do this, instead of a plain txt. So your config file would look like:
[settings]
Domain=UDE
Intention=Unspecified
nlu_slot_details={"Location": {"literal": "18 Slash 6/2015"}, "Search-phrase": {"literal": "18 slash 6/2015"}
And in your python code:
import configparser
config = configparser.RawConfigParser()
config.read("foo.cfg")
domain = config.get('settings', 'Domain')
intention = config.get('settings', 'Intention')
nlu_slot_details = config.get('settings', 'nlu_slot_details')

Read special characters from .txt file in python

The goal of this code is to find the frequency of words used in a book.
I am tying to read in the text of a book but the following line keeps throwing my code off:
precious protégés. No, gentlemen; he'll always show 'em a clean pair
specifically the é character
I have looked at the following documentation, but I don't quite understand it: https://docs.python.org/3.4/howto/unicode.html
Heres my code:
import string
# Create word dictionary from the comprehensive word list
word_dict = {}
def create_word_dict ():
# open words.txt and populate dictionary
word_file = open ("./words.txt", "r")
for line in word_file:
line = line.strip()
word_dict[line] = 1
# Removes punctuation marks from a string
def parseString (st):
st = st.encode("ascii", "replace")
new_line = ""
st = st.strip()
for ch in st:
ch = str(ch)
if (n for n in (1,2,3,4,5,6,7,8,9,0)) in ch or ' ' in ch or ch.isspace() or ch == u'\xe9':
print (ch)
new_line += ch
else:
new_line += ""
# now remove all instances of 's or ' at end of line
new_line = new_line.strip()
print (new_line)
if (new_line[-1] == "'"):
new_line = new_line[:-1]
new_line.replace("'s", "")
# Conversion from ASCII codes back to useable text
message = new_line
decodedMessage = ""
for item in message.split():
decodedMessage += chr(int(item))
print (decodedMessage)
return new_line
# Returns a dictionary of words and their frequencies
def getWordFreq (file):
# Open file for reading the book.txt
book = open (file, "r")
# create an empty set for all Capitalized words
cap_words = set()
# create a dictionary for words
book_dict = {}
total_words = 0
# remove all punctuation marks other than '[not s]
for line in book:
line = line.strip()
if (len(line) > 0):
line = parseString (line)
word_list = line.split()
# add words to the book dictionary
for word in word_list:
total_words += 1
if (word in book_dict):
book_dict[word] = book_dict[word] + 1
else:
book_dict[word] = 1
print (book_dict)
# close the file
book.close()
def main():
wordFreq1 = getWordFreq ("./Tale.txt")
print (wordFreq1)
main()
The error that I received is as follows:
Traceback (most recent call last):
File "Books.py", line 80, in <module>
main()
File "Books.py", line 77, in main
wordFreq1 = getWordFreq ("./Tale.txt")
File "Books.py", line 60, in getWordFreq
line = parseString (line)
File "Books.py", line 36, in parseString
decodedMessage += chr(int(item))
OverflowError: Python int too large to convert to C long
When you open a text file in python, the encoding is ANSI by default, so it doesn't contain your é chartecter. Try
word_file = open ("./words.txt", "r", encoding='utf-8')
The best way I could think of is to read each character as an ASCII value, into an array, and then take the char value. For example, 97 is ASCII for "a" and if you do char(97) it will output "a". Check out some online ASCII tables that provide values for special characters also.
Try:
def parseString(st):
st = st.encode("ascii", "replace")
# rest of code here
The new error you are getting is because you are calling isalpha on an int (i.e. a number)
Try this:
for ch in st:
ch = str(ch)
if (n for n in (1,2,3,4,5,6,7,8,9,0) if n in ch) or ' ' in ch or ch.isspace() or ch == u'\xe9':
print (ch)

Categories

Resources