Handling word index in text files - python

wordlist A: book jesus christ son david son abraham jacob judah his brothers perez amminadab
wordlist B: akwụkwọ jizọs kraịst nwa devid nwa ebreham jekọb juda ya ụmụnne pirez aminadab
file.txt A:
the book of the history of jesus christ , son of david , son of abraham :
abraham became father to isaac ; isaac became father to jacob ; jacob became father to judah and his brothers ;
file.txt B:
akwụkwọ nke kọrọ akụkọ banyere jizọs kraịst , nwa devid , nwa ebreham :
ebreham mụrụ aịzik ; aịzik amụọ jekọb ; jekọb amụọ juda na ụmụnne ya ndị ikom ;
I have 2 above word-lists (say A & B) of 2 diff. languages. Both contain word translation of each other in order. My task is to run these word-lists through 2 separate files.txt of both languages like word-list A through file.txt A and vice versa, then return a line for both txt files, each will contain the index numbers of both word-list where they were found on each line of the txt paired like:
2:1 7:6 8:7 10:9 12:10 14:12 16:13 [ 2:1 = 2 index of book in txt.file A and 1-akwụkwọ in txt.file B and so on]
1:1 11:6 13:8 17:10 19:12 20:13 [ 1:1 = 1 index of abraham in txt.file A and 1- ebreham in txt.file B and so on].
see codes below:
import sys
def wordlist(filename):
wordlist = []
with open(filename, 'rb') as f:
for line in f:
wordlist.append(line)
return wordlist
eng = []
for lines in open('eng_try.txt', 'rb'):
line = lines.strip()
eng.append(line)
igb = []
for lines in open('igb_try.txt', 'rb'):
line = lines.strip()
igb.append(line)
i = 0
while i < len(eng):
eng_igb_verse_pair = eng[i] + " " + igb[i]
line = eng_igb_verse_pair.strip().split()
for n in range(0, len(wordlist('eng_wordlist.txt'))):
eng_word = wordlist('eng_wordlist.txt').pop(n)
igb_word = wordlist('igb_wordlist.txt').pop(n)
if eng_word in line and igb_word in line:
print '{0} {1}:{2}'.format(i, line.index[eng_word], line.index[igb_word])
i += 1
This actually prints empty. I know my problem is in the last segment of the program. Can someone help. I am not that experienced python programmer. Apologies if I didn't construct my explanation well.

You mean something like this:
import sys
def checkLine(line_eng, line_igb):
eng_words = line_eng.split()
igb_words = line_igb.split()
for word in eng_words:
if word in eng:
igb_word = igb[eng.index(word)]
print "%d:%d" % ( eng_words.index(word)+1, igb_words.index(igb_word)+1),
def linelist(filename):
lineslist = []
for line in open(filename, 'rb'):
lineslist.append(line)
return lineslist
eng = []
for lines in open('eng_try.txt', 'rb'):
line = lines.strip()
for w in line.split():
eng.append(w)
igb = []
for lines in open('igb_try.txt', 'rb'):
line = lines.strip()
for w in line.split():
igb.append(w)
eng_lines = linelist("eng_wordlist.txt")
igb_lines = linelist("igb_wordlist.txt")
for n in range(0, len(eng_lines)):
print "%d. " % (n+1),
checkLine(eng_lines[n],igb_lines[n])
print
For your files i got result:
1. 2:1 7:6 8:7 10:9 12:10 10:9 16:13
2. 1:1 11:7 11:7 17:11 19:14 20:13
BR
Parasit Hendersson

Related

Remove item from list Python

I have a document containg a list. When I write the document it prints
['γύρισε στις φυλακές δομοκού κουφοντίνας huffpost greece']
['australia']
[]
['brasil']
[]
['canada']
[]
['españa']
[]
What I want is to remove the [] characters. So far I've done the following.
for file_name in list_of_files:
with open(file_name, 'r', encoding="utf-8") as inf:
lst = []
for line in inf:
#special characters removal
line = line.lower()
line = re.sub('\W+',' ', line )
line = word_tokenize(line)
#stopwords removal
line = ' '.join([word for word in line if word not in stopwords_dict])
line = line.split('\n')
line = list(filter(None, line))
lst.append(line)
inf.close()
Which removes some '' from inside the empty [], which seems reasonable. I have tried several approaches such as strip, remove() and [x for x in strings if x] without success. I am rather inexperienced, what am I missing?
update:
the initial text looks like this
Εκτέλεσαν τον δημοσιογράφο Γιώργο Καραϊβάζ στον Άλιμο | HuffPost Greece
Australia
Brasil
Canada
España
France
Ελλάδα (Greece)
India
Italia
日本 (Japan)
한국 (Korea)
Québec (en français)
United Kingdom
United States
Ελλάδα (Greece)
Update:
And I am writing the list to a file like this
for line in lst:
outf.write("%s\n" % line)
outf.close()
It looks like you're writing the lists itself after you're appending all of the items in the file. If you want to print the items of a list in python without the surrounding '[' and ']', then just loop over each item and print like so:
for item in list:
outf.write("%s\n" % item)
and for a list of lists
for list in lists:
for item in list:
outf.write("%s\n" % item)
if your output line always contains just one of each '[' and ']' then you can get what's in between with something like
for line in lst:
open_split = line.split('[')
after_open = open_split[1] if len(open_split) > 0 else ""
closed_split = after_open.split(']')
in_between_brackets = closed_split[0]
outf.write("%s\n" % in_between_brackets)
A short hand fragile version of the above split method can be done like so:
for line in lst:
outf.write("%s\n" % line.split('[')[1].split(']')[0])
If the expected result from the clean is like this
γύρισε στις φυλακές δομοκού κουφοντίνας huffpost greece
australia
brasil
canada
españa
Then this code below would help you.
import re
with open('original.txt') as f:
data = f.read()
with open('cleaned.txt', 'w') as f:
# Remove chars like [, ] and '
result = re.sub("\[|\]|'", '', data)
# Remove the extra lines (replace 2 \n by 1).
result = re.sub('\\n\\n', '\\n', result)
f.write(result)
After you remove the stop words you likely don't want to:
line = line.split('\n')
line = list(filter(None, line))
You likely want to inspect what is left and just continue if it is "nothing"
import re # mocking for NLTK
stopwords_dict = {
"huffpost": True
}
text_in = '''
Εκτέλεσαν τον δημοσιογράφο Γιώργο Καραϊβάζ στον Άλιμο | HuffPost Greece
Australia
Brasil
Canada
España
France
Ελλάδα (Greece)
India
Italia
日本 (Japan)
한국 (Korea)
Québec (en français)
United Kingdom
United States
Ελλάδα (Greece)
'''
'''
This emulates NLTK.word_tokenize
'''
def word_tokenize(text):
return re.sub(r'[^\w\s]', '', text).split()
lst = []
for line in text_in.splitlines():
line = line.lower()
line = re.sub('\W+',' ', line )
line_tokens = word_tokenize(line)
line_tokens = [token for token in line_tokens if token not in stopwords_dict]
# after cleaning a line, if there is nothing left skip
if not line_tokens:
continue
line = ' '.join(line_tokens)
lst.append(line)
with open("file_out.txt", "w", encoding='utf-8') as file_out:
for line in lst:
file_out.write("%s\n" % line)
This is give you a file with the contents of:
εκτέλεσαν τον δημοσιογράφο γιώργο καραϊβάζ στον άλιμο greece
australia
brasil
canada
españa
france
ελλάδα greece
india
italia
日本 japan
한국 korea
québec en français
united kingdom
united states
ελλάδα greece
Which is what you are hoping for (I think).
I managed to remove the empty lines by channging:
line = list(filter(None, line))
lst.append(line)
to
lst.append(line)
lst = list(filter(None, lst))
Apologies, it was a trivial mistake, thank you for your answers.

How to properly read double newline character in Python

This is a continuation of my previous question posted here, where I was struggling with parsing RIS file. However, now I have combined some code into a new parser, which correctly reads a record. Unfortunately, the code stops after the first record, while I have no idea how to differentiate between end of file and double newspace character which separate records. Any idea?
The input file is provided here:
Record #1 of 306
ID: CN-01160769
AU: Uedo N
AU: Yao K
AU: Muto M
AU: Ishikawa H
TI: Development of an E-learning system.
SO: United European Gastroenterology Journal
YR: 2015
VL: 3
NO: 5 SUPPL. 1
PG: A490
XR: EMBASE 72267184
PT: Journal: Conference Abstract
DOI: 10.1177/2050640615601623
US: http://onlinelibrary.wiley.com/o/cochrane/clcentral/articles/769/CN-01160769/frame.html
Record #2 of 306
ID: CN-01070265
AU: Krogh LQ
AU: Bjornshave K
AU: Vestergaard LD
AU: Sharma MB
AU: Rasmussen SE
AU: Nielsen HV
AU: Thim T
AU: Lofgren B
TI: E-learning in pediatric basic life support: A randomized controlled non-inferiority study.
SO: Resuscitation
YR: 2015
VL: 90
PG: 7-12
XR: EMBASE 2015935529
PT: Journal: Article
DOI: 10.1016/j.resuscitation.2015.01.030
US: http://onlinelibrary.wiley.com/o/cochrane/clcentral/articles/265/CN-01070265/frame.html
Record #3 of 306
ID: CN-00982835
AU: Worm BS
AU: Jensen K
TI: Does peer learning or higher levels of e-learning improve learning abilities?
SO: Medical education online
YR: 2013
VL: 18
NO: 1
PG: 21877
PM: PUBMED 28166018
XR: EMBASE 24229729
PT: Journal Article; Randomized Controlled Trial
DOI: 10.3402/meo.v18i0.21877
US: http://onlinelibrary.wiley.com/o/cochrane/clcentral/articles/835/CN-00982835/frame.html
And the code is pasted below:
import re
# Function to process single record
def read_record(infile):
line = infile.readline()
line = line.strip()
if not line:
# End of file
return None
if not line.startswith("Record"):
raise TypeError("Not a proper file: %r" % line)
# Read tags and fields
tags = []
fields = []
while 1:
line = infile.readline().rstrip()
if line == "":
# Reached the end of the record or end of the file
break
prog = re.compile("^([A-Z][A-Z0-9][A-Z]?): (.*)")
match = prog.match(line)
tag = match.groups()[0]
field = match.groups()[1]
tags.append(tag)
fields.append(field)
return [tags, fields]
# Function to loop through records
def read_records(input_file):
records = []
while 1:
record = read_record(input_file)
if record is None:
break
records.append(record)
return records
infile = open("test.txt")
for record in read_records(infile):
print(record)
Learn how to iterate over a file line by line using for line in infile:. No need to test for end of file with a "", the for loop iteration will do that for you:
for line in infile:
# remove trailing newlines, and truncate lines that
# are all-whitespace down to just ''
line = line.rstrip()
if line:
# there is something on this line
else:
# this is a blank line - but it is definitely NOT the end-of-file
As suggested by #PaulMcG here is a solution which iterates over a file line by line.
import re
records = []
count_records = 0
count_newlines = 0
prog = re.compile("^([A-Z][A-Z0-9][A-Z]?): (.*)")
bom = re.compile("^\ufeff")
with open("test.ris") as infile:
for line in infile:
line = line.rstrip()
if bom.match(line):
line = re.sub("^\ufeff", "", line)
if line:
if line.startswith("Record"):
print("START NEW RECORD")
count_records += 1
count_newlines = 0
current_record = {}
continue
match = prog.match(line)
tag = match.groups()[0]
field = match.groups()[1]
if tag == "AU":
if tag in current_record:
current_record[tag].append(field)
else:
current_record[tag] = [field]
else:
current_record.update({tag: field})
else:
count_newlines += 1
if count_newlines > 1 and count_records > 0:
print("# of records: ", count_records)
print("# of newlines: ", count_newlines)
records.append(current_record)

For-loop does not iterate to next element in a file

I have a problem with my two for-loops in this code:
def batchm():
searchFile = sys.argv[1]
namesFile = sys.argv[2]
writeFile = sys.argv[3]
countDict = {}
with open(searchFile, "r") as nlcfile:
with open(namesFile, "r") as namesList:
with open(writeFile, "a") as wfile:
for name in namesList:
for line in nlcfile:
if name in line:
res = line.split("\t")
countValue = res[0]
countKey = res[-1]
countDict[countKey] = countValue
countDictMax = sorted(countDict, key = lambda x: x[1], reverse = True)
print(countDictMax)
The loop is iterating over this:
namesList:
Greene
Donald
Donald Duck
MacDonald
.
.
.
nlcfile:
123 1999–2000 Northampton Town F.C. season Northampton Town
5 John Simpson Kirkpatrick
167 File talk:NewYorkRangers1940s.png talk
234 Parshu Ram Sharma(Raj Comics) Parshuram Sharma
.
.
.
What I get looks like this:
['Lyn Greene\n', 'Rydbergia grandiflora (Torrey &amp; A. Gray in A. Gray) E. Greene\n', 'Tyler Greene\n', 'Ty Greene\n' ..... ]
and this list appears 48 times, which also happens to be the number of lines in namesList.
Desired output:
("string from namesList" -> "record with highest number in nlcfile")
Greene -> Ly Greene
Donald -> Donald Duck
.
.
.
I think that the two for-loops don't iterate the right way. But I have no clue, why.
Can anyone see, where the problem is?
Thank you very much!

Data Analysis using Python

I have 2 CSV files. One with city name, population and humidity. In second cities are mapped to states. I want to get state-wise total population and average humidity. Can someone help? Here is the example:
CSV 1:
CityName,population,humidity
Austin,1000,20
Sanjose,2200,10
Sacramento,500,5
CSV 2:
State,city name
Ca,Sanjose
Ca,Sacramento
Texas,Austin
Would like to get output(sum population and average humidity for state):
Ca,2700,7.5
Texas,1000,20
The above solution doesn't work because dictionary will contain one one key value. i gave up and finally used a loop. below code is working, mentioned input too
csv1
state_name,city_name
CA,sacramento
utah,saltlake
CA,san jose
Utah,provo
CA,sanfrancisco
TX,austin
TX,dallas
OR,portland
CSV2
city_name population humidity
sacramento 1000 1
saltlake 300 5
san jose 500 2
provo 100 7
sanfrancisco 700 3
austin 2000 4
dallas 2500 5
portland 300 6
def mapping_within_dataframe(self, file1,file2,file3):
self.csv1 = file1
self.csv2 = file2
self.outcsv = file3
one_state_data = 0
outfile = csv.writer(open('self.outcsv', 'w'), delimiter=',')
state_city = read_csv(self.csv1)
city_data = read_csv(self.csv2)
all_state = list(set(state_city.state_name))
for one_state in all_state:
one_state_cities = list(state_city.loc[state_city.state_name == one_state, "city_name"])
one_state_data = 0
for one_city in one_state_cities:
one_city_data = city_data.loc[city_data.city_name == one_city, "population"].sum()
one_state_data = one_state_data + one_city_data
print one_state, one_state_data
outfile.writerows(whatever)
def output(file1, file2):
f = lambda x: x.strip() #strips newline and white space characters
with open(file1) as cities:
with open(file2) as states:
states_dict = {}
cities_dict = {}
for line in states:
line = line.split(',')
states_dict[f(line[0])] = f(line[1])
for line in cities:
line = line.split(',')
cities_dict[f(line[0])] = (int(f(line[1])) , int(f(line[2])))
for state , city in states_dict.iteritems():
try:
print state, cities_dict[city]
except KeyError:
pass
output(CSV1,CSV2) #these are the names of the files
This gives the output you wanted. Just make sure the names of cities in both files are the same in terms of capitalization.

How to compare a zero padded number in dictionary with a non zero padded number

I have two files and I need to compare both of them & update the value of the 1st file from the 2nd file.
My first file is as below,
SeqNo City State
1 Chicago IL
2 Boston MA
3 New York NY
4 Los Angeles CA
5 Seattle WA
My second file is as below,
SeqNo City State NewSeqNo
005 Seattle WA 001
001 Chicago IL 002
004 Los Angeles CA 003
002 Boston MA 004
003 New York NY 005
I have the following code to update the SEQ Number in the first file with the value in the NewSeqNo from the second file & save it as a third file. But it throws key error as SEQNO is zero padded in the second file where as its not in the first,
import csv
lookup = {}
with open('secondfile') as f:
reader = csv.reader(f)
for line in reader:
oldseq, city, state, newseq = line
lookup[oldseq] = newseq
with open('firstfile') as f, open('outfile','w') as w:
reader = csv.reader(f)
writer = csv.writer(w)
for line in reader:
seq, city, state = line
if seq in lookup:
seq = lookup[seq]
writer.writerow([seq, city, state])
For example, the output of the thirs file should be,
NewSeqNo City State
002 Chicago IL
004 Boston MA
005 New York NY
003 Los Angeles CA
001 Seattle WA
Any help is appreciated
Convert your 'numbers' to integers to remove the padding before storing in the dictionary:
import csv
lookup = {}
with open('secondfile') as f:
reader = csv.reader(f)
for line in reader:
oldseq, city, state, newseq = line
lookup[int(oldseq)] = newseq
with open('firstfile') as f, open('outfile','w') as w:
reader = csv.reader(f)
writer = csv.writer(w)
for line in reader:
seq, city, state = line
if int(seq) in lookup:
seq = lookup[int(seq)]
writer.writerow([seq, city, state])
Now lookup has integer keys, and when looking up matching keys in the second loop, we pass in integer keys again.
If you know that it is always padded for a length of 3, when reading your first file, you can convert your seq to an int and use format to write a padded value:
with open('firstfile') as f, open('outfile','w') as w:
reader = csv.reader(f)
writer = csv.writer(w)
for line in reader:
seq, city, state = line
# Convert to padded value
seq = "{:03}".format(int(seq))
if seq in lookup:
seq = lookup[seq]
writer.writerow([seq, city, state])
#!/usr/bin/python
old_dict = dict()
new_dict = dict()
with open('old', 'r') as fh:
for l in fh.readlines():
r = l.split()
if r:
old_dict.setdefault(int(r[0]), None)
old_dict[int(r[0])] = ' '.join(r[1:])
with open('new', 'r') as fh:
for l in fh.readlines():
r = l.split()
if r:
k = ' '.join(r[1:-1])
new_dict.setdefault(k, None)
new_dict[k] = int(r[-1])
for i,j in old_dict.iteritems():
d = j.split()
print '%0.3d %s %s' % (new_dict[j], ' '.join(d[0:-1]), d[-1])
Output:
002 Chicago IL
004 Boston MA
005 New York NY
003 Los Angeles CA
001 Seattle WA

Categories

Resources