I have a document containg a list. When I write the document it prints
['γύρισε στις φυλακές δομοκού κουφοντίνας huffpost greece']
['australia']
[]
['brasil']
[]
['canada']
[]
['españa']
[]
What I want is to remove the [] characters. So far I've done the following.
for file_name in list_of_files:
with open(file_name, 'r', encoding="utf-8") as inf:
lst = []
for line in inf:
#special characters removal
line = line.lower()
line = re.sub('\W+',' ', line )
line = word_tokenize(line)
#stopwords removal
line = ' '.join([word for word in line if word not in stopwords_dict])
line = line.split('\n')
line = list(filter(None, line))
lst.append(line)
inf.close()
Which removes some '' from inside the empty [], which seems reasonable. I have tried several approaches such as strip, remove() and [x for x in strings if x] without success. I am rather inexperienced, what am I missing?
update:
the initial text looks like this
Εκτέλεσαν τον δημοσιογράφο Γιώργο Καραϊβάζ στον Άλιμο | HuffPost Greece
Australia
Brasil
Canada
España
France
Ελλάδα (Greece)
India
Italia
日本 (Japan)
한국 (Korea)
Québec (en français)
United Kingdom
United States
Ελλάδα (Greece)
Update:
And I am writing the list to a file like this
for line in lst:
outf.write("%s\n" % line)
outf.close()
It looks like you're writing the lists itself after you're appending all of the items in the file. If you want to print the items of a list in python without the surrounding '[' and ']', then just loop over each item and print like so:
for item in list:
outf.write("%s\n" % item)
and for a list of lists
for list in lists:
for item in list:
outf.write("%s\n" % item)
if your output line always contains just one of each '[' and ']' then you can get what's in between with something like
for line in lst:
open_split = line.split('[')
after_open = open_split[1] if len(open_split) > 0 else ""
closed_split = after_open.split(']')
in_between_brackets = closed_split[0]
outf.write("%s\n" % in_between_brackets)
A short hand fragile version of the above split method can be done like so:
for line in lst:
outf.write("%s\n" % line.split('[')[1].split(']')[0])
If the expected result from the clean is like this
γύρισε στις φυλακές δομοκού κουφοντίνας huffpost greece
australia
brasil
canada
españa
Then this code below would help you.
import re
with open('original.txt') as f:
data = f.read()
with open('cleaned.txt', 'w') as f:
# Remove chars like [, ] and '
result = re.sub("\[|\]|'", '', data)
# Remove the extra lines (replace 2 \n by 1).
result = re.sub('\\n\\n', '\\n', result)
f.write(result)
After you remove the stop words you likely don't want to:
line = line.split('\n')
line = list(filter(None, line))
You likely want to inspect what is left and just continue if it is "nothing"
import re # mocking for NLTK
stopwords_dict = {
"huffpost": True
}
text_in = '''
Εκτέλεσαν τον δημοσιογράφο Γιώργο Καραϊβάζ στον Άλιμο | HuffPost Greece
Australia
Brasil
Canada
España
France
Ελλάδα (Greece)
India
Italia
日本 (Japan)
한국 (Korea)
Québec (en français)
United Kingdom
United States
Ελλάδα (Greece)
'''
'''
This emulates NLTK.word_tokenize
'''
def word_tokenize(text):
return re.sub(r'[^\w\s]', '', text).split()
lst = []
for line in text_in.splitlines():
line = line.lower()
line = re.sub('\W+',' ', line )
line_tokens = word_tokenize(line)
line_tokens = [token for token in line_tokens if token not in stopwords_dict]
# after cleaning a line, if there is nothing left skip
if not line_tokens:
continue
line = ' '.join(line_tokens)
lst.append(line)
with open("file_out.txt", "w", encoding='utf-8') as file_out:
for line in lst:
file_out.write("%s\n" % line)
This is give you a file with the contents of:
εκτέλεσαν τον δημοσιογράφο γιώργο καραϊβάζ στον άλιμο greece
australia
brasil
canada
españa
france
ελλάδα greece
india
italia
日本 japan
한국 korea
québec en français
united kingdom
united states
ελλάδα greece
Which is what you are hoping for (I think).
I managed to remove the empty lines by channging:
line = list(filter(None, line))
lst.append(line)
to
lst.append(line)
lst = list(filter(None, lst))
Apologies, it was a trivial mistake, thank you for your answers.
Related
I have been practicing on iterating through dictionary and list in Python.
The source file is a csv document containing Country and Capital. It seems I had to go through 2 for loops for country_dict in order to produce the same print result for country_list and capital_list.
Is there a better way to do this in Python dictionary?
The code:
import csv
path = #Path_to_CSV_File
country_list=[]
capital_list=[]
country_dict={'Country':[],'Capital':[]}
with open(path, mode='r') as data:
for line in csv.DictReader(data):
locals().update(line)
country_dict['Country'].append(Country)
country_dict['Capital'].append(Capital)
country_list.append(Country)
capital_list.append(Capital)
i=14 #set pointer value to the 15th row in the csv document
#---------------------- Iterating through Dictionary using for loops---------------------------
if i >= (len(country_dict['Country'])-1):
print("out of bound")
for count1, element in enumerate(country_dict['Country']):
if count1==i:
print('Country = ' + element)
for count2, element in enumerate(country_dict['Capital']):
if count2==i:
print('Capital = ' + element)
#--------------------------------Direct print for list----------------------------------------
print('Country = ' + country_list[i] + '\nCapital = ' + capital_list[i])
The output:
Country = Djibouti
Capital = Djibouti (city)
Country = Djibouti
Capital = Djibouti (city)
The CSV file content:
Country,Capital
Algeria,Algiers
Angola,Luanda
Benin,Porto-Novo
Botswana,Gaborone
Burkina Faso,Ouagadougou
Burundi,Gitega
Cabo Verde,Praia
Cameroon,Yaounde
Central African Republic,Bangui
Chad,N'Djamena
Comoros,Moroni
"Congo, Democratic Republic of the",Kinshasa
"Congo, Republic of the",Brazzaville
Cote d'Ivoire,Yamoussoukro
Djibouti,Djibouti (city)
Egypt,Cairo
Equatorial Guinea,"Malabo (de jure), Oyala (seat of government)"
Eritrea,Asmara
Eswatini (formerly Swaziland),"Mbabane (administrative), Lobamba (legislative, royal)"
Ethiopia,Addis Ababa
Gabon,Libreville
Gambia,Banjul
Ghana,Accra
Guinea,Conakry
Guinea-Bissau,Bissau
Kenya,Nairobi
Lesotho,Maseru
Liberia,Monrovia
Libya,Tripoli
Madagascar,Antananarivo
Malawi,Lilongwe
Mali,Bamako
Mauritania,Nouakchott
Mauritius,Port Louis
Morocco,Rabat
Mozambique,Maputo
Namibia,Windhoek
Niger,Niamey
Nigeria,Abuja
Rwanda,Kigali
Sao Tome and Principe,São Tomé
Senegal,Dakar
Seychelles,Victoria
Sierra Leone,Freetown
Somalia,Mogadishu
South Africa,"Pretoria (administrative), Cape Town (legislative), Bloemfontein (judicial)"
South Sudan,Juba
Sudan,Khartoum
Tanzania,Dodoma
Togo,Lomé
Tunisia,Tunis
Uganda,Kampala
Zambia,Lusaka
Zimbabwe,Harare
I am not sure if I get your point; Please check out the code.
import csv
path = #Path_to_CSV_File
country_dict={}
with open(path, mode='r') as data:
lines = csv.DictReader(data)
for idx,line in enumerate(lines):
locals().update(line)
country_dict[idx] = {"Country":Country,"Capital":}
i=14 #set pointer value to the 15th row in the csv document
#---------------------- Iterating through Dictionary using for loops---------------------------
country_info = country_dict.get(i)
#--------------------------------Direct print for list----------------------------------------
print('Country = ' + country_info['Country'] + '\nCapital = ' + country_info['Capital'])
My list is formatted like:
gymnastics_school,participant_name,all-around_points_earned
I need to divide it up by schools but keep the scores.
import collections
def main():
names = ["gymnastics_school", "participant_name", "all_around_points_earned"]
Data = collections.namedtuple("Data", names)
data = []
with open('state_meet.txt','r') as f:
for line in f:
line = line.strip()
items = line.split(',')
items[2] = float(items[2])
data.append(Data(*items))
These are examples of how they're set up:
Lanier City Gymnastics,Ben W.,55.301
Lanier City Gymnastics,Alex W.,54.801
Lanier City Gymnastics,Sky T.,51.2
Lanier City Gymnastics,William G.,47.3
Carrollton Boys,Cameron M.,61.6
Carrollton Boys,Zachary W.,58.7
Carrollton Boys,Samuel B.,58.6
La Fayette Boys,Nate S.,63
La Fayette Boys,Kaden C.,62
La Fayette Boys,Cohan S.,59.1
La Fayette Boys,Cooper J.,56.101
La Fayette Boys,Avi F.,53.401
La Fayette Boys,Frederic T.,53.201
Columbus,Noah B.,50.3
Savannah Metro,Levi B.,52.801
Savannah Metro,Taylan T.,52
Savannah Metro,Jacob S.,51.5
SAAB Gymnastics,Dawson B.,58.1
SAAB Gymnastics,Dean S.,57.901
SAAB Gymnastics,William L.,57.101
SAAB Gymnastics,Lex L.,52.501
Suwanee Gymnastics,Colin K.,57.3
Suwanee Gymnastics,Matthew B.,53.201
After processing it should look like:
Lanier City Gymnastics:participants(4)
as it own list
Carrollton Boys(3)
as it own list
La Fayette Boys(6)
etc.
I would recommend putting them in dictionaries:
data = {}
with open('state_meet.txt','r') as f:
for line in f:
line = line.strip()
items = line.split(',')
items[2] = float(items[2])
if items[0] in data:
data[items[0]].append(items[1:])
else:
data[items[0]] = [items[1:]]
Then access schools could be done in the following way:
>>> data['Lanier City Gymnastics']
[['Ben W.',55.301],['Alex W.',54.801],['Sky T'.,51.2],['William G.',47.3]
EDIT:
Assuming you need the whole dataset as a list first, then you want to divide it into smaller lists you can generate the dictionary from the list:
data = []
with open('state_meet.txt','r') as f:
for line in f:
line = line.strip()
items = line.split(',')
items[2] = float(items[2])
data.append(items)
#perform median or other operation on your data
nested_data = {}
for items in data:
if items[0] in data:
data[items[0]].append(items[1:])
else:
data[items[0]] = [items[1:]]
nested_data[item[0]]
When you need to get a subset of a list you can use slicing:
mylist[start:stop:step]
where start, stop and step are optional (see link for more comprehensive introduction)
I have two files and I need to compare both of them & update the value of the 1st file from the 2nd file.
My first file is as below,
SeqNo City State
1 Chicago IL
2 Boston MA
3 New York NY
4 Los Angeles CA
5 Seattle WA
My second file is as below,
SeqNo City State NewSeqNo
005 Seattle WA 001
001 Chicago IL 002
004 Los Angeles CA 003
002 Boston MA 004
003 New York NY 005
I have the following code to update the SEQ Number in the first file with the value in the NewSeqNo from the second file & save it as a third file. But it throws key error as SEQNO is zero padded in the second file where as its not in the first,
import csv
lookup = {}
with open('secondfile') as f:
reader = csv.reader(f)
for line in reader:
oldseq, city, state, newseq = line
lookup[oldseq] = newseq
with open('firstfile') as f, open('outfile','w') as w:
reader = csv.reader(f)
writer = csv.writer(w)
for line in reader:
seq, city, state = line
if seq in lookup:
seq = lookup[seq]
writer.writerow([seq, city, state])
For example, the output of the thirs file should be,
NewSeqNo City State
002 Chicago IL
004 Boston MA
005 New York NY
003 Los Angeles CA
001 Seattle WA
Any help is appreciated
Convert your 'numbers' to integers to remove the padding before storing in the dictionary:
import csv
lookup = {}
with open('secondfile') as f:
reader = csv.reader(f)
for line in reader:
oldseq, city, state, newseq = line
lookup[int(oldseq)] = newseq
with open('firstfile') as f, open('outfile','w') as w:
reader = csv.reader(f)
writer = csv.writer(w)
for line in reader:
seq, city, state = line
if int(seq) in lookup:
seq = lookup[int(seq)]
writer.writerow([seq, city, state])
Now lookup has integer keys, and when looking up matching keys in the second loop, we pass in integer keys again.
If you know that it is always padded for a length of 3, when reading your first file, you can convert your seq to an int and use format to write a padded value:
with open('firstfile') as f, open('outfile','w') as w:
reader = csv.reader(f)
writer = csv.writer(w)
for line in reader:
seq, city, state = line
# Convert to padded value
seq = "{:03}".format(int(seq))
if seq in lookup:
seq = lookup[seq]
writer.writerow([seq, city, state])
#!/usr/bin/python
old_dict = dict()
new_dict = dict()
with open('old', 'r') as fh:
for l in fh.readlines():
r = l.split()
if r:
old_dict.setdefault(int(r[0]), None)
old_dict[int(r[0])] = ' '.join(r[1:])
with open('new', 'r') as fh:
for l in fh.readlines():
r = l.split()
if r:
k = ' '.join(r[1:-1])
new_dict.setdefault(k, None)
new_dict[k] = int(r[-1])
for i,j in old_dict.iteritems():
d = j.split()
print '%0.3d %s %s' % (new_dict[j], ' '.join(d[0:-1]), d[-1])
Output:
002 Chicago IL
004 Boston MA
005 New York NY
003 Los Angeles CA
001 Seattle WA
wordlist A: book jesus christ son david son abraham jacob judah his brothers perez amminadab
wordlist B: akwụkwọ jizọs kraịst nwa devid nwa ebreham jekọb juda ya ụmụnne pirez aminadab
file.txt A:
the book of the history of jesus christ , son of david , son of abraham :
abraham became father to isaac ; isaac became father to jacob ; jacob became father to judah and his brothers ;
file.txt B:
akwụkwọ nke kọrọ akụkọ banyere jizọs kraịst , nwa devid , nwa ebreham :
ebreham mụrụ aịzik ; aịzik amụọ jekọb ; jekọb amụọ juda na ụmụnne ya ndị ikom ;
I have 2 above word-lists (say A & B) of 2 diff. languages. Both contain word translation of each other in order. My task is to run these word-lists through 2 separate files.txt of both languages like word-list A through file.txt A and vice versa, then return a line for both txt files, each will contain the index numbers of both word-list where they were found on each line of the txt paired like:
2:1 7:6 8:7 10:9 12:10 14:12 16:13 [ 2:1 = 2 index of book in txt.file A and 1-akwụkwọ in txt.file B and so on]
1:1 11:6 13:8 17:10 19:12 20:13 [ 1:1 = 1 index of abraham in txt.file A and 1- ebreham in txt.file B and so on].
see codes below:
import sys
def wordlist(filename):
wordlist = []
with open(filename, 'rb') as f:
for line in f:
wordlist.append(line)
return wordlist
eng = []
for lines in open('eng_try.txt', 'rb'):
line = lines.strip()
eng.append(line)
igb = []
for lines in open('igb_try.txt', 'rb'):
line = lines.strip()
igb.append(line)
i = 0
while i < len(eng):
eng_igb_verse_pair = eng[i] + " " + igb[i]
line = eng_igb_verse_pair.strip().split()
for n in range(0, len(wordlist('eng_wordlist.txt'))):
eng_word = wordlist('eng_wordlist.txt').pop(n)
igb_word = wordlist('igb_wordlist.txt').pop(n)
if eng_word in line and igb_word in line:
print '{0} {1}:{2}'.format(i, line.index[eng_word], line.index[igb_word])
i += 1
This actually prints empty. I know my problem is in the last segment of the program. Can someone help. I am not that experienced python programmer. Apologies if I didn't construct my explanation well.
You mean something like this:
import sys
def checkLine(line_eng, line_igb):
eng_words = line_eng.split()
igb_words = line_igb.split()
for word in eng_words:
if word in eng:
igb_word = igb[eng.index(word)]
print "%d:%d" % ( eng_words.index(word)+1, igb_words.index(igb_word)+1),
def linelist(filename):
lineslist = []
for line in open(filename, 'rb'):
lineslist.append(line)
return lineslist
eng = []
for lines in open('eng_try.txt', 'rb'):
line = lines.strip()
for w in line.split():
eng.append(w)
igb = []
for lines in open('igb_try.txt', 'rb'):
line = lines.strip()
for w in line.split():
igb.append(w)
eng_lines = linelist("eng_wordlist.txt")
igb_lines = linelist("igb_wordlist.txt")
for n in range(0, len(eng_lines)):
print "%d. " % (n+1),
checkLine(eng_lines[n],igb_lines[n])
print
For your files i got result:
1. 2:1 7:6 8:7 10:9 12:10 10:9 16:13
2. 1:1 11:7 11:7 17:11 19:14 20:13
BR
Parasit Hendersson
In the below case i want to match string "Singapore" where "S" should always be capital and rest of the words may be in lower or in uppercase. but in the below string "s" is in lower case and it gets matched in search condition. can any body let me know how to implement this?
import re
st = "Information in sinGapore "
if re.search("S""(?i)(ingapore)" , st):
print "matched"
Singapore => matched
sIngapore => notmatched
SinGapore => matched
SINGAPORE => matched
As commented, the Ugly way would be:
>>> re.search("S[iI][Nn][Gg][Aa][Pp][Oo][Rr][Ee]" , "SingaPore")
<_sre.SRE_Match object at 0x10cea84a8>
>>> re.search("S[iI][Nn][Gg][Aa][Pp][Oo][Rr][Ee]" , "Information in sinGapore")
The more elegant way would be matching Singapore case-insensitive, and then checking that the first letter is S:
reg=re.compile("singapore", re.I)
>>> s="Information in sinGapore"
>>> reg.search(s) and reg.search(s).group()[0]=='S'
False
>>> s="Information in SinGapore"
>>> reg.search(s) and reg.search(s).group()[0]=='S'
True
Update
Following your comment - you could use:
reg.search(s).group().startswith("S")
Instead of:
reg.search(s).group()[0]==("S")
If it seems more readable.
Since you want to set a GV code according to the catched phrase (unique name or several name blank separated, I know that), there must be a step in which the code is choosen in a dictionary according to the catched phrase.
So it's easy to make a profit of this step to perform the test on the first letter (must be uppercased) or the first name in the phrase that no regex is capable of.
I choosed certain conditions to constitute the test. For example, a dot in a first name is not mandatory, but uppercased letters are. These conditions will be easily changed at need.
EDIT 1
import re
def regexize(cntry):
def doot(x):
return '\.?'.join(ch for ch in x) + '\.?'
to_join = []
for c in cntry:
cspl = c.split(' ',1)
if len(cspl)==1: # 'Singapore','Austria',...
to_join.append('(%s)%s'
% (doot(c[0]), doot(c[1:])))
else: # 'Den LMM','LMM Den',....
to_join.append('(%s) +%s'
% (doot(cspl[0]),
doot(cspl[1].strip(' ').lower())))
pattern = '|'.join(to_join).join('()')
return re.compile(pattern,re.I)
def code(X,CNTR,r = regexize):
r = regexize(CNTR)
for ma in r.finditer(X):
beg = ma.group(1).split(' ')[0]
if beg==ma.group(1):
GV = countries[beg[0]+beg[1:].replace('.','').lower()] \
if beg[0].upper()==beg[0] else '- bad match -'
else:
try:
k = (ki for ki in countries.iterkeys()
if beg.replace('.','')==ki.split(' ')[0]).next()
GV = countries[k]
except StopIteration:
GV = '- bad match -'
yield ' {!s:15} {!s:^13}'.format(ma.group(1), GV)
countries = {'Singapore':'SG','Austria':'AU',
'Swiss':'CH','Chile':'CL',
'Den LMM':'DN','LMM Den':'LM'}
s = (' Singapore SIngapore SiNgapore SinGapore'
' SI.Ngapore SIngaPore SinGaporE SinGAPore'
' SINGaporE SiNg.aPoR singapore sIngapore'
' siNgapore sinGapore sINgap.ore sIngaPore'
' sinGaporE sinGAPore sINGaporE siNgaPoRe'
' Austria Aus.trIA aUSTria AUSTRiA'
' Den L.M.M Den Lm.M DEn Lm.M.'
' DEN L.MM De.n L.M.M. Den LmM'
' L.MM DEn LMM DeN LM.m Den')
print '\n'
print '\n'.join(res for res in code(s,countries))
EDIT 2
I improved the code. It's shorter and more readable.
The instruction assert(.....] is to verify that the keys of the dictionaru are well formed for the purpose.
import re
def doot(x):
return '\.?'.join(ch for ch in x) + '\.?'
def regexize(labels,doot=doot,
wg2 = '(%s) *( %s)',wnog2 = '(%s)(%s)',
ri = re.compile('(.(?!.*? )|[^ ]+)( ?) *(.+\Z)')):
to_join = []
modlabs = {}
for K in labels.iterkeys():
g1,g2,g3 = ri.match(K).groups()
to_join.append((wg2 if g2 else wnog2)
% (doot(g1), doot(g3.lower())))
modlabs[g1+g2+g3.lower()] = labels[K]
return (re.compile('|'.join(to_join), re.I), modlabs)
def code(X,labels,regexize = regexize):
reglab,modlabs = regexize(labels)
for ma in reglab.finditer(X):
a,b = tuple(x for x in ma.groups() if x)
k = (a + b.lower()).replace('.','')
GV = modlabs[k] if k in modlabs else '- bad match -'
yield ' {!s:15} {!s:^13}'.format(a+b, GV)
countries = {'Singapore':'SG','Austria':'AU',
'Swiss':'CH','Chile':'CL',
'Den LMM':'DN','LMM Den':'LM'}
assert(all('.' not in k and
(k.count(' ')==1 or k[0].upper()==k[0])
for k in countries))
s = (' Singapore SIngapore SiNgapore SinGapore'
' SI.Ngapore SIngaPore SinGaporE SinGAPore'
' SINGaporE SiNg.aPoR singapore sIngapore'
' siNgapore sinGapore sINgap.ore sIngaPore'
' sinGaporE sinGAPore sINGaporE siNgaPoRe'
' Austria Aus.trIA aUSTria AUSTRiA'
' Den L.M.M Den Lm.M DEn Lm.M.'
' DEN L.MM De.n L.M.M. Den LmM'
' L.MM DEn LMM DeN LM.m Den')
print '\n'.join(res for res in code(s,countries))
You could write a simple lambda to generate the ugly-but-all-re-solution:
>>> leading_cap_re = lambda s: s[0].upper() + ''.join('[%s%s]' %
(c.upper(),c.lower())
for c in s[1:])
>>> leading_cap_re("Singapore")
'S[Ii][Nn][Gg][Aa][Pp][Oo][Rr][Ee]'
For multi-word cities, define a string-splitting version:
>>> leading_caps_re = lambda s : r'\s+'.join(map(leading_cap_re,s.split()))
>>> print leading_caps_re('Kuala Lumpur')
K[Uu][Aa][Ll][Aa]\s+L[Uu][Mm][Pp][Uu][Rr]
Then your code could just be:
if re.search(leading_caps_re("Singapore") , st):
...etc...
and the ugliness of the RE would be purely internal.
interestingly
/((S)((?i)ingapore))/
Does the right thing in perl but doesn't seem to work as needed in python. To be fair the python docs spell it out clearly, (?i) alters the whole regexp
This is the BEST answer:
(?-i:S)(?i)ingapore
ClickHere for proof: