appending a list from a read text file python3 - python

I am attempting to read a txt file and create a dictionary from the text. a sample txt file is:
John likes Steak
John likes Soda
John likes Cake
Jane likes Soda
Jane likes Cake
Jim likes Steak
My desired output is a dictionary with the name as the key, and the "likes" as a list of the respective values:
{'John':('Steak', 'Soda', 'Cake'), 'Jane':('Soda', 'Cake'), 'Jim':('Steak')}
I continue to run into the error of adding my stripped word to my list and have tried a few different ways:
pred = ()
prey = ()
spacedLine = inf.readline()
line = spacedLine.rstrip('\n')
while line!= "":
line = line.split()
pred.append = (line[0])
prey.append = (line[2])
spacedLine = inf.readline()
line = spacedLine.rstrip('\n')
and also:
spacedLine = inf.readline()
line = spacedLine.rstrip('\n')
while line!= "":
line = line.split()
if line[0] in chain:
chain[line[0] = [0, line[2]]
else:
chain[line[0]] = line[2]
spacedLine = inf.readline()
line = spacedLine.rstrip('\n')
any ideas?

This will do it (without needing to read the entire file into memory first):
likes = {}
for who, _, what in (line.split()
for line in (line.strip()
for line in open('likes.txt', 'rt'))):
likes.setdefault(who, []).append(what)
print(likes)
Output:
{'Jane': ['Soda', 'Cake'], 'John': ['Steak', 'Soda', 'Cake'], 'Jim': ['Steak']}
Alternatively, to simplify things slightly you could use a temporarycollections.defaultdict:
from collections import defaultdict
likes = defaultdict(list)
for who, _, what in (line.split()
for line in (line.strip()
for line in open('likes.txt', 'rt'))):
likes[who].append(what)
print(dict(likes)) # convert to plain dictionary and print

Your input is a sequence of sequences. Parse the outer sequence first, parse each item next.
Your outer sequence is:
Statement
<empty line>
Statement
<empty line>
...
Assume that f is the open file with the data. Read each statement and return a list of them:
def parseLines(f):
result = []
for line in f: # file objects iterate over text lines
if line: # line is non-empty
result.append(line)
return result
Note that the function above accepts a much wider grammar: it allows arbitrarily many empty lines between non-empty lines, and two non-empty lines in a row. But it does accept any correct input.
Then, your statement is a triple: X likes Y. Parse it by splitting it by whitespace, and checking the structure. The result is a correct pair of (x, y).
def parseStatement(s):
parts = s.split() # by default, it splits by all whitespace
assert len(parts) == 3, "Syntax error: %r is not three words" % s
x, likes, y = parts # unpack the list of 3 items into varaibles
assert likes == "likes", "Syntax error: %r instead of 'likes'" % likes
return x, y
Make a list of pairs for each statement:
pairs = [parseStatement(s) for s in parseLines(f)]
Now you need to group values by key. Let's use defaultdict which supplies a default value for any new key:
from collections import defaultdict
the_answer = defaultdict(list) # the default value is an empty list
for key, value in pairs:
the_answer[key].append(value)
# we can append because the_answer[key] is set to an empty list on first access
So here the_answer is what you need, only it uses lists as dict values instead of tuples. This must be enough for you to understand your homework.

dic={}
for i in f.readlines():
if i:
if i.split()[0] in dic.keys():
dic[i.split()[0]].append(i.split()[2])
else:
dic[i.split()[0]]=[i.split()[2]]
print dic
This should do it.
Here we iterater through f.readlines f being the file object,and on each line we fill up the dictionary by using first part of split as key and last part of split as value

Related

Remove partial column duplicates from a txt file

How can I export only 1st column partial duplicate lines? For example in.txt contains lines:
red,color,color
red,color,color
blue,color,color
blue,color,color
Desired outcome:
red,color,color
blue,color,color
with open(infile,'r', encoding="cp437",errors="ignore") as in_file, open(outfile,'w', encoding="cp437",errors="ignore") as out_file:
seen = set()
for line in in_file:
if line.split(',')[0] == (str(x).split(',')[0] for x in seen):
continue
seen.add(line)
out_file.write(line)
(str(x).split(',')[0] for x in seen) is a generator expression, it won't be equal to any string, like line.split(',')[0].
If you want to check if a string is equal to any string in an iterable, you could use any:
if any(line.split(',')[0] == str(x).split(',')[0] for x in seen):
or collect the results of the generator expression in a list and use the in operator for membership test:
if line.split(',')[0] in [str(x).split(',')[0] for x in seen]:
But: why don't you just only store the first part of the line (line.split(',')[0]) in the seen list, instead of the whole line, and better yet, use a set instead, this will greatly simplify your code:
seen = set()
for line in in_file:
first_part = line.split(',')[0]
if first_part in seen:
continue
seen.add(first_part)
out_file.write(line)

python and iteration specifically "for line in first_names"

If I 'pop' an item off an array in python, I seem to shoot myself in the foot by messing up the array's total length? See the following example:
Also am I just being an idiot or is this normal behaviour? And is there a better way to achieve what I am trying to do?
first_names = []
last_names = []
approved_names = []
blacklisted_names = []
loopcounter = 0
with open("first_names.txt") as file:
first_names = file.readlines()
#first_names = [line.rstrip() for line in first_names]
for line in first_names:
line = line.strip("\r\n")
line = line.strip("\n")
line = line.strip(" ")
if line == "":
first_names.pop(loopcounter)
#first_names.pop(first_names.index("")) # Does not work as expected
#loopcounter -= 1 # Does not work as expected either......
loopcounter += 1
loopcounter = 0
def save_names():
with open("first_names2.txt",'wt',encoding="utf-8") as file:
file.writelines(first_names)
and the resulting files:
first_names.txt
{
Abbey
Abbie
Abbott
Abby
Abe
Abie
Abstinence
Acton
}
And the output file
{
Abbey
Abbie
Abbott
Abe
Abie
Abstinence
Acton
}
list.pop() removes an item from a list and returns the value (see e.g. this ref). For the very basic task of cleaning and writing the list of names, an easy edit would be:
with open("first_names.txt") as file:
first_names = file.readlines()
cleaned_lines = []
for line in first_names:
clean_l = line.strip("\r\n").strip("\n").strip(" ")
if clean_l != "":
cleaned_lines.append(clean_l)
with open("first_names2.txt",'wt',encoding="utf-8") as file:
file.writelines(cleaned_lines)
If you don't want to create a cleaned copy of the list first_names, you could iteratively append single lines to the file as well.
with open("first_names.txt") as file:
first_names = file.readlines()
with open("first_names2.txt",'wt',encoding="utf-8") as file:
for line in first_names:
clean_l = line.strip("\r\n").strip("\n").strip(" ")
if clean_l != "":
file.writelines([clean_l, ])
In general it is not a good idea to mutate a list on which you're iterating, as you stated in your question. If you pop an element from the list you don't necessarily mess up the array's length, but you may encounter unexpected behavior when dealing with which index to pop. In this case you may skip some elements of the array.
A quick solution would be to make a copy of the list and use the built-in enumerate() method as follows:
copy = first_names.copy()
for i, line in enumerate(copy):
line = line.strip("\r\n")
line = line.strip("\n")
line = line.strip(" ")
if line == "":
first_names.remove(i)
More on enumerate() here.
The usual practice would be to filter or create a new list, rather than change the list you are iterating. It's not uncommon to create a new list with the changes you want, and then just assign it back to the original variable name. Here is a list comprehension. Note the if statement that filters out the undesirable blank lines.
first_names = [name.strip() for name in first_names if name.strip()]
https://docs.python.org/3/glossary.html#term-list-comprehension
And you can do the same with iterators using map to apply a function to each item in the list, and filter to remove the blank lines.
first_names_iterator = filter(lambda x: bool(x), map(lambda x: x.strip(), first_names))
first_names = list(first_names_iterator)
https://docs.python.org/3/library/functions.html#map
https://docs.python.org/3/library/functions.html#filter
The last line demonstrates that you could have just passed the iterator to list's constructor to get a list, but iterators are better. You can iterate through them without having to have the whole list at once. If you wanted a list, you should use list comprehension.
The lambda notation is just a fast way to write a function. I could have defined a function with a good name, but that's often overkill for things like map, filter, or a sort key.
Full code:
test_cases = [
'Abbey',
' Abbie ',
'',
'Acton',
]
print(test_cases)
first_names = list(test_cases)
first_names = [name.strip() for name in first_names if name.strip()]
print(first_names)
first_names = list(test_cases)
for name in filter(lambda x: bool(x),
map(lambda x: x.strip(),
first_names)):
print(name)

How to Convert a Text File into a List in Python3

In Python3, from an existing .txt file which contain lyric/subtitle/other,
I want to make a simple list (without any nestings)
of existing words, without spaces or other interpuction signs.
Based on other StackExchange requests, i made this
import csv
crimefile = open('she_loves_you.txt', 'r')
reader = csv.reader(crimefile)
allRows = list(reader) # result is a list with nested lists
ultimate = []
for i in allRows:
ultimate += i # result is a list with elements longer than one word
ultimate2 = []
for i in ultimate:
ultimate2 += i # result is a list with elements which are single letters
my wished result would be like
['She', 'loves', 'you', 'yeah', 'yeah', 'yeah', 'She', 'loves', 'you', ...]
======================================================================
Interesting as well would be to understand why the code (it runs as extension of the one above):
import re
print (re.findall(r"[\w']+", ultimate))
brings the following error:
Traceback (most recent call last):
File "4.4.4.csv.into.list.py", line 72, in <module>
print (re.findall(r"[\w']+", ultimate))
File "/usr/lib/python3.7/re.py", line 223, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
The error message is fully clear "expected string or bytes-like object".
It mean your ultimate should convert as string (str), and when you check the type of your ultimate is list object.
>>> type(ultimate)
<class 'list'>
# or
>>> type([])
<class 'list'>
In your case;
print (re.findall(r"[\w']+", str(ultimate))) # original text
# or
print (re.findall(r"[\w']+", ' '.join(ultimate))) # joined words
Try this:
import csv
crimefile = open('she_loves_you.txt', 'r')
reader = csv.reader(crimefile)
allRows = list(reader) # result is a list with nested lists
ultimate = []
for i in allRows:
ultimate += i.split(" ")
Bellow is full output of the work i made in area of this question
import csv
import re
import json
#1 def1
#def decomposition(file):
'''
opening the text file,
and in 3 steps creating a list containing signle words that appears in the text file
'''
crimefile = open('she_loves_you.txt', 'r')
reader = csv.reader(crimefile)
#step1 : list with nested lists
allRows = list(reader) # result is a list with nested lists, on which we are going to work later
#step2 : one list, with elements longer that one word
ultimate = []
for i in allRows:
ultimate += i
#step3 : one list, with elements which are lenght of one word
#print (re.findall(r"[\w']+", ultimate)) # does not work
#print (re.findall(r"[\w']+", str(ultimate))) # works
list_of_words = re.findall(r"[\w']+", ' '.join(ultimate)) # works even better!
#2 def2
def saving():
'''
# creating/opening writable file (as a variable),
# and saving into it 'list of words'
'''
with open('she_loves_you_list.txt', 'w') as fp:
#Save as JSON with
json.dump(list_of_words, fp)
#3 def3
def lyric_to_frequencies(lyrics):
'''
# you provide a list,
# and recieve a dictionary, which contain amount of unique words in this list
'''
myDict = {}
for word in lyrics:
if word in myDict:
myDict[word] += 1
else :
myDict[word] = 1
#print (myDict)
return myDict
#4 def4
def most_common_words(freqs):
'''
you provide a list of words ('freqs')
and recieve how often they appear
'''
values = freqs.values()
best = max(values) #finding biggest value very easily
words = []
for k in freqs : # and here we are checking which entries have biggers (best) values
if freqs[k] == best:
words.append(k) #just add it to the list
print(words,best)
return(words,best)
#5 def5
def words_often(freqs, minTimes):
'''
you provide a list of words ('freqs') AND minimumTimes how the word suppose to appear in file to be printed out
and recieve how often they appear
'''
result = []
done = False
while not done :
temp = most_common_words(freqs)
if temp[1] >= minTimes:
result.append(temp)
for w in temp[0]:
del(freqs[w])
else:
done = True
return result
#1
decomposition('she_loves_you.txt')
#2
saving()
#3
lyric_to_frequencies(list_of_words)
#4
most_common_words(lyric_to_frequencies(list_of_words))
#5
words_often(lyric_to_frequencies(list_of_words), 5)

Enumerating and replacing all tokens in a string file in python

I have a question for you, dear python lovers.
I have a corpus file, as the following:
Ah , this is greasy .
I want to eat kimchee .
Is Chae Yoon &apos;s coordinator in here ?
Excuse me , aren &apos;t you Chae Yoon &apos;s coordinator ? Yes . Me ?
-Chae Yoon is done singing .
This lady right next to me ... everyone knows who she is right ?
I want to assign a specific number for each token, and replace it with the assigned number on the file.
What I mean by saying token is, basically each group of characters in the file separated by ' '. So, for example, ? is a token, also Excuse is a token as well.
I have a corpus file which involves more than 4 million lines, as above. Can you show me a fastest way to do I want?
Thanks,
Might be overkill but you could write your own classifier:
# Python 3.x
class Classifier(dict):
def __init__(self, args = None):
'''args is an iterable of keys (only)'''
self.n = 1
super().__init__()
if args:
for thing in args:
self[thing] = self.n
def __setitem__(self, key, value = None):
## print('setitem', key)
if key not in self:
super().__setitem__(key, self.n)
self.n += 1
def setdefault(self, key, default = None):
increment = key not in self
n = super().setdefault(key, self.n)
self.n += int(increment)
## print('setdefault', n)
return n
def update(self, other):
for k, v in other:
self.setdefault(k)
def transpose(self):
return {v:k for k, v in self.items()}
Usage:
c = Classifier()
with open('foo.txt') as infile, open('classified.txt', 'w+') as outfile:
for line in infile:
line = (str(c.setdefault(token)) for token in line.strip().split())
outfile.write(' '.join(line))
outfile.write('\n')
To reduce the number of writes you could accumulate lines in a list and use writelines() at some set length.
If you have enough memory, you could read the entire file in and split it then feed that to Classifier.
De-classify
z = c.transpose()
with open('classified.txt') as f:
for line in f:
line = (z[int(n)] for n in line.strip().split())
print(' '.join(line))
For Python 2.7 super() requires arguments - replace super() with super(Classifier, self).
If you are going to be working mainly with strings for the token numbers, in the class you should convert self.n to a string when saving it then you won't have to convert back and forth between strings and ints in your working code.
You also may be able to use LabelEncoder from sklearn.
If you have a specific dictionary already to change your values, you need to simply map the new values.
mapping = { '?':1, 'Excuse':2, ...}
for k, v in mapping.iteritems():
my_string = my_string.replace(k, v)
If you want to create a brand new dictionary:
mapping = list(set(my_string.split(' ')))
mapping = dict[(i,x) for i,x in enumerate(mapping)]
for k, v in mapping.iteritems():
my_string = my_string.replace(k, v)
from collection import defaultdict
from itertools import count
with open(filename) as f:
with open(output, 'w+') as out:
c = count()
d = defaultdict(c.__next__)
for line in f:
line = line.split()
line = ' '.join([d[token] for token in line])
out.write(line)
Using a defaultdict, we remember what tokens we've seen. Every time we see a new token, we get the next number and assign it to that token. This writes output to a different file.
split = "super string".split(' ')
map = []
result = ''
foreach word in split:
if not map.__contains__(word):
map[word] = len(map)
result += ' ' + str(map[word]
this way avoid to do my_string = my_string.replace(k, v) that makes it slow
Try the following: it assigns a numeric number to each token then replaces tokens with corresponding number.
a = """Ah , this is greasy .
I want to eat kimchee .
Is Chae Yoon &apos;s coordinator in here ?
Excuse me , aren &apos;t you Chae Yoon &apos;s coordinator ? Yes . Me ?
-Chae Yoon is done singing .
This lady right next to me ... everyone knows who she is right ?""".split(" ")
key_map = dict({(j,str(m)) for m,j in enumerate(set(a))})
" ".join(map(lambda x:key_map[x], a))
i.e. first map each unique token to a number, then you can use the key_map to assign the numeric value to each token

String Object Not Callable When Using Tuples and Ints

I am utterly flustered. I've created a list of tuples from a text file and done all of the conversions to ints:
for line in f:
if firstLine is True: #first line of file is the total knapsack size and # of items.
knapsackSize, nItems = line.split()
firstLine = False
else:
itemSize, itemValue = line.split()
items.append((int(itemSize), int(itemValue)))
print items
knapsackSize, nItems = int(knapsackSize), int(nItems) #convert strings to ints
I have functions that access the tuples for more readable code:
def itemSize(item): return item[0]
def itemValue(item): return item[1]
Yet when I call these functions, i.e.,:
elif itemSize(items[nItems-1]) > sizeLimit
I get an inexplicable "'str' object is not callable" error, referencing the foregoing line of code. I have type checked everything that should be a tuple or an int using instsanceof, and it all checks out. What gives?
Because at this point:
itemSize, itemValue = line.split()
itemSize is still a string - you've appended to items the int converted values...
I would also change your logic slightly for handling first line:
with open('file') as fin:
knapsackSize, nItems = next(fin).split() # take first line
for other_lines in fin: # everything after
pass # do stuff for rest of file
Or just change the whole lot (assuming it's a 2column file of ints)
with open('file') as fin:
lines = (map(int, line.split()) for line in fin)
knapsackSize, nItems = next(lines)
items = list(lines)
And possibly instead of your functions to return items - use a dict or a namedtuple...
Or if you want to stay with functions, then go to the operator module and use:
itemSize = operator.itemgetter(0)

Categories

Resources