Remove partial column duplicates from a txt file - python

How can I export only 1st column partial duplicate lines? For example in.txt contains lines:
red,color,color
red,color,color
blue,color,color
blue,color,color
Desired outcome:
red,color,color
blue,color,color
with open(infile,'r', encoding="cp437",errors="ignore") as in_file, open(outfile,'w', encoding="cp437",errors="ignore") as out_file:
seen = set()
for line in in_file:
if line.split(',')[0] == (str(x).split(',')[0] for x in seen):
continue
seen.add(line)
out_file.write(line)

(str(x).split(',')[0] for x in seen) is a generator expression, it won't be equal to any string, like line.split(',')[0].
If you want to check if a string is equal to any string in an iterable, you could use any:
if any(line.split(',')[0] == str(x).split(',')[0] for x in seen):
or collect the results of the generator expression in a list and use the in operator for membership test:
if line.split(',')[0] in [str(x).split(',')[0] for x in seen]:
But: why don't you just only store the first part of the line (line.split(',')[0]) in the seen list, instead of the whole line, and better yet, use a set instead, this will greatly simplify your code:
seen = set()
for line in in_file:
first_part = line.split(',')[0]
if first_part in seen:
continue
seen.add(first_part)
out_file.write(line)

Related

python and iteration specifically "for line in first_names"

If I 'pop' an item off an array in python, I seem to shoot myself in the foot by messing up the array's total length? See the following example:
Also am I just being an idiot or is this normal behaviour? And is there a better way to achieve what I am trying to do?
first_names = []
last_names = []
approved_names = []
blacklisted_names = []
loopcounter = 0
with open("first_names.txt") as file:
first_names = file.readlines()
#first_names = [line.rstrip() for line in first_names]
for line in first_names:
line = line.strip("\r\n")
line = line.strip("\n")
line = line.strip(" ")
if line == "":
first_names.pop(loopcounter)
#first_names.pop(first_names.index("")) # Does not work as expected
#loopcounter -= 1 # Does not work as expected either......
loopcounter += 1
loopcounter = 0
def save_names():
with open("first_names2.txt",'wt',encoding="utf-8") as file:
file.writelines(first_names)
and the resulting files:
first_names.txt
{
Abbey
Abbie
Abbott
Abby
Abe
Abie
Abstinence
Acton
}
And the output file
{
Abbey
Abbie
Abbott
Abe
Abie
Abstinence
Acton
}
list.pop() removes an item from a list and returns the value (see e.g. this ref). For the very basic task of cleaning and writing the list of names, an easy edit would be:
with open("first_names.txt") as file:
first_names = file.readlines()
cleaned_lines = []
for line in first_names:
clean_l = line.strip("\r\n").strip("\n").strip(" ")
if clean_l != "":
cleaned_lines.append(clean_l)
with open("first_names2.txt",'wt',encoding="utf-8") as file:
file.writelines(cleaned_lines)
If you don't want to create a cleaned copy of the list first_names, you could iteratively append single lines to the file as well.
with open("first_names.txt") as file:
first_names = file.readlines()
with open("first_names2.txt",'wt',encoding="utf-8") as file:
for line in first_names:
clean_l = line.strip("\r\n").strip("\n").strip(" ")
if clean_l != "":
file.writelines([clean_l, ])
In general it is not a good idea to mutate a list on which you're iterating, as you stated in your question. If you pop an element from the list you don't necessarily mess up the array's length, but you may encounter unexpected behavior when dealing with which index to pop. In this case you may skip some elements of the array.
A quick solution would be to make a copy of the list and use the built-in enumerate() method as follows:
copy = first_names.copy()
for i, line in enumerate(copy):
line = line.strip("\r\n")
line = line.strip("\n")
line = line.strip(" ")
if line == "":
first_names.remove(i)
More on enumerate() here.
The usual practice would be to filter or create a new list, rather than change the list you are iterating. It's not uncommon to create a new list with the changes you want, and then just assign it back to the original variable name. Here is a list comprehension. Note the if statement that filters out the undesirable blank lines.
first_names = [name.strip() for name in first_names if name.strip()]
https://docs.python.org/3/glossary.html#term-list-comprehension
And you can do the same with iterators using map to apply a function to each item in the list, and filter to remove the blank lines.
first_names_iterator = filter(lambda x: bool(x), map(lambda x: x.strip(), first_names))
first_names = list(first_names_iterator)
https://docs.python.org/3/library/functions.html#map
https://docs.python.org/3/library/functions.html#filter
The last line demonstrates that you could have just passed the iterator to list's constructor to get a list, but iterators are better. You can iterate through them without having to have the whole list at once. If you wanted a list, you should use list comprehension.
The lambda notation is just a fast way to write a function. I could have defined a function with a good name, but that's often overkill for things like map, filter, or a sort key.
Full code:
test_cases = [
'Abbey',
' Abbie ',
'',
'Acton',
]
print(test_cases)
first_names = list(test_cases)
first_names = [name.strip() for name in first_names if name.strip()]
print(first_names)
first_names = list(test_cases)
for name in filter(lambda x: bool(x),
map(lambda x: x.strip(),
first_names)):
print(name)

Comparing 2 text files in python

I have 2 text files. I want to compare the 2 text files and return a list that has every line number that is different. Right now, I think my code returns the lines that are different, but how do I return the line number instead?
def diff(filename1, filename2):
with open('./exercise-files/text_a.txt', 'r') as filename1:
with open('./exercise-files/text_b.txt', 'r') as filename2:
difference = set(filename1).difference(filename2)
difference.discard('\n')
with open('diff.txt', 'w') as file_out:
for line in difference:
file_out.write(line)
Testing on:
diff('./exercise-files/text_a.txt', './exercise-files/text_b.txt') == [3, 4, 6]
diff('./exercise-files/text_a.txt', './exercise-files/text_a.txt') == []
difference = [
line_number + 1 for line_number, (line1, line2)
in enumerate(zip(filename1, filename2))
if line1 != line2
]
zip takes two (or more) generators and returns a generator of tuples, where each tuple contains the corresponding entries of each generator. enumerate takes this generator and returns a generator of tuples, where the first element is the index and the second the value from the original generator. And it's straightforward from there.
Here is an example which will ignore any surplus lines if one file has more lines than the other. The key is to use enumerate when iterating to get the line number as well as the contents. next can be used to get a line from the file iterator which is not used directly by the for loop.
def diff(filename1, filename2):
difference_line_numbers = []
with open(filename1, "r") as file1, open(filename2, "r") as file2:
for line_number, contents1 in enumerate(file1, 1):
try:
contents2 = next(file2)
except StopIteration:
break
if contents1 != contents2:
difference_line_numbers.append(line_number)
return difference_line_numbers

I'm looking for a string in a file, seems to not be working

My function first calculates all possible anagrams of the given word. Then, for each of these anagrams, it checks if they are valid words, but checking if they equal to any of the words in the wordlist.txt file. The file is a giant file with a bunch of words line by line. So I decided to just read each line and check if each anagram is there. However, it comes up blank. Here is my code:
def perm1(lst):
if len(lst) == 0:
return []
elif len(lst) == 1:
return [lst]
else:
l = []
for i in range(len(lst)):
x = lst[i]
xs = lst[:i] + lst[i+1:]
for p in perm1(xs):
l.append([x] + p)
return l
def jumbo_solve(string):
'''jumbo_solve(string) -> list
returns list of valid words that are anagrams of string'''
passer = list(string)
allAnagrams = []
validWords = []
for x in perm1(passer):
allAnagrams.append((''.join(x)))
for x in allAnagrams:
if x in open("C:\\Users\\Chris\\Python\\wordlist.txt"):
validWords.append(x)
return(validWords)
print(jumbo_solve("rarom"))
If have put in many print statements to debug, and the passed in list, "allAnagrams", is fully functional. For example, with the input "rarom, one valid anagram is the word "armor", which is contained in the wordlist.txt file. However, when I run it, it does not detect if for some reason. Thanks again, I'm still a little new to Python so all the help is appreciated, thanks!
You missed a tiny but important aspect of:
word in open("C:\\Users\\Chris\\Python\\wordlist.txt")
This will search the file line by line, as if open(...).readlines() was used, and attempt to match the entire line, with '\n' in the end. Really, anything that demands iterating over open(...) works like readlines().
You would need
x+'\n' in open("C:\\Users\\Chris\\Python\\wordlist.txt")
if the file is a list of words on separate lines to make this work to fix what you have, but it's inefficient to do this on every function call. Better to do once:
wordlist = open("C:\\Users\\Chris\\Python\\wordlist.txt").read().split('\n')
this will create a list of words if the file is a '\n' separated word list. Note you can use
`readlines()`
instead of read().split('\n'), but this will keep the \n on every word, like you have, and you would need to include that in your search as I show above. Now you can use the list as a global variable or as a function argument.
if x in wordlist: stuff
Note Graphier raised an important suggestion in the comments. A set:
wordlist = set(open("C:\\Users\\Chris\\Python\\wordlist.txt").read().split('\n'))
Is better suited for a word lookup than a list, since it's O(word length).
You have used the following code in the wrong way:
if x in open("C:\\Users\\Chris\\Python\\wordlist.txt"):
Instead, try the following code, it should solve your problem:
with open("words.txt", "r") as file:
lines = file.read().splitlines()
for line in lines:
# do something here
So, putting all advice together, your code could be as simple as:
from itertools import permutations
def get_valid_words(file_name):
with open(file_name) as f:
return set(line.strip() for line in f)
def jumbo_solve(s, valid_words=None):
"""jumbo_solve(s: str) -> list
returns list of valid words that are anagrams of `s`"""
if valid_words is None:
valid_words = get_valid_words("C:\\Users\\Chris\\Python\\wordlist.txt")
return [word for word in permutations(s) if word in valid_words]
if __name__ == "__main__":
print(jumbo_solve("rarom"))

appending a list from a read text file python3

I am attempting to read a txt file and create a dictionary from the text. a sample txt file is:
John likes Steak
John likes Soda
John likes Cake
Jane likes Soda
Jane likes Cake
Jim likes Steak
My desired output is a dictionary with the name as the key, and the "likes" as a list of the respective values:
{'John':('Steak', 'Soda', 'Cake'), 'Jane':('Soda', 'Cake'), 'Jim':('Steak')}
I continue to run into the error of adding my stripped word to my list and have tried a few different ways:
pred = ()
prey = ()
spacedLine = inf.readline()
line = spacedLine.rstrip('\n')
while line!= "":
line = line.split()
pred.append = (line[0])
prey.append = (line[2])
spacedLine = inf.readline()
line = spacedLine.rstrip('\n')
and also:
spacedLine = inf.readline()
line = spacedLine.rstrip('\n')
while line!= "":
line = line.split()
if line[0] in chain:
chain[line[0] = [0, line[2]]
else:
chain[line[0]] = line[2]
spacedLine = inf.readline()
line = spacedLine.rstrip('\n')
any ideas?
This will do it (without needing to read the entire file into memory first):
likes = {}
for who, _, what in (line.split()
for line in (line.strip()
for line in open('likes.txt', 'rt'))):
likes.setdefault(who, []).append(what)
print(likes)
Output:
{'Jane': ['Soda', 'Cake'], 'John': ['Steak', 'Soda', 'Cake'], 'Jim': ['Steak']}
Alternatively, to simplify things slightly you could use a temporarycollections.defaultdict:
from collections import defaultdict
likes = defaultdict(list)
for who, _, what in (line.split()
for line in (line.strip()
for line in open('likes.txt', 'rt'))):
likes[who].append(what)
print(dict(likes)) # convert to plain dictionary and print
Your input is a sequence of sequences. Parse the outer sequence first, parse each item next.
Your outer sequence is:
Statement
<empty line>
Statement
<empty line>
...
Assume that f is the open file with the data. Read each statement and return a list of them:
def parseLines(f):
result = []
for line in f: # file objects iterate over text lines
if line: # line is non-empty
result.append(line)
return result
Note that the function above accepts a much wider grammar: it allows arbitrarily many empty lines between non-empty lines, and two non-empty lines in a row. But it does accept any correct input.
Then, your statement is a triple: X likes Y. Parse it by splitting it by whitespace, and checking the structure. The result is a correct pair of (x, y).
def parseStatement(s):
parts = s.split() # by default, it splits by all whitespace
assert len(parts) == 3, "Syntax error: %r is not three words" % s
x, likes, y = parts # unpack the list of 3 items into varaibles
assert likes == "likes", "Syntax error: %r instead of 'likes'" % likes
return x, y
Make a list of pairs for each statement:
pairs = [parseStatement(s) for s in parseLines(f)]
Now you need to group values by key. Let's use defaultdict which supplies a default value for any new key:
from collections import defaultdict
the_answer = defaultdict(list) # the default value is an empty list
for key, value in pairs:
the_answer[key].append(value)
# we can append because the_answer[key] is set to an empty list on first access
So here the_answer is what you need, only it uses lists as dict values instead of tuples. This must be enough for you to understand your homework.
dic={}
for i in f.readlines():
if i:
if i.split()[0] in dic.keys():
dic[i.split()[0]].append(i.split()[2])
else:
dic[i.split()[0]]=[i.split()[2]]
print dic
This should do it.
Here we iterater through f.readlines f being the file object,and on each line we fill up the dictionary by using first part of split as key and last part of split as value

String Object Not Callable When Using Tuples and Ints

I am utterly flustered. I've created a list of tuples from a text file and done all of the conversions to ints:
for line in f:
if firstLine is True: #first line of file is the total knapsack size and # of items.
knapsackSize, nItems = line.split()
firstLine = False
else:
itemSize, itemValue = line.split()
items.append((int(itemSize), int(itemValue)))
print items
knapsackSize, nItems = int(knapsackSize), int(nItems) #convert strings to ints
I have functions that access the tuples for more readable code:
def itemSize(item): return item[0]
def itemValue(item): return item[1]
Yet when I call these functions, i.e.,:
elif itemSize(items[nItems-1]) > sizeLimit
I get an inexplicable "'str' object is not callable" error, referencing the foregoing line of code. I have type checked everything that should be a tuple or an int using instsanceof, and it all checks out. What gives?
Because at this point:
itemSize, itemValue = line.split()
itemSize is still a string - you've appended to items the int converted values...
I would also change your logic slightly for handling first line:
with open('file') as fin:
knapsackSize, nItems = next(fin).split() # take first line
for other_lines in fin: # everything after
pass # do stuff for rest of file
Or just change the whole lot (assuming it's a 2column file of ints)
with open('file') as fin:
lines = (map(int, line.split()) for line in fin)
knapsackSize, nItems = next(lines)
items = list(lines)
And possibly instead of your functions to return items - use a dict or a namedtuple...
Or if you want to stay with functions, then go to the operator module and use:
itemSize = operator.itemgetter(0)

Categories

Resources