Need help deleting repeating lines in txt file

Need help deleting repeating lines in txt file - python

I need to have an output printed in which only 1 list is split with no duplicates. The list i am using has like 100k emails and 1000x repeat. I want to remove those ..
I have tried some i have looked online
but nothing is written in my new file and the pycharm just freezes on running
def uniquelines(lineslist):
unique = {}
result = []
for item in lineslist:
if item.strip() in unique: continue
unique[item.strip()] = 1
result.append(item)
return result
file1 = open("wordlist.txt","r")
filelines = file1.readlines()
file1.close()
output = open("wordlist_unique.txt","w")
output.writelines(uniquelines(filelines))
output.close()
I expect it to just print all the emails with none repeating into a new text file

Before I get into the few ways to hopefully solve the issue, one thing I see off the bat is that you are using both a dictionary and a list within your function. This almost doubles the memory you will need to process things. I suggest using one or the other.
Using a set will provide you with a guaranteed "list" of unique items. The set.add() function will ignore duplicates.
s = {1, 2, 3}
print(s) #{1, 2, 3}
s.add(4)
print(s) #{1, 2, 3, 4}
s.add(4)
print(s) #{1, 2, 3, 4}
With that, you can modify your function to the following to achieve what you want. For my example, I have input.txt as a series of lines just containing a single integer value with plenty of duplicates.
def uniquelines(lineslist):
unique = set()
for line in lineslist:
unique.add(str(line).strip())
return list(unique)
with open('input.txt', 'r') as f:
lines = f.readlines()
output = uniquelines(lines)
with open('output.txt', 'w') as f:
f.write("\n".join([i for i in output]))
output.txt is as follows without any duplicates!
2
0
4
5
3
1
9
6
You can accomplish the same thing by calling set() on a list comprehension, but the disadvantage here is that you will need to load all the records into memory first and then pull out the duplicates. THe method above will hold all the unique values, but no duplicates, so depending on the size of your set, you probably want to use the function.
with open('input.txt', 'r') as f:
lines = f.readlines()
output = set([l.strip() for l in lines])
with open('output.txt', 'w') as f:
f.write("\n".join([i for i in output]))
I couldn't quite tell if you were looking to maintain a running count of how many times each unique line occured. If that's what you're going for, then you can use the in operator to see if it is in the keys already.
def uniquelines(lineslist):
unique = {}
for line in lineslist:
line = line.strip()
if line in unique:
unique[line] += 1
else:
unique[line] = 1
return unique
# {'9': 2, '0': 3, '4': 3, '1': 1, '3': 4, '2': 1, '6': 3, '5': 1}

Related

Write a program that reads the contents of a text file and return index of words into Values

I am doing an exercise from a textbook and I have been stuck for 3 days finally I decided to get help here.
The question is: write a program that reads the contents of a text file. The program should create a dictionary in which the key-value pairs are described as follows:
Key. The keys are the individual words found in the file.
Values. Each value is a list that contains the line numbers in the file where the word (the key) is found.
For example: suppose the word “robot” is found in lines 7, 18, 94, and 138. The dictionary would contain an element in which the key was the string “robot”, and the value was a list containing the numbers 7, 18, 94, and 138.
Once the dictionary is built, the program should create another text file, known as a word index, listing the contents of the dictionary. The word index file should contain an alphabetical listing of the words that are stored as keys in the dictionary, along with the line numbers where the words appear in the original file.
Figure 9-1 shows an example of an original text file (Kennedy.txt) and its index file (index.txt).
Here are the code i tried so far and the functions is not completed, not sure what to do next:
def create_Kennedytxt():
f = open('Kennedy.txt','w')
f.write('We observe today not a victory\n')
f.write('of party but a celebration\n')
f.write('of freedom symbolizing an end\n')
f.write('as well as a beginning\n')
f.write('signifying renewal as well\n')
f.write('as change\n')
f.close()
create_Kennedytxt()
def split_words():
f = open('Kennedy.txt','r')
count = 0
for x in f:
y = x.strip()
z = y.split(' ') #get individual character to find its index
count+=1 #get index for each line during for loop
split_words()
can anyone help me with the answer of code or give me some hints? and the answer shouldn't be import anythings, but only use methods and functions to achieved it. i will be very appreciated it!

You are on the right track. This is how it can be done
def build_word_index(txt):
out = {}
for i, line in enumerate(txt.split("\n")):
for word in line.strip().split(" "):
if word not in out:
out[word] = [i + 1]
else:
out[word].append(i + 1)
return out
print(build_word_index('''
We observe today not a victory
of party but a celebration
of freedom symbolizing an end
as well as a beginning
signifying renewal as well
as change
'''))
This works by first defining a dictionary
out = {}
Then we are going to loop line by line of input (we are going to use enumerate just so we have an index that starts from 0 and goes up by one each line
for i, line in enumerate(txt.split("\n")):
Next we are going to loop for each word in that line
for word in line.strip().split(" "):
Finally we are going to examine two cases by checking if our dictionary does not contain the word
if word not in out:
In the case we haven't seen the word before we need to create and entry in our dictionary that keeps track of words. We are using a list so that we can handle words being on multiple lines. (We are adding 1 to i here to offset us starting at 0).
out[word] = [i + 1]
In the case we have seen the word before we can just add the line we are currently on to the end of it
out[word].append(i + 1)
This will get us a dictionary where each word is the key and the value is a list of what lines the word appears in.
I am going to leave how to actually output the dictionary correctly to you.

This is a three step process:
Read the file line by line and split each line into words
Identify all unique words in each line (use set to do this)
For each word, check if word exists in the dictionary.
If it exists in the dictionary, then add the line number (line starts with 0, so you may need to add +1)
to add 1 to it)
If it does NOT exist in the dictionary, create a new key entry for the word and include the line number.
The dictionary will be a keys with lists.
To do this, you can create a program like this:
keys_in_file = {}
with open ('Kennedy.txt', 'r') as f:
for i,line in enumerate(f):
words = line.strip().split()
for word in set(words):
keys_in_file.setdefault(word, []).append(i+1)
print (keys_in_file)
The output of the file you provided (Kennedy.txt) is:
{'today': [1], 'victory': [1], 'observe': [1], 'a': [1, 2, 4], 'We': [1], 'not': [1], 'celebration': [2], 'of': [2, 3], 'party': [2], 'but': [2], 'freedom': [3], 'an': [3], 'symbolizing': [3], 'end': [3], 'as': [4, 5, 6], 'well': [4, 5], 'beginning': [4], 'renewal': [5], 'signifying': [5], 'change': [6]}
If you want to ensure that all words (We, WE, we) get counted as same word, you need to convert words to lowercase.
words = line.lower().strip().split()
If you want the values to be printed in the format of index.txt, then you add the following to the code:
for k in sorted(keys_in_file):
print (k+':', *keys_in_file[k])
The output will be as follows:
Note: I converted We to lowercase so it will show up later in the alphabetic order
a: 1 2 4
an: 3
as: 4 5 6
beginning: 4
but: 2
celebration: 2
change: 6
end: 3
freedom: 3
not: 1
observe: 1
of: 2 3
party: 2
renewal: 5
signifying: 5
symbolizing: 3
today: 1
victory: 1
we: 1
well: 4 5

from collections import Counter
fname = input("Enter file name: ")
with open (fname, 'r') as input_file:
count = Counter(word for line in input_file
for word in line.split())
print(count.most_common(20))
f= open("index.txt","w+")
s = str(count.most_common(20))
f.write(s)
f.close()

Remove duplicates from large list but remove both if it does exist?

So I have a text file like this
123
1234
123
1234
12345
123456
You can see 123 appears twice so both instances should be removed. but 12345 appears once so it stays. My text file is about 70,000 lines.
Here is what I came up with.
file = open("test.txt",'r')
lines = file.read().splitlines() #to ignore the '\n' and turn to list structure
for appId in lines:
if(lines.count(appId) > 1): #if element count is not unique remove both elements
lines.remove(appId) #first instance removed
lines.remove(appId) #second instance removed
writeFile = open("duplicatesRemoved.txt",'a') #output the left over unique elements to file
for element in lines:
writeFile.write(element + "\n")
When I run this I feel like my logic is correct, but I know for a fact the output is suppose to be around 950, but Im still getting 23000 elements in my output so a lot is not getting removed. Any ideas where the bug could reside?
Edit: I FORGOT TO MENTION. An element can only appear twice MAX.

Use Counter from built in collections:
In [1]: from collections import Counter
In [2]: a = [123, 1234, 123, 1234, 12345, 123456]
In [3]: a = Counter(a)
In [4]: a
Out[4]: Counter({123: 2, 1234: 2, 12345: 1, 123456: 1})
In [5]: a = [k for k, v in a.items() if v == 1]
In [6]: a
Out[6]: [12345, 123456]
For your particular problem I will do it like this:
from collections import defaultdict
out = defaultdict(int)
with open('input.txt') as f:
for line in f:
out[line.strip()] += 1
with open('out.txt', 'w') as f:
for k, v in out.items():
if v == 1: #here you use logic suitable for what you want
f.write(k + '\n')

Be careful about removing elements from a list while still iterating over that list. This changes the behavior of the list iterator, and can make it skip over elements, which may be part of your problem.
Instead, I suggest creating a filtered copy of the list using a list comprehension - instead of removing elements that appear more than twice, you would keep elements that appear less than that:
file = open("test.txt",'r')
lines = file.read().splitlines()
unique_lines = [line for line in lines if lines.count(line) <= 2] # if it appears twice or less
with open("duplicatesRemoved.txt", "w") as writefile:
writefile.writelines(unique_lines)
You could also easily modify this code to look for only one occurrence (if lines.count(line) == 1) or for more than two occurrences.

You can count all of the elements and store them in a dictionary:
dic = {a:lines.count(a) for a in lines}
Then remove all duplicated one from array:
for k in dic:
if dic[k]>1:
while k in lines:
lines.remove(k)
NOTE: The while loop here is becaues line.remove(k) removes first k value from array and it must be repeated till there's no k value in the array.
If the for loop is complicated, you can use the dictionary in another way to get rid of duplicated values:
lines = [k for k, v in dic.items() if v==1]

Python - How do i build a dictionary from a text file?

for the class data structures and algorithms at Tilburg University i got a question in an in class test:
build a dictionary from testfile.txt, with only unique values, where if a value appears again, it should be added to the total sum of that productclass.
the text file looked like this, it was not a .csv file:
apples,1
pears,15
oranges,777
apples,-4
oranges,222
pears,1
bananas,3
so apples will be -3 and the output would be {"apples": -3, "oranges": 999...}
in the exams i am not allowed to import any external packages besides the normal: pcinput, math, etc. i am also not allowed to use the internet.
I have no idea how to accomplish this, and this seems to be a big problem in my development of python skills, because this is a question that is not given in a 'dictionaries in python' video on youtube (would be to hard maybe), but also not given in a expert course because there this question would be to simple.
hope you guys can help!
enter code here
from collections import Counter
from sys import exit
from os.path import exists, isfile
##i did not finish it, but wat i wanted to achieve was build a list of the
strings and their belonging integers. then use the counter method to add
them together
## by splitting the string by marking the comma as the split point.
filename = input("filename voor input: ")
if not isfile(filename):
print(filename, "bestaat niet")
exit()
keys = []
values = []
with open(filename) as f:
xs = f.read().split()
for i in xs:
keys.append([i])
print(keys)
my_dict = {}
for i in range(len(xs)):
my_dict[xs[i]] = xs.count(xs[i])
print(my_dict)
word_and_integers_dict = dict(zip(keys, values))
print(word_and_integers_dict)
values2 = my_dict.split(",")
for j in values2:
print( value2 )
the output becomes is this:
[['schijndel,-3'], ['amsterdam,0'], ['tokyo,5'], ['tilburg,777'], ['zaandam,5']]
{'zaandam,5': 1, 'tilburg,777': 1, 'amsterdam,0': 1, 'tokyo,5': 1, 'schijndel,-3': 1}
{}
so i got the dictionary from it, but i did not separate the values.
the error message is this:
28 values2 = my_dict.split(",") <-- here was the error
29 for j in values2:
30 print( value2 )
AttributeError: 'dict' object has no attribute 'split'

I don't understand what your code is actually doing, I think you don't know what your variables are containing, but this is an easy problem to solve in Python. Split into a list, split each item again, and count:
>>> input = "apples,1 pears,15 oranges,777 apples,-4 oranges,222 pears,1 bananas,3"
>>> parts = input.split()
>>> parts
['apples,1', 'pears,15', 'oranges,777', 'apples,-4', 'oranges,222', 'pears,1', 'bananas,3']
Then split again. Behold the list comprehension. This is an idiomatic way to transform a list to another in python. Note that the numbers are strings, not ints yet.
>>> strings = [s.split(',') for s in strings]
>>> strings
[['apples', '1'], ['pears', '15'], ['oranges', '777'], ['apples', '-4'], ['oranges', '222'], ['pears', '1'], ['bananas', '3']]
Now you want to iterate over pairs, and sum all the same fruits. This calls for a dict:
>>> result = {}
>>> for fruit, countstr in pairs:
... if fruit not in result:
... result[fruit] = 0
... result[fruit] += int(countstr)
>>> result
{'pears': 16, 'apples': -3, 'oranges': 999, 'bananas': 3}
This pattern of adding an element if it doesn't exist comes up frequently. You should checkout defaultdict in the collections module. If you use that, you don't even need the if.

Let's walk through what you need to do to. First, check if the file exists and read the contents to a variable. Second, parse each line - you need to split the line on the comma, convert the number from a string to an integer, and then pass the values to a dictionary. In this case I would recommend using defaultdict from collections, but we can also do it with a standard dictionary.
from os.path import exists, isfile
from collections import defaultdict
filename = input("filename voor input: ")
if not isfile(filename):
print(filename, "bestaat niet")
exit()
# this reads the file to a list, removing newline characters
with open(filename) as f:
line_list = [x.strip() for x in f]
# create a dictionary
my_dict = {}
# update the value in the dictionary if it already exists,
# otherwise add it to the dictionary
for line in line_list:
k, v_str = line.split(',')
if k in my_dict:
my_dict[k] += int(v_str)
else:
my_dict[k] = int(v_str)
# print the dictionary
table_str = '{:<30}{}'
print(table_str.format('Item','Count'))
print('='*35)
for k,v in sorted(my_dict.item()):
print(table_str.format(k,v))

Creating a program that compares two lists

I am trying to create a program that checks whether items from one list are not in another. It keeps returning lines saying that x value is not in the list. Any suggestions? Sorry about my code, it's quite sloppy.
Searching Within an Array
Putting .txt files into arrays
with open('Barcodes', 'r') as f:
barcodes = [line.strip() for line in f]
with open('EAN Staging', 'r') as f:
EAN_staging = [line.strip() for line in f]
Arrays
list1 = barcodes
list2 = EAN_staging
Main Code
fixed = -1
for x in list1:
for variable in list1: # Moves along each variable in the list, in turn
if list1[fixed] in list2: # If the term is in the list, then
fixed = fixed + 1
location = list2.index(list1[fixed]) # Finds the term in the list
print ()
print ("Found", variable ,"at location", location) # Prints location of terms

Instead of lists, read the files as sets:
with open('Barcodes', 'r') as f:
barcodes = {line.strip() for line in f}
with open('EAN Staging', 'r') as f:
EAN_staging = {line.strip() for line in f}
Then all you need to do is to calculate the symmetric difference between them:
diff = barcodes - EAN_staging # or barcodes.difference(EAN_stagin)
An extracted example:
a = {1, 2, 3}
b = {3, 4, 5}
print(a - b)
>> {1, 2, 4, 5} # 1, 2 are in a but in b
Note that if you are operating with sets, information about how many times an element is present will be lost. If you care about situations when an element is present in barcodes 3 times, but only 2 times in EAN_staging, you should use Counter from collections.

Your code doesn't seem to quite answer your question. If all you want to do is see which elements aren't shared, I think sets are the way to go.
set1 = set(list1)
set2 = set(list2)
in_first_but_not_in_second = set1.difference(set2) # outputs a set
not_in_both = set1.symmetric_difference(set2) # outputs a set

How to read specific lines of a large csv file

I am trying to read some specific rows of a large csv file, and I don't want to load the whole file into memory. The index of the specific rows are given in a list L = [2, 5, 15, 98, ...] and my csv file looks like this:
Col 1, Col 2, Col3
row11, row12, row13
row21, row22, row23
row31, row32, row33
...
Using the ideas mentioned here I use the following command to read the rows
with open('~/file.csv') as f:
r = csv.DictReader(f) # I need to read it as a dictionary for my purpose
for i in L:
for row in enumerate(r):
print row[i]
I immediately get the following error:
IndexError Traceback (most recent call last)
<ipython-input-25-78951a0d4937> in <module>()
6 for i in L:
7 for row in enumerate(r):
----> 8 print row[i]
IndexError: tuple index out of range
Question 1. It seems like my use of the for loops here is obviously wrong. Any ideas on how to fix this?
On the other hand, the following gets the job done, but it's too slow:
def read_csv_line(line_number):
with open("~/file.csv") as f:
r = csv.DictReader(f)
for i, line in enumerate(r):
if i == (line_number - 2):
return line
return None
for i in L:
print read_csv_line(i)
Question 2. Any idea on how to improve this basic method of going through the whole file until I reach row i then print it?

A file doesn't have "lines" or "rows". What you consider a "line" is "what is found between two newline characters". As such you cannot read the nth line without reading the lines before it, as you couldn't count the newline characters.
Answer 1: if you consider your example, but with L=[9], unrolling your loops would give:
i=9
row = (0, {'Col 2': 'row12', 'Col 3': 'row13', 'Col 1': 'row11'})
As you can see, row is a tuple with two members, calling row[i] means row[9], hence the IndexError.
Answer 2: This is very slow because you are reading the file up to the line number every time. In your example, you read the first 2 lines, then the first 5, then the first 15, then the first 98, etc. So you've read the first 5 lines 3 times. You could create a generator that only returns the lines you want (beware, line numbers would be 0-indexed):
def read_my_lines(csv_reader, lines_list):
for line_number, row in enumerate(csv_reader):
if line_number in lines_list:
yield line_number, row
So when you want to process the lines, you would do:
L = [2, 5, 15, 98, ...]
with open('~/file.csv') as f:
r = csv.DictReader(f)
for line_number, line in read_my_lines(r, L):
do_something_with_line(line)
* Edit *
This could further be improved to stop reading the file when you've read all the lines you wanted:
def read_my_lines(csv_reader, lines_list):
# make sure every line number shows up only once:
lines_set = set(lines_list)
for line_number, row in enumerate(csv_reader):
if line_number in lines_set:
yield line_number, row
lines_set.remove(line_number)
# Stop when the set is empty
if not lines_set:
raise StopIteration

Assuming L is a list containing the line numbers you want, you could do :
with open("~/file.csv") as f:
r = csv.DictReader(f)
for i, line in enumerate(r):
if i in L: # or (i+2) in L: from your second example
print line
That way :
you read the file only once
you do not load the whole file in memory
you only get the lines you are interested in
The only caveat is that you read whole file even if L = [3]

for row in enumerate(r):
will pull tuples. You are then trying to select your ith element from a 2 element tuple.
for example
>> for i in enumerate({"a":1, "b":2}): print i
(0, 'a')
(1, 'b')
Additionally, since dictionaries are hash tables, your initial order is not necessarily preserved. for instance:
>>list({"a":1, "b":2, "c":3, "d":5})
['a', 'c', 'b', 'd']

Just to sum up the great ideas, I ended up using something like this: L can be sorted relatively quickly, and in my case it was actually already sorted. So, instead of several membership checks in L it pays off to sort it and then only check each index against the first entry of it. Here is my piece of code:
count=0
with open('~/file.csv') as f:
r = csv.DictReader(f)
for row in r:
count += 1
if L == []:
break
elif count == L[0]:
print (row)
L.pop(0)
Note that this stops as soon as we've gone through L once.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Need help deleting repeating lines in txt file - python

Related

Write a program that reads the contents of a text file and return index of words into Values

Remove duplicates from large list but remove both if it does exist?

Python - How do i build a dictionary from a text file?

Creating a program that compares two lists

How to read specific lines of a large csv file

Categories

Resources