How do I remove duplicate entries in my output file in Python?

How do I remove duplicate entries in my output file in Python? - python

I'm very new to Python. I am trying to extract data from a text file in the format:
85729 block addressing index approximate text retrieval
85730 automatic query expansion based divergence
etc...
The output text file is a list of the words but with no duplicate entries. The text file that is input can have duplicates. The output will look like this:
block
addressing
index
approximate
etc....
With my code so far, I am able to get the list of words but the duplicates are included. I try to check for duplicates before I enter a word into the output file but the output does not reflect that. Any suggestions? My code:
infile = open("paper.txt", 'r')
outfile = open("vocab.txt", 'r+a')
lines = infile.readlines()
for i in lines:
thisline = i.split()
for word in thisline:
digit = word.isdigit()
found = False
for line in outfile:
if word in line:
found = True
break
if (digit == False) and (found == False ):
outfile.write(word);
outfile.write("\n");
I don't understand how for loops are closed in Python. In C++ or Java, the curly braces can be used to define the body of a for loop but I'm not sure how its done in Python. Can anyone help?

Python loops are closed by dedenting; the whitespace on the left has semantic meaning. This saves you from furiously typing curly braces or do/od or whatever, and eliminates a class of errors where your indentation accidentally doesn't reflect your control flow accurately.
Your input doesn't appear to be large enough to justify a loop over your output file (and if it did I'd probably use a gdbm table anyway), so you can probably do something like this (tested very briefly):
#!/usr/local/cpython-3.3/bin/python
with open('/etc/crontab', 'r') as infile, open('output.txt', 'w') as outfile:
seen = set()
for line in infile:
for word in line.split():
if word not in seen:
seen.add(word)
outfile.write('{}\n'.format(word))

Related

Concatenate lines with previous line based on number of letters in first column

New to coding and trying to figure out how to fix a broken csv file to make be able to work with it properly.
So the file has been exported from a case management system and contains fields for username, casenr, time spent, notes and date.
The problem is that occasional notes have newlines in them and when exporting the csv the tooling does not contain quotation marks to define it as a string within the field.
see below example:
user;case;hours;note;date;
tnn;123;4;solved problem;2017-11-27;
tnn;124;2;random comment;2017-11-27;
tnn;125;3;I am writing a comment
that contains new lines
without quotation marks;2017-11-28;
HJL;129;8;trying to concatenate lines to re form the broken csv;2017-11-29;
I would like to concatenate lines 3,4 and 5 to show the following:
tnn;125;3;I am writing a comment that contains new lines without quotation marks;2017-11-28;
Since every line starts with a username (always 3 letters) I thought I would be able to iterate the lines to find which lines do not start with a username and concatenate that with the previous line.
It is not really working as expected though.
This is what I have got so far:
import re
with open('Rapp.txt', 'r') as f:
for line in f:
previous = line #keep current line in variable to join next line
if not re.match(r'^[A-Za-z]{3}', line): #regex to match 3 letters
print(previous.join(line))
Script shows no output just finishes silently, any thoughts?

I think I would go a slightly different way:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
for line in f:
if not re.search("\d{4}-\d{1,2}-\d{1,2};\n", line):
line = re.sub("\n", "", line)
all_the_data = "".join([all_the_data, line])
print (all_the_data)
There a several ways to do this each with pros and cons, but I think this keeps it simple.
Loop the file as you have done and if the line doesn't end in a date and ; take off the carriage return and stuff it into all_the_data. That way you don't have to play with looking back 'up' the file. Again, lots of way to do this. If you would rather use the logic of starts with 3 letters and a ; and looking back, this works:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
all_the_data = ""
for line in f:
if not re.search("^[A-Za-z]{3};", line):
all_the_data = re.sub("\n$", "", all_the_data)
all_the_data = "".join([all_the_data, line])
print ("results:")
print (all_the_data)
Pretty much what was asked for. The logic being if the current line doesn't start right, take out the previous line's carriage return from all_the_data.
If you need help playing with the regex itself, this site is great: http://regex101.com

The regex in your code matches to all the lines (string) in the txt (finds a valid match to the pattern). The if condition is never true and hence nothing prints.
with open('./Rapp.txt', 'r') as f:
join_words = []
for line in f:
line = line.strip()
if len(line) > 3 and ";" in line[0:4] and len(join_words) > 0:
print(';'.join(join_words))
join_words = []
join_words.append(line)
else:
join_words.append(line)
print(";".join(join_words))
I've tried to not use regex here to keep it a little clear if possible. But, regex is a better option.

A simple way would be to use a generator that acts as a filter on the original file. That filter would concatenate a line to the previous one if it has not a semicolon (;) in its 4th column. Code could be:
def preprocess(fd):
previous = next(fd)
for line in fd:
if line[3] == ';':
yield previous
previous = line
else:
previous = previous.strip() + " " + line
yield previous # don't forget last line!
You could then use:
with open(test.txt) as fd:
rd = csv.DictReader(preprocess(fd))
for row in rd:
...
The trick here is that the csv module only requires on object that returns a line each time next function is applied to it, so a generator is appropriate.
But this is only a workaround and the correct way would be that the previous step directly produces a correct CSV file.

How to input a line word by word in Python?

I have multiple files, each with a line with, say ~10M numbers each. I want to check each file and print a 0 for each file that has numbers repeated and 1 for each that doesn't.
I am using a list for counting frequency. Because of the large amount of numbers per line I want to update the frequency after accepting each number and break as soon as I find a repeated number. While this is simple in C, I have no idea how to do this in Python.
How do I input a line in a word-by-word manner without storing (or taking as input) the whole line?
EDIT: I also need a way for doing this from live input rather than a file.

Read the line, split the line, copy the array result into a set. If the size of the set is less than the size of the array, the file contains repeated elements
with open('filename', 'r') as f:
for line in f:
# Here is where you do what I said above
To read the file word by word, try this
import itertools
def readWords(file_object):
word = ""
for ch in itertools.takewhile(lambda c: bool(c), itertools.imap(file_object.read, itertools.repeat(1))):
if ch.isspace():
if word: # In case of multiple spaces
yield word
word = ""
continue
word += ch
if word:
yield word # Handles last word before EOF
Then you can do:
with open('filename', 'r') as f:
for num in itertools.imap(int, readWords(f)):
# Store the numbers in a set, and use the set to check if the number already exists
This method should also work for streams because it only reads one byte at a time and outputs a single space delimited string from the input stream.
After giving this answer, I've updated this method quite a bit. Have a look
<script src="https://gist.github.com/smac89/bddb27d975c59a5f053256c893630cdc.js"></script>

The way you are asking it is not possible I guess. You can't read word by word as such in python . Something of this can be done:
f = open('words.txt')
for word in f.read().split():
print(word)

Python Runtime error counting words in file

After being faced with a syntax error for some noticeable time and having realised I made a foolish mistake, I proceeded to correct my way only to encounter a runtime error. So far I'm trying to produce a program which is able to read the amount of words from a file, however, instead of counting the number of words the program seems to count the number of letters which is not benefital for the outcome of my program. Please find the appropriate code below. Thanks for any and all contributions!
def GameStage02():
global FileSelection
global ReadFile
global WordCount
global WrdCount
FileSelection = filedialog.askopenfilename(filetypes=(("*.txt files", ".txt"),("*.txt files", "")))
with open(FileSelection, 'r') as file:
ReadFile = file.read()
SelectTextLabel.destroy()
WrdCount=0
for line in ReadFile:
Words=line.split()
WrdCount=WrdCount+len(Words)
print(WrdCount)
GameStage01Button.config(state=NORMAL)

Let's break it down:
ReadFile = file.read() will give you a string.
for line in ReadFile will iterate over the characters in that string.
Words=line.split() will give you a list with one or zero characters in it.
That's probably not what you want. Change
ReadFile = file.read()
to
ReadFile = file.readlines()
This will give you a list of lines, which you can iterate over and/or split into lists of words.
In addition, note that file is not a good variable name (in Python2), because that's already the name of a builtin.

As a continuation of timgeb's answer, here is a working piece of code that does this:
import re
#open file.txt, read and
#split the file content with \n character as the delimiter(basically as lines)
lines = open('file.txt').read().splitlines()
count = 0
for line in lines:
#split the line with whitespace delimiter and get the list of words in the line
words = re.split(r'\s', line)
count += len(words)
print count

Refering to a list of names using Python

I am new to Python, so please bear with me.
I can't get this little script to work properly:
genome = open('refT.txt','r')
datafile - a reference genome with a bunch (2 million) of contigs:
Contig_01
TGCAGGTAAAAAACTGTCACCTGCTGGT
Contig_02
TGCAGGTCTTCCCACTTTATGATCCCTTA
Contig_03
TGCAGTGTGTCACTGGCCAAGCCCAGCGC
Contig_04
TGCAGTGAGCAGACCCCAAAGGGAACCAT
Contig_05
TGCAGTAAGGGTAAGATTTGCTTGACCTA
The file is opened:
cont_list = open('dataT.txt','r')
a list of contigs that I want to extract from the dataset listed above:
Contig_01
Contig_02
Contig_03
Contig_05
My hopeless script:
for line in cont_list:
if genome.readline() not in line:
continue
else:
a=genome.readline()
s=line+a
data_out = open ('output.txt','a')
data_out.write("%s" % s)
data_out.close()
input('Press ENTER to exit')
The script successfully writes the first three contigs to the output file, but for some reason it doesn't seem able to skip "contig_04", which is not in the list, and move on to "Contig_05".
I might seem a lazy bastard for posting this, but I've spent all afternoon on this tiny bit of code -_-

I would first try to generate an iterable which gives you a tuple: (contig, gnome):
def pair(file_obj):
for line in file_obj:
yield line, next(file_obj)
Now, I would use that to get the desired elements:
wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'}
with open('filename') as fin:
pairs = pair(fin)
while wanted:
p = next(pairs)
if p[0] in wanted:
# write to output file, store in a list, or dict, ...
wanted.forget(p[0])

I would recommend several things:
Try using with open(filename, 'r') as f instead of f = open(...)/f.close(). with will handle the closing for you. It also encourages you to handle all of your file IO in one place.
Try to read in all the contigs you want into a list or other structure. It is a pain to have many files open at once. Read all the lines at once and store them.
Here's some example code that might do what you're looking for
from itertools import izip_longest
# Read in contigs from file and store in list
contigs = []
with open('dataT.txt', 'r') as contigfile:
for line in contigfile:
contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL
# Read through genome file, open up an output file
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile:
# Nifty way to sort through fasta files 2 lines at a time
for name, seq in izip_longest(*[genomefile]*2):
# compare the contig name to your list of contigs
if name.rstrip() in contigs:
outfile.write(name) #optional. remove if you only want the seq
outfile.write(seq)

Here's a pretty compact approach to get the sequences you'd like.
def get_sequences(data_file, valid_contigs):
sequences = []
with open(data_file) as cont_list:
for line in cont_list:
if line.startswith(valid_contigs):
sequence = cont_list.next().strip()
sequences.append(sequence)
return sequences
if __name__ == '__main__':
valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05')
sequences = get_sequences('dataT.txt', valid_contigs)
print(sequences)
The utilizes the ability of startswith() to accept a tuple as a parameter and check for any matches. If the line matches what you want (a desired contig), it will grab the next line and append it to sequences after stripping out the unwanted whitespace characters.
From there, writing the sequences grabbed to an output file is pretty straightforward.
Example output:
['TGCAGGTAAAAAACTGTCACCTGCTGGT',
'TGCAGGTCTTCCCACTTTATGATCCCTTA',
'TGCAGTGTGTCACTGGCCAAGCCCAGCGC',
'TGCAGTAAGGGTAAGATTTGCTTGACCTA']

Compression function for words.. python

This is supposed to be a compression function. We're supposed to read in a text file with just words, and sort them by frequencies and by the amount of words. Upper and lower case. I don't an answer to solve it, I just want help.
for each word in the input list
if the word is new to the list of unique words
insert it into the list and set its frequency to 1
otherwise
increase its frequency by 1
sort unique word list by frequencies (function)
open input file again
open output file with appropriate name based on input filename (.cmp)
write out to the output file all the words in the unique words list
for each line in the file (delimited by newlines only!)
split the line into words
for each word in the line
look up each word in the unique words list
and write its location out to the output file
don't output a newline character
output a newline character after the line is finished
close both files
tell user compression is done
This is my next step:
def compression():
for line in infile:
words = line.split()
def get_file():
opened = False
fn = input("enter a filename ")
while not opened:
try:
infile = open(fn, "r")
opended = True
except:
print("Won't open")
fn = input("enter a filename ")
return infile
def compression():
get_file()
data = infile.readlines()
infile.close()
for line in infile:
words = line.split()
words = []
word_frequencies = []
def main():
input("Do you want to compress or decompress? Enter 'C' or 'D' ")
main()

So I'm not going to do an entire assignment for you, but I can try my best and walk you through it one by one.
It would seem the first step is to create an array of all the words in the text file (assuming you know file reading methods). For this, you should look into Python's split function (regular expressions can be used for more complex variations of this). So you need to store each word somewhere, and pair that value with the amount of times it appears. Sounds like the job of a dictionary, to me. That should get you off on the right track.

Thanks to the pseudocode I can more or less figure out what this should look like. I have to do some guesswork since you say you're restricted to what you've covered in class, and we have no way to know what that includes and what it doesn't.
I am assuming you can handle opening the input file and splitting it into words. That's pretty basic stuff - if your class is asking you to handle input files it must have covered it. If not, the main thing to look back to is that you can iterate over a file and get its lines:
with open('example.txt') as infile:
for line in infile:
words = line.split()
Now, for each word you need to keep track of two things - the word itself and its frequency. Your problem requires you to use lists to store your information. This means you have to either use two different lists, one storing the words and one storing their frequencies, or use one list that stores two facts for each of its positions. There are disadvantages either way - lists are not the best data structure to use for this, but your problem definition puts the better tools off limits for now.
I would probably use two different lists, one containing the words and one containing the frequency counts for the word in the same position in the word list. The advantage there is that that would allow using the index method on the word list to find the list position of a given word, and then increment the matching frequency count. That will be much faster than searching a list that stores both the word and its frequency count using a for loop. The down side is that sorting is harder - you have to find a way to retain the list position of each frequency when you sort, or to combine the words and their frequencies before sorting. In this approach, you probably need to build a list data that stores two pieces of information - the frequency count and then either the word list index or the word itself - and then sort that list by the frequency count.
I hope that part of the point of this exercise is to help drive home how useful the better data structures are when you're allowed to use them.
edited
So the inner loop is going to look something like this:
words = []
word_frequencies = []
for line in infline:
for word in line.split():
try:
word_position = words.index(word)
except ValueError:
# word is not in words
words.append(word)
# what do you think should happen to word_frequencies here?
else:
# now word_position is a number giving the position in both words and word_frequencies for the word
# what do you think should happen to word_frequences here?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.