Python: Issue with Writing over Lines? - python

So, this is the code I'm using in Python to remove lines, hence the name "cleanse." I have a list of a few thousand words and their parts-of-speech:
NN by
PP at
PP at
... This is the issue. For whatever reason (one I can't figure out and have been trying to for a few hours), the program I'm using to go through the word inputs isn't clearing duplicates, so the next best thing I can do is the former! Y'know, cycle through the file and delete the duplicates on run. However, whenever I do, this code instead takes the last line of the list and duplicates that hundreds of thousands of times.
Thoughts, please? :(
EDIT: The idea is that cleanseArchive() goes through a file called words.txt, takes any duplicate lines and deletes them. Since Python isn't able to delete lines, though, and I haven't had luck with any other methods, I've turned to essentially saving the non-duplicate data in a list (saveList) and then writing each object from that list into a new file (deleting the old). However, as of the moment as I said, it just repeats the final object of the original list thousands upon thousands of times.
EDIT2: This is what I have so far, taking suggestions from the replies:
def cleanseArchive():
f = open("words.txt", "r+")
given_line = f.readlines()
f.seek(0)
saveList = set(given_line)
f.close()
os.remove("words.txt")
f = open("words.txt", "a")
f.write(saveList)
but ATM it's giving me this error:
Traceback (most recent call last):
File "C:\Python33\Scripts\AI\prototypal_intelligence.py", line 154, in <module>
initialize()
File "C:\Python33\Scripts\AI\prototypal_intelligence.py", line 100, in initialize
cleanseArchive()
File "C:\Python33\Scripts\AI\prototypal_intelligence.py", line 29, in cleanseArchive
f.write(saveList)
TypeError: must be str, not set

for i in saveList:
f.write(n+"\n")
You basically print the value of n over and over.
Try this:
for i in saveList:
f.write(i+"\n")

If you just want to delete "duplicated lines", I've modified your reading code:
saveList = []
duplicates = []
with open("words.txt", "r") as ins:
for line in ins:
if line not in duplicates:
duplicates.append(line)
saveList.append(line)
Additionally take the correction above!

def cleanseArchive():
f = open("words.txt", "r+")
f.seek(0)
given_line = f.readlines()
saveList = set()
for x,y in enumerate(given_line):
t=(y)
saveList.add(t)
f.close()
os.remove("words.txt")
f = open("words.txt", "a")
for i in saveList: f.write(i)
Finished product! I ended up digging into enumerate and essentially just using that to get the strings. Man, Python has some bumpy roads when you get into sets/lists, holy shit. So much stuff not working for very ambiguous reasons! Whatever the case, fixed it up.

Let's clean up this code you gave us in your update:
def cleanseArchive():
f = open("words.txt", "r+")
given_line = f.readlines()
f.seek(0)
saveList = set(given_line)
f.close()
os.remove("words.txt")
f = open("words.txt", "a")
f.write(saveList)
We have bad names that don't respect the Style Guide for Python Code, we have superfluous code parts, we don't use the full power of Python and part of it is not working.
Let us start with dropping unneeded code while at the same time using meaningful names.
def cleanse_archive():
infile = open("words.txt", "r")
given_lines = infile.readlines()
words = set(given_lines)
infile.close()
outfile = open("words.txt", "w")
outfile.write(words)
The seek was not needed, the mode for opening a file to read is now just r, the mode for writing is now w and we dropped the removing of the file because it will be overwritten anyway. Having a look at this now clearer code we see, that we missed to close the file after writing. If we open the file with the with statement Python will take care of that for us.
def cleanse_archive():
with open("words.txt", "r") as infile:
words = set(infile.readlines())
with open("words.txt", "w") as outfile:
outfile.write(words)
Now that we have clear code we'll deal with the error message that occurs when outfile.write is called: TypeError: must be str, not set. This message is clear: You can't write a set directly to the file. Obviously you'll have to loop over the content of the set.
def cleanse_archive():
with open("words.txt", "r") as infile:
words = set(infile.readlines())
with open("words.txt", "w") as outfile:
for word in words:
outfile.write(word)
That's it.

Related

Split and print the word before and after the \ of *n of lines, from a txt to two different txt's

I searched around a bit, but I couldn't find a solution that fits my needs.
I'm new to python, so I'm sorry if what I'm asking is pretty obvious.
I have a .txt file (for simplicity I will call it inputfile.txt) with a list of names of folder\files like this:
camisos\CROWDER_IMAG_1.mov
camisos\KS_HIGHENERGY.mov
camisos\KS_LOWENERGY.mov
What I need is to split the first word (the one before the \) and write it to a txt file (for simplicity I will call it outputfile.txt).
Then take the second (the one after the \) and write it in another txt file.
This is what i did so far:
with open("inputfile.txt", "r") as f:
lines = f.readlines()
with open("outputfile.txt", "w") as new_f:
for line in lines:
text = input()
print(text.split()[0])
This in my mind should print only the first word in the new txt, but I only got an empty txt file without any error.
Any advice is much appreciated, thanks in advance for any help you could give me.
You can read the file in a list of strings and split each string to create 2 separate lists.
with open("inputfile.txt", "r") as f:
lines = f.readlines()
X = []
Y = []
for line in lines:
X.append(line.split('\\')[0] + '\n')
Y.append(line.split('\\')[1])
with open("outputfile1.txt", "w") as f1:
f1.writelines(X)
with open("outputfile2.txt", "w") as f2:
f2.writelines(Y)

improve search using a dict and pyahocorasick

I´m new at python and I don´t know how to program well. How do I edit this code so it can works using pyahocorasick? My code is very slow, because I need to search lots of strings at a very big file.
Any other way to improve the search?
import sys
with open('C:/dict_search.txt', 'r') as search_list:
targets = [line.strip() for line in search_list]
with open('C:/source.txt', 'r') as source_file, open('C:/out.txt', 'w') as fout:
for line in source_file:
if any(target in line for target in targets):
fout.write(line)
Dict_search.txt
509344
827276
324194
782211
772854
727246
858908
280903
377881
247333
538710
182734
701212
379326
148310
542129
315285
840427
581092
485581
867746
434527
746814
749479
252045
189668
418513
624231
620284
(...)
source.txt
1,324194,20190103,0000048632,00000000000004870,0000045054!
1,701212,20190103,0000048632,00000000000147072,0000045055!
1,581092,20190103,0000048632,00000000000032900,0000045056!
(...)
I need to find the "word" from dict_search.txt is in the source.txt and if the word is on the line, i need to copy the line to other file.
The problem is that my source.txt is very big and I have more than 100k words at dict_search.txt
My code takes to execute. I tried using the set() method, but I got a blank file.
After looking at your files, it looks like each line in the dict_search.txt file match with the format of second column in source.txt file. If this is the case, the below code will work for you. It's a linear time solution so it's going to be fast on the cost of space because it creates dictionary in memory.
d={}
with open("source.txt", 'r') as f:
for index, line in enumerate(f):
l = line.strip().split(",")
d[l[1]]= line
with open("Dict_search.txt", 'r') as search, open('output.txt', 'w') as output:
for line in search:
row = line.strip()
if row in d:
output.write(d[row])

"Move" some parts of the file to another file

Let say I have a file with 48,222 lines. I then give an index value, let say, 21,000.
Is there any way in Python to "move" the contents of the file starting from index 21,000 such that now I have two files: the original one and the new one. But the original one now is having 21,000 lines and the new one 27,222 lines.
I read this post which uses partition and is quite describing what I want:
with open("inputfile") as f:
contents1, sentinel, contents2 = f.read().partition("Sentinel text\n")
with open("outputfile1", "w") as f:
f.write(contents1)
with open("outputfile2", "w") as f:
f.write(contents2)
Except that (1) it uses "Sentinel Text" as separator, (2) it creates two new files and require me to delete the old file. As of now, the way I do it is like this:
for r in result.keys(): #the filenames are in my dictionary, don't bother that
f = open(r)
lines = f.readlines()
f.close()
with open("outputfile1.txt", "w") as fn:
for line in lines[0:21000]:
#write each line
with open("outputfile2.txt", "w") as fn:
for line in lines[21000:]:
#write each line
Which is quite a manual work. Is there a built-in or more efficient way?
You can also use writelines() and dump the sliced list of lines from 0 to 20999 into one file and another sliced list from 21000 to the end into another file.
with open("inputfile") as f:
content = f.readlines()
content1 = content[:21000]
content2 = content[21000:]
with open("outputfile1.txt", "w") as fn1:
fn1.writelines(content1)
with open('outputfile2.txt','w') as fn2:
fn2.writelines(content2)

How to read first line of a file twice?

I have a big files with many lines and want to read the first line first and then loop through all lines starting with the first line again.
I first thought that something like that would do it:
file = open("fileName", 'r')
first_line = file.readline()
DoStuff_1(first_line)
for line in file:
DoStuff_2(line)
file.close()
But this issue with this script is that the first line that is passed to DoStuff_2 is the second line and not the first one. I don't have a good intuition of what kind of object file is. I think it is an iterator and don't really know how to deal with it. The bad solution I found is
file = open("fileName", 'r')
first_line = file.readline()
count = 0
for line in file:
if count == 0:
count = 1
DoStuff_1(first_line)
DoStuff_2(line)
file.close()
But it is pretty dumb and is computationally a bit costly as it runs a if statement at each iteration.
You could do this:
with open('fileName', 'r') as file:
first_line = file.readline()
DoStuff_1(first_line)
DoStuff_2(first_line)
# remaining lines
for line in file:
DoStuff_2(line)
Note that I changed your code to use with so file is automatically closed.
I'd like using a generator to abstract your general control flow. Something like:
def first_and_file(file_obj):
"""
:type file_obj: file
:rtype: (str, __generator[str])
"""
first_line = next(file_obj)
def gen_rest():
yield first_line
yield from file_obj
return first_line, gen_rest()
In Python 2.7, swap out the yield from for:
for line in file_obj:
yield line
Another answer is to just open the file twice.
with open("file.txt", "r") as r:
Do_Stuff1(r.readline())
with open("file.txt", "r") as r:
for line in r:
Do_Stuff2(line)
One of the solutions for a general case of this question is to save the line number on which you are. After completing an operation which requires you to go a previous line relative to the current line, use the line number variable by doing file.seek(0) and then looping over file.readline() the required number of times.

Read a multielement list, look for an element and print it out in python

I am writing a python script in order to write a tex file. But I had to use some information from another file. Such file has names of menus in each line that I need to use. I use split to have a list for each line of my "menu".
For example, I had to write a section with the each second element of my lists but after running, I got anything, what could I do?
This is roughly what I am doing:
texfile = open(outputtex.tex', 'w')
infile = open(txtfile.txt, 'r')
for line in infile.readlines():
linesplit = line.split('^')
for i in range(1,len(infile.readlines())):
texfile.write('\section{}\n'.format(linesplit[1]))
texfile.write('\\begin{figure*}[h!]\n')
texfile.write('\centering\n')
texfile.write('\includegraphics[scale=0.95]{pg_000%i.pdf}\n' %i)
texfile.write('\end{figure*}\n')
texfile.write('\\newpage\n')
texfile.write('\end{document}')
texfile.close()
By the way, in the inclugraphics line, I had to increace the number after pg_ from "0001" to "25050". Any clues??
I really appreciate your help.
I don't quite follow your question. But I see several errors in your code. Most importantly:
for line in infile.readlines():
...
...
for i in range(1,len(infile.readlines())):
Once you read a file, it's gone. (You can get it back, but in this case there's no point.) That means that the second call to readlines is yielding nothing, so len(infile.readlines()) == 0. Assuming what you've written here really is what you want to do (i.e. write file_len * (file_len - 1) + 1 lines?) then perhaps you should save the file to a list. Also, you didn't put quotes around your filenames, and your indentation is strange. Try this:
with open('txtfile.txt', 'r') as infile: # (with automatically closes infile)
in_lines = infile.readlines()
in_len = len(in_lines)
texfile = open('outputtex.tex', 'w')
for line in in_lines:
linesplit = line.split('^')
for i in range(1, in_len):
texfile.write('\section{}\n'.format(linesplit[1]))
texfile.write('\\begin{figure*}[h!]\n')
texfile.write('\centering\n')
texfile.write('\includegraphics[scale=0.95]{pg_000%i.pdf}\n' %i)
texfile.write('\end{figure*}\n')
texfile.write('\\newpage\n')
texfile.write('\end{document}')
texfile.close()
Perhaps you don't actually want nested loops?
infile = open('txtfile.txt', 'r')
texfile = open('outputtex.tex', 'w')
for line_number, line in enumerate(infile):
linesplit = line.split('^')
texfile.write('\section{{{0}}}\n'.format(linesplit[1]))
texfile.write('\\begin{figure*}[h!]\n')
texfile.write('\centering\n')
texfile.write('\includegraphics[scale=0.95]{pg_000%i.pdf}\n' % line_number)
texfile.write('\end{figure*}\n')
texfile.write('\\newpage\n')
texfile.write('\end{document}')
texfile.close()
infile.close()

Categories

Resources