AWK to Python For Mbox - python

What would be the best Pythonic way of implementing this awk command in python?
awk 'BEGIN{chunk=0} /^From /{msgs++;if(msgs==500){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}' mbox
I'm using this now to split up enormous mailbox (mbox format) files.
I'm trying a recursive method right now.
def chunkUp(mbox, chunk=0):
with open(mbox, 'r') as bigfile:
msg = 0
for line in bigfile:
if msg == 0:
with open("./TestChunks/chunks/chunk_"+str(chunk)+".txt", "a+") as cf:
if line.startswith("From "): msg += 1
cf.write(line)
if msg > 20: chunkUp(mbox, chunk+1)
I would love to be able to implement this in python and be able to resume progress if it is interrupted. Working on that bit now.
I'm tying my brain into knots! Cheers!

your recursive approach is doomed to fail: you may end up having too many open files at once, since the with blocks don't exit until the end of the program.
Better have one handle open and write to it, close & reopen new handle when "From" is encountered.
also open your files in write mode, not append. The code below tries to do the minimal operations & tests to write each line in a file, and close/open another file when From: is found. Also, in the end, the last file is closed.
def chunkUp(mbox):
with open(mbox, 'r') as bigfile:
handle = None
chunk = 0
for line in bigfile:
if line.startswith("From "):
# next (or first) file
chunk += 1
if handle is not None:
handle.close()
handle = None
# file was closed / first file: create a new one
if handle is None:
handle = open("./TestChunks/chunks/chunk_{}.txt".format(chunk), "w")
# write the line in the current file
handle.write(line)
if handle is not None:
handle.close()
I haven't tested it, but it's simple enough, it should work. If file doesn't have "From" in the first line, all lines before are stored in chunk_0.txt file.

Related

How do I properly read large text files in Python so I dont clog up memory?

So today while buying BTC I messed up and lost my decryption passphrase to wallet that ATM sends automatically on email.
I remember the last 4 characters of the passphrase so I generated a wordlist and wanted to try to bruteforce my way into it. It was a 4MB file and the script checked all the possibilities with no luck. Then I realized that maybe the letters are wrong, but I still remember what numbers were in those 4 chars. Well suddenly, I have 2GB file that get SIGKILLed by Ubuntu.
Here is the whole code, it is very short.
#!/usr/bin/python
from zipfile import ZipFile
import sys
i = 0
found = False
with ZipFile("/home/kuskus/Desktop/wallet.zip") as zf:
with open('/home/kuskus/Desktop/wl.txt') as wordlist:
for line in wordlist.readlines():
if(not found):
try:
zf.extractall(pwd = str.encode(line))
print("password found: %s" % line)
found = True
except:
print(i)
i += 1
else: sys.exit()
I think the issue is that the textfile fills up the memory so OS kills it. I really don't know how could I read the file, maybe by 1000 lines, then clean it and do another 1000 lines. If anyone could help me I would be very grateful, thank you in advance :) Oh and the text file has about 300 milion lines, if it matters.
Usually the best thing to do is to iterate over the file directly. The file handler will act as a generator, producing lines one at a time rather than aggregating them all into memory at once into a list (as fh.readlines() does):
with open("somefile") as fh:
for line in fh:
# do something
Furthermore, file handles allow you to read specific amounts of data if you so choose:
with open("somefile") as fh:
number_of_chars = fh.read(15) # 15 is the number of characters in a StringIO style handler
while number_of_chars:
# do something with number_of_chars
number_of_chars = fh.read(15)
Or, if you want to read a specific number of lines:
with open('somefile') as fh:
while True:
chunk_of_lines = [fh.readline() for i in range(5)] # this will read 5 lines at a time
if not chunk_of_lines:
break
# do something else here
Where fh.readline() is analogous to calling next(fh) in a for loop.
The reason a while loop is used in the latter two examples is because once the file has been completely iterated through, fh.readline() or fh.read(some_integer) will yield an empty string, which acts as False and will terminate the loop

How to read a csv file in tail -f manner using python?

I want to read the csv file in a manner similar to tail -f i.e. like reading an error log file.
I can perform this operation in a text file with this code:
while 1:
where = self.file.tell()
line = self.file.readline()
if not line:
print "No line waiting, waiting for one second"
time.sleep(1)
self.file.seek(where)
if (re.search('[a-zA-Z]', line) == False):
continue
else:
response = self.naturalLanguageProcessing(line)
if(response is not None):
response["id"] = self.id
self.id += 1
response["tweet"] = line
self.saveResults(response)
else:
continue
How do I perform the same task for a csv file? I have gone through a link which can give me last 8 rows but that is not what I require. The csv file will be getting updated simultaneously and I need to get the newly appended rows.
Connecting A File Tailer To A csv.reader
In order to plug your code that looks for content newly appended to a file into a csv.reader, you need to put it into the form of an iterator.
I'm not intending to showcase correct code, but specifically to show how to adopt your existing code into this form, without making assertions about its correctness. In particular, the sleep() would be better replaced with a mechanism such as inotify to let the operating system assertively inform you when the file has changed; and the seek() and tell() would be better replaced with storing partial lines in memory rather than backing up and rereading them from the beginning over and over.
import csv
import time
class FileTailer(object):
def __init__(self, file, delay=0.1):
self.file = file
self.delay = delay
def __iter__(self):
while True:
where = self.file.tell()
line = self.file.readline()
if line and line.endswith('\n'): # only emit full lines
yield line
else: # for a partial line, pause and back up
time.sleep(self.delay) # ...not actually a recommended approach.
self.file.seek(where)
csv_reader = csv.reader(FileTailer(open('myfile.csv')))
for row in csv_reader:
print("Read row: %r" % (row,))
If you create an empty myfile.csv, start python csvtailer.py, and then echo "first,line" >>myfile.csv from a different window, you'll see the output of Read row: ['first', 'line'] immediately appear.
Finding A Correct File Tailer In Python
For a correctly-implemented iterator that waits for new lines to be available, consider referring to one of the existing StackOverflow questions on the topic:
How to implement a pythonic equivalent of tail -F?
Reading infinite stream - tail
Reading updated files on the fly in Python

Open text file, print new lines only in python

I am opening a text file, which once created is constantly being written to, and then printing this out to a console any new lines, as I don't want to reprint the whole text file each time. I am checking to see if the file grows in size, if it is, just print the next new line. This is mostly working, but occasionally it gets a bit confused about the next new line, and new lines appear a few lines up, mixed in with the old lines.
Is there a better way to do this, below is my current code.
infile = "Null"
while not os.path.exists(self.logPath):
time.sleep(.1)
if os.path.isfile(self.logPath):
infile = codecs.open(self.logPath, encoding='utf8')
else:
raise ValueError("%s isn't a file!" % file_path)
lastSize = 0
lastLineIndex = 0
while True:
wx.Yield()
fileSize = os.path.getsize(self.logPath)
if fileSize > lastSize:
lines = infile.readlines()
newLines = 0
for line in lines[lastLineIndex:]:
newLines += 1
self.running_log.WriteText(line)
lastLineIndex += newLines
if "DBG-X: Returning 1" in line:
self.subject = "FAILED! - "
self.sendEmail(self)
break
if "DBG-X: Returning 0" in line:
self.subject = "PASSED! - "
self.sendEmail(self)
break
fileSize1 = fileSize
infile.flush()
infile.seek(0)
infile.close()
Also my application freezes whilst waiting for the text file to be created, as it takes a couple of seconds to appear, which isn't great.
Cheers.
This solution could help. You'd also have to do a bit of waiting until the file appears, using os.path.isfile and time.sleep.
Maybe you could:
open the file each time you need to read in it,
use lastSize as argument to seek directly to where you stopped at last reading.
Additional comment: I don't know if you need some protection, but I think you should not bother to test whether given filename is a file or not; just open it in a try...except block and catch problems if any.
As for the freezing of your application, you may want to use some kind of Threading, for instance: one thread, your main one, is handling the GUI, and a second one would wait for the file to be created. Once the file is created, the second thread sends signals to the GUI thread, containing the data to be displayed.

Read line from file, process it, then remove it

I have a 22mb text file containing a list of numbers (1 number per line). I am trying to have python read the number, process the number and write the result in another file. All of this works but if I have to stop the program it starts all over from the beginning. I tried to use a mysql database at first but it was way too slow. I am getting about 4 times the number being processed this way. I would like to be able to delete the line after the number was processed.
with open('list.txt', 'r') as file:
for line in file:
filename = line.rstrip('\n') + ".txt"
if os.path.isfile(filename):
print "File", filename, "exists, skipping!"
else:
#process number and write file
#(need code to delete current line here)
As you can see every time it is restarted it has to search the hard drive for the file name to make sure it gets to the place it left off. With 1.5 million numbers this can take a while. I found an example with truncate but it did not work.
Are there any commands similar to array_shift (PHP) for python that will work with text files.
I would use a marker file to keep the number of the last line processed instead of rewriting the input file:
start_from = 0
try:
with open('last_line.txt', 'r') as llf: start_from = int(llf.read())
except:
pass
with open('list.txt', 'r') as file:
for i, line in enumerate(file):
if i < start_from: continue
filename = line.rstrip('\n') + ".txt"
if os.path.isfile(filename):
print "File", filename, "exists, skipping!"
else:
pass
with open('last_line.txt', 'w') as outfile: outfile.write(str(i))
This code first checks for the file last_line.txt and tries to read a number from it. The number is the number of line which was processed in during the previous attempt. Then it simply skips the required number of lines.
I use Redis for stuff like that. Install redis and then pyredis and you can have a persistent set in memory. Then you can do:
r = redis.StrictRedis('localhost')
with open('list.txt', 'r') as file:
for line in file:
if r.sismember('done', line):
continue
else:
#process number and write file
r.sadd('done', line)
if you don't want to install Redis you can also use the shelve module, making sure that you open it with the writeback=False option. I really recommend Redis though, it makes things like this so much easier.
Reading the data file should not be a bottleneck. The following code read a 36 MB, 697997 line text file in about 0,2 seconds on my machine:
import time
start = time.clock()
with open('procmail.log', 'r') as f:
lines = f.readlines()
end = time.clock()
print 'Readlines time:', end-start
Because it produced the following result:
Readlines time: 0.1953125
Note that this code produces a list of lines in one go.
To know where you've been, just write the number of lines you've processed to a file. Then if you want to try again, read all the lines and skip the ones you've already done:
import os
# Raad the data file
with open('list.txt', 'r') as f:
lines = f.readlines()
skip = 0
try:
# Did we try earlier? if so, skip what has already been processed
with open('lineno.txt', 'r') as lf:
skip = int(lf.read()) # this should only be one number.
del lines[:skip] # Remove already processed lines from the list.
except:
pass
with open('lineno.txt', 'w+') as lf:
for n, line in enumerate(lines):
# Do your processing here.
lf.seek(0) # go to beginning of lf
lf.write(str(n+skip)+'\n') # write the line number
lf.flush()
os.fsync() # flush and fsync make sure the lf file is written.

Changing contents of a file - Python

So I have a program which runs. This is part of the code:
FileName = 'Numberdata.dat'
NumberFile = open(FileName, 'r')
for Line in NumberFile:
if Line == '4':
print('1')
else:
print('9')
NumberFile.close()
A pretty pointless thing to do, yes, but I'm just doing it to enhance my understanding. However, this code doesn't work. The file remains as it is and the 4's are not replaced by 1's and everything else isn't replaced by 9's, they merely stay the same. Where am I going wrong?
Numberdata.dat is "444666444666444888111000444"
It is now:
FileName = 'Binarydata.dat'
BinaryFile = open(FileName, 'w')
for character in BinaryFile:
if charcter == '0':
NumberFile.write('')
else:
NumberFile.write('#')
BinaryFile.close()
You need to build up a string and write it to the file.
FileName = 'Numberdata.dat'
NumberFileHandle = open(FileName, 'r')
newFileString = ""
for Line in NumberFileHandle:
for char in line: # this will work for any number of lines.
if char == '4':
newFileString += "1"
elif char == '\n':
newFileString += char
else:
newFileString += "9"
NumberFileHandle.close()
NumberFileHandle = open(FileName, 'w')
NumberFileHandle.write(newFileString)
NumberFileHandle.close()
First, Line will never equal 4 because each line read from the file includes the newline character at the end. Try if Line.strip() == '4'. This will remove all white space from the beginning and end of the line.
Edit: I just saw your edit... naturally, if you have all your numbers on one line, the line will never equal 4. You probably want to read the file a character at a time, not a line at a time.
Second, you're not writing to any file, so naturally the file won't be getting changed. You will run into difficulty changing a file as you read it (since you have to figure out how to back up to the same place you just read from), so the usual practice is to read from one file and write to a different one.
Because you need to write to the file as well.
with open(FileName, 'w') as f:
f.write(...)
Right now you are just reading and manipulating the data, but you're not writing them back.
At the end you'll need to reopen your file in write mode and write to it.
If you're looking for references, take a look at theopen() documentation and at the Reading and Writing Files section of the Python Tutorial.
Edit: You shouldn't read and write at the same time from the same file. You could either, write to a temp file and at the end call shutil.move(), or load and manipulate your data and then re-open your original file in write mode and write them back.
You are not sending any output to the data, you are simply printing 1 and 9 to stdout which is usually the terminal or interpreter.
If you want to write to the file you have to use open again with w.
eg.
out = open(FileName, 'w')
you can also use
print >>out, '1'
Then you can call out.write('1') for example.
Also it is a better idea to read the file first if you want to overwrite and write after.
According to your comment:
Numberdata is just a load of numbers all one line. Maybe that's where I'm going wrong? It is "444666444666444888111000444"
I can tell you that the for cycle, iterate over lines and not over chars. There is a logic error.
Moreover, you have to write the file, as Rik Poggi said (just rember to open it in write mode)
A few things:
The r flag to open indicates read-only mode. This obviously won't let you write to the file.
print() outputs things to the screen. What you really want to do is output to the file. Have you read the Python File I/O tutorial?
for line in file_handle: loops through files one line at a time. Thus, if line == '4' will only be true if the line consists of a single character, 4, all on its own.
If you want to loop over characters in a string, then do something like for character in line:.
Modifying bits of a file "in place" is a bit harder than you think.
This is because if you insert data into the middle of a file, the rest of the data has to shuffle over to make room - this is really slow because everything after your insertion has to be rewritten.
In theory, a one-byte for one-byte replacement can be done fast, but in general people don't want to replace byte-for-byte, so this is an advanced feature. (See seek().) The usual approach is to just write out a whole new file.
Because print doesn't write to your file.
You have to open the file and read it, modify the string you obtain creating a new string, open again the file and write it again.
FileName = 'Numberdata.dat'
NumberFile = open(FileName, 'r')
data = NumberFile.read()
NumberFile.close()
dl = data.split('\n')
for i in range(len(dl)):
if dl[i] =='4':
dl[i] = '1'
else:
dl[i] = '9'
NumberFile = open(FileName, 'w')
NumberFile.write('\n'.join(dl))
NumberFile.close()
Try in this way. There are for sure different methods but this seems to be the most "linear" to me =)

Categories

Resources