I have a text file in this format:
abc? cdfde" nhj.cde' dfwe-df$sde.....
How can i ignore all the special characters, blanks, numbers, end of the lines, etc and write only the characters in another file?For example, the above file becomes
abccdfdenhjcdedfwedfsde.....
And from this output file,
Should able to read single character by character till the end of file.
Should be able to read two characters at a time, like ab,bc,cc,cd,df,... from above file
Should be able to read three characters at a time, like abc,bcc,ccd,cdf,... from the above file
First of all, how can i read only characters and write to external file?
I can read single character by character by using f.read(1) till end of file.How can i apply this to read 2,3 chars at a time, that too skipping only one character(that is, if i have abcd, i should read ab,bc,cd but not ab,cd(this, i think can be done by f.read(2))). Thanks. I am doing this for cryptanalysis work to analyze ciphertexts by frequency.
If you need to peek ahead (read a few extra characters at a time), you need a buffered file object. The following class does just that:
import io
class AlphaPeekReader(io.BufferedReader):
def readalpha(self, count):
"Read one character, and peek ahead (count - 1) *extra* characters"
val = [self.read1(1)]
# Find first alpha character
while not val[0].isalpha():
if val == ['']:
return '' # EOF
val = [self.read1(1)]
require = count - len(val)
peek = self.peek(require * 3) # Account for a lot of garbage
if peek == '': # EOF
return val[0]
for c in peek:
if c.isalpha():
require -= 1
val.append(c)
if not require:
break
# There is a chance here that there were not 'require' alpha chars in peek
# Return anyway.
return ''.join(val)
This attempts to find extra characters beyond the one character you are reading, but doesn't make a guarantee it'll be able to satisfy your requirements. It could read fewer if we are at the end of the file or if there is a lot of non-alphabetic text in the next block.
Usage:
with AlphaPeekReader(io.open(filename, 'rb')) as alphafile:
alphafile.readalpha(3)
Demo, using a file with your example input:
>>> f = io.open('/tmp/test.txt', 'rb')
>>> alphafile = AlphaPeekReader(f)
>>> alphafile.readalpha(3)
'abc'
>>> alphafile.readalpha(3)
'bcc'
>>> alphafile.readalpha(3)
'ccd'
>>> alphafile.readalpha(10)
'cdfdenhjcd'
>>> alphafile.readalpha(10)
'dfdenhjcde'
To use the readalpha() calls in a loop, where you get each and every character separately plus the two next 2 bytes, use the iter() with a sentinel:
for alpha_with_extra in iter(lambda: alphafile.readalpha(3), ''):
# Do something with alpha_with_extra
To read a line from a file:
import fileinput
text_file = open("Output.txt", "w")
for line in fileinput.input("sample.txt"):
outstring = ''.join(ch for ch in line if ch.isalpha())
text_file.write("%s"%outstring)
text_file.close()
Related
I'm trying to set a variable to the last character of a file. I am using Python, and I'm fairly new to it. If it is of any importance, my code appends a random number between 2 and 9 to the end of an HTML file. In a separate function, I want to set the last character of the HTML file (the last character being the random number between 2 and 9) to a variable, then delete the last character (as to not affect the function of the HTML). Doe's anyone know how I could do this? I can attach my code below if needed, but I chose not to as it is 50 lines long and all 50 lines are needed for full context.
try this,
"a.txt" file has number 1, 3, 4, 5
Below code will read the file and pulls out last character from the file.
file = open('a.txt','r')
lines = file.read()
print(lines[-1])
=> 5
Using #Jab's answer from the comment above as well as some assumptions, we can produce a more efficient solution to finding the last character and replacing it.
The assumptions that are made are common and most likely will be valid:
You will know whether there is a newline character at the very end of the file, or whether the random number is truly the last character in the file (meaning accounting for whitespace).
You know the encoding of the file. This is valid since almost all HTML is utf-8, (can be utf-16), and since you are the one editing it, you will know. Most times the encoding won't even matter.
So, this is what we can do:
with open("test.txt", "rb+", encoding='utf-8') as f:
f.seek(-2, 2)
# -1 or -2, may change depending on whitespace characters at end of the file
var = f.read(1) # read one byte for a number
f.seek(-1,1)
print("last character:", str(var, 'utf-8'))
f.write(bytes('variable', 'utf-8')) # set whatever info here
f.write(bytes('\n', 'utf-8')) # you may want a newline character at the end of the file
f.truncate()
This is efficient because we actually don't have to iterate through the entire file. We iterate through just the last character, once to read and once to write.
You can do something like that:
# Open the file to read and the file to write
with open('file.txt'), open('new_file.txt', 'w+') as f_in, f_out:
# Read all the lines to memory (you can't find the last line lazily)
lines = f_in.readlines()
# Iterate over every line
for i, line in enumerate(lines):
# If the current index is the last index (i.e. the last line)
if i == len(lines) - 1:
# Get the last character
last_char = line[-1]
# Write to the output file the line without the last character
print(line[:-1], file=f_out, end='')
else:
# Write to the output file the line as it is
print(line, file=f_out, end='')
# Print the removed char
print(last_char)
If you don't want to create a new file, you can load all the file to memory as we're currently doing:
# Read all the lines into memory
with open('file.txt') as f:
lines = f.readlines()
# Replace the lines inside the list using the previous logic
for i, line in enumerate(lines):
if i == len(lines) - 1:
last_char = line[-1]
lines[i] = line[:-1]
else:
lines[i] = line
# Write the changed lines to the same file
with open('file.txt', 'w+') as f:
print(''.join(lines), file=f, end='')
# Print the removed char
print(last_char)
I have a rather large text document and would like to replace all instances of hexadecimals inside with regular decimals. Or if possible convert them into text surrounded by '' e.g. 'I01A' instead of $49303141
The hexadecimals are currently marked by starting with $ but I can ctrl+F change that into 0x if that helps, and I need the program to detect the end of the number since some are short $A, while others are long like $568B1F
How could I do this with python, or is it not possible?
Thank you for the help thus far, hoping to clarify my request a bit more to hopefully get a complete solution.
I used a version of Grismar's answer and the output it gives me is
"if not (GetItemTypeId(GetSoldItem())==I0KB) then
set int1= 2+($3E8*3)"
However, I would like to add the ' around the newly created text and convert hex strings smaller then 8 to decimals instead so the output becomes
"if not (GetItemTypeId(GetSoldItem())=='I0KB') then
set int1= 2+(1000*3)"
Hoping for some more help tog et the rest of the way.
def hex2dec(s):
return int(s,16)
was my attempt to convert the shorter hexadecimals to decimal but clearly has not worked, throws syntax errors instead.
Also, I will manually deal with the few $ not used to denote a hexadecimal.
# just creating an example file
with open('D:\Deprotect\wc3\mpq editor\Work\\new 4.txt', 'w') as f:
f.write('if not (GetItemTypeId(GetSoldItem())==$49304B42) then\n')
f.write('set int1= 2+($3E8*3)\n')
def hex_match_to_string(m):
return ''.join([chr(int(m.group(1)[i:i+2], 16)) for i in range(0, len(m.group(1)), 2)])
def hex2dec(s):
return int(s,16)
# open the file for reading
with open('D:\Deprotect\wc3\mpq editor\Work\\new 4.txt', 'r') as file_in:
# open the same file again for reading and writing
with open('D:\Deprotect\wc3\mpq editor\Work\\new 4.txt', 'r+') as file_out:
# start writing at the start of the existing file, overwriting the contents
file_out.seek(0)
while True:
line = file_in.readline()
if line == '':
# end of file
break
# replace the parts of the string matching the regex
line = re.sub(r'\$((?:\w\w\w\w\w\w\w\w)+)', hex_match_to_string, line)
#line = re.sub(r'$\w+', hex2dec,line)
file_out.write(line)
# the resulting file is shorter, truncate it from the current position
file_out.truncate()
See the answer https://stackoverflow.com/a/12597709/1780027 for how to use re.sub to replace specific content of a string with the output of a function. Using this you could presumably use the "int("FFFF", 16) " code snippet you're talking about to perform the action you desire.
EG:
>>> def replace(match):
... match = match.group(1)
... return str(int(match, 16))
>>> sample = "here's a hex $49303141 and there's a nother 1034B and another $8FD0B"
>>> re.sub(r'\$([a-fA-F0-9]+)', replace, sample)
"here's a hex 1227895105 and there's a nother 41803 and another 589067"
Since you are replacing parts of the file with something that's shorter, you can write to the same file you're reading. But keep in mind that, if you were replacing those parts with something that was longer, you would need to write the result to a new file and replace the old file with the new file once you were done.
Also, from your description, it appears you are reading a text file, which makes reading the file line by line the easiest, but if your file was some sort of binary file, using re wouldn't be as convenient and you'd probably need a different solution.
Finally, your question doesn't mention whether $ might also appear elsewhere in the text file (not just in front of pairs of characters that should be read as hexadecimal numbers). This answer assumes $ only appears in front of strings of 2-character hexadecimal numbers.
Here's a solution:
import re
# just creating an example file
with open('test.txt', 'w') as f:
f.write('example line $49303141\n')
f.write('$49303141 example line, with more $49303141\n')
f.write('\n')
f.write('just some text\n')
def hex_match_to_string(m):
return ''.join([chr(int(m.group(1)[i:i+2], 16)) for i in range(0, len(m.group(1)), 2)])
# open the file for reading
with open('test.txt', 'r') as file_in:
# open the same file again for reading and writing
with open('test.txt', 'r+') as file_out:
# start writing at the start of the existing file, overwriting the contents
file_out.seek(0)
while True:
line = file_in.readline()
if line == '':
# end of file
break
# replace the parts of the string matching the regex
line = re.sub(r'\$((?:\w\w)+)', hex_match_to_string, line)
file_out.write(line)
# the resulting file is shorter, truncate it from the current position
file_out.truncate()
The regex is simple r'\$((?:\w\w)+)', which matches any string starting with an actual $ (the backslash avoids it being interpreted as 'the beginning of the string') and followed by 1 or more (+) pairs of letters and numbers (\w\w).
The function hex_match_to_string(m) expects a regex match object and loops over pairs of characters in the first matched group. Each pair is turned into its decimal value by interpreting it as a hexadecimal string (int(pair, 16)) and that decimal value is then turned into a character with that ASCII value (chr(value)). All the resulting characters are joined into a single string (''.join(list)).
A different way or writing hex_match_to_string(m):
def hex_match_to_string(m):
hex_nums = iter(m.group(1))
return ''.join([chr(int(a, 16) * 16 + int(b, 16)) for a, b in zip(hex_nums, hex_nums)])
This may perform a bit better, since it avoids manipulating strings, but it does the same thing.
I am currently working on an application which requires reading all the input from a file until a certain character is encountered.
By using the code:
file=open("Questions.txt",'r')
c=file.readlines()
c=[x.strip() for x in c]
Every time strip encounters \n, it is removed from the input and treated as a string in list c.
This means every line is split into the part of a list c. But I want to make a list up to a point whenever a special character is encountered like this:
if the input file has the contents:
1.Hai
2.Bye\-1
3.Hello
4.OAPd\-1
then I want to get a list as
c=['1.Hai\n2.Bye','3.Hello\n4.OApd']
Please help me in doing this.
The easiest way would be to read the file in as a single string and then split it across your separator:
with open('myFileName') as myFile:
text = myFile.read()
result = text.split(separator) # use your \-1 (whatever that means) here
In case your file is very large, holding the complete contents in memory as a single string for using .split() is maybe not desirable (and then holding the complete contents in the list after the split is probably also not desirable). Then you could read it in chunks:
def each_chunk(stream, separator):
buffer = ''
while True: # until EOF
chunk = stream.read(CHUNK_SIZE) # I propose 4096 or so
if not chunk: # EOF?
yield buffer
break
buffer += chunk
while True: # until no separator is found
try:
part, buffer = buffer.split(separator, 1)
except ValueError:
break
else:
yield part
with open('myFileName') as myFile:
for chunk in each_chunk(myFile, separator='\\-1\n'):
print(chunk) # not holding in memory, but printing chunk by chunk
I used "*" instead of "-1", I'll let you make the appropriate changes.
s = '1.Hai\n2.Bye*3.Hello\n4.OAPd*'
temp = ''
results = []
for char in s:
if char is '*':
results.append(temp)
temp = []
else:
temp += char
if len(temp) > 0:
results.append(temp)
I want to read in a list of numbers from a file as chars one char at a time to check what that char is, whether it is a digit, a period, a + or -, an e or E, or some other char...and then perform whatever operation I want based on that. How can I do this using the existing code I already have? This is an example that I have tried, but didn't work. I am new to python. Thanks in advance!
import sys
def is_float(n):
state = 0
src = ""
ch = n
if state == 0:
if ch.isdigit():
src += ch
state = 1
...
f = open("file.data", 'r')
for n in f:
sys.stdout.write("%12.8e\n" % is_float(n))
Here is a technique to make a one-character-at-a-time file iterator:
from functools import partial
with open("file.data") as f:
for char in iter(partial(f.read, 1), ''):
# now do something interesting with the characters
...
The with-statement opens the file and unconditionally closes it when you're finished.
The usual way to read one character is f.read(1).
The partial creates a function of zero arguments by always calling f.read with an argument of 1.
The two argument form of iter() creates an iterator that loops until you see the empty-string end-of-file marker.
In fact it's much easier. There is a nice utility in itertools, that's often neglected. ;-)
for character in itertools.chain.from_iterable(open('file.data')):
process(character)
for x in open() reads lines from a file. Read the entire file in as a block of text, then go through each character of the text:
import sys
def is_float(n):
state = 0
src = ""
ch = n
if state == 0:
if ch.isdigit():
src += ch
state = 1
...
data = open("file.data", 'r').read()
for n in data: # characters
sys.stdout.write("%12.8e\n" % is_float(n))
I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?
Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.
To read both the first and final line of a file you could...
open the file, ...
... read the first line using built-in readline(), ...
... seek (move the cursor) to the end of the file, ...
... step backwards until you encounter EOL (line break) and ...
... read the last line from there.
def readlastline(f):
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found ...
f.seek(-2, 1) # ... jump back, over the read byte plus one more.
return f.read() # Read all data from this point on.
with open(file, "rb") as f:
first = f.readline()
last = readlastline(f)
Jump to the second last byte directly to prevent trailing newline characters to cause empty lines to be returned*.
The current offset is pushed ahead by one every time a byte is read so the stepping backwards is done two bytes at a time, past the recently read byte and the byte to read next.
The whence parameter passed to fseek(offset, whence=0) indicates that fseek should seek to a position offset bytes relative to...
0 or os.SEEK_SET = The beginning of the file.
1 or os.SEEK_CUR = The current position.
2 or os.SEEK_END = The end of the file.
* As would be expected as the default behavior of most applications, including print and echo, is to append one to every line written and has no effect on lines missing trailing newline character.
Efficiency
1-2 million lines each and I have to do this for several hundred files.
I timed this method and compared it against against the top answer.
10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs 6.92s.
100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95.
Millions of lines would increase the difference a lot more.
Exakt code used for timing:
with open(file, "rb") as f:
first = f.readline() # Read and store the first line.
for last in f: pass # Read all lines, keep final value.
Amendment
A more complex, and harder to read, variation to address comments and issues raised since.
Return empty string when parsing empty file, raised by comment.
Return all content when no delimiter is found, raised by comment.
Avoid relative offsets to support text mode, raised by comment.
UTF16/UTF32 hack, noted by comment.
Also adds support for multibyte delimiters, readlast(b'X<br>Y', b'<br>', fixed=False).
Please note that this variation is really slow for large files because of the non-relative offsets needed in text mode. Modify to your need, or do not use it at all as you're probably better off using f.readlines()[-1] with files opened in text mode.
#!/bin/python3
from os import SEEK_END
def readlast(f, sep, fixed=True):
r"""Read the last segment from a file-like object.
:param f: File to read last line from.
:type f: file-like object
:param sep: Segment separator (delimiter).
:type sep: bytes, str
:param fixed: Treat data in ``f`` as a chain of fixed size blocks.
:type fixed: bool
:returns: Last line of file.
:rtype: bytes, str
"""
bs = len(sep)
step = bs if fixed else 1
if not bs:
raise ValueError("Zero-length separator.")
try:
o = f.seek(0, SEEK_END)
o = f.seek(o-bs-step) # - Ignore trailing delimiter 'sep'.
while f.read(bs) != sep: # - Until reaching 'sep': Read sep-sized block
o = f.seek(o-step) # and then seek to the block to read next.
except (OSError,ValueError): # - Beginning of file reached.
f.seek(0)
return f.read()
def test_readlast():
from io import BytesIO, StringIO
# Text mode.
f = StringIO("first\nlast\n")
assert readlast(f, "\n") == "last\n"
# Bytes.
f = BytesIO(b'first|last')
assert readlast(f, b'|') == b'last'
# Bytes, UTF-8.
f = BytesIO("X\nY\n".encode("utf-8"))
assert readlast(f, b'\n').decode() == "Y\n"
# Bytes, UTF-16.
f = BytesIO("X\nY\n".encode("utf-16"))
assert readlast(f, b'\n\x00').decode('utf-16') == "Y\n"
# Bytes, UTF-32.
f = BytesIO("X\nY\n".encode("utf-32"))
assert readlast(f, b'\n\x00\x00\x00').decode('utf-32') == "Y\n"
# Multichar delimiter.
f = StringIO("X<br>Y")
assert readlast(f, "<br>", fixed=False) == "Y"
# Make sure you use the correct delimiters.
seps = { 'utf8': b'\n', 'utf16': b'\n\x00', 'utf32': b'\n\x00\x00\x00' }
assert "\n".encode('utf8' ) == seps['utf8']
assert "\n".encode('utf16')[2:] == seps['utf16']
assert "\n".encode('utf32')[4:] == seps['utf32']
# Edge cases.
edges = (
# Text , Match
("" , "" ), # Empty file, empty string.
("X" , "X" ), # No delimiter, full content.
("\n" , "\n"),
("\n\n", "\n"),
# UTF16/32 encoded U+270A (b"\n\x00\n'\n\x00"/utf16)
(b'\n\xe2\x9c\x8a\n'.decode(), b'\xe2\x9c\x8a\n'.decode()),
)
for txt, match in edges:
for enc,sep in seps.items():
assert readlast(BytesIO(txt.encode(enc)), sep).decode(enc) == match
if __name__ == "__main__":
import sys
for path in sys.argv[1:]:
with open(path) as f:
print(f.readline() , end="")
print(readlast(f,"\n"), end="")
docs for io module
with open(fname, 'rb') as fh:
first = next(fh).decode()
fh.seek(-1024, 2)
last = fh.readlines()[-1].decode()
The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.
Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:
for line in fh:
pass
last = line
You don't need to bother with the binary flag you could just use open(fname).
ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.
Here's a modified version of SilentGhost's answer that will do what you want.
with open(fname, 'rb') as fh:
first = next(fh)
offs = -100
while True:
fh.seek(offs, 2)
lines = fh.readlines()
if len(lines)>1:
last = lines[-1]
break
offs *= 2
print first
print last
No need for an upper bound for line length here.
Can you use unix commands? I think using head -1 and tail -n 1 are probably the most efficient methods. Alternatively, you could use a simple fid.readline() to get the first line and fid.readlines()[-1], but that may take too much memory.
This is my solution, compatible also with Python3. It does also manage border cases, but it misses utf-16 support:
def tail(filepath):
"""
#author Marco Sulla (marcosullaroma#gmail.com)
#date May 31, 2016
"""
try:
filepath.is_file
fp = str(filepath)
except AttributeError:
fp = filepath
with open(fp, "rb") as f:
size = os.stat(fp).st_size
start_pos = 0 if size - 1 < 0 else size - 1
if start_pos != 0:
f.seek(start_pos)
char = f.read(1)
if char == b"\n":
start_pos -= 1
f.seek(start_pos)
if start_pos == 0:
f.seek(start_pos)
else:
char = ""
for pos in range(start_pos, -1, -1):
f.seek(pos)
char = f.read(1)
if char == b"\n":
break
return f.readline()
It's ispired by Trasp's answer and AnotherParker's comment.
First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.
a=open('file.txt','rb')
lines = a.readlines()
if lines:
first_line = lines[:1]
last_line = lines[-1]
w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:
x= line
print ('last line is : ',x)
w.close()
The for loop runs through the lines and x gets the last line on the final iteration.
with open("myfile.txt") as f:
lines = f.readlines()
first_row = lines[0]
print first_row
last_row = lines[-1]
print last_row
Here is an extension of #Trasp's answer that has additional logic for handling the corner case of a file that has only one line. It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated. Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argument will be raised.
def tail(filepath):
with open(filepath, "rb") as f:
first = f.readline() # Read the first line.
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
try:
f.seek(-2, 1) # ...jump back the read byte plus one more.
except IOError:
f.seek(-1, 1)
if f.tell() == 0:
break
last = f.readline() # Read last line.
return last
Nobody mentioned using reversed:
f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()
Getting the first line is trivially easy. For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from SEEK_END find the second to last line ending and then readline() the last line.
with open(filename, "rb") as f:#Needs to be in binary mode for the seek from the end to work
first = f.readline()
if f.read(1) == '':
return first
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
f.seek(-2, 1) # ...jump back the read byte plus one more.
last = f.readline() # Read last line.
return last
The above answer is a modified version of the above answers which handles the case that there is only one line in the file