I have a very large text file, where most of the lines are composed of ASCII characters, but a small fraction of lines have non-ASCII characters. What is the fastest way to create a new text file containing only the ASCII lines? Right now I am checking each character in each line to see if it's ASCII, and writing each line to the new file if all the characters are ASCII, but this method is rather slow. Also, I am using Python, but would be open to using other languages in the future.
Edit: updated with code
#!/usr/bin/python
import string
def isAscii(s):
for c in s:
if ord(c) > 127 or ord(c) < 0:
return False
return True
f = open('data.tsv')
g = open('data-ASCII-only.tsv', 'w')
linenumber = 1
for line in f:
if isAscii(line):
g.write(line)
linenumber += 1
f.close()
g.close()
You can use grep: "-v" keeps the opposite, -P uses perl regex syntax, and [\x80-\xFF] is the character range for non-ascii.
grep -vP "[\x80-\xFF]" data.tsv > data-ASCII-only.tsv
See this question How do I grep for all non-ASCII characters in UNIX for more about search for ascii characters with grep.
The following suggestion uses a command-line filter (ie, you would use it on the shell command line), this example works in a shell on linux or unix systems, maybe OSX too (I've heard OSX is BSDish):
$ cat big_file | tr -dc '\000-\177' > big_file_ascii_only
It uses the "tr" (translate) filter. In this case, we are telling tr to "delete" all characters which are outside the range octal-000 to octal-177. You may wish to tweak the charcter set - check the man page for tr to get some ideas on other ways to specify the characters you want to keep (or delete)
The other approaches given will work if, and only if, the file is
encoded in such a way that "non-ASCII" is equivalent to "high bit
set", such as Latin-1 or UTF-8. Here's a program in Python 3 that will
work with any encoding.
#!/usr/bin/env python3
import codecs
in_fname = "utf16file"
in_encoding = "utf-16"
out_fname = "ascii_lines"
out_encoding = "ascii"
def is_ascii(s):
try:
s.encode("ascii")
except UnicodeEncodeError:
return False
return True
f_in = codecs.open(in_fname, "r", in_encoding)
f_out = codecs.open(out_fname, "w", out_encoding)
for s in f_in:
if is_ascii(s):
f_out.write(s)
f_in.close()
f_out.close()
Related
I am using a python script to create a shell script that I would ideally like to annotate with comments. If I want to add strings with hashtags in them to a code section like this:
with open(os.path.join("location","filename"),"w") as f:
file = f.read()
file += """my_function() {{
if [ $# -eq 0 ]
then
echo "Please supply an argument"
return
fi
echo "argument is $1"
}}
"""
with open(os.path.join("location","filename"),"w") as f:
f.write(file)
what is the best way I can accomplish this?
You already have a # character in that string literal, in $#, so I'm not sure what the problem is.
Python considers a """ string literal as one big string, newlines, comment-esque sequences and all, as you've noticed, until the ending """.
To also pass escape characters (e.g. \n as \n, not a newline) through raw, you'd use r"""...""".
In other words, with
with open("x", "w") as f:
f.write("""x
hi # hello world
""")
you end up with a file containing
x
hi # hello world
In terms of your wider goal, to write a file with a bash function file from a Python script seems a little wayward.
This is not really a reliable practise, if your use case specifically requires you to define a bash function via script, please explain your use case further. A cleaner way to do this would be:
Define an .sh file and read contents in from there:
# function.sh
my_function() {{
# Some code
}}
Then in your script:
with open('function.sh', 'r') as function_fd:
# Opened in 'append' mode so that content is automatically appended
with open(os.path.join("location","filename"), "a") as target_file:
target_file.write(function_fd.read())
Problem
I'm running dataflow job where I have steps - reading txt file from cloud storage using dataflow/beam - apache_beam.io.textio.ReadFromText() which has StrUtf8Coder (utf-8) by default and after that loading it into postgres using StringIteratorIO with copy_from.
data coming from pcollection element by element, there are some elements which will look like this:
line = "some information|more information S\\\\H, F\226|DIST|local|app\\\\\lock\|"
After that, I need to download it to postgres (the delimiter here is "|"), but the problem is these kinds of elements because postgres try to encode it(and I'm getting: 'invalid byte sequence for encoding "UTF8"'):
from F\226 we are getting this -> F\x96
This slash is not visible so I can not just replace it like this:
line.replace("\\", "\\\\")
Using python 3.8.
Have tried repr() or encode("unicode_escape").decode()
Also in every line we have different elements so let's say in the next one can be r\456
I'm able to catch and change it with regex only if I will use a raw string, but not sure how to represent a regular string as a raw if we already have it in a variable.
import re
line = r"some information|more information S\\\\H, F\226|DIST|local|app\\\\\lock\|"
updated = re.sub("([a-zA-Z])\\\\(\\d*)", "\\1\\\\\\\\\\2",string)
print(updated)
$ some information|more information S\\\\\H, F\\226|DIST|local|app\\\\\\lock\\|
Goal
Have an extra backslash if after backslash we have some element, so the line need to look like this:
line = "some information|more information S\\\\\H, F\\226|DIST|local|app\\\\\\lock\\|"
Thank's for any help!
If you're able to read the file in binary or select the encoding, you could get a better starting point. This is how to do it in binary:
>>> line = b"some information|more information S\\\\H, F\226|DIST|local|app\\\\\lock\|"
>>> line.decode('cp1252')
'some information|more information S\\\\H, F–|DIST|local|app\\\\\\lock\\|'
This is how to decode the whole file:
f = open('file.txt', encoding='cp1252')
f.read()
The encoding CP-1252 is the legacy Microsoft latin-1 encoding.
I'm trying to make a simple script which takes a list of names off the clipboard formatted as "Last, First", then pastes them back as "First Last". I'm using Python 3 and Pyperclip.
Here is the code:
import pyperclip
text = pyperclip.paste()
lines = text.split('\n')
for i in range(len(lines)):
last, first = lines[i].split(', ')
lines[i] = first + " " + last
text = '\n'.join(lines)
pyperclip.copy(text)
When I copy this to the clipboard:
Carter, Bob
Goodall, Jane
Then run the script, it produces: Bob CarterJane Goodall with the names just glued together and no new line. I'm not sure what's screwy.
Thanks for your help.
Apparently I need to use '\r\n' instead of just '\n'. I don't know exactly why this is but I found that answer on the internet and it worked.
To include newlines in your file, you need to explicitly pass them to the file methods.
On Unix platforms, strings passed into .write should end with \n. Likewise, each of
the strings in the sequence that is passed into to .writelines should end in \n. On
Windows, the newline string is \r\n.
To program in a cross platform manner, the linesep string found in the os module
defines the correct newline string for the platform:
>>> import os
>>> os.linesep # Unix platform
'\n'
Souce: Illustrated Guide to Python 3
I converted a huge file which I wrote it at python 2.7.3 and then now I wanted to upgrade to python 3+ (i have 3.5).
what I have done so far:
installed the python interpreter 3.5+
updated the environment path to read from python3+ folder
upgraded the numpy, pandas,
I used >python 2to3.py -w viterbi.py to convert to version 3+
the section that I have error
import sys
import numpy as np
import pandas as pd
# Counting number of lines in the text file
lines = 0
buffer = bytearray(2048)
with open(inputFilePatheName) as f:
while f.readinto(buffer) > 0:
lines += buffer.count('\n')
My error is:
AttributeError: '_io.TextIOWrapper' object has no attribute 'readinto'
This is the first error and I cannot proceed to see if there is any other error. I dont know what is the equivalent command for readinto
In 3.x, the readinto method is only available on binary I/O streams. Thus: with open(inputFilePatheName, 'rb') as f:.
Separately, buffer.count('\n') will not work any more, because Python 3.x handles text properly, as something distinct from a raw sequence of bytes. buffer, being a bytearray, stores bytes; it still has a .count method, but it has to be given either an integer (representing the numeric value of a byte to look for) or a "bytes-like object" (representing a subsequence of bytes to look for). So we also have to update that, as buffer.count(b'\n') (using a bytes literal).
Finally, we need to be aware that processing the file this way means we don't get universal newline translation by default any more.
Open the file as binary.
As long as you can guarantee it's utf-8 or CP encoded, all \ns will necessarily be newlines:
with open(inputFilePatheName, "rb") as f:
while f.readinto(buffer) > 0:
lines += buffer.count(b'\n')
That way you also save the time of decoding the file, and use your buffer in the most efficient way possible.
A better approach to what you're trying to achieve is using memory mapped files.
In case of Windows:
file_handle = os.open(r"yourpath", os.O_RDONLY|os.O_BINARY|os.O_SEQUENTIAL)
try:
with mmap.mmap(file_handle, 0, access=mmap.ACCESS_READ) as f:
pos = -1
total = 0
while (pos := f.find(b"\n", pos+1)) != -1:
total +=1
finally:
os.close(file_handle)
Again, make sure you are not encoding the text as UTF-16 which is the default for Windows.
I'm working on IPython notebook on OS X. My source code consist entirely of ascii characters. But compiler reports to me that I'm using non-ascii character. The source code looks like:
%%file Sierpinski_triangle.py
from turtle import *
reset()
tracer(False)
s = 'A+B+A−B−A−B−A+B+A'
l_angle = 60
r_angle = 60
for c in s:
if c == 'A' or c == 'B':
forward(10)
elif c == '+':
left(l_angle)
#l_angle = l_angle * -1
elif c == '-':
right(r_angle)
#r_angle = r_angle * -1
done()
File "Sierpinski_triangle.py", line 7
SyntaxError: Non-ASCII character '\xe2' in file Sierpinski_triangle.py on line 7, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
Thank you in advance!
Your minuses are wrong.
Try:
s = 'A+B+A-B-A-B-A+B+A'
[update]
Somehow SO uses a font that makes the minuses look the same. They are not. Try copying my version and your version in a text editor and see the differences.
When I run your code I get the same error as you. When I replace the minuses in your code with asci minuses the code runs fine.
What text editor are you using to create this python code? Does it has some sort of auto completion? (ms word?) If so use a real text editor or idle to avoid these problems.
to prove they are different try printing the strings as hex (copy paste in new .py file):
# -*- coding: utf-8 -*-
your_s = 'A+B+A−B−A−B−A+B+A'
my_s = 'A+B+A-B-A-B-A+B+A'
print(":".join("{:02x}".format(ord(c)) for c in your_s))
print(":".join("{:02x}".format(ord(c)) for c in my_s))
gives you:
>>41:2b:42:2b:41:e2:88:92:42:e2:88:92:41:e2:88:92:42:e2:88:92:41:2b:42:2b:41
>>41:2b:42:2b:41:2d:42:2d:41:2d:42:2d:41:2b:42:2b:41
This is caused by the replacing standard characters like apostrophe (‘) by non-standard characters like quotation mark (`) during copying. It happens quite often when you copy text from a pdf file. The difference is very subtle, but there is a huge difference as far as Python is concerned. The apostrophe is completely legal to indicate a text string, but the quotation mark is not.