why does python write to a file in gibberish characters - python

I attempted Problem 10 at project euler and passed but I decided, what if i wote all the prime numbers below 2 million to a text(.txt) file and so I continued and so made some small adjustments to the main function which solved the problem so without just adding it to a variable(tot) I wrote the prime number which was generated by a generator to a text file and it at first worked but forgot to add spaces after each prime number, so the output was sort of gibberish
357111317192329313741434753
so I modified my txt.write(str(next_prime)) to txt.write(str(next_prime) + ' ')
after that slight modification, the output was completely gibberish
″‵‷ㄱㄠ″㜱ㄠ‹㌲㈠‹ㄳ㌠‷ㄴ㐠″
here's my complete code for the function:
def solve_number_10():
total = 2
txt = open("output.txt","w")
for next_prime in get_primes(3):
if next_prime < 2000000:
txt.write(str(next_prime) + ' ')
#total += next_prime
else:
print "Data written to txt file"
#print total
txt.close()
return
Why does this happen and how could I make the output like
3 5 7 11 13 17 19

This is a bug in Microsoft's Notepad program, not in your code.
>>> a = '‵‷ㄱㄠ″㜱ㄠ‹㌲㈠‹ㄳ㌠‷ㄴ㐠'
>>> a.decode('UTF-8').encode('UTF-16LE')
'5 7 11 13 17 19 23 29 31 37 41 4'
Oh hey, look, they're prime numbers (I assume 4 is just a truncated 43).
You can work around the bug in Notepad by
Using a different file viewer that doesn't have the bug.
Write a ZWNBSP, once, to the beginning of the file, encoded in UTF-8:
txt.write(u'\uFEFF'.encode('UTF-8'))
This is incorrectly called a BOM. It would be a BOM in UTF-16, but UTF-8 is not technically supposed to have a BOM. Most programs ignore it, and in other programs it will be harmless.

Try this:
txt.write('%i ' % next_prime)
Looks like str() is converting your number to a character that matches it in some encoding, and not to its string representation.

Related

How can I add a space after every two characters in .txt file? [duplicate]

This question already has answers here:
Insert element in Python list after every nth element
(11 answers)
Closed 2 years ago.
The input I have is a large strain of characters in a .TXT file (over 18,000 characters) and I need to add a space after every two characters. How can I write the code to provide the output in a .TXT file again?
Like so;
Input:
123456789
Output:
12 34 56 78 9
The enumerate() function is going to be doing the heavy work for you here, all we need to do is iterate over the string of characters and use modulo to split ever two characters, a worked example is below!
string_of_chars = "123456789101213141516171819"
spaced_chars = ""
for i, c in enumerate(string_of_chars):
if i % 2 == 1:
spaced_chars += c + " "
else:
spaced_chars += c
print(spaced_chars)
This will produce 12 34 56 78 9
t = '123456789'
' '.join(t[i:i+2] for i in range(0, len(t), 2))
If the file is very large, you won't want to read it all into memory, and instead read a block of characters, and write them to an output handle, and loop that.
To include read/write:
write_handle = open('./output.txt', 'w')
with open('./input.txt') as read_handle:
for line in read_handle:
write_handle.write(' '.join(line[i:i+2] for i in range(0, len(line), 2)))
write_handle.close()
Try the following
txt = '123456789'
print(*[txt[x:x+2] for x in range(0, len(txt), 2)])
output
'12 34 56 78 9'

How to print text file content with line breaks in python?

The content of my text file is:
5 7 6 6 15
4 3
When I do
fs.open('path',mode='rb').read()
I get
b'5 7 6 6 15\r\n4 3'
But because I want it to compare to string output
5 7 6 6 15
4 3
I want to do this comparison like :
if fs.open('path',mode='rb').read() == output
print("yes")
How should I convert it in way that line breaks space everything is maintained?
PS: output is just the string that I am getting through json.
Using Python 3, fs.open('path',mode='rb').read() yields a bytes object, moreover containing a carriage return (windows text file)
(and using Python 2 doesn't help, because of this extra \r which isn't removed because of binary mode)
You're comparing a bytes object with a str object: that is always false.
Moreover, it's unclear if the output string has a line termination on the last line. I would open the file in text mode and strip blanks/newline the end (the file doesn't seem to contain one, but better safe than sorry):
with open('path') as f:
if f.read().rstrip() == output.rstrip():
Change the read mode from rb to r: rb gives back binary, r puts out text.

Data reading - csv

I have some datas in a .dfx file and I trying to read it as a csv with pandas. But it has some special characters which are not read by pandas. They are separators as well.I attached one line from it
The "DC4" is being removed when I print the file. The SI is read as space, correctly. I tried some encoding (utf-8, latin1 etc), but no success.
I attached the printed first line as well. I marked the place where the characters should be.
My code is simple:
import pandas
file_log = pandas.read_csv("file_log.DFX", header=None)
print(file_log)
I hope I was clear and someone has an idea.
Thanks in advance!
EDIT:
The input. LINK: drive.google.com/open?id=0BxMDhep-LHOIVGcybmsya2JVM28
The expected output:
88.4373 0 12.07.2014/17:05:22 38.0366 38.5179 1.3448 31.9839
30.0070 0 12.07.2014/17:14:27 38.0084 38.5091 0.0056 0.0033
By examining the example.DFX in hex (with xxd), the two separators are 0x14 and 0x0f accordingly.
Read the csv with multiple separators using python engine:
import pandas
sep1 = chr(0x14) # the one shows dc4
sep2 = chr(0x0f) # the one shows si
file_log = pandas.read_csv('example.DFX', header=None, sep='{}|{}'.format(sep1, sep2), engine='python')
print file_log
And you get:
0 1 2 3 4 5 6 7
0 88.4373 0 12.07.2014/17:05:22 38.0366 38.5179 1.3448 31.9839 NaN
1 30.0070 0 12.07.2014/17:14:27 38.0084 38.5091 0.0056 0.0033 NaN
It seems it has an empty column at the end. But I'm sure you can handle that.
The encoding seems to be ASCII here. DC4 stands for "device control 4" and SI for "Shift In". These are control characters in an ASCII file and not printable. Thus you cannot see them when you issue a "print(file_log)", although it might do something depending on your terminal to view this (like \n would do a new-line).
Try typing file_log in your interpreter to get the representation of that variable and check if those special characters are included. Chances are that you'll see DC4 in the representation as '\x14' which means hexadecimal 14.
You may then further process these strings in your program by using string manipulation like replace.

Python - Parsing Conundrum

I have searched high and low for a resolution to this situation, and tested a few different methods, but I haven't had any luck thus far. Basically, I have a file with data in the following format that I need to convert into a CSV:
(previously known as CyberWay Pte Ltd)
0 2019
01.com
0 1975
1 TRAVEL.COM
0 228
1&1 Internet
97 606
1&1 Internet AG
0 1347
1-800-HOSTING
0 8
1Velocity
0 28
1st Class Internet Solutions
0 375
2iC Systems
0 192
I've tried using re.sub and replacing the whitespace between the numbers on every other line with a comma, but haven't had any success so far. I admit that I normally parse from CSVs, so raw text has been a bit of a challenge for me. I would need to maintain the string formats that are above each respective set of numbers.
I'd prefer the CSV to be formatted as such:
foo bar
0,8
foo bar
0,9
foo bar
0,10
foo bar
0,11
There's about 50,000 entries, so manually editing this would take an obscene amount of time.
If anyone has any suggestions, I'd be most grateful.
Thank you very much.
If you just want to replace whitespace with comma, you can just do:
line = ','.join(line.split())
You'll have to do this only on every other line, but from your question it sounds like you already figured out how to work with every other line.
If I have correctly understood your requirement, you need a strip() on all lines and a split based on whitespace on even lines (lines starting from 1):
import re
fp = open("csv.txt", "r")
while True:
line = fp.readline()
if '' == line:
break
line = line.strip()
fields = re.split("\s+", fp.readline().strip())
print "\"%s\",%s,%s" % ( line, fields[0], fields[1] )
fp.close()
The output is a CSV (you might need to escape quotes if they occur in your input):
"Content of odd line",Number1,Number2
I do not understand the 'foo,bar' you place as header on your example's odd lines, though.

Convert binary data to web-safe text and back - Python

I want to convert a binary file (such as a jpg, mp3, etc) to web-safe text and then back into binary data. I've researched a few modules and I think I'm really close but I keep getting data corruption.
After looking at the documentation for binascii I came up with this:
from binascii import *
raw_bytes = open('test.jpg','rb').read()
text = b2a_qp(raw_bytes,quotetabs=True,header=False)
bytesback = a2b_qp(text,header=False)
f = open('converted.jpg','wb')
f.write(bytesback)
f.close()
When I try to open the converted.jpg I get data corruption :-/
I also tried using b2a_base64 with 57-long blocks of binary data. I took each block, converted to a string, concatenated them all together, and then converted back in a2b_base64 and got corruption again.
Can anyone help? I'm not super knowledgeable on all the intricacies of bytes and file formats. I'm using Python on Windows if that makes a difference with the \r\n stuff
Your code looks quite complicated. Try this:
#!/usr/bin/env python
from binascii import *
raw_bytes = open('28.jpg','rb').read()
i = 0
str_one = b2a_base64(raw_bytes) # 1
str_list = b2a_base64(raw_bytes).split("\n") #2
bytesBackAll = a2b_base64(''.join(str_list)) #2
print bytesBackAll == raw_bytes #True #2
bytesBackAll = a2b_base64(str_one) #1
print bytesBackAll == raw_bytes #True #1
Lines tagged with #1 and #2 represent alternatives to each other. #1 seems most straightforward to me - just make it one string, process it and convert it back.
You should use base64 encoding instead of quoted printable. Use b2a_base64() and a2b_base64().
Quoted printable is much bigger for binary data like pictures. In this encoding each binary (non alphanumeric character) code is changed into =HEX. It can be used for texts that consist mainly of alphanumeric like email subjects.
Base64 is much better for mainly binary data. It takes 6 bites of first byte, then last 2 bits of 1st byte and 4 bites from 2nd byte. etc. It can be recognized by = padding at the end of the encoded text (sometimes other character is used).
As an example I took .jpeg of 271 700 bytes. In qp it is 627 857 b while in base64 it is 362 269 bytes. Size of qp is dependent of data type: text which is letters only do not change. Size of base64 is orig_size * 8 / 6.
Your documentation reference is for Python 3.0.1. There is no good reason using Python 3.0. You should be using 3.2 or 2.7. What exactly are you using?
Suggestion: (1) change bytes to raw_bytes to avoid confusion with the bytes built-in (2) check for raw_bytes == bytes_back in your test script (3) while your test should work with quoted-printable, it is very inefficient for binary data; use base64 instead.
Update: Base64 encoding produces 4 output bytes for every 3 input bytes. Your base64 code doesn't work with 56-byte chunks because 56 is not an integral multiple of 3; each chunk is padded out to a multiple of 3. Then you join the chunks and attempt to decode, which is guaranteed not to work.
Your chunking loop would be much better written as:
output_string = ''.join(
b2a_base64(raw_bytes[i:i+57]) for i in xrange(0, xrange(len(raw_bytes), 57)
)
In any case, chunking is rather slow and pointless; just do b2a_base64(raw_bytes)
#PMC's answer copied from the question:
Here's what works:
from binascii import *
raw_bytes = open('28.jpg','rb').read()
str_list = []
i = 0
while i < len(raw_bytes):
byteSegment = raw_bytes[i:i+57]
str_list.append(b2a_base64(byteSegment))
i += 57
bytesBackAll = a2b_base64(''.join(str_list))
print bytesBackAll == raw_bytes #True
Thanks for the help guys. I'm not sure why this would fail with [0:56] instead of [0:57] but I'll leave that as an exercise for the reader :P

Categories

Resources