Creating words out of numbers from a file

Creating words out of numbers from a file - python

Right so, I have a file with lots of numbers in it. It is a .txt file with binary numbers in it so it has 0's, 1's and spaces every 8 numbers (it is the lyrics to Telegraph Road but in binary, but that isn't important). What I am trying to do is create a program that takes the file, reads a single character of the file, and depending on what it reads it then writes either "one" or "zero" in a second .txt file.
As it stands, as a proof of concept, this works:
with open('binary.txt') as file:
while 1:
string = file.read(1)
if string == "1":
print("one")
elif string == "0":
print("zero")
It prints out either a "one" or "zero" in about 15000 lines:
Picture of the IDLE Shell after running the program
In the future I want it to print them in set of eight (so that one line = one binary ascii code) but this is pointless if I cant get it to do the following first.
The following is instead of printing it to the IDLE, I want to write it into a new .txt file. A few hours of searching, testing and outright guessing has me here:
with open('binary.txt') as file:
with open('words.txt') as paper:
while 1:
string = file.read(1)
if string == "1":
line = ['one']
paper.writelines(line)
elif string == "0":
line = ['zero']
paper.writelines(line)
This then prints to the IDLE:
paper.writelines(line),
io.UnsupportedOperation: not writable
It goes without saying that I'm not very good at this. it has been a long time since I tried programming and I wasn't very good at it then either so any help I will be much appreciative of.
Cheers.

So in python when you use open() function it opens the file in read mode, but you change this by passing a second argument specifying the permissions you need. For your need and to keep things simple you need to open the file in write mode or append mode.
# this will clear all the contents of the file and put the cursor and the beginning
with open('file.txt') as f:
# logic
# this will open the file and put the cursor at the end of file after its contents
with open('file.txt', 'a') as f:
# logic

i think thats what you are trying to do
with open('binary.txt',"r") as file:
with open('words.txt',"r+") as paper:
while 1:
for i in range(8):
string = file.read(1)
if string == "1":
line = 'one '
paper.write(line)
elif string == "0":
line = 'zero '
paper.write(line)
paper.write("\n")
this program will take the binary file items.. and put them in the paper file as you said (1=one and 0=zero. 8 for every line)

Here it is:
with open('binary.txt',"r") as file:
data = file.read()
characters = data.split()
print(len(characters))
file.close()
with open('binary.txt',"r") as file:
with open('words.txt',"w") as paper:
count = 0
while count < len(characters):
for i in range(9):
string = file.read(1)
if string == "1":
line = 'one '
paper.write(line)
print("one")
elif string == "0":
line = 'zero '
paper.write(line)
print("zero")
paper.write("\n")
count += 1
Its probably really inefficient in places but it does the job.
Important stuff:
By not having the written file opened for writing but as default (reading), the program was unable to write to it. Adding "w" solved this, and also added an "r" onto the reading file to be safe.
Opening the read file once before to work out number of lines so that the program would stop churning out returns and making the write file millions of lines long if you didn't stop it immediately.
Worked out the number of lines necessary (the (print(len(characters)) line is, ironically, unnecessary):
data = file.read()
characters = data.split()
print(len(characters))
Limited the number of lines the program could write by first setting a variable to 0, then changed the while loop so that it would continue looping if the variable has a number less than that of how many lines there are and then adding a bit of code to add 1 to the variable after every new line is made:
count = 0
while count < len(characters):
count += 1
for i in range(8) was changed to for i in range(9) so that it would actually produce words in blocks of 8 before making new lines. Before it would make a line of 8 and then a line of 7 every time until the last line, where it would only right the leftover words, which in my case was 6
Read file must have a very specific layout, that is only 0's and 1's, must have a space every 8 numbers and no returns at all:
Picture of the text file, notice how each "line" continues beyond the boundary, this is because each line is 1001 columns long but due to there being a limit per column of 1001, it returns back around to create a new line, while still counting as the same column on the counter at the bottom.

Related

script to cat every other (even) line in a set of files together while leaving the odd lines unchanged

I have a set of three .fasta files of standardized format. Each one begins with a string that acts as a header on line 1, followed by a long string of nucleotides on line 2, where the header string denotes the animal that the nucleotide sequence came from. There are 14 of them altogether, for a total of 28 lines, and each of the three files has the headers in the same order. A snippet of one of the files is included below as an example, with the sequences shortened for clarity.
anas-crecca-crecca_KSW4951-mtDNA
ATGCAACCCCAGTCCTAGTCCTCAGTCTCGCATTAG...CATTAG
anas-crecca-crecca_KGM021-mtDNA
ATGCAACCCCAGTCCTAGTCCTCAGTCTCGCATTAG...CATTAG
anas-crecca-crecca_KGM020-mtDNA
ATGCAACCCCAGTCCTAGTCCTCAGTCTCGCATTAG...CATTAG
What I would like to do is write a script or program that cats each of the strings of nucleotides together, but keeps them in the same position. My knowledge, however, is limited to rudimentary python, and I'd appreciate any help or tips someone could give me.

Try this:
data = ""
with open('filename.fasta') as f:
i = 0
for line in f:
i=i+1
if (i%2 == 0):
data = data + line[:-1]
# Copy and paste above block for each file,
# replacing filename with the actual name.
print(data)
Remember to replace "filename.fasta" with your actual file name!
How it works
Variable i acts as a line counter, when it is even, i%2 will be zero and the new line is concatenated to the "data" string. This way, the odd lines are ignored.
The [:-1] at the end of the data line removes the line break, allowing you to add all sequences to the same line.

Python Loop copy specific line identified by a character into a csv file or any text file

First of all I want to begin with the actual code I’m trying to make it work:
for Num in range(200,307):
with open(work_path+"\RESULTS\RES"+str(Num)+"_power_total.res","r") as f:
a=0
for line in f.readlines():
liste = line.split( )
if liste[0] == '1' :
F=float(liste[1])
W=float(liste[2])
a = a + 1
list.append(liste[1])
list.append(liste[2])
list=list[::-1]
np.savetxt('res.csv', (list), delimiter=' ')
What I'm trying to do is :
Inside the "Results" folder there's a big amount of files that share the name except a number, thus the str(Num).
Note: Those ".res" files are just text files like it can be opened in Wordpad for example.
Anyway the .res files also have the same structure like 35 lines that all starts with the character : "#". Except at one line it starts with the number "1". After this number there's two numbers.
For example:
"# gibberish
"# gibberish
"# gibberish
"1 2.0E+5 5.051511E+10
"# gibberish
So what I'm trying to do is to simply scan all the files, read them and copy/paste all of the lines that starts with the number "1", into another file (the res.csv file) to be later used in Excel.
At the end I would like a res.csv file that look like this :
2.0E+5 5.051511E+10
2.5E+5 4.464868E+09
2.7E+5 5.1261461+09
etc...
Now for the errors I get, I keep having this : "IndentationError: unexpected indent"
But I don't see anything wrong with the indentation I put?
Btw I'm kinda new at programming so please bear with me.
Thanks in advance and best regards.

After cleaning the indents and putting the appends into the loop, it seems to work. I made it a function as well:
def process_file(myfile):
my_list = [] # create a list to hold filtered data
with open(myfile,"r") as f:
a=0
for line in f.readlines():
liste = line.split() # this line had 5-space indent
if liste: # eliminate empty lines
if liste[0] == '1' : # this line had 5-space indent
F=float(liste[1]) # F and W can be omitted if not used elsewhere
W=float(liste[2])
my_list.append(F)
my_list.append(W)
a = a + 1
my_list=my_list[::-1] # is this list inversion intended?
return my_list
# print(process_file("some_file.txt")) ===> [50515110000.0, 200000.0]

Removing lines from a txt file based on the structure of the line

Code:
with open("filename.txt" 'r') as f: #I'm not sure about reading it as r because I would be removing lines.
lines = f.readlines() #stores each line in the txt into 'lines'.
invalid_line_count = 0
for line in lines: #this iterates through each line of the txt file.
if line is invalid:
# something which removes the invalid lines.
invalid_line_count += 1
print("There were " + invalid_line_count + " amount of invalid lines.")
I have a text file like so:
1,2,3,0,0
2,3,0,1,0
0,0,0,1,2
1,0,3,0,0
3,2,1,0,0
The valid line structure is 5 values split by commas.
For a line to be valid, it must have a 1, 2, 3 and two 0's. It doesn't matter in what position these numbers are.
An example of a valid line is 1,2,3,0,0
An example of an invalid line is 1,0,3,0,0, as it does not contain a 2 and has 3 0's instead of 2.
I would like to be able to iterate through the text file and remove invalid lines.
and maybe a little message saying "There were x amount of invalid lines."
Or maybe as suggested:
As you read each line from the original file, test it for validity. If it passes, write it out to the new file. When you're finished, rename the original file to something else, then rename the new file to the original file.
I think that the csv module may help so I read the documentation and it doesn't help me.
Any ideas?

You can't remove lines from a file, per se. Rather, you have to rewrite the file, including only the valid lines. Either close the file after you've read all the data, and reopen in mode "w", or write to a new file as you process the lines (which takes less memory in the short term.
Your main problem with detecting line validity seems to be handling the input. You want to convert the input text to a list of values; this is a skill you should get from learning your tools. The ones you need here are split to divide the line, and int to convert the values. For instance:
line_vals = line.split(',')
Now iterate through line_vals, and convert each to integer with int.
Validity: you need to count the quantity of each value you have in this list. You should be able to count things by value; if not back up to your prior lessons and review basic logic and data flow. If you want the advanced method for this, use collections.Counter, which is a convenient type of dictionary that accumulates counts from any sequence.
Does that get you moving? If you're still lost, I recommend some time with a local tutor.

One of the possible right approaches:
with open('filename.txt', 'r+') as f: # opening file in read/write mode
inv_lines_cnt = 0
valid_list = [0, 0, 1, 2, 3] # sorted list of valid values
lines = f.read().splitlines()
f.seek(0)
f.truncate(0) # truncating the initial file
for l in lines:
if sorted(map(int, l.split(','))) == valid_list:
f.write(l+'\n')
else:
inv_lines_cnt += 1
print("There were {} amount of invalid lines.".format(inv_lines_cnt))
The output:
There were 2 amount of invalid lines.
The final filename.txt contents:
1,2,3,0,0
2,3,0,1,0
3,2,1,0,0

This is a mostly language-independent problem. What you would do is open another file for writing. As you read each line from the original file, test it for validity. If it passes, write it out to the new file. When you're finished, rename the original file to something else, then rename the new file to the original file.

For a line to be valid, each line must have a 1, 2, 3 and 2 0's. It doesn't matter in what position these numbers are.
CHUNK_SIZE = 65536
def _is_valid(line):
"""Check if a line is valid.
A line is valid if it is of length 5 and contains '1', '2', '3',
in any order, as well as '0', twice.
:param list line: The line to check.
:return: True if the line is valid, else False.
:rtype: bool
"""
if len(line) != 5:
# If there's not exactly five elements in the line, return false
return False
if all(x in line for x in {"1", "2", "3"}) and line.count("0") == 2:
# Builtin `all` checks if a condition (in this case `x in line`)
# applies to all elements of a certain iterator.
# `list.count` returns the amount of times a specific
# element appears in it. If "0" appears exactly twice in the line
# and the `all` call returns True, the line is valid.
return True
# If the previous block doesn't execute, the line isn't valid.
return False
def get_valid_lines(path):
"""Get the valid lines from a file.
The valid lines will be written to `path`.
:param str path: The path to the file.
:return: None
:rtype: None
"""
invalid_lines = 0
contents = []
valid_lines = []
with open(path, "r") as f:
# Open the `path` parameter in reading mode.
while True:
chunk = f.read(CHUNK_SIZE)
# Read `CHUNK_SIZE` bytes (65536) from the file.
if not chunk:
# Reaching the end of the file, we get an EOF.
break
contents.append(chunk)
# If the chunk is not empty, add it to the contents.
contents = "".join(contents).split("\n")
# `contents` will be split in chunks of size 65536. We need to join
# them using `str.join`. We then split all of this by newlines, to get
# each individual line.
for line in contents:
if not _is_valid(line=line):
invalid_lines += 1
else:
valid_lines.append(line)
print("Found {} invalid lines".format(invalid_lines))
with open(path, "w") as f:
for line in valid_lines:
f.write(line)
f.write("\n")
I'm splitting this up into two functions, one to check if a line is valid according to your rules, and a second one to manipulate a file. If you want to return the valid lines instead, just remove the second with statement and replace it with return valid_lines.

Not counting characters right in text file

I am doing another program with text file I/O and i'm confused because my code seems perfectly reasonable but the result seems crazy. I want to count the number of words, characters, sentances, and unique words in a text file of political speeches. Here is my code so it might clear things up a bit.
#This program will serve to analyze text files for the number of words in
#the text file, number of characters, sentances, unique words, and the longest
#word in the text file. This program will also provide the frequency of unique
#words. In particular, the text will be three political speeches which we will
#analyze, building on searching techniques in Python.
#CISC 101, Queen's University
#By Damian Connors; 10138187
def main():
harper = readFile("Harper's Speech.txt")
print(numCharacters(harper), "Characters.")
obama1 = readFile("Obama's 2009 Speech.txt")
print(numCharacters(obama1), "Characters.")
obama2 = readFile("Obama's 2008 Speech.txt")
print(numCharacters(obama1), "Characters.")
def readFile(filename):
'''Function that reads a text file, then prints the name of file without
'.txt'. The fuction returns the read file for main() to call, and print's
the file's name so the user knows which file is read'''
inFile1 = open(filename, "r")
fileContentsList = inFile1.readlines()
inFile1.close()
print(filename.replace(".txt", "") + ":") #this prints filename
return fileContentsList
def numCharacters(file):
return len(file) - file.count(" ")
What i'm having trouble with at the moment is counting the characters. It keeps saying that the # is 85, but it's a pretty big file and i know it is supposed to be 7792 characters. Any idea what i'm doing wrong with this? Here is my shell output and i'm using python 3.3.3
>>> ================================ RESTART ================================
>>>
Harper's Speech:
85 Characters.
Obama's 2009 Speech:
67 Characters.
Obama's 2008 Speech:
67 Characters.
>>>
so as you can see i have 3 speech files, but there's no way they are that little amount of characters.

You should change this line fileContentsList = inFile1.readlines()
Now you are counting how many lines obama has in his speech.
change readLines to read() and it will work

The readlines function returns a list containing the lines, so the length of it will be the number of lines in the file, not the number of characters.
You either have to find a way to read in all the characters (so that the length is correct), like using read().
Or go through each line tallying up the character in it, perhaps something like:
tot = 0
for line in file:
tot = tot + len(line) - line.count(" ")
return tot
(assuming your actual method chosen for calculating characters is correct, of course).
As an aside, your third output statement is referencing obama1 rather than obama2, you may want to fix that as well.

You are instead counting lines. In more detail, you are effectively reading the file into a list of lines and then counting them. A cleaned up version of your code follows.
def count_lines(filename):
with open(filename) as stream:
return len(stream.readlines())
The easiest change to such code to count words is to instead read out the whole file and split it into words and then count them, see the following code.
def count_words(filename):
with open(filename) as stream:
return len(stream.read().split())
Notes:
The code may need to be updated to match your exact definition of words.
This method is not suitable for very large files as it reads the whole file into memory and the list of words is also stored there.
Therefore the above code is more a conceptual model than the best final solution.

What you are currently seeing is the number of lines in your file. As the fileContentsList will return a list, numCharacters will return size of list.
If you want to continue to use 'readlines', you need to count number of characters in each line and add them to get total number of characters in file.
def main():
print(readFile("Harper's Speech.txt"), "Characters.")
print(readFile("Obama's 2009 Speech.txt"), "Characters.")
print(readFile("Obama's 2008 Speech.txt"), "Characters.")
def readFile(filename):
'''Function that reads a text file, then prints the name of file without
'.txt'. The fuction returns the read file for main() to call, and print's
the file's name so the user knows which file is read'''
inFile1 = open(filename, "r")
fileContentsList = inFile1.readlines()
inFile1.close()
totalChar =0 # Variable to store total number of characters
for line in fileContentsList: # reading all lines
line = line.rstrip("\n") # removing line end character '\n' from lines
totalChar = totalChar + len(line) - line.count(" ") # adding number of characters in line to total characters,
# also removing number of whitespaces in current line
print(filename.replace(".txt", "") + ":") #this prints filename
return totalChar
main() # calling main function.

Python: read a file and replace it line by line with a certain condition

I have a file like this below.
0 0 0
0.00254 0.00047 0.00089
0.54230 0.87300 0.74500
0 0 0
I want to modify this file. If a value is less than 0.05, then a value is to be 1. Otherwise, a value is to be 0.
After python script runs, the file should be like
1 1 1
1 1 1
0 0 0
1 1 1
Would you please help me?

OK, since you're new to StackOverflow (welcome!) I'll walk you through this. I'm assuming your file is called test.txt.
with open("test.txt") as infile, open("new.txt", "w") as outfile:
opens the files we need, our input file and a new output file. The with statement ensures that the files will be closed after the block is exited.
for line in infile:
loops through the file line by line.
values = [float(value) for value in line.split()]
Now this is more complicated. Every line contains space-separated values. These can be split into a list of strings using line.split(). But they are still strings, so they must be converted to floats first. All this is done with a list comprehension. The result is that, for example, after the second line has been processed this way, values is now the following list: [0.00254, 0.00047, 0.00089].
results = ["1" if value < 0.05 else "0" for value in values]
Now we're creating a new list called results. Each element corresponds to an element of values, and it's going to be a "1" if that value < 0.05, or a "0" if it isn't.
outfile.write(" ".join(results))
converts the list of "integer strings" back to a string, separated by 7 spaces each.
outfile.write("\n")
adds a newline. Done.
The two list comprehensions could be combined into one, if you don't mind the extra complexity:
results = ["1" if float(value) < 0.05 else "0" for value in line.split()]

if you can use libraries I'd suggest numpy :
import numpy as np
myarray = np.genfromtxt("my_path_to_text_file.txt")
my_shape = myarray.shape()
out_array = np.where(my_array < 0.05, 1, 0)
np.savetxt(out_array)
You can add formating as arguments to the savetxt function. The docstrings of the function are pretty self explanatory.
If you are stuck with pure python :
with open("my_path_to_text_file") as my_file:
list_of_lines = my_file.readlines()
list_of_lines = [[int( float(x) < 0.05) for x in line.split()] for line in list_of_lines]
then write that list to file as you see fit.

You can use this code
f_in=open("file_in.txt", "r") #opens a file in the reading mode
in_lines=f_in.readlines() #reads it line by line
out=[]
for line in in_lines:
list_values=line.split() #separate elements by the spaces, returning a list with the numbers as strings
for i in range(len(list_values)):
list_values[i]=eval(list_values[i]) #converts them to floats
# print list_values[i],
if list_values[i]<0.05: #your condition
# print ">>", 1
list_values[i]=1
else:
# print ">>", 0
list_values[i]=0
out.append(list_values) #stores the numbers in a list, where each list corresponds to a lines' content
f_in.close() #closes the file
f_out=open("file_out.txt", "w") #opens a new file in the writing mode
for cur_list in out:
for i in cur_list:
f_out.write(str(i)+"\t") #writes each number, plus a tab
f_out.write("\n") #writes a newline
f_out.close() #closes the file

The following code performs the replacements in-place: for that , the file is opened in 'rb+' mode. It's absolutely mandatory to open it in binary mode b. The + in 'rb+' means that it's possible to write and to read in the file. Note that the mode can be written 'r+b' also.
But using 'rb+' is awkward:
if you read with for line in f , the file is read by chunks and several lines are kept in the buffer where they are really read one after the other, until another chunk of data is read and loaded in the buffer. That makes it harder to perform transformations, because one must follow the position of the file's pointer with the help of tell() and to move the pointer with seek() and in fact I've not completly understood how it must done.
.
Happily, there's a solution with replace(), because , I don't know why, but I believe the facts, when readline() reads a line, the file 's pointer doesn't go further on disk than the end of the line (that is to say it stops at the newline).
Now it's easy to move and know positions of the file's pointer
to make writing after reading, it's necessary to make seek() being executed , even if it should be to do seek(0,1), meaning a move of 0 caracters from the actual position. That must change the state of the file's pointer, something like that.
Well, for your problem, the code is as follows:
import re
from os import fsync
from os.path import getsize
reg = re.compile('[\d.]+')
def ripl(m):
g = m.group()
return ('1' if float(g)<0.5 else '0').ljust(len(g))
path = ...........'
print 'length of file before : %d' % getsize(path)
with open('Copie de tixti.txt','rb+') as f:
line = 'go'
while line:
line = f.readline()
lg = len(line)
f.seek(-lg,1)
f.write(reg.sub(ripl,line))
f.flush()
fsync(f.fileno())
print 'length of file after : %d' % getsize(path)
flush() and fsync() must be executed to ensure that the instruction f.write(reg.sub(ripl,line)) effectively writes at the moment it is ordred to.
Note that I've never managed a file encoded in unicode like. It's certainly still more dificult since every unicode character is encoded on several bytes (and in the case of UTF8 , variable number of bytes depending on the character)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating words out of numbers from a file - python

Related

script to cat every other (even) line in a set of files together while leaving the odd lines unchanged

Python Loop copy specific line identified by a character into a csv file or any text file

Removing lines from a txt file based on the structure of the line

Not counting characters right in text file

Python: read a file and replace it line by line with a certain condition

Categories

Resources