Write a file ascii ordened - python

I'm trying to copy one file to another one ascii ordened, but is giving me some bugs, for example on the first line it adds a \n with no reason, I'm trying to understand it but I don't get it, also if you think this way is not a good one please advice to me to do it better, thanks.
demo.txt (An ascii file)
!=orIh^
-_hIdH2 !=orIh^
-_hIdH2
code .py
count = 0
try:
fcopy = open("demo.txt", 'r')
fdestination = open("demo2.txt", 'w')
for line in fcopy.readlines():
count = len(line) -1
list1 = ''.join(sorted(line))
str1 = ''.join(str(e) for e in list1)
fdestination.write(str(count)+str1)
fcopy.close()
fdestination.close()
except Exception, e:
print(str(e))
Note count is the count of letters that are on a line
Output
7
!=I^hor15
!-2=HII^_dhhor6-2HI_dh
the problem is it should be the number of letters and then ordened asciily

Each line in your code has a newline character at the end. When you sort all characters, the newline character is sorted, too, and moved to the appropriate position (which is in general not at the end of the string anymore). This causes line breaks to happen at almost random places.
What you need is to remove the line break before sorting and add it back after sorting. Also, the second join in your loop is not doing anything, and list1 is not a list but a string.
str1 = ''.join(sorted(line.strip('\n')))
fdestination.write(str(count)+str1+'\n')

Related

Python word counting program for .txt files keeps on showing string index out of range as an error code

Im pretty new to this and i was trying to write a program which counts the words in txt files. There is probably a better way of doing this, but this was the idea i came up with, so i wanted to go through with it. I just don´t understand, why i, or any variable, does´nt work for as an index for the string of the page, that i´m counting on...
Do you guys have a solution or should i just take a different approach?
page = open("venv\harrry_potter.txt", "r")
alphabet = "qwertzuiopüasdfghjklöäyxcvbnmßQWERTZUIOPÜASDFGHJKLÖÄYXCVBNM"
# Counting the characters
list_of_lines = page.readlines()
characternum = 0
textstr = "" # to convert the .txt file to string
for line in list_of_lines:
for character in line:
characternum += 1
textstr += character
# Counting the words
i = 0
wordnum = 1
while i <= characternum:
if textstr[i] not in alphabet and textstr[i+1] in alphabet:
wordnum += 1
i += 1
print(wordnum)
page.close()
Counting the characters and converting the .txt file to string is done a bit weird, because i thought the other way could be the source of the problem...
Can you help me please?
Typically you want to use split for simplistically counting words. They way you are doing it you will get right-minded as two words, or don't as 2 words. If you can just rely on spaces then you can just use split like this:
book = "Hello, my name is Inigo Montoya, you killed my father, prepare to die."
words = book.split()
print(f'word count = {len(words)}')
you can also use parameters to split to add more options if the given doesn't suit you.
https://pythonexamples.org/python-count-number-of-words-in-text-file/
You want to get the word count of a text file
The shortest code is this (that I could come up with):
with open('lorem.txt', 'r') as file:
print(len(file.read().split()))
First of for smaller files this is fine but this loads all of the data into the memory so not that great for large files. First of use a context manager (with), it helps with error handling an other stuff. What happens is you print the length of the whole file read and split by space so file.read() reads the whole file and returns a string, so you use .split() on it and it splits the whole string by space and returns a list of each word in between spaces so you get the lenght of that.
A better approach would be this:
word_count = 0
with open('lorem.txt', 'r') as file:
for line in file:
word_count += len(line.split())
print(word_count)
Because here the whole file is not saved into memory, you read each line separately and overwrite the previous in the memory. Here again for each line you split it by space and measure the length of the returned list, then add to the total word count. At the end simply print out the total word count.
Useful sources:
about with
Context Managers - Efficiently Managing Resources (to learn how they work a bit in detail) by Corey Schafer
.split() "docs"

making a list from a specific parts of text file

hi I made this little exercise for myself, I want to pull out the last number in each line In this text file which has 5 lines and 6 numbers/line separated by spaces. I made a loop to get all the remaining characters of the selected line starting from the 5th space. it works for every line print(findtext(0 to 3)), except the last line if the last number has less than 3 characters... what is wrong? I can't figure it out
text = open("text","r")
lines = text.readlines()
def findtext(c):
count = 0
count2 = 0
while count < len(lines[c]) and count2<5:
if lines[c][count] == " ":
count2=count2+1
count=count+1
return float(lines[c][count:len(lines[c])-1])
print(findtext(0))
You proposed solution doesn't seem very Pythonic to me.
with open('you_file') as lines:
for line in lines:
# Exhaust the iterator
pass
# Split by whitespace and get the last element
*_, last = line.split()
print(last)
Several things:
Access files within context managers, as this guarantees resources are destroyed correctly
Don't keep track of indexes if you don't need to, it makes the code harder to read
Use split instead of counting the literal whitespace character
with open('file') as f :
numbers = f.readlines()
last_nums = [ line.split()[-1] for line in numbers ]
line.split() will split the string into elements of a list using the space as a separator (if you put no arguments in it),
[-1] will get the last element of this list for you

Python - count key value pairs from text file

I have the following text file:
abstract 233:1 253:1 329:2 1087:2 1272:1
game 64:1 99:1 206:1 595:1
direct 50:1 69:1 1100:1 1765:1 2147:1 3160:1
each key pair is how many times each string appears in a document [docID]:[stringFq]
How could you calculate the number of key pairs in this text file?
Your regex approach works fine. Here is an iterative approach. If you uncomment the print statements you will uncover some itermediate results.
Given
%%file foo.txt
abstract 233:1 253:1 329:2 1087:2 1272:1
game 64:1 99:1 206:1 595:1
direct 50:1 69:1 1100:1 1765:1 2147:1 3160:1
Code
import itertools as it
with open("foo.txt") as f:
lines = f.readlines()
#print(lines)
pred = lambda x: x.isalpha()
count = 0
for line in lines:
line = line.strip("\n")
line = "".join(it.dropwhile(pred, line))
pairs = line.strip().split(" ")
#print(pairs)
count += len(pairs)
count
# 15
Details
First we use a with statement, which an idiom for safely opening and closing files. We then split the file into lines via readlines(). We define a conditional function (or predicate) that we will use later. The lambda expression is used for convenience and is equivalent to the following function:
def pred(x):
return x.isaplha()
We initialize a count variable and start iterating each line. Every line may have a trailing newline character \n, so we first strip() them away before feeding the line to dropwhile.
dropwhile is a special itertools iterator. As it iterates a line, it will discard any leading characters that satisfy the predicate until it reaches the first character that fails the predicate. In other words, all letters at the start will be dropped until the first non-letter is found (which happens to be a space). We clean the new line again, stripping the leading space, and the remaining string is split() into a list of pairs.
Finally the length of each line of pairs is incrementally added to count. The final count is the sum of all lengths of pairs.
Summary
The code above shows how to tackle basic file handling with simple, iterative steps:
open the file
split the file into lines
while iterating each line, clean and process data
output a result
import re
file = open('input.txt', 'r')
file = file.read()
numbers = re.findall(r"[-+]?\d*\.\d+|\d+", file)
#finds all ints from text file
numLen = len(numbers) / 2
#counts all ints, when I needed to count pairs, so I just divided it by 2
print(numLen)

Weird Value Change without Changing a Text File Python

I am writing a program that is supposed to return the minimum sequence alignment score (smaller = better), and it worked with the Coursera sample inputs, but for the dataset we're given, I can't manually input the sequences, so I have to resort to using a textfile. There are a few things which I found weird.
First things first,
pattern = 'AAA'
DNA = open('practice_data.txt')
empty = []
for lines in DNA:
line=lines.strip().strip('\n')
empty.append(line)
print(empty)
print(smallest_distance(pattern, DNA))
If I run this, my program outputs 0. If I comment out for loop, my program outputs 2. I didn't change DNA, so why should my program behave differently? Also, my strip('\n') is working (and for some reason, strip('n') works just as well) but my strip() is not working. Once I figure this out, I can test out empty in my smallest_distance function.
Here is what my data looks like:
ACTAG
CTTAGTATCACTCTGAAAAGAGATTCCGTATCGATGACCGCCAGTTAATACGTGCGAGAAGTGGACACGGCCGCCGACGGCTTCTACACGCTATTACGATG AACCAACAATTGCTCGAATCCTTCCTCAAAATCGCACACGTCTCTCTGGTCGTAGCACGGATCGGCGACCCACGCGTGACAGCCATCACCTATGATTGCCG
TTAAGGTACTGCTTCATTGATCAACACCCCTCAGCCGGCAATCACTCTGGGTGCGGGCTGGGTTTACAGGGGTATACGGAAACCGCTGCTTGCCCAATAAT
etc...
Solution:
pattern = 'AAA'
with open('practice_data.txt') as f_dna:
dna_list = [sequence for line in f_dna for sequence in line.split()]
print(smallest_distance(pattern, dna_list))
Explanation:
You were close to the solution, but you needed to remplace strip() by split()
-> strip() remove the extra characters, so your strip('\n') was a good guess.
But since \n is at the end of the line, split will automatically get rid of it because it is count as a delimitor
e.g
>>> 'test\ntest'.split()
>>> ['test', 'test']
>>> 'test\n'.split()
>>> ['test']
Now you have to remplace .append() by a simple addition between list operation since split returns a list.
DNA = open('practice_data.txt')
empty = []
for lines in DNA:
line = lines.split()
empty += line
But, there is still some problems in your code:
It is better to use the with statement while opening a file because it automatically handles exceptions and close the file descriptor at the end:
empty = []
with open('practice_data.txt') as DNA:
for lines in DNA:
line = lines.split()
empty += line
Your code is now fine, you can still refactor using list-comprehension (very common in python)
with open('practice_data.txt') as DNA:
empty = [sequence for line in DNA for sequence in line.split()]
If you struggle understanding this; try to recompose it with for loop
empty = []
with open('practice_data.txt') as DNA:
for line in DNA:
for sequence in line.split():
empty.append(sequence)
Note: #MrGeek solution works, but as two major defaults:
as it is not using a with statement, the file is never closed, causing memory issue,
using .read().splitlines() will load ALL the content of the file in memory, this could lead to MemoryError exception if the file is too big.
Go further, handle huge file:
Now imaging that you have a 1GO file filled with DNA sequences, even if you don't load all your file in memory, you still have a huge dict, a better pratice will be to create another file for the result and process your DNA on the fly:
e.g
pattern = 'AAA'
with open('practice_data.txt') as f_dna, open('result.txt', 'w') as f_result:
for line in DNA:
for sequence in line.split():
result = smallest_distance(pattern, sequence)
f_result.write(result)
Warning: You will have to make sure your function smallest_distance accepts a string rather than a list.
If not possible, you may need to process batch instead, but since it is a little complicated I will not talk of this here.
Now you can refactor a bit using for example a genetor function to improve readability
def extract_sequence(file, pattern):
for line in file:
for sequence in line.split():
yield smallest_distance(pattern, sequence)
pattern = 'AAA'
with open('practice_data.txt') as f_dna, open('result.txt', 'w') as f_result:
for result in extract_sequence(f_dna, pattern):
f_result.write(result)
potential errors:
print(smallest_distance(pattern, DNA))
DNA is file descriptor, not a string array. Because DNA = open('practice_data.txt')
For loop consume DNA. So, if you are using for loop for lines in DNA: again in smallest_distance, it doesn't work.
Update:
In this case, the for loop go from the beginning of file to the end. It would not go back again like a list. Unless you call DNS.close() and re-initialize file descriptor again DNA = open('practice_data.txt')
An simple example you can try
DNA = open('text.txt')
for lines in DNA:
line=lines.strip().strip('\n')
print (line) # print everything in the file here
print ('try again')
for lines in DNA:
line=lines.strip().strip('\n')
print (line) # will not print anything at all
print ('done')
Read For loop not working twice on the same file descriptor for more detail
Write :
pattern = 'AAA'
DNA = open('practice_data.txt').read().splitlines()
newDNA = []
for line in DNA:
newDNA += line.split() # create an array with strings then concatenate it with the newDNA array
print(smallest_distance(pattern, newDNA))

Python Programming for .json Loggly files

I want to search particular strings in long .json loggly file, including its line number and also want to print 5 lines above and below the searched line. Can u guzz plzz help me ?
it is always returning "NOT FOUND".
after this now i am only getting some output with the below shown program.
with open('logg.json', 'r') as f:
for ln, line in enumerate(f):
if "error CRASHLOG" in line:
i = ln-25
for i in (ln-25,ln+25):
l = linecache.getline('logg.json', i)
i+=1
print(ln,l)
print(" Next Error")
file.readlines() return a list of lines. Lines does contains newline (\n).
You need to specify newline to match the line:
ln = data.index("error CRASHLOG\n")
If you want to find a line that contians a target string, you need to iterate the lines:
with open('logg.json', 'r') as f:
for ln, line in enumerate(f):
if "error CRASHLOG" in line:
# You now know the line index (`ln`)
# Print above/below 5 lines here.
break
else:
print("Not Found")
BTW, this kind of work is easily done with grep(1):
grep -C 5 'error CRASHLOG' logg.json || echo 'Not Found'
UPDATE
Following is more complete code:
from collections import deque
from itertools import islice, chain
import sys
with open('logg.json', 'r') as f:
last_lines = deque(maxlen=5) # contains the last (up to) 5 lines.
for ln, line in enumerate(f):
if "error CRASHLOG" in line:
sys.stdout.writelines(chain(last_lines, [line], islice(f, 5)))
last_lines.append(line)
else:
print("Not Found")
I'm sure it actually returns "Not found", but I put that down to anxiety-induced shoutiness.
data is a list. The documentation about the list type (http://docs.python.org/2/tutorial/datastructures.html) states that list.index(x) returns "the index in the list of the first item whose value is x. It is an error if there is no such item."
Therefore the only lines that would be reported are those that contain ONLY the string you specify with no other characters. As falsetru points out in her/his answer, if there is no other information on the log lines then you must include for comparison the newline that readlines() ensures is at the end of every line in the list it returns (even on a Windows system as long as you open the file in text mode, which is the default). Without that there is no chance of a match.
If the lines contain other information then a better test might indeed be to use x in string as she/he suggests, but I suspect you might be interested to see how much more processing it takes than a simple equality test. Come to that so would I, but this isn't my problem ...

Categories

Resources