I am going through tabular text files (.txt), and I am wondering what is the best way to delete lines on a text file that have entries with either negative numbers or blank entries?
This is my current code, but the main question comes in what python code will write from the file, only valid lines (lines in the text file that do not contain a negative number or a blank entry.) When I run this script, it still writes all the text file entries despite the conditions set.
import os, sys
inFile = sys.argv[1]
baseN = os.path.basename(inFile)
outFile = 'c:/exampleSolution.txt'
#if path exists, read and write file
if os.path.exists(inFile):
inf = open(inFile,'r')
outf = open(outFile,'w')
#reading and writing header
header = inf.readline()
outf.write(header)
not_consider = []
lines = inf.read().splitlines()
for i in range(0,len(lines)):
data = lines[i].split(' ')
for j in range(0,len(data)):
if (data[j] == '' or int(data[j]) < 0):
#if line is having blank or negtive value
# append i value to the not_consider list
leaveOut.append(i)
for i in range(0,len(lines)):
#if i is in not_consider list, don't write to out file
if i not in leaveOut:
outf.write(lines[i])
print(lines[i])
outf.write("\n")
inf.close()
outf.close()
This is an example of a file input I am working with:
screenshot of part of the chart I'm working with, in this example, I would delete site no's 2 and 4 for having a negative number and/or missing items
Related
Im learning python, I´ve been trying to split this txt file into multiple files grouped by a sliced string at the beginning of each line.
currently i have two issues:
1 - The string can have 5 or 6 chars is marked by a space at the end.(as in WSON33 and JHSF3 etc...)
Here is an exemple of the file i would like to split ( first line is a header):
H24/06/202000003TORDISTD
BWSON33 0803805000000000016400000003250C000002980002415324C1 0000000000000000
BJHSF3 0804608800000000003500000000715V000020280000031810C1 0000000000000000
2- I´ve come with a lot of code, but i´m not able to put everything together so this can work:
This code here i adappeted from another post and it kind of works breaking into multiple files, but it requires a sorting of the lines before i start writing files, i aslo need to copy the header in each file and not isolete it one file.
with open('tordist.txt', 'r') as fin:
# group each line in input file by first part of split
for i, (k, g) in enumerate(itertools.groupby(fin, lambda l: l.split()[0]),1):
# create file to write to suffixed with group number - start = 1
with open('{0} tordist.txt'.format(i), 'w') as fout:
# for each line in group write it to file
for line in g:
fout.write(line.strip() + '\n')
So from what I can gather, you have a text file with many lines, where every line begins with a short string of 5 or six characters. It sounds like you want all the lines that begin with the same string to go into the same file, so that after the code is run you have as many new files as there are unique starting strings. Is that accurate?
Like you, I'm fairly new to python, and so I'm sure there are more compact ways to do this. The code below loops through the file a number of times, and makes new files in the same folder as the file where your text and python files are.
# code which separates lines in a file by an identifier,
#and makes new files for each identifier group
filename = input('type filename')
if len(filename) < 1:
filename = "mk_newfiles.txt"
filehandle = open(filename)
#This chunck loops through the file, looking at the beginning of each line,
#and adding it to a list of identifiers if it is not on the list already.
Unique = list()
for line in filehandle:
#like Lalit said, split is a simple way to seperate a longer string
line = line.split()
if line[0] not in Unique:
Unique.append(line[0])
#For each item in the list of identifiers, this code goes through
#the file, and if a line starts with that identifier then it is
#added to a new file.
for item in Unique:
#this 'if' skips the header, which has a '/' in it
if '/' not in item:
# the .seek(0) 'rewinds' the iteration variable, which is apperently needed
#needed if looping through files multiple times
filehandle.seek(0)
#makes new file
newfile = open(str(item) + ".txt","w+")
#inserts header, and goes to next line
newfile.write(Unique[0])
newfile.write('\n')
#goes through old file, and adds relevant lines to new file
for line in filehandle:
split_line = line.split()
if item == split_line[0]:
newfile.write(line)
print(Unique)
I wrote a function to input a text file and a ratio (eg. 80%) to divide the first 80% of the file into a file and the other 20% to another file. The first part is correct but the second part is empty. can someone take a look and let me know my mistake?
def splitFile(inputFilePatheName, outputFilePathNameFirst, outputFilePathNameRest, splitRatio):
lines = 0
buffer = bytearray(2048)
with open(inputFilePatheName) as f:
while f.readinto(buffer) > 0:
lines += buffer.count('\n')
print lines
line80 = int(splitRatio * lines)
print line80
with open(inputFilePatheName) as originalFile:
firstNlines = originalFile.readlines()[0:line80]
restOfTheLines=originalFile.readlines()[(line80+1):lines]
print len(firstNlines)
print len(restOfTheLines)
with open(outputFilePathNameFirst, 'w') as outputFileNLines:
for item in firstNlines:
outputFileNLines.write("{}".format(item))
with open(outputFilePathNameRest,'w') as outputFileRest:
for word in restOfTheLines:
outputFileRest.write("{}".format(word))
I believe this is your problem:
firstNlines = originalFile.readlines()[0:line80]
restOfTheLines=originalFile.readlines()[(line80+1):lines]
When you call readlines() the second time, you don't get anything, because you've already read all the lines from the file. Try:
allLines = originalFile.readlines()
firstNLines, restOfTheLines = allLines[:line80], allLines[(line80+1):]
Of course, for very large files there is a problem that you are reading the entire file into memory.
I need to crop a large text file with over 10000 lines of numbers in addition to a header with the format (number_of_lines, number_difference, "Sam")
Number_difference is the difference between the first and last number.
For example, if the file looks like this:
10
12
13.5
17
20
Then, the header should be:
5 10 Sam
The problem is the flags do not work for not writing a header more than once and the big file's header carries over to the 1st small file.
The headers will never be the same for each file.
How do I add a changing header to each file?
def TextCropper():
lines_per_file = 1000
smallfile = None
with open(inputFileName) as bigfile:
for lineno, line in enumerate(bigfile):
if lineno % lines_per_file == 0:
if smallfile:
smallfile.close()
small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file)
smallfile = open(small_filename, "w")
if (flags[counter] == False):
smallfile.write(lines_per_file)
flags[counter] = True
smallfile.write(line)
elif smallfile:
smallfile.close()
TextCropper()
You're reading and writing the lines one at a time, which is inefficient. By doing that, you also don't know what the last line will be, so you can't write your header in advance.
Just read up to N lines, if available. islice() will do exactly that for you. If the list comes back empty, there were no lines left to read, otherwise you can proceed to write the current chunk into a file.
Since each line is read as a number with a trailing newline ('\n'), strip that, convert the first and last numbers into floats and calculate the difference. Writing the actual numbers to the file is straightforward by joining the elements of the list.
To make the function reusuable, include the variables that are likely to change as arguments. That way you can name any big file, any output small file and any number of lines you want without changing hardcoded values.
from itertools import islice
def number_difference(iterable):
return float(iterable[-1].strip('\n')) - float(iterable[0].strip('\n'))
def file_crop(big_fname, chunk_fname, no_lines):
with open(big_fname, 'r') as big_file:
ifile = 0
while True:
data = list(islice(big_file, no_lines))
if not data:
break
with open('{}_{}.txt'.format(chunk_fname, ifile), 'w') as small_file:
small_file.write('{} {} Sam\n'.format(len(data), number_difference(data)))
small_file.write(''.join(data))
ifile += 1
Hopefully this is an easy fix. I'm trying to edit one field of a file we use for import, however when I run the following code it leaves the file blank and 0kb. Could anyone advise what I'm doing wrong?
import re #import regex so we can use the commands
name = raw_input("Enter filename:") #prompt for file name, press enter to just open test.nhi
if len(name) < 1 : name = "test.nhi"
count = 0
fhand = open(name, 'w+')
for line in fhand:
words = line.split(',') #obtain individual words by using split
words[34] = re.sub(r'\D', "", words[34]) #remove non-numeric chars from string using regex
if len(words[34]) < 1 : continue # If the 34th field is blank go to the next line
elif len(words[34]) == 2 : "{0:0>3}".format([words[34]]) #Add leading zeroes depending on the length of the field
elif len(words[34]) == 3 : "{0:0>2}".format([words[34]])
elif len(words[34]) == 4 : "{0:0>1}".format([words[34]])
fhand.write(words) #write the line
fhand.close() # Close the file after the loop ends
I have taken below text in 'a.txt' as input and modified your code. Please check if it's work for you.
#Intial Content of a.txt
This,program,is,Java,program
This,program,is,12Python,programs
Modified code as follow:
import re
#Reading from file and updating values
fhand = open('a.txt', 'r')
tmp_list=[]
for line in fhand:
#Split line using ','
words = line.split(',')
#Remove non-numeric chars from 34th string using regex
words[3] = re.sub(r'\D', "", words[3])
#Update the 3rd string
# If the 3rd field is blank go to the next line
if len(words[3]) < 1 :
#Removed continue it from here we need to reconstruct the original line and write it to file
print "Field empty.Continue..."
elif len(words[3]) >= 1 and len(words[3]) < 5 :
#format won't add leading zeros. zfill(5) will add required number of leading zeros depending on the length of word[3].
words[3]=words[3].zfill(5)
#After updating 3rd value in words list, again creating a line out of it.
tmp_str = ",".join(words)
tmp_list.append(tmp_str)
fhand.close()
#Writing to same file
whand = open("a.txt",'w')
for val in tmp_list:
whand.write(val)
whand.close()
File content after running code
This,program,is,,program
This,program,is,00012,programs
The file mode 'w+' Truncates your file to 0 bytes, so you'll only be able to read lines that you've written.
Look at Confused by python file mode "w+" for more information.
An idea would be to read the whole file first, close it, and re-open it to write files in it.
Not sure which OS you're on but I think reading and writing to the same file has undefined behaviour.
I guess internally the file object holds the position (try fhand.tell() to see where it is). You could probably adjust it back and forth as you went using fhand.seek(last_read_position) but really that's asking for trouble.
Also, I'm not sure how the script would ever end as it would end up reading the stuff it had just written (in a sort of infinite loop).
Best bet is to read the entire file first:
with open(name, 'r') as f:
lines = f.read().splitlines()
with open(name, 'w') as f:
for l in lines:
# ....
f.write(something)
For 'Printing to a file via Python' you can use:
ifile = open("test.txt","r")
print("Some text...", file = ifile)
I should create a Script that collects data about files in a certain folder and store them in an xml file, which is created during the process.
I am stuck at the point, at which the first sentence of the file should be stored in .
The files have one sentence per line, but some start with empty lines. The files with empty lines would need to store the first non empty line.
This is my attempt:
first_sent = et.SubElement(file_corpus, 'firstsentence')
text = open(filename, 'rU')
first_sent.text=text.readline() #this line was before if!!
if text.readline() != '':
print text.readline()
first_sent.text = text.readline()
Currently it only some (random) sentence for very few files.
You're calling text.readline() again instead of checking the value previously read. And you need a loop in order to skip all the blank lines.
Something resembling this should work:
first_sent = et.SubElement(file_corpus, 'firstsentence')
text = open(filename, 'rU')
first_sent.text=text.readline() #this line was before if!!
while first_sent.text == '':
first_sent.text = text.readline()