I should create a Script that collects data about files in a certain folder and store them in an xml file, which is created during the process.
I am stuck at the point, at which the first sentence of the file should be stored in .
The files have one sentence per line, but some start with empty lines. The files with empty lines would need to store the first non empty line.
This is my attempt:
first_sent = et.SubElement(file_corpus, 'firstsentence')
text = open(filename, 'rU')
first_sent.text=text.readline() #this line was before if!!
if text.readline() != '':
print text.readline()
first_sent.text = text.readline()
Currently it only some (random) sentence for very few files.
You're calling text.readline() again instead of checking the value previously read. And you need a loop in order to skip all the blank lines.
Something resembling this should work:
first_sent = et.SubElement(file_corpus, 'firstsentence')
text = open(filename, 'rU')
first_sent.text=text.readline() #this line was before if!!
while first_sent.text == '':
first_sent.text = text.readline()
Related
I am going through tabular text files (.txt), and I am wondering what is the best way to delete lines on a text file that have entries with either negative numbers or blank entries?
This is my current code, but the main question comes in what python code will write from the file, only valid lines (lines in the text file that do not contain a negative number or a blank entry.) When I run this script, it still writes all the text file entries despite the conditions set.
import os, sys
inFile = sys.argv[1]
baseN = os.path.basename(inFile)
outFile = 'c:/exampleSolution.txt'
#if path exists, read and write file
if os.path.exists(inFile):
inf = open(inFile,'r')
outf = open(outFile,'w')
#reading and writing header
header = inf.readline()
outf.write(header)
not_consider = []
lines = inf.read().splitlines()
for i in range(0,len(lines)):
data = lines[i].split(' ')
for j in range(0,len(data)):
if (data[j] == '' or int(data[j]) < 0):
#if line is having blank or negtive value
# append i value to the not_consider list
leaveOut.append(i)
for i in range(0,len(lines)):
#if i is in not_consider list, don't write to out file
if i not in leaveOut:
outf.write(lines[i])
print(lines[i])
outf.write("\n")
inf.close()
outf.close()
This is an example of a file input I am working with:
screenshot of part of the chart I'm working with, in this example, I would delete site no's 2 and 4 for having a negative number and/or missing items
Im learning python, I´ve been trying to split this txt file into multiple files grouped by a sliced string at the beginning of each line.
currently i have two issues:
1 - The string can have 5 or 6 chars is marked by a space at the end.(as in WSON33 and JHSF3 etc...)
Here is an exemple of the file i would like to split ( first line is a header):
H24/06/202000003TORDISTD
BWSON33 0803805000000000016400000003250C000002980002415324C1 0000000000000000
BJHSF3 0804608800000000003500000000715V000020280000031810C1 0000000000000000
2- I´ve come with a lot of code, but i´m not able to put everything together so this can work:
This code here i adappeted from another post and it kind of works breaking into multiple files, but it requires a sorting of the lines before i start writing files, i aslo need to copy the header in each file and not isolete it one file.
with open('tordist.txt', 'r') as fin:
# group each line in input file by first part of split
for i, (k, g) in enumerate(itertools.groupby(fin, lambda l: l.split()[0]),1):
# create file to write to suffixed with group number - start = 1
with open('{0} tordist.txt'.format(i), 'w') as fout:
# for each line in group write it to file
for line in g:
fout.write(line.strip() + '\n')
So from what I can gather, you have a text file with many lines, where every line begins with a short string of 5 or six characters. It sounds like you want all the lines that begin with the same string to go into the same file, so that after the code is run you have as many new files as there are unique starting strings. Is that accurate?
Like you, I'm fairly new to python, and so I'm sure there are more compact ways to do this. The code below loops through the file a number of times, and makes new files in the same folder as the file where your text and python files are.
# code which separates lines in a file by an identifier,
#and makes new files for each identifier group
filename = input('type filename')
if len(filename) < 1:
filename = "mk_newfiles.txt"
filehandle = open(filename)
#This chunck loops through the file, looking at the beginning of each line,
#and adding it to a list of identifiers if it is not on the list already.
Unique = list()
for line in filehandle:
#like Lalit said, split is a simple way to seperate a longer string
line = line.split()
if line[0] not in Unique:
Unique.append(line[0])
#For each item in the list of identifiers, this code goes through
#the file, and if a line starts with that identifier then it is
#added to a new file.
for item in Unique:
#this 'if' skips the header, which has a '/' in it
if '/' not in item:
# the .seek(0) 'rewinds' the iteration variable, which is apperently needed
#needed if looping through files multiple times
filehandle.seek(0)
#makes new file
newfile = open(str(item) + ".txt","w+")
#inserts header, and goes to next line
newfile.write(Unique[0])
newfile.write('\n')
#goes through old file, and adds relevant lines to new file
for line in filehandle:
split_line = line.split()
if item == split_line[0]:
newfile.write(line)
print(Unique)
I'm trying to read a plaintext file line by line, cherry pick lines that begin with a pattern of any six digits. Pass those to a list and then write that list row by row to a .csv file.
Here's an example of a line I'm trying to match in the file:
**000003** ANW2248_08_DESOLATE-WASTELAND-3. A9 C 00:55:25:17 00:55:47:12 10:00:00:00 10:00:21:20
And here is a link to two images, one showing the above line in context of the rest of the file and the expected result: https://imgur.com/a/XHjt9e1
import csv
identifier = re.compile(r'^(\d\d\d\d\d\d)')
matched_line = []
with open('file.edl', 'r') as file:
reader = csv.reader(file)
for line in reader:
line = str(line)
if identifier.search(line) == True:
matched_line.append(line)
else: continue
with open('file.csv', 'w') as outputEDL:
print('Copying EDL contents into .csv file for reformatting...')
outputEDL.write(str(matched_line))
Expected result would be the reader gets to a line, searches using the regex, then if the result of the search finds the series of 6 numbers at the beginning, it appends that entire line to the matched_line list.
What I'm actually getting is, once I write what reader has read to a .csv file, it has only picked out [], so the regex search obviously isn't functioning properly in the way I've written this code. Any tips on how to better form it to achieve what I'm trying to do would be greatly appreciated.
Thank you.
Some more examples of expected input/output would better help with solving this problem but from what I can see you are trying to write each line within a text file that contains a timestamp to a csv. In that case here is some psuedo code that might help you solve your problem as well as a separate regex match function to make your code more readable
import re
def match_time(line):
pattern = re.compile(r'(?:\d+[:]\d+[:]\d+[:]\d+)+')
result = pattern.findall(line)
return " ".join(result)
This will return a string of the entire timecode if a match is found
lines = []
with open('yourfile', 'r') as txtfile:
with open('yourfile', 'w') as csvfile:
for line in txtfile:
res = match_line(line)
#alternatively you can test if res in line which might be better
if res != "":
lines.append(line)
for item in lines:
csvfile.write(line)
Opens a text file for reading, if the line contains a timecode, appends the line to a list, then iterates that list and writes the line to the csv.
Using python I need to read a file and determine if all lines are the same length or not. If they are I move the file into a "good" folder and if they aren't all the same length I move them into a "bad" folder and write a word doc that says which line was not the same as the rest. Any help or ways to start?
You should use all():
with open(filename) as read_file:
length = len(read_file.readline())
if all(len(line) == length for line in read_file):
# Move to good folder
else:
# Move to bad folder
Since all() is short-circuiting, it will stop reading the file at the first non-match.
First off, you can read the file, here example.txt and put all lines in a list, content:
with open(filename) as f:
content = f.readlines()
Next you need to trim all the newline characters from the end of a line and put it in another list result:
for line in content:
line = line.strip()
result.append(line)
Now it's not that hard to get the length of every sentence, and since you want lines that are bad, you loop through the list:
for line in result:
lengths.append(len(line))
So the i-th element of result has length [i-th element of lengths]. We can make a counter for what line length occurs the most in the list, it is as simple as one line!
most_occuring = max(set(lengths), key=lengths.count)
Now we can make another for-loop to check which lengths don't correspond with the most-occuring and add those to bad-lines:
for i in range(len(lengths)):
if (lengths[i] != most_occuring):
bad_lines.append([i, result[i]])
The next step is check where the file needs to go, the good folder, or the bad folder:
if len(bad_lines == 0):
#Good file, move it to the good folder, use the os or shutil module
os.rename("path/to/current/file.foo", "path/to/new/desination/for/file.foo")
else:
#Bad file, one or more lines are bad, thus move it to the bad folder
os.rename("path/to/current/file.foo", "path/to/new/desination/for/file.foo")
The last step is writing the bad lines to another file, which is do-able, since we have the bad lines already in a list bad_lines:
with open("bad_lines.txt", "wb") as f:
for bad_line in bad_lines:
f.write("[%3i] %s\n" % (bad_line[0], bad_line[1]))
It's not a doc file, but I think this is a nice start. You can take a look at the docx module if you really want to write to a doc file.
EDIT: Here is an example python script.
with open("example.txt") as f:
content = f.readlines()
result = []
lengths = []
#Strip the file of \n
for line in content:
line = line.strip()
result.append(line)
lengths.append(len(line))
most_occuring = max(set(lengths), key=lengths.count)
bad_lines = []
for i in range(len(lengths)):
if (lengths[i] != most_occuring):
#Append the bad_line to bad_lines
bad_lines.append([i, result[i]])
#Check if it's a good, or a bad file
#if len(bad_lines == 0):
#Good File
#Move file to the good folder...
#else:
#Bad File
with open("bad_lines.txt", "wb") as f:
for bad_line in bad_lines:
f.write("[%3i] %s\n" % (bad_line[0], bad_line[1]))
I have a question about reading in a .txt rile and taking the string from inside to be used later on in the code.
If I have a file called 'file0.txt' and it contains:
file1.txt
file2.txt
The rest of the files either contain more string file names or are empty.
How can I save both of these strings for later use. What I attempted to do was:
infile = open(file, 'r')
line = infile.readline()
line.split('\n')
But that returned the following:
['file1.txt', '']
I understand that readline only reads one line, but I thought that by splitting it by the return key it would also grab the next file string.
I am attempting to simulate a file tree or to show which files are connected together, but as it stands now it is only going through the first file string in each .txt file.
Currently my output is:
File 1 crawled.
File 3 crawled.
Dead end reached.
My hope was that instead of just recursivley crawling the first file it would go through the entire web, but that goes back to my issue of not giving the program the second file name in the first place.
I'm not asking for a specific answer, just a push in the right direction on how to better handle the strings from the files and be able to store both of them instead of 1.
My current code is pretty ugly, but hopefully it gets the idea across, I will just post it for reference to what I'm trying to accomplish.
def crawl(file):
infile = open(file, 'r')
line = infile.readline()
print(line.split('\n'))
if 'file1.txt' in line:
print('File 1 crawled.')
return crawl('file1.txt')
if 'file2.txt' in line:
print('File 2 crawled.')
return crawl('file2.txt')
if 'file3.txt' in line:
print('File 3 crawled.')
return crawl('file3.txt')
if 'file4.txt' in line:
print('File 4 crawled.')
return crawl('file4.txt')
if 'file5.txt' in line:
print('File 5 crawled.')
return crawl('file5.txt')
#etc...etc...
else:
print('Dead end reached.')
Outside the function:
file = 'file0.txt'
crawl(file)
Using read() or readlines() will help. e.g.
infile = open(file, 'r')
lines = infile.readlines()
print list(lines)
gives
['file1.txt\n', 'file2.txt\n']
or
infile = open(file, 'r')
lines = infile.read()
print list(lines.split('\n'))
gives
['file1.txt', 'file2.txt']
Readline only gets one line from the file so it has a newline at the end. What you want is file.read() which will give you the whole file as a single string. Split that using newline and you should have what you need. Also remember that you need to save the list of lines as a new variable i.e. assign to your line.split('\n') action. You could also just use readlines which will get a list of lines from the file.
change readline to readlines. and no need to split(\n), its already a list.
here is a tutorial you should read
I prepared file0.txt with two files in it, file1.txt, with one file in it, plus file2.txt and file3.txt, which contained no data. Note, this won't extract values already in the list
def get_files(current_file, files=[]):
# Initialize file list with previous values, or intial value
new_files = []
if not files:
new_files = [current_file]
else:
new_files = files
# Read files not already in list, to the list
with open(current_file, "r") as f_in:
for new_file in f_in.read().splitlines():
if new_file not in new_files:
new_files.append(new_file.strip())
# Do we need to recurse?
cur_file_index = new_files.index(current_file)
if cur_file_index < len(new_files) - 1:
next_file = new_files[cur_file_index + 1]
# Recurse
get_files(next_file, new_files)
# We're done
return new_files
initial_file = "file0.txt"
files = get_files(initial_file)
print(files)
Returns: ['file0.txt', 'file1.txt', 'file2.txt', 'file3.txt']
file0.txt
file1.txt
file2.txt
file1.txt
file3.txt
file2.txt and file3.txt were blank
Edits: Added .strip() for safety, and added the contents of the data files so this can be replicated.