I have a folder of 295 text files with each one containing a couple rows of data that I need to condense. I want to open one file, grab two lines from that file, and combine those two lines into another text file I've created, then close the data file and repeat for the next.
I currently have a for loop that does mostly that but the problem I run into is that the for loop copies the same text of the first instance 295 times. How do I get it so that it moves onto the next list item in the filelist? This is my first program in Python so I'm pretty new.
My code:
import os
filelist = os.listdir('/colors')
colorlines = []
for x in filelist:
with open('/colors/'+x, 'rt') as colorfile: #opens single txt file for reading text as colorfile
for colorline in colorfile: #Creates list from each line of the txt file and puts it into colorline
colorlines.append(colorline.rstrip('\n')) #removes the paragraph space in list
tup = (colorlines[1], colorlines[3]) #combines second and fourth line into one line into a tuple
str = ''.join(tup) #joins the tuple into a string with no space between the two
print(str)
newtext = open("colorcode_rework.txt","a") #opens output file for the reworked data
newtext.write(str+'\n') #pastes the string and inserts a new line
newtext.close()
colorfile.close()
You need to reset the color line list for each file. AS you are calling a specific item in the list (1 and 3) you are always calling the same even though more items have been added.
To reset the colorline list for each file:
for x in filelist:
colorlines = []
Related
Im learning python, I´ve been trying to split this txt file into multiple files grouped by a sliced string at the beginning of each line.
currently i have two issues:
1 - The string can have 5 or 6 chars is marked by a space at the end.(as in WSON33 and JHSF3 etc...)
Here is an exemple of the file i would like to split ( first line is a header):
H24/06/202000003TORDISTD
BWSON33 0803805000000000016400000003250C000002980002415324C1 0000000000000000
BJHSF3 0804608800000000003500000000715V000020280000031810C1 0000000000000000
2- I´ve come with a lot of code, but i´m not able to put everything together so this can work:
This code here i adappeted from another post and it kind of works breaking into multiple files, but it requires a sorting of the lines before i start writing files, i aslo need to copy the header in each file and not isolete it one file.
with open('tordist.txt', 'r') as fin:
# group each line in input file by first part of split
for i, (k, g) in enumerate(itertools.groupby(fin, lambda l: l.split()[0]),1):
# create file to write to suffixed with group number - start = 1
with open('{0} tordist.txt'.format(i), 'w') as fout:
# for each line in group write it to file
for line in g:
fout.write(line.strip() + '\n')
So from what I can gather, you have a text file with many lines, where every line begins with a short string of 5 or six characters. It sounds like you want all the lines that begin with the same string to go into the same file, so that after the code is run you have as many new files as there are unique starting strings. Is that accurate?
Like you, I'm fairly new to python, and so I'm sure there are more compact ways to do this. The code below loops through the file a number of times, and makes new files in the same folder as the file where your text and python files are.
# code which separates lines in a file by an identifier,
#and makes new files for each identifier group
filename = input('type filename')
if len(filename) < 1:
filename = "mk_newfiles.txt"
filehandle = open(filename)
#This chunck loops through the file, looking at the beginning of each line,
#and adding it to a list of identifiers if it is not on the list already.
Unique = list()
for line in filehandle:
#like Lalit said, split is a simple way to seperate a longer string
line = line.split()
if line[0] not in Unique:
Unique.append(line[0])
#For each item in the list of identifiers, this code goes through
#the file, and if a line starts with that identifier then it is
#added to a new file.
for item in Unique:
#this 'if' skips the header, which has a '/' in it
if '/' not in item:
# the .seek(0) 'rewinds' the iteration variable, which is apperently needed
#needed if looping through files multiple times
filehandle.seek(0)
#makes new file
newfile = open(str(item) + ".txt","w+")
#inserts header, and goes to next line
newfile.write(Unique[0])
newfile.write('\n')
#goes through old file, and adds relevant lines to new file
for line in filehandle:
split_line = line.split()
if item == split_line[0]:
newfile.write(line)
print(Unique)
I'm working on a code that reads in lists from a text file using nested loops. Right now it goes to a file via an inputted file path, then opens the file and converts everything in it into a list. This works well and fine.
However before it loops again and does the same for the rest of the inputted files I want it to automatically take the 6th from last element of each list that it just made put that into another list without overwriting anything. This is what I have so far:
listpath = "/Users/myname/Documents/list.txt"
lstfull = []
lstreduced = []
with open(listpath, "r") as flp:
file_list = flp.readlines() #Makes a list from the file paths
fp = [x.strip() for x in file_list] #comprehension that removes \n from strings
for i in fp: #list of file paths
with open(i, "r") as f: #Opens each file from the file path
for line in f:
lstfull.append(line) #Takes each line and appends it to the list (lst)
six = len(lstfull) - 6 #This is the element from each of the files I want
lstreduced.append(lstfull[six])
Lists.txt is just a text file with a list of file paths that I can input such that I can have the code work from anywhere.
The last line, (lstreduced.append(lstfull[si]), is where I run into issues.
I only want a listreduced to be made up of the 6th to last element from each of the inputted list but I get the error: IndexError: list index out of range. Does anyone know how to fix this?
The sixth last element from the end of a list can be accessed by using
lstfull[-6]
This should avoid IndexErrors as long as every list has at least 6 elements.
Is it possible that your code has an indentation error? It would make more sense if you were to do the following:
with open(listpath, "r") as flp:
file_list = flp.readlines() #Makes a list from the file paths
fp = [x.strip() for x in file_list] #comprehension that removes \n from strings
for i in fp: #list of file paths
with open(i, "r") as f: #Opens each file from the file path
for line in f:
lstfull.append(line) #Takes each line and appends it to the list (lst)
lstreduced.append(lstfull[-6])
I am beginner in the programming world and a would like some tips on how to solve a challenge.
Right now I have ~10 000 .dat files each with a single line following this structure:
Attribute1=Value&Attribute2=Value&Attribute3=Value...AttibuteN=Value
I have been trying to use python and the CSV library to convert these .dat files into a single .csv file.
So far I was able to write something that would read all files, store the contents of each file in a new line and substitute the "&" to "," but since the Attribute1,Attribute2...AttributeN are exactly the same for every file, I would like to make them into column headers and remove them from every other line.
Any tips on how to go about that?
Thank you!
Since you are a beginner, I prepared some code that works, and is at the same time very easy to understand.
I assume that you have all the files in the folder called 'input'. The code beneath should be in a script file next to the folder.
Keep in mind that this code should be used to understand how a problem like this can be solved. Optimisations and sanity checks have been left out intentionally.
You might want to check additionally what happens when a value is missing in some line, what happens when an attribute is missing, what happens with a corrupted input etc.. :)
Good luck!
import os
# this function splits the attribute=value into two lists
# the first list are all the attributes
# the second list are all the values
def getAttributesAndValues(line):
attributes = []
values = []
# first we split the input over the &
AtributeValues = line.split('&')
for attrVal in AtributeValues:
# we split the attribute=value over the '=' sign
# the left part goes to split[0], the value goes to split[1]
split = attrVal.split('=')
attributes.append(split[0])
values.append(split[1])
# return the attributes list and values list
return attributes,values
# test the function using the line beneath so you understand how it works
# line = "Attribute1=Value&Attribute2=Value&Attribute3=Vale&AttibuteN=Value"
# print getAttributesAndValues(line)
# this function writes a single file to an output file
def writeToCsv(inFile='', wfile="outFile.csv", delim=","):
f_in = open(inFile, 'r') # only reading the file
f_out = open(wfile, 'ab+') # file is opened for reading and appending
# read the whole file line by line
lines = f_in.readlines()
# loop throug evert line in the file and write its values
for line in lines:
# let's check if the file is empty and write the headers then
first_char = f_out.read(1)
header, values = getAttributesAndValues(line)
# we write the header only if the file is empty
if not first_char:
for attribute in header:
f_out.write(attribute+delim)
f_out.write("\n")
# we write the values
for value in values:
f_out.write(value+delim)
f_out.write("\n")
# Read all the files in the path (without dir pointer)
allInputFiles = os.listdir('input/')
allInputFiles = allInputFiles[1:]
# loop through all the files and write values to the csv file
for singleFile in allInputFiles:
writeToCsv('input/'+singleFile)
but since the Attribute1,Attribute2...AttributeN are exactly the same
for every file, I would like to make them into column headers and
remove them from every other line.
input = 'Attribute1=Value1&Attribute2=Value2&Attribute3=Value3'
once for the the first file:
','.join(k for (k,v) in map(lambda s: s.split('='), input.split('&')))
for each file's content:
','.join(v for (k,v) in map(lambda s: s.split('='), input.split('&')))
Maybe you need to trim the strings additionally; don't know how clean your input is.
Put the dat files in a folder called myDats. Put this script next to the myDats folder along with a file called temp.txt. You will also need your output.csv. [That is, you will have output.csv, myDats, and mergeDats.py in the same folder]
mergeDats.py
import csv
import os
g = open("temp.txt","w")
for file in os.listdir('myDats'):
f = open("myDats/"+file,"r")
tempData = f.readlines()[0]
tempData = tempData.replace("&","\n")
g.write(tempData)
f.close()
g.close()
h = open("text.txt","r")
arr = h.read().split("\n")
dict = {}
for x in arr:
temp2 = x.split("=")
dict[temp2[0]] = temp2[1]
with open('output.csv','w' """use 'wb' in python 2.x""" ) as output:
w = csv.DictWriter(output,my_dict.keys())
w.writeheader()
w.writerow(my_dict)
I should create a Script that collects data about files in a certain folder and store them in an xml file, which is created during the process.
I am stuck at the point, at which the first sentence of the file should be stored in .
The files have one sentence per line, but some start with empty lines. The files with empty lines would need to store the first non empty line.
This is my attempt:
first_sent = et.SubElement(file_corpus, 'firstsentence')
text = open(filename, 'rU')
first_sent.text=text.readline() #this line was before if!!
if text.readline() != '':
print text.readline()
first_sent.text = text.readline()
Currently it only some (random) sentence for very few files.
You're calling text.readline() again instead of checking the value previously read. And you need a loop in order to skip all the blank lines.
Something resembling this should work:
first_sent = et.SubElement(file_corpus, 'firstsentence')
text = open(filename, 'rU')
first_sent.text=text.readline() #this line was before if!!
while first_sent.text == '':
first_sent.text = text.readline()
Using pdftotext, a text file was created that includes footers from the source pdf. The footers get in the way of other parsing that needs to be done. The format of the footer is as follows:
This is important text.
9
Title 2012 and 2013
\fCompany
Important text begins again.
The line for Company is the only one that does not recur elsewhere in the file. It appears as \x0cCompany\n. I would like to search for this line and remove it and the preceding three lines (the page number, title, and a blank line) based on where the \x0cCompany\n appears. This is what I have so far:
report = open('file.txt').readlines()
data = range(len(report))
name = []
for line_i in data:
line = report[line_i]
if re.match('.*\\x0cCompany', line ):
name.append(report[line_i])
print name
This allows me to make a list storing which line numbers have this occurrence, but I do not understand how to delete these lines as well as the three preceding lines. It seems I need to create some other loop based on this loop but I cannot make it work.
Instead of iterating through and getting the indices of that lines you want to delete, iterate through your lines and append only the lines that you want to keep.
It would also be more efficient to iterate your actual file object, rather than putting it all into one list:
keeplines = []
with open('file.txt') as b:
for line in b:
if re.match('.*\\x0cCompany', line):
keeplines = keeplines[:-3] #shave off the preceding lines
else:
keeplines.append(line)
file = open('file.txt', 'w'):
for line in keeplines:
file.write(line)