I have many .txt files full of data that need to be read into python and then converted to one excel document. I have the code working to read the data, remove first 5 lines and last two lines and then also export all txt files into an excel file but the problem I'm running into now is that the TXT file has inconsistent use of white space so a simple white space delimiter is not working as it should.
Here is an example of the txt file.
2020MAY16 215015 2.0 1004.4 30.0 2.0 2.0 2.0 NO LIMIT OFF OFF OFF OFF -25.84 -32.50 CRVBND N/A -0.0 28.52 78.54 FCST GOES16 33.4
*This is all on one line in the text file
Id like to be able to take this and make it look like this,
2020MAY16, 215015, 2.0, 1004.4, 30.0, 2.0, 2.0, 2.0, NO LIMIT, OFF, OFF, OFF, OFF, -25.84, -32.50, CRVBND, N/A, -0.0, 28.52, 78.54, FCST, GOES16, 33.4,
I have added the portion of code below that grabs the file from the URL address the user enters, iterates through the amount of storms to change the URL text. This also removes the top 5 lines and bottom 2. So if anyone has any suggestion on adding commas in that would be great to allow for an easy conversion to CSV file later.
for i in range(1,10,1):
url = mod_string+str(i)+"L-list.txt"
storm_abrev = url[54:57:1] #Grabs the designator from the URL to allow for simplistic naming of files
File_Name = (storm_abrev)+".txt" #Names the file
print(url) #Prints URL to allow user to confirm URLS are correct
urllib.request.urlretrieve(url,File_Name) #Sends a request to the URL from above, grabs the file, and then saves it as the designator.txt
file = open(File_Name,"r")
Counter = 0
Content = file.read()
CoList = Content.split("\n")
for j in CoList:
if j:
Counter += 1
print("This is the number of lines in the file")
print(Counter)
Final_Count = (Counter-2)
print(Final_Count)
with open(File_Name,'r') as f:
new_lines = []
for idx, line in enumerate(f):
if idx in [x for x in range(5,Final_Count)]:
new_lines.append(line)
with open(File_Name,"w") as f:
for line in new_lines:
f.write(line)
Edit: Fixed the issue caught by #DarryIG.
Create a list of phrases that need to remain intact. Let's call it phrase_list.
Identify a character or string that will never be used in the input file. For example, here I am assuming that an underscore will never be found in the input file and assign it to the variable forbidden_str. We could also use something like %$#%$#^%&%_#^ - chances of something like that occurring is very rare.
Replace multiple spaces with single spaces. Then replace spaces in phrases to _ (forbidden_str). Then replace all spaces with commas. Finally, replace _s back to spaces.
You could also simplify the reading lines part of your code using readlines().
...
phrase_list = ['NO LIMIT']
forbidden_str = "_"
with open(File_Name,'r') as f:
new_lines = f.readlines()
new_lines = new_lines[5:Final_Count]
with open(File_Name,"w") as f:
for line in new_lines:
for phrase in phrase_list:
if phrase in line:
line = line.replace(phrase, phrase.replace(" ", forbidden_str))
line = line.replace(" ", " ") # replaces multiple spaces with single spaces
line = line.replace(" ", ",")
line = line.replace(forbidden_str, " ")
f.write(line)
Output:
2020MAY16,215015,2.0,1004.4,30.0,2.0,2.0,2.0,NO LIMIT,OFF,OFF,OFF,OFF,-25.84,-32.50,CRVBND,N/A,-0.0,28.52,78.54,FCST,GOES16,33.4,
Also, a quick suggestion. It's a good practice to name variables in lower case. For example, final_count instead of Final_Count. Upper cases are usually used for classes, instances, etc. It just helps in readability and debugging.
If I understand your code correctly, you have a list of lines where you want to replace a space with a comma followed by a space. In python you can do this very simple like this:
lines = [x.replace(" ", ", ") for x in lines]
Related
I have a file called test.txt.
I need to convert one string in the file which matches the dictionary.
test.txt:
abc
asd
ds
{{ PRODUCT CATEGORY }}
fdsavfacxvasdvvc
dfvssfzxvdfvzd
Code is below:
data = {'PRODUCT CATEGORY':'Customer'}
all_files = ['test.txt']
out_files = ['ut.txt']
read_dict = {}
for file in all_files:
with open(file,'r') as read_file:
lines = read_file.readlines()
read_dict[file] = lines
for in_f, out_f in zip(all_files, out_files):
with open(in_f,'r') as read_file:
lines = read_file.readlines()
with open(out_f,'w+') as write_file:
for line in lines:
updated_line = []
for word in line.split():
if word in data:
updated_line.append(data[word])
else:
updated_line.append(word)
write_file.writelines(" ".join(updated_line))
print (" ".join(updated_line))
There is a space at the end and at the beginning PRODUCT CATEGORY
Expected output:
abc
asd
ds
Customer
fdsavfacxvasdvvc
dfvssfzxvdfvzd
Try this
import re
data = {'PRODUCT CATEGORY':'Customer'}
all_files = ['test.txt']
out_files = ['ut.txt']
for in_f, out_f in zip(all_files, out_files):
with open(in_f,'r') as read_file:
text = read_file.read()
for word, replace_with in data.items():
text = re.sub(r'\{+ *'+ word + r' *\}+', replace_with, text)
open(out_f,'w+').write(text)
You are splitting by white space, and you have a white space in "Product category" so it never finds an exact match for the word. You can see this if you add a print(word) after the for word in line.split() line
A way to solve this is by replacing Product category with Product_Category in data and in your test.txt file.
Also, you are missing the new line carry after writting each line to the output file, you should replace:
write_file.writelines(" ".join(updated_line))
with
write_file.writelines(" ".join(updated_line)+"\n")
With both these issues solved you get the desired output.
You can't loop over individual words on the input line because that prevents you from finding a dictionary key which consists of more than one word, like the one you have in your example.
Here is a simple refactoring which prints to standard output just so you can see what you are doing.
data = {'PRODUCT CATEGORY':'Customer'}
all_files = ['test.txt']
for file in all_files:
with open(file,'r') as read_file:
for line in read_file:
for keyword in data:
token = '{{ %s }}' % keyword
if token in line:
line = line.replace(token, data[keyword])
print(line, end='')
The end='' is necessary because line already contains a newline, but print wants to supply one of its own; if you write to a file instead of print, you can avoid this particular quirk. (But often, a better design for reusability is to just print to standard output, and let the caller decide what to do with the output.)
It is unclear why you had a separate read_dict for the lines in the input file or why you read the file twice, so I removed those parts.
Looping over a file a line at a time avoids reading the entire file into memory, so if you don't care what's on the previous or the next lines, this is usually a good idea, and scales to much bigger files (you only need to keep one line in memory at a time -- let's hope that one line is not several gigabytes).
Here is a demo (I took the liberty to use just one space in {{ PRODUCT CATEGORY }} and fix the spelling of "customer"): https://repl.it/repls/DarkredMindlessPackagedsoftware#main.py
I am having a silly issue whereby I have a text file with user inputs structured as follows:
x = variable1
y = variable2
and so on. I want to grab the variables, to do this I was going to just import the text file and then grab out the UserInputs[2], UserInputs[5] etc. I have spent a lot of time reading through how to do this and the closest I got was initially with the csv package but this resulted in just getting the '=' signs when I printed it so I went back to just using the open command and readlines and then trying to iterate through the lines and splitting by ' '.
So far I have the following code:
Text_File_Import = open('USER_INPUTS.txt', 'r')
Text_lines = Text_File_Import.readlines()
for line in Text_lines:
User_Inputs = line.split(' ')
print User_Inputs
However this only outputs the first line from my text file (i.e I get 'x', '=', 'variable1'but it never enters the next line. How would I iterate this code through the imported text file?
I have bodged it a bit for the time being and rearranged the text file to be variable1 = x and so on. This way I can still import the variable and the x has the /n after it if I just import them with the folloing code:
def ReadTextFile(textfilename):
Text_File_Import = open(textfilename, 'r')
Text_lines = Text_File_Import.readlines()
User_Inputs = Text_lines[1].split(' ')
User_Inputs_clength = User_Inputs[0]
#print User_Inputs[2] + User_Inputs_clength
User_Inputs = Text_lines[2].split(' ')
User_Inputs_cradius = User_Inputs[0]
#print User_Inputs[2], ' ', User_Inputs_cradius
return User_Inputs_clength, User_Inputs_cradius
Thanks
I don't quite understand the question. If you want to store the variables:
As long as the variables in the text file are valid python syntax (eg. strings surrounded by parentheses), here is an easy but very insecure method:
file=open('file.txt')
exec(file.read())
It will store all the variables, with their names.
If you want to split a text file between the spaces:
file=open('file.txt')
output=file.read().split(' ')
And if you want to replace newlines with spaces:
file=open('file.txt')
output=file.read().replace('\n', ' ')
You have a lot in indentation issues. To read lines and split by space the below snippet should help.
Demo
with open('USER_INPUTS.txt', 'r') as infile:
data = infile.readlines()
for i in data:
print(i.split())
New to coding and trying to figure out how to fix a broken csv file to make be able to work with it properly.
So the file has been exported from a case management system and contains fields for username, casenr, time spent, notes and date.
The problem is that occasional notes have newlines in them and when exporting the csv the tooling does not contain quotation marks to define it as a string within the field.
see below example:
user;case;hours;note;date;
tnn;123;4;solved problem;2017-11-27;
tnn;124;2;random comment;2017-11-27;
tnn;125;3;I am writing a comment
that contains new lines
without quotation marks;2017-11-28;
HJL;129;8;trying to concatenate lines to re form the broken csv;2017-11-29;
I would like to concatenate lines 3,4 and 5 to show the following:
tnn;125;3;I am writing a comment that contains new lines without quotation marks;2017-11-28;
Since every line starts with a username (always 3 letters) I thought I would be able to iterate the lines to find which lines do not start with a username and concatenate that with the previous line.
It is not really working as expected though.
This is what I have got so far:
import re
with open('Rapp.txt', 'r') as f:
for line in f:
previous = line #keep current line in variable to join next line
if not re.match(r'^[A-Za-z]{3}', line): #regex to match 3 letters
print(previous.join(line))
Script shows no output just finishes silently, any thoughts?
I think I would go a slightly different way:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
for line in f:
if not re.search("\d{4}-\d{1,2}-\d{1,2};\n", line):
line = re.sub("\n", "", line)
all_the_data = "".join([all_the_data, line])
print (all_the_data)
There a several ways to do this each with pros and cons, but I think this keeps it simple.
Loop the file as you have done and if the line doesn't end in a date and ; take off the carriage return and stuff it into all_the_data. That way you don't have to play with looking back 'up' the file. Again, lots of way to do this. If you would rather use the logic of starts with 3 letters and a ; and looking back, this works:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
all_the_data = ""
for line in f:
if not re.search("^[A-Za-z]{3};", line):
all_the_data = re.sub("\n$", "", all_the_data)
all_the_data = "".join([all_the_data, line])
print ("results:")
print (all_the_data)
Pretty much what was asked for. The logic being if the current line doesn't start right, take out the previous line's carriage return from all_the_data.
If you need help playing with the regex itself, this site is great: http://regex101.com
The regex in your code matches to all the lines (string) in the txt (finds a valid match to the pattern). The if condition is never true and hence nothing prints.
with open('./Rapp.txt', 'r') as f:
join_words = []
for line in f:
line = line.strip()
if len(line) > 3 and ";" in line[0:4] and len(join_words) > 0:
print(';'.join(join_words))
join_words = []
join_words.append(line)
else:
join_words.append(line)
print(";".join(join_words))
I've tried to not use regex here to keep it a little clear if possible. But, regex is a better option.
A simple way would be to use a generator that acts as a filter on the original file. That filter would concatenate a line to the previous one if it has not a semicolon (;) in its 4th column. Code could be:
def preprocess(fd):
previous = next(fd)
for line in fd:
if line[3] == ';':
yield previous
previous = line
else:
previous = previous.strip() + " " + line
yield previous # don't forget last line!
You could then use:
with open(test.txt) as fd:
rd = csv.DictReader(preprocess(fd))
for row in rd:
...
The trick here is that the csv module only requires on object that returns a line each time next function is applied to it, so a generator is appropriate.
But this is only a workaround and the correct way would be that the previous step directly produces a correct CSV file.
I am attempting to pull out multiple (50-100) sequences from a large .txt file seperated by new lines ('\n'). The sequence is a few lines long but not always the same length so i can't just print lines x-y. The sequences end with " and the next line always starts with the same word so maybe that could be used as a keyword.
I am writing using python 3.3
This is what I have so far:
searchfile = open('filename.txt' , 'r')
cache = []
for line in searchfile:
cache.append(line)
for line in range(len(cache)):
if "keyword1" in cache[line].lower():
print(cache[line+5])
This pulls out the starting line (which is 5 lines below the keyword line always) however it only pulls out this line.
How do I print the whole sequence?
Thankyou for your help.
EDIT 1:
Current output = ABCDCECECCECECE ...
Desired output = ABCBDBEBSOSO ...
ABCBDBDBDBDD ...
continued until " or new line
Edit 2
Text file looks like this:
Name (keyword):
Date
Address1
Address2
Sex
Response"................................"
Y/N
The sequence between the " and " is what I need
TL;DR - How do I print from line + 5 to end when end = keyword
Not sure if I understand your sequence data but if you're searching for each 'keyword' then the next " char then the following should work:
keyword_pos =[]
endseq_pos = []
for line in range(len(cache)):
if 'keyword1' in cache[line].lower():
keyword_pos.append(line)
if '"' in cache[line]:
endseq_pos.append(line)
for key in keyword_pos:
for endseq in endseq_pos:
if endseq > key:
print(cache[key:endseq])
break
This simply compiles a list of all the positions of all the keywords and " characters and then matches the two and prints all the lines in between.
Hope that helps.
I agree with #Michal Frystacky that regex is the way forward. However as I now understand the problem, we need two searches one for the 'keyword' then again 5 lines on, to find the 'sequence'
This should work but may need the regex to be tweaked:
import re
with open('yourfile.txt') as f:
lines = f.readlines()
for i,line in enumerate(lines):
#first search for keyword
key_match = re.search(r'\((keyword)',line)
if key_match:
#if successful search 5 lines on for the string between the quotation marks
seq_match = re.search(r'"([A-Z]*)"',lines[i+5])
if seq_match:
print(key_match.group(1) +' '+ seq_match.group(1))
1This can be done rather simply with regex
import re
lines = 'Name (keyword):','Date','Address1','Address2','Sex','Response"................................" '
for line in lines:
match = re.search('.*?"(:?.*?)"?',line)
if match:
print(match.group(1))
Eventually to use this sample code we would lines = f.readlines() from the dataset. Its important to note that we catch only things between " and another ", if there is no " mark at the end, we will miss this data, but accounting for that isn't too difficult.
Hopefully this is an easy fix. I'm trying to edit one field of a file we use for import, however when I run the following code it leaves the file blank and 0kb. Could anyone advise what I'm doing wrong?
import re #import regex so we can use the commands
name = raw_input("Enter filename:") #prompt for file name, press enter to just open test.nhi
if len(name) < 1 : name = "test.nhi"
count = 0
fhand = open(name, 'w+')
for line in fhand:
words = line.split(',') #obtain individual words by using split
words[34] = re.sub(r'\D', "", words[34]) #remove non-numeric chars from string using regex
if len(words[34]) < 1 : continue # If the 34th field is blank go to the next line
elif len(words[34]) == 2 : "{0:0>3}".format([words[34]]) #Add leading zeroes depending on the length of the field
elif len(words[34]) == 3 : "{0:0>2}".format([words[34]])
elif len(words[34]) == 4 : "{0:0>1}".format([words[34]])
fhand.write(words) #write the line
fhand.close() # Close the file after the loop ends
I have taken below text in 'a.txt' as input and modified your code. Please check if it's work for you.
#Intial Content of a.txt
This,program,is,Java,program
This,program,is,12Python,programs
Modified code as follow:
import re
#Reading from file and updating values
fhand = open('a.txt', 'r')
tmp_list=[]
for line in fhand:
#Split line using ','
words = line.split(',')
#Remove non-numeric chars from 34th string using regex
words[3] = re.sub(r'\D', "", words[3])
#Update the 3rd string
# If the 3rd field is blank go to the next line
if len(words[3]) < 1 :
#Removed continue it from here we need to reconstruct the original line and write it to file
print "Field empty.Continue..."
elif len(words[3]) >= 1 and len(words[3]) < 5 :
#format won't add leading zeros. zfill(5) will add required number of leading zeros depending on the length of word[3].
words[3]=words[3].zfill(5)
#After updating 3rd value in words list, again creating a line out of it.
tmp_str = ",".join(words)
tmp_list.append(tmp_str)
fhand.close()
#Writing to same file
whand = open("a.txt",'w')
for val in tmp_list:
whand.write(val)
whand.close()
File content after running code
This,program,is,,program
This,program,is,00012,programs
The file mode 'w+' Truncates your file to 0 bytes, so you'll only be able to read lines that you've written.
Look at Confused by python file mode "w+" for more information.
An idea would be to read the whole file first, close it, and re-open it to write files in it.
Not sure which OS you're on but I think reading and writing to the same file has undefined behaviour.
I guess internally the file object holds the position (try fhand.tell() to see where it is). You could probably adjust it back and forth as you went using fhand.seek(last_read_position) but really that's asking for trouble.
Also, I'm not sure how the script would ever end as it would end up reading the stuff it had just written (in a sort of infinite loop).
Best bet is to read the entire file first:
with open(name, 'r') as f:
lines = f.read().splitlines()
with open(name, 'w') as f:
for l in lines:
# ....
f.write(something)
For 'Printing to a file via Python' you can use:
ifile = open("test.txt","r")
print("Some text...", file = ifile)