I have a file called test.txt.
I need to convert one string in the file which matches the dictionary.
test.txt:
abc
asd
ds
{{ PRODUCT CATEGORY }}
fdsavfacxvasdvvc
dfvssfzxvdfvzd
Code is below:
data = {'PRODUCT CATEGORY':'Customer'}
all_files = ['test.txt']
out_files = ['ut.txt']
read_dict = {}
for file in all_files:
with open(file,'r') as read_file:
lines = read_file.readlines()
read_dict[file] = lines
for in_f, out_f in zip(all_files, out_files):
with open(in_f,'r') as read_file:
lines = read_file.readlines()
with open(out_f,'w+') as write_file:
for line in lines:
updated_line = []
for word in line.split():
if word in data:
updated_line.append(data[word])
else:
updated_line.append(word)
write_file.writelines(" ".join(updated_line))
print (" ".join(updated_line))
There is a space at the end and at the beginning PRODUCT CATEGORY
Expected output:
abc
asd
ds
Customer
fdsavfacxvasdvvc
dfvssfzxvdfvzd
Try this
import re
data = {'PRODUCT CATEGORY':'Customer'}
all_files = ['test.txt']
out_files = ['ut.txt']
for in_f, out_f in zip(all_files, out_files):
with open(in_f,'r') as read_file:
text = read_file.read()
for word, replace_with in data.items():
text = re.sub(r'\{+ *'+ word + r' *\}+', replace_with, text)
open(out_f,'w+').write(text)
You are splitting by white space, and you have a white space in "Product category" so it never finds an exact match for the word. You can see this if you add a print(word) after the for word in line.split() line
A way to solve this is by replacing Product category with Product_Category in data and in your test.txt file.
Also, you are missing the new line carry after writting each line to the output file, you should replace:
write_file.writelines(" ".join(updated_line))
with
write_file.writelines(" ".join(updated_line)+"\n")
With both these issues solved you get the desired output.
You can't loop over individual words on the input line because that prevents you from finding a dictionary key which consists of more than one word, like the one you have in your example.
Here is a simple refactoring which prints to standard output just so you can see what you are doing.
data = {'PRODUCT CATEGORY':'Customer'}
all_files = ['test.txt']
for file in all_files:
with open(file,'r') as read_file:
for line in read_file:
for keyword in data:
token = '{{ %s }}' % keyword
if token in line:
line = line.replace(token, data[keyword])
print(line, end='')
The end='' is necessary because line already contains a newline, but print wants to supply one of its own; if you write to a file instead of print, you can avoid this particular quirk. (But often, a better design for reusability is to just print to standard output, and let the caller decide what to do with the output.)
It is unclear why you had a separate read_dict for the lines in the input file or why you read the file twice, so I removed those parts.
Looping over a file a line at a time avoids reading the entire file into memory, so if you don't care what's on the previous or the next lines, this is usually a good idea, and scales to much bigger files (you only need to keep one line in memory at a time -- let's hope that one line is not several gigabytes).
Here is a demo (I took the liberty to use just one space in {{ PRODUCT CATEGORY }} and fix the spelling of "customer"): https://repl.it/repls/DarkredMindlessPackagedsoftware#main.py
Related
I have many .txt files full of data that need to be read into python and then converted to one excel document. I have the code working to read the data, remove first 5 lines and last two lines and then also export all txt files into an excel file but the problem I'm running into now is that the TXT file has inconsistent use of white space so a simple white space delimiter is not working as it should.
Here is an example of the txt file.
2020MAY16 215015 2.0 1004.4 30.0 2.0 2.0 2.0 NO LIMIT OFF OFF OFF OFF -25.84 -32.50 CRVBND N/A -0.0 28.52 78.54 FCST GOES16 33.4
*This is all on one line in the text file
Id like to be able to take this and make it look like this,
2020MAY16, 215015, 2.0, 1004.4, 30.0, 2.0, 2.0, 2.0, NO LIMIT, OFF, OFF, OFF, OFF, -25.84, -32.50, CRVBND, N/A, -0.0, 28.52, 78.54, FCST, GOES16, 33.4,
I have added the portion of code below that grabs the file from the URL address the user enters, iterates through the amount of storms to change the URL text. This also removes the top 5 lines and bottom 2. So if anyone has any suggestion on adding commas in that would be great to allow for an easy conversion to CSV file later.
for i in range(1,10,1):
url = mod_string+str(i)+"L-list.txt"
storm_abrev = url[54:57:1] #Grabs the designator from the URL to allow for simplistic naming of files
File_Name = (storm_abrev)+".txt" #Names the file
print(url) #Prints URL to allow user to confirm URLS are correct
urllib.request.urlretrieve(url,File_Name) #Sends a request to the URL from above, grabs the file, and then saves it as the designator.txt
file = open(File_Name,"r")
Counter = 0
Content = file.read()
CoList = Content.split("\n")
for j in CoList:
if j:
Counter += 1
print("This is the number of lines in the file")
print(Counter)
Final_Count = (Counter-2)
print(Final_Count)
with open(File_Name,'r') as f:
new_lines = []
for idx, line in enumerate(f):
if idx in [x for x in range(5,Final_Count)]:
new_lines.append(line)
with open(File_Name,"w") as f:
for line in new_lines:
f.write(line)
Edit: Fixed the issue caught by #DarryIG.
Create a list of phrases that need to remain intact. Let's call it phrase_list.
Identify a character or string that will never be used in the input file. For example, here I am assuming that an underscore will never be found in the input file and assign it to the variable forbidden_str. We could also use something like %$#%$#^%&%_#^ - chances of something like that occurring is very rare.
Replace multiple spaces with single spaces. Then replace spaces in phrases to _ (forbidden_str). Then replace all spaces with commas. Finally, replace _s back to spaces.
You could also simplify the reading lines part of your code using readlines().
...
phrase_list = ['NO LIMIT']
forbidden_str = "_"
with open(File_Name,'r') as f:
new_lines = f.readlines()
new_lines = new_lines[5:Final_Count]
with open(File_Name,"w") as f:
for line in new_lines:
for phrase in phrase_list:
if phrase in line:
line = line.replace(phrase, phrase.replace(" ", forbidden_str))
line = line.replace(" ", " ") # replaces multiple spaces with single spaces
line = line.replace(" ", ",")
line = line.replace(forbidden_str, " ")
f.write(line)
Output:
2020MAY16,215015,2.0,1004.4,30.0,2.0,2.0,2.0,NO LIMIT,OFF,OFF,OFF,OFF,-25.84,-32.50,CRVBND,N/A,-0.0,28.52,78.54,FCST,GOES16,33.4,
Also, a quick suggestion. It's a good practice to name variables in lower case. For example, final_count instead of Final_Count. Upper cases are usually used for classes, instances, etc. It just helps in readability and debugging.
If I understand your code correctly, you have a list of lines where you want to replace a space with a comma followed by a space. In python you can do this very simple like this:
lines = [x.replace(" ", ", ") for x in lines]
I'm trying to read a plaintext file line by line, cherry pick lines that begin with a pattern of any six digits. Pass those to a list and then write that list row by row to a .csv file.
Here's an example of a line I'm trying to match in the file:
**000003** ANW2248_08_DESOLATE-WASTELAND-3. A9 C 00:55:25:17 00:55:47:12 10:00:00:00 10:00:21:20
And here is a link to two images, one showing the above line in context of the rest of the file and the expected result: https://imgur.com/a/XHjt9e1
import csv
identifier = re.compile(r'^(\d\d\d\d\d\d)')
matched_line = []
with open('file.edl', 'r') as file:
reader = csv.reader(file)
for line in reader:
line = str(line)
if identifier.search(line) == True:
matched_line.append(line)
else: continue
with open('file.csv', 'w') as outputEDL:
print('Copying EDL contents into .csv file for reformatting...')
outputEDL.write(str(matched_line))
Expected result would be the reader gets to a line, searches using the regex, then if the result of the search finds the series of 6 numbers at the beginning, it appends that entire line to the matched_line list.
What I'm actually getting is, once I write what reader has read to a .csv file, it has only picked out [], so the regex search obviously isn't functioning properly in the way I've written this code. Any tips on how to better form it to achieve what I'm trying to do would be greatly appreciated.
Thank you.
Some more examples of expected input/output would better help with solving this problem but from what I can see you are trying to write each line within a text file that contains a timestamp to a csv. In that case here is some psuedo code that might help you solve your problem as well as a separate regex match function to make your code more readable
import re
def match_time(line):
pattern = re.compile(r'(?:\d+[:]\d+[:]\d+[:]\d+)+')
result = pattern.findall(line)
return " ".join(result)
This will return a string of the entire timecode if a match is found
lines = []
with open('yourfile', 'r') as txtfile:
with open('yourfile', 'w') as csvfile:
for line in txtfile:
res = match_line(line)
#alternatively you can test if res in line which might be better
if res != "":
lines.append(line)
for item in lines:
csvfile.write(line)
Opens a text file for reading, if the line contains a timecode, appends the line to a list, then iterates that list and writes the line to the csv.
New to coding and trying to figure out how to fix a broken csv file to make be able to work with it properly.
So the file has been exported from a case management system and contains fields for username, casenr, time spent, notes and date.
The problem is that occasional notes have newlines in them and when exporting the csv the tooling does not contain quotation marks to define it as a string within the field.
see below example:
user;case;hours;note;date;
tnn;123;4;solved problem;2017-11-27;
tnn;124;2;random comment;2017-11-27;
tnn;125;3;I am writing a comment
that contains new lines
without quotation marks;2017-11-28;
HJL;129;8;trying to concatenate lines to re form the broken csv;2017-11-29;
I would like to concatenate lines 3,4 and 5 to show the following:
tnn;125;3;I am writing a comment that contains new lines without quotation marks;2017-11-28;
Since every line starts with a username (always 3 letters) I thought I would be able to iterate the lines to find which lines do not start with a username and concatenate that with the previous line.
It is not really working as expected though.
This is what I have got so far:
import re
with open('Rapp.txt', 'r') as f:
for line in f:
previous = line #keep current line in variable to join next line
if not re.match(r'^[A-Za-z]{3}', line): #regex to match 3 letters
print(previous.join(line))
Script shows no output just finishes silently, any thoughts?
I think I would go a slightly different way:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
for line in f:
if not re.search("\d{4}-\d{1,2}-\d{1,2};\n", line):
line = re.sub("\n", "", line)
all_the_data = "".join([all_the_data, line])
print (all_the_data)
There a several ways to do this each with pros and cons, but I think this keeps it simple.
Loop the file as you have done and if the line doesn't end in a date and ; take off the carriage return and stuff it into all_the_data. That way you don't have to play with looking back 'up' the file. Again, lots of way to do this. If you would rather use the logic of starts with 3 letters and a ; and looking back, this works:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
all_the_data = ""
for line in f:
if not re.search("^[A-Za-z]{3};", line):
all_the_data = re.sub("\n$", "", all_the_data)
all_the_data = "".join([all_the_data, line])
print ("results:")
print (all_the_data)
Pretty much what was asked for. The logic being if the current line doesn't start right, take out the previous line's carriage return from all_the_data.
If you need help playing with the regex itself, this site is great: http://regex101.com
The regex in your code matches to all the lines (string) in the txt (finds a valid match to the pattern). The if condition is never true and hence nothing prints.
with open('./Rapp.txt', 'r') as f:
join_words = []
for line in f:
line = line.strip()
if len(line) > 3 and ";" in line[0:4] and len(join_words) > 0:
print(';'.join(join_words))
join_words = []
join_words.append(line)
else:
join_words.append(line)
print(";".join(join_words))
I've tried to not use regex here to keep it a little clear if possible. But, regex is a better option.
A simple way would be to use a generator that acts as a filter on the original file. That filter would concatenate a line to the previous one if it has not a semicolon (;) in its 4th column. Code could be:
def preprocess(fd):
previous = next(fd)
for line in fd:
if line[3] == ';':
yield previous
previous = line
else:
previous = previous.strip() + " " + line
yield previous # don't forget last line!
You could then use:
with open(test.txt) as fd:
rd = csv.DictReader(preprocess(fd))
for row in rd:
...
The trick here is that the csv module only requires on object that returns a line each time next function is applied to it, so a generator is appropriate.
But this is only a workaround and the correct way would be that the previous step directly produces a correct CSV file.
Hopefully this is an easy fix. I'm trying to edit one field of a file we use for import, however when I run the following code it leaves the file blank and 0kb. Could anyone advise what I'm doing wrong?
import re #import regex so we can use the commands
name = raw_input("Enter filename:") #prompt for file name, press enter to just open test.nhi
if len(name) < 1 : name = "test.nhi"
count = 0
fhand = open(name, 'w+')
for line in fhand:
words = line.split(',') #obtain individual words by using split
words[34] = re.sub(r'\D', "", words[34]) #remove non-numeric chars from string using regex
if len(words[34]) < 1 : continue # If the 34th field is blank go to the next line
elif len(words[34]) == 2 : "{0:0>3}".format([words[34]]) #Add leading zeroes depending on the length of the field
elif len(words[34]) == 3 : "{0:0>2}".format([words[34]])
elif len(words[34]) == 4 : "{0:0>1}".format([words[34]])
fhand.write(words) #write the line
fhand.close() # Close the file after the loop ends
I have taken below text in 'a.txt' as input and modified your code. Please check if it's work for you.
#Intial Content of a.txt
This,program,is,Java,program
This,program,is,12Python,programs
Modified code as follow:
import re
#Reading from file and updating values
fhand = open('a.txt', 'r')
tmp_list=[]
for line in fhand:
#Split line using ','
words = line.split(',')
#Remove non-numeric chars from 34th string using regex
words[3] = re.sub(r'\D', "", words[3])
#Update the 3rd string
# If the 3rd field is blank go to the next line
if len(words[3]) < 1 :
#Removed continue it from here we need to reconstruct the original line and write it to file
print "Field empty.Continue..."
elif len(words[3]) >= 1 and len(words[3]) < 5 :
#format won't add leading zeros. zfill(5) will add required number of leading zeros depending on the length of word[3].
words[3]=words[3].zfill(5)
#After updating 3rd value in words list, again creating a line out of it.
tmp_str = ",".join(words)
tmp_list.append(tmp_str)
fhand.close()
#Writing to same file
whand = open("a.txt",'w')
for val in tmp_list:
whand.write(val)
whand.close()
File content after running code
This,program,is,,program
This,program,is,00012,programs
The file mode 'w+' Truncates your file to 0 bytes, so you'll only be able to read lines that you've written.
Look at Confused by python file mode "w+" for more information.
An idea would be to read the whole file first, close it, and re-open it to write files in it.
Not sure which OS you're on but I think reading and writing to the same file has undefined behaviour.
I guess internally the file object holds the position (try fhand.tell() to see where it is). You could probably adjust it back and forth as you went using fhand.seek(last_read_position) but really that's asking for trouble.
Also, I'm not sure how the script would ever end as it would end up reading the stuff it had just written (in a sort of infinite loop).
Best bet is to read the entire file first:
with open(name, 'r') as f:
lines = f.read().splitlines()
with open(name, 'w') as f:
for l in lines:
# ....
f.write(something)
For 'Printing to a file via Python' you can use:
ifile = open("test.txt","r")
print("Some text...", file = ifile)
I have a .txt file in cyrillic. It's structure is like that but in cyrillic:
city text text text.#1#N
river, text text.#3#Name (Name1, Name2, Name3)
lake text text text.#5#N (Name1)
mountain text text.#23#Na
What I need:
1) look at the first word in a line
2) if it is "river" then write all words after "#3#", i.e. Name (Name1, Name2, Name3) in a file 'river'.
That I have to do also with another first words in lines, i. e. city, lake, mountain.
What I have done only finds if the first word is "city" and saves whole line to a file:
lines = f.readlines()
for line in lines:
if line.startswith('city'):
f2.write(line)
f.close()
f2.close()
I know I can use regex to find Names: #[0-9]+#(\W+) but I don't know how to implement it to a code.
I really need your help! And I'm glad for any help.
If all of your river**s have ,s after them, like in the above code you posted, I would do something like:
for line in f.readlines():
items = line.split("**,")
if items[0] == "**river":
names = line.split("#")[1].strip().split("(")[1].split(")")[0].split(",")
names = [Name1, Name2, Name3]
#.. now write each one
What you want to do here is avoid hard-coding the names of the files you need. Instead, glean that from the input file. Create a dictionary of the files you need to writing to, opening each one as it's needed. Something like this (untested and probably in need of some adaptation):
outfiles = {}
try:
with open("infile.txt") as infile:
for line in infile:
tag = line.split(" ", 1)[0].strip("*, ") # e.g. "river"
if tag not in outfiles: # if it's the first time we've seen a tag
outfiles[tag] = open(tag = ".txt", "w") # open tag.txt to write
content = line.rsplit("#", 1)[-1].strip("* ")
outfiles[tag].write(content + "\n")
finally:
for outfile in outfiles.itervalues():
outfile.close()