I have a string within a text file that reads as one row, but I need to split the string into multiple rows based on a separator. If possible, I would like to separate the elements in the string based on the period (.) separating the different line elements listed here:
"Line 1: Element '{URL1}Decimal': 'x' is not a valid value of the atomic type 'xs:decimal'.Line 2: Element '{URL2}pos': 'y' is not a valid value of the atomic type 'xs:double'.Line 3: Element '{URL3}pos': 'y z' is not a valid value of the list type '{list1}doubleList'"
Here is my current script that is able to read the .txt file and convert it to a csv, but does not separate each entry into it's own row.
import glob
import csv
import os
path = "C:\\Users\\mdl518\\Desktop\\txt_strip\\"
with open(os.path.join(path,"test.txt"), 'r') as infile, open(os.path.join(path,"test.csv"), 'w') as outfile:
stripped = (line.strip() for line in infile)
lines = (line.split(",") for line in stripped if line)
writer = csv.writer(outfile)
writer.writerows(lines)
If possible, I would like to be able to just write to a .txt with multiple rows but a .csv would also work - Any help is most appreciated!
One way to make it work:
import glob
import csv
import os
path = "C:\\Users\\mdl518\\Desktop\\txt_strip\\"
with open(os.path.join(path,"test.txt"), 'r') as infile, open(os.path.join(path,"test.csv"), 'w') as outfile:
stripped = (line.strip() for line in infile)
lines = ([sent] for para in (line.split(".") for line in stripped if line) for sent in para)
writer = csv.writer(outfile)
writer.writerows(lines)
Explanation below:
The output is one line because code in the last line reads a 2d array and there is only one instance in that 2d array which is the entire paragraph. To visualise it, "lines" is stored as [[s1,s2,s3]] where writer.writerows() takes rows input as [[s1],[s2],[s3]]
There can be two improvements.
(1) Take period '.' as seperator. line.split(".")
(2) Iterate over the split list in the list comprehension.
lines = ([sent] for para in (line.split(".") for line in stripped if line) for sent in para)
str.split() splits a string by separator and store instances in a list. In your case, it tried to store the list in a list comprehension which made it a 2d array. It saves your paragraph into [[s1,s2,s3]]
Related
I need to process console output which looks like this and make a csv from it:
ID,FLAG,ADDRESS,MAC-ADDRESS,HOST-NAME,SERVER,STATUS,LAST-SEEN
0 10.0.0.11 00:1D:72:29:F2:4F lan waiting never
;;; test comment
1 10.0.0.19 00:13:21:15:D4:00 lan waiting never
2 10.0.0.10 00:60:6E:05:0C:E0 lan waiting never
3 D 10.0.1.199 24:E9:B3:20:FA:C7 home server1 bound 4h54m52s
4 D 100.64.1.197 E6:17:AE:21:EA:00 Suzana-s-A51 dhcp1 bound 2h16m45s
I have managed to split lines but regex is not working for tabs and spaces. Can someone point me in the right direction?
The code I am using is this:
import csv
import re
# Open the input file in read-only mode
with open('output.txt', 'r') as input_file:
# Open the output file in write-only mode
with open('output.csv', 'w') as output_file:
# Create a CSV writer that will write to the output file
writer = csv.writer(output_file)
# Read the first line of the input file (the header)
# and write it to the output file as a single value
# (i.e. do not split it on commas)
header = input_file.readline()
writer.writerow([header.strip()])
# Iterate over the remaining lines of the input file
for line in input_file:
# Ignore lines that start with ";;;" (these are comments)
if line.startswith(';;;'):
continue
# Split the line on newlines
values = line.split('\n')
line = re.sub(r'[\t ]+', ',', line)
# Iterate over the resulting values
for i, value in enumerate(values):
# If the value contains a comma, split it on commas
# and assign the resulting values to the `values` list
if ',' in value:
values[i:i+1] = value.split(',')
# Write the values to the output file
writer.writerow(values)
The regular expression can be handy here, make a mask, and then take each value from line read.
you can refer the regex and will give you great visuals.
so for each line will put a regex reg_format=r"(\d*?)(?:\s+)(.*?)(?:\s)(?:\s*?)(\w*\.\w*\.\w*\.\w*)(?:\s*)(\w*?:\w*?:\w*?:\w*?:\w*?:\w*)(?:\s*)(\w*)(?:\s*)(\w*)(?:\s*)(\w*)"
pls note that when we write to csv using writer.writerow it expects a list.
following would work for you, and you can tweak it as needed.
tweaked your code, and added the comments
Update:
Added masking for records
import csv
import re
#reg_format=r"(\d*?)(?:\s+)(.*?)(?:\s)(?:\s*?)(\w*\.\w*\.\w*\.\w*)(?:\s*)(\w*?:\w*?:\w*?:\w*?:\w*?:\w*)(?:\s*)(\w*)(?:\s*)(\w*)(?:\s*)(\w*)"
all_fields=r"(\d*?)(?:\s+)(.*?)(?:\s)(?:\s*?)(\w*\.\w*\.\w*\.\w*)(?:\s*)(\w*?:\w*?:\w*?:\w*?:\w*?:\w*)(?:\s{1,2})([\w-]{1,14})(?:\s*?)(\w+)(?:\s*)(\w+)(?:\s*)(\w*)(?:\s*)(\w*)"
all_fields_minus_host=r"(\d*?)(?:\s+)(.*?)(?:\s)(?:\s*?)(\w*\.\w*\.\w*\.\w*)(?:\s*)(\w*?:\w*?:\w*?:\w*?:\w*?:\w*)(?:\s{1,})([\w-]{1,14})(?:\s*?)(\w+)(?:\s*)(\w+)(?:\s*)(\w*)(?:\s*)(\w*)"
# Open the input file in read-only mode
with open('testreg.txt', 'r') as input_file:
# Open the output file in write-only mode
with open('output.csv', 'w') as output_file:
# Create a CSV writer that will write to the output file
writer = csv.writer(output_file)
# Read the first line of the input file (the header)
# and write it to the output file as a single value
# (i.e. do not split it on commas)
header = input_file.readline()
writer.writerow(header.split(',')) # split by "," as write row need list
#writer.writerow([header.strip()])
# Iterate over the remaining lines of the input file
for line in input_file:
# Ignore lines that start with ";;;" (these are comments)
if line.startswith(';;;'):
continue
#print(line)
gps=re.findall(all_fields,line)
if gps:
line_write=(['"'+gp+'"' for gp in list(gps[0])]) # if dont need quotes, put like gp for gp in list(gps[0])]
writer.writerow(line_write[:-1])
else:
gps=re.findall(all_fields_minus_host,line)
line_write=(['"'+gp+'"' for gp in list(gps[0])]) # if dont need quotes, put like gp for gp in list(gps[0])]
line_write.insert(4,'""')
writer.writerow(line_write[:-2])
#writer.writerow(line_write)
# commented below line
'''
# Split the line on newlines
values = line.split('\n')
line = re.sub(r'[\t ]+', ',', line)
# Iterate over the resulting values
for i, value in enumerate(values):
# If the value contains a comma, split it on commas
# and assign the resulting values to the `values` list
if ',' in value:
values[i:i+1] = value.split(',')
# Write the values to the output file
#writer.writerow(values)
'''
Many text files with .txt extensions are present in a directory (1620_10.asc_rsmp_1.0.txt, 132_10.asc_rsmp_1.0.txt, etc) and the first few digits of the file names are the only changes (for example 1620 in first file and 132 in second file). I want to perform some operations on the text files.
The first line of every text file is a string while the rest are floating point numbers.
step1: The first thing I want to do is delete the first line from all the existing text files
step2: I want to convert rows to columns in all text files.
step:3 Following that, I want to arrange the files produced in step2 based on the names of the files(132_10.asc_rsmp_1.0.txt 1620_10.asc_rsmp_1.0.txt ...) side by side and want to save in a separate file.
cat 1620_10.asc_rsmp_1.0.txt
TIMESERIES ____, 5605 xxxxxxx, 1 yyy, 1969-11-31T22:52:10.000000, ZZZZZ, FLOAT,
+0.0000000000e+00 +5.8895751219e-02 +1.9720949872e-02 +4.7712552071e-02 +1.6255806150e-02 +5.0983512543e-02
+2.4151940813e-02 +4.3959767187e-02 +1.9066090517e-02 +4.8980189213e-02 +2.6237709462e-02 +4.1379166269e-02
cat 132_10.asc_rsmp_1.0.txt
TIMESERIES ____, 5605 xxxxxxx, 1 yyy, 1980-12-31T23:58:20.000000, ZZZZZ, FLOAT,
+2.0337053383e-02 +4.7575540537e-02 +2.7508078190e-02 +3.9923797852e-02 +2.1663353231e-02 +4.6368790709e-02
+2.8194768989e-02 +3.8577115641e-02 +2.1935380223e-02 +4.6024962357e-02 +2.9320681307e-02 +3.7630711188e-02
Expected output:
cat output.txt
+2.0337053383e-02 +0.0000000000e+00
+4.7575540537e-02 +5.8895751219e-02
+2.7508078190e-02 +1.9720949872e-02
+3.9923797852e-02 +4.7712552071e-02
+2.1663353231e-02 +1.6255806150e-02
+4.6368790709e-02 +5.0983512543e-02
+2.8194768989e-02 +2.4151940813e-02
+3.8577115641e-02 +4.3959767187e-02
+2.1935380223e-02 +1.9066090517e-02
+4.6024962357e-02 +4.8980189213e-02
+2.9320681307e-02 +2.6237709462e-02
+3.7630711188e-02 +4.1379166269e-02
My trial code:
with open("*.txt",'r') as f:
with open("new_file.txt",'w') as f1:
f.next() # skip header line
for line in f:
f1.write(line)
However it doesnot produce any expected output.Hope Helps from experts.Thanks.
It's unclear exactly what you want. This does what I think you want:
from glob import glob
# Returns a list of all relevant filenames
filenames = glob("*_10.asc_rsmp_1.0.txt")
# All the values will be stored in a dict where the key is the filename, and
# the value is a list of values
# It will be used later on to arrange the values side by side
values_by_filename = {}
# Read each filename
for filename in filenames:
with open(filename) as f:
with open(filename + "_processed.txt", "w") as f_new:
# Skip the first line (header)
next(f)
# Add all the values on every line to a single list
values = []
for line in f:
values.extend(line.split())
# Write each value on a new line in a new file
f_new.write("\n".join(values))
# Store the original filename and values to a dict for later
values_by_filename[filename] = values
# Order the filenames by the number before the first underscore
ordered_filenames = sorted(values_by_filename,
key=lambda filename: int(filename.split("_")[0]))
# Arrange the values side by side in a new file
# zip iterates over every list of values at once, yielding the next value
# from every list as a tuple each iteration
lines = []
for values in zip(*(values_by_filename[filename] for filename in ordered_filenames)):
# Separate each column by 3 spaces, as per your expected output
lines.append(" ".join(values))
# Write the concatenated values to file with a newline between each row, but
# not at the end of the file
with open("output.txt", "w") as f:
f.write("\n".join(lines))
output.txt:
+2.0337053383e-02 +0.0000000000e+00
+4.7575540537e-02 +5.8895751219e-02
+2.7508078190e-02 +1.9720949872e-02
+3.9923797852e-02 +4.7712552071e-02
+2.1663353231e-02 +1.6255806150e-02
+4.6368790709e-02 +5.0983512543e-02
+2.8194768989e-02 +2.4151940813e-02
+3.8577115641e-02 +4.3959767187e-02
+2.1935380223e-02 +1.9066090517e-02
+4.6024962357e-02 +4.8980189213e-02
+2.9320681307e-02 +2.6237709462e-02
+3.7630711188e-02 +4.1379166269e-02
Be sure to read the documentation, in particular:
https://docs.python.org/3/library/glob.html#glob.glob
https://docs.python.org/3/library/functions.html#zip
https://docs.python.org/3/library/functions.html#sorted
https://docs.python.org/3/library/stdtypes.html#str.join
I want to use lines of strings of a .txt file as search queries in other .txt files. But before this, I need to slice those strings of the lines of my original text data. Is there a simple way to do this?
This is my original .txt data:
CHEMBL2057820|MUBD_HDAC2_ligandset|mol2|42|dock12
CHEMBL1957458|MUBD_HDAC2_ligandset|mol2|58|dock10
CHEMBL251144|MUBD_HDAC2_ligandset|mol2|41|dock98
CHEMBL269935|MUBD_HDAC2_ligandset|mol2|30|dock58
... (over thousands)
And I need to have a new file where the new new lines contain only part of those strings, like:
CHEMBL2057820
CHEMBL1957458
CHEMBL251144
CHEMBL269935
Open the file, read in the lines and split each line at the | character, then index the first result
with open("test.txt") as f:
parts = (line.lstrip().split('|', 1)[0] for line in f)
with open('dest.txt', 'w') as dest:
dest.write("\n".join(parts))
Explanation:
lstrip - removes whitespace on leading part of the line
split("|") returns a list like: ['CHEMBL2057820', 'MUBD_HDAC2_ligandset', 'mol2', '42', 'dock12'] for each line
Since we're only conerned with the first section it's redundant to split the rest of the contents of the line on the | character, so we can specify a maxsplit argument, which will stop splitting the string after it's encoutered that many chacters
So split("|", 1)
gives['CHEMBL2057820','MUBD_HDAC2_ligandset|mol2|42|dock12']
Since we're only interested in the first part split("|", 1)[0] returns
the "CHEMBL..." section
Use split and readlines:
with open('foo.txt') as f:
g = open('bar.txt')
lines = f.readlines()
for line in lines:
l = line.strip().split('|')[0]
g.write(l)
I have a .csv file with many lines and with the structure:
YYY-MM-DD HH first_name quantity_number second_name first_number second_number third_number
I have a script in python to convert the separator from space to comma, and that working fine.
import csv
with open('file.csv') as infile, open('newfile.dat', 'w') as outfile:
for line in infile:
outfile.write(" ".join(line.split()).replace(' ', ','))
I need change, in the newfile.dat, the position of each value, for example put the HH value in position 6, the second_name value in position 2, etc.
Thanks in advance for your help.
If you're import csv might as well use it
import csv
with open('file.csv', newline='') as infile, open('newfile.dat', 'w+', newline='') as outfile:
read = csv.reader(infile, delimiter=' ')
write = csv.writer(outfile) #defaults to excel format, ie commas
for line in read:
write.writerow(line)
Use newline='' when opening csv files, otherwise you get double spaced files.
This just writes the line as it is in the input. If you want to change it before writing, do it in the for line in read: loop. line is a list of strings, which you can change the order of in any number of ways.
One way to reorder the values is to use operator.itemgetter:
from operator import itemgetter
getter = itemgetter(5,4,3,2,1,0) #This will reverse a six_element list
for line in read:
write.writerow(getter(line))
To reorder the items, a basic way could be as follows:
split_line = line.split(" ")
column_mapping = [9,6,3,7,3,2,1]
reordered = [split_line[c] for c in column_mapping]
joined = ",".join(reordered)
outfile.write(joined)
This splits up the string, reorders it according to column_mapping and then combines it back into one string (comma separated)
(in your code don't include column_mapping in the loop to avoid reinitialising it)
A CSV returns the following values
"1,323104,564382"
"2,322889,564483"
"3,322888,564479"
"4,322920,564425"
"5,322942,564349"
"6,322983,564253"
"7,322954,564154"
"8,322978,564121"
How would i take the " marks off each end of the rows, it seems to make individual columns when i do this.
reader=[[i[0].replace('\'','')] for i in reader]
does not change the file at all
It seems strictly easier to peel the quotes off first, and then feed it to the csv reader, which simply takes any iterable over lines as input.
import csv
import sys
f = open(sys.argv[1])
contents = f.read().replace('"', '')
reader = csv.reader(contents.splitlines())
for x,y,z in reader:
print x,y,z
Assuming every line is wrapped by two double quotes, we can do this:
f = open("filename.csv", "r")
newlines = []
for line in f: # we could use a list comprehension, but for simplicity, we won't.
newlines.append(line[1:-1])
f.close()
f2 = open("filename.csv", "w")
for index, line in enumerate(f2):
f2.write(newlines[index])
f2.close()
[1:-1] uses a list-indexing operation to get the second letter of the string to the last letter of the string, each represented by the indexes 1 and -1.
enumerate() is a helper function that turns an iterable into (0, first_element), (1, second_element), ... pairs.
Iterating over a file gets you its lines.