Many text files with .txt extensions are present in a directory (1620_10.asc_rsmp_1.0.txt, 132_10.asc_rsmp_1.0.txt, etc) and the first few digits of the file names are the only changes (for example 1620 in first file and 132 in second file). I want to perform some operations on the text files.
The first line of every text file is a string while the rest are floating point numbers.
step1: The first thing I want to do is delete the first line from all the existing text files
step2: I want to convert rows to columns in all text files.
step:3 Following that, I want to arrange the files produced in step2 based on the names of the files(132_10.asc_rsmp_1.0.txt 1620_10.asc_rsmp_1.0.txt ...) side by side and want to save in a separate file.
cat 1620_10.asc_rsmp_1.0.txt
TIMESERIES ____, 5605 xxxxxxx, 1 yyy, 1969-11-31T22:52:10.000000, ZZZZZ, FLOAT,
+0.0000000000e+00 +5.8895751219e-02 +1.9720949872e-02 +4.7712552071e-02 +1.6255806150e-02 +5.0983512543e-02
+2.4151940813e-02 +4.3959767187e-02 +1.9066090517e-02 +4.8980189213e-02 +2.6237709462e-02 +4.1379166269e-02
cat 132_10.asc_rsmp_1.0.txt
TIMESERIES ____, 5605 xxxxxxx, 1 yyy, 1980-12-31T23:58:20.000000, ZZZZZ, FLOAT,
+2.0337053383e-02 +4.7575540537e-02 +2.7508078190e-02 +3.9923797852e-02 +2.1663353231e-02 +4.6368790709e-02
+2.8194768989e-02 +3.8577115641e-02 +2.1935380223e-02 +4.6024962357e-02 +2.9320681307e-02 +3.7630711188e-02
Expected output:
cat output.txt
+2.0337053383e-02 +0.0000000000e+00
+4.7575540537e-02 +5.8895751219e-02
+2.7508078190e-02 +1.9720949872e-02
+3.9923797852e-02 +4.7712552071e-02
+2.1663353231e-02 +1.6255806150e-02
+4.6368790709e-02 +5.0983512543e-02
+2.8194768989e-02 +2.4151940813e-02
+3.8577115641e-02 +4.3959767187e-02
+2.1935380223e-02 +1.9066090517e-02
+4.6024962357e-02 +4.8980189213e-02
+2.9320681307e-02 +2.6237709462e-02
+3.7630711188e-02 +4.1379166269e-02
My trial code:
with open("*.txt",'r') as f:
with open("new_file.txt",'w') as f1:
f.next() # skip header line
for line in f:
f1.write(line)
However it doesnot produce any expected output.Hope Helps from experts.Thanks.
It's unclear exactly what you want. This does what I think you want:
from glob import glob
# Returns a list of all relevant filenames
filenames = glob("*_10.asc_rsmp_1.0.txt")
# All the values will be stored in a dict where the key is the filename, and
# the value is a list of values
# It will be used later on to arrange the values side by side
values_by_filename = {}
# Read each filename
for filename in filenames:
with open(filename) as f:
with open(filename + "_processed.txt", "w") as f_new:
# Skip the first line (header)
next(f)
# Add all the values on every line to a single list
values = []
for line in f:
values.extend(line.split())
# Write each value on a new line in a new file
f_new.write("\n".join(values))
# Store the original filename and values to a dict for later
values_by_filename[filename] = values
# Order the filenames by the number before the first underscore
ordered_filenames = sorted(values_by_filename,
key=lambda filename: int(filename.split("_")[0]))
# Arrange the values side by side in a new file
# zip iterates over every list of values at once, yielding the next value
# from every list as a tuple each iteration
lines = []
for values in zip(*(values_by_filename[filename] for filename in ordered_filenames)):
# Separate each column by 3 spaces, as per your expected output
lines.append(" ".join(values))
# Write the concatenated values to file with a newline between each row, but
# not at the end of the file
with open("output.txt", "w") as f:
f.write("\n".join(lines))
output.txt:
+2.0337053383e-02 +0.0000000000e+00
+4.7575540537e-02 +5.8895751219e-02
+2.7508078190e-02 +1.9720949872e-02
+3.9923797852e-02 +4.7712552071e-02
+2.1663353231e-02 +1.6255806150e-02
+4.6368790709e-02 +5.0983512543e-02
+2.8194768989e-02 +2.4151940813e-02
+3.8577115641e-02 +4.3959767187e-02
+2.1935380223e-02 +1.9066090517e-02
+4.6024962357e-02 +4.8980189213e-02
+2.9320681307e-02 +2.6237709462e-02
+3.7630711188e-02 +4.1379166269e-02
Be sure to read the documentation, in particular:
https://docs.python.org/3/library/glob.html#glob.glob
https://docs.python.org/3/library/functions.html#zip
https://docs.python.org/3/library/functions.html#sorted
https://docs.python.org/3/library/stdtypes.html#str.join
Related
I need to process console output which looks like this and make a csv from it:
ID,FLAG,ADDRESS,MAC-ADDRESS,HOST-NAME,SERVER,STATUS,LAST-SEEN
0 10.0.0.11 00:1D:72:29:F2:4F lan waiting never
;;; test comment
1 10.0.0.19 00:13:21:15:D4:00 lan waiting never
2 10.0.0.10 00:60:6E:05:0C:E0 lan waiting never
3 D 10.0.1.199 24:E9:B3:20:FA:C7 home server1 bound 4h54m52s
4 D 100.64.1.197 E6:17:AE:21:EA:00 Suzana-s-A51 dhcp1 bound 2h16m45s
I have managed to split lines but regex is not working for tabs and spaces. Can someone point me in the right direction?
The code I am using is this:
import csv
import re
# Open the input file in read-only mode
with open('output.txt', 'r') as input_file:
# Open the output file in write-only mode
with open('output.csv', 'w') as output_file:
# Create a CSV writer that will write to the output file
writer = csv.writer(output_file)
# Read the first line of the input file (the header)
# and write it to the output file as a single value
# (i.e. do not split it on commas)
header = input_file.readline()
writer.writerow([header.strip()])
# Iterate over the remaining lines of the input file
for line in input_file:
# Ignore lines that start with ";;;" (these are comments)
if line.startswith(';;;'):
continue
# Split the line on newlines
values = line.split('\n')
line = re.sub(r'[\t ]+', ',', line)
# Iterate over the resulting values
for i, value in enumerate(values):
# If the value contains a comma, split it on commas
# and assign the resulting values to the `values` list
if ',' in value:
values[i:i+1] = value.split(',')
# Write the values to the output file
writer.writerow(values)
The regular expression can be handy here, make a mask, and then take each value from line read.
you can refer the regex and will give you great visuals.
so for each line will put a regex reg_format=r"(\d*?)(?:\s+)(.*?)(?:\s)(?:\s*?)(\w*\.\w*\.\w*\.\w*)(?:\s*)(\w*?:\w*?:\w*?:\w*?:\w*?:\w*)(?:\s*)(\w*)(?:\s*)(\w*)(?:\s*)(\w*)"
pls note that when we write to csv using writer.writerow it expects a list.
following would work for you, and you can tweak it as needed.
tweaked your code, and added the comments
Update:
Added masking for records
import csv
import re
#reg_format=r"(\d*?)(?:\s+)(.*?)(?:\s)(?:\s*?)(\w*\.\w*\.\w*\.\w*)(?:\s*)(\w*?:\w*?:\w*?:\w*?:\w*?:\w*)(?:\s*)(\w*)(?:\s*)(\w*)(?:\s*)(\w*)"
all_fields=r"(\d*?)(?:\s+)(.*?)(?:\s)(?:\s*?)(\w*\.\w*\.\w*\.\w*)(?:\s*)(\w*?:\w*?:\w*?:\w*?:\w*?:\w*)(?:\s{1,2})([\w-]{1,14})(?:\s*?)(\w+)(?:\s*)(\w+)(?:\s*)(\w*)(?:\s*)(\w*)"
all_fields_minus_host=r"(\d*?)(?:\s+)(.*?)(?:\s)(?:\s*?)(\w*\.\w*\.\w*\.\w*)(?:\s*)(\w*?:\w*?:\w*?:\w*?:\w*?:\w*)(?:\s{1,})([\w-]{1,14})(?:\s*?)(\w+)(?:\s*)(\w+)(?:\s*)(\w*)(?:\s*)(\w*)"
# Open the input file in read-only mode
with open('testreg.txt', 'r') as input_file:
# Open the output file in write-only mode
with open('output.csv', 'w') as output_file:
# Create a CSV writer that will write to the output file
writer = csv.writer(output_file)
# Read the first line of the input file (the header)
# and write it to the output file as a single value
# (i.e. do not split it on commas)
header = input_file.readline()
writer.writerow(header.split(',')) # split by "," as write row need list
#writer.writerow([header.strip()])
# Iterate over the remaining lines of the input file
for line in input_file:
# Ignore lines that start with ";;;" (these are comments)
if line.startswith(';;;'):
continue
#print(line)
gps=re.findall(all_fields,line)
if gps:
line_write=(['"'+gp+'"' for gp in list(gps[0])]) # if dont need quotes, put like gp for gp in list(gps[0])]
writer.writerow(line_write[:-1])
else:
gps=re.findall(all_fields_minus_host,line)
line_write=(['"'+gp+'"' for gp in list(gps[0])]) # if dont need quotes, put like gp for gp in list(gps[0])]
line_write.insert(4,'""')
writer.writerow(line_write[:-2])
#writer.writerow(line_write)
# commented below line
'''
# Split the line on newlines
values = line.split('\n')
line = re.sub(r'[\t ]+', ',', line)
# Iterate over the resulting values
for i, value in enumerate(values):
# If the value contains a comma, split it on commas
# and assign the resulting values to the `values` list
if ',' in value:
values[i:i+1] = value.split(',')
# Write the values to the output file
#writer.writerow(values)
'''
I have a string within a text file that reads as one row, but I need to split the string into multiple rows based on a separator. If possible, I would like to separate the elements in the string based on the period (.) separating the different line elements listed here:
"Line 1: Element '{URL1}Decimal': 'x' is not a valid value of the atomic type 'xs:decimal'.Line 2: Element '{URL2}pos': 'y' is not a valid value of the atomic type 'xs:double'.Line 3: Element '{URL3}pos': 'y z' is not a valid value of the list type '{list1}doubleList'"
Here is my current script that is able to read the .txt file and convert it to a csv, but does not separate each entry into it's own row.
import glob
import csv
import os
path = "C:\\Users\\mdl518\\Desktop\\txt_strip\\"
with open(os.path.join(path,"test.txt"), 'r') as infile, open(os.path.join(path,"test.csv"), 'w') as outfile:
stripped = (line.strip() for line in infile)
lines = (line.split(",") for line in stripped if line)
writer = csv.writer(outfile)
writer.writerows(lines)
If possible, I would like to be able to just write to a .txt with multiple rows but a .csv would also work - Any help is most appreciated!
One way to make it work:
import glob
import csv
import os
path = "C:\\Users\\mdl518\\Desktop\\txt_strip\\"
with open(os.path.join(path,"test.txt"), 'r') as infile, open(os.path.join(path,"test.csv"), 'w') as outfile:
stripped = (line.strip() for line in infile)
lines = ([sent] for para in (line.split(".") for line in stripped if line) for sent in para)
writer = csv.writer(outfile)
writer.writerows(lines)
Explanation below:
The output is one line because code in the last line reads a 2d array and there is only one instance in that 2d array which is the entire paragraph. To visualise it, "lines" is stored as [[s1,s2,s3]] where writer.writerows() takes rows input as [[s1],[s2],[s3]]
There can be two improvements.
(1) Take period '.' as seperator. line.split(".")
(2) Iterate over the split list in the list comprehension.
lines = ([sent] for para in (line.split(".") for line in stripped if line) for sent in para)
str.split() splits a string by separator and store instances in a list. In your case, it tried to store the list in a list comprehension which made it a 2d array. It saves your paragraph into [[s1,s2,s3]]
What I'm trying to do is to open two CSV files and print only the lines in which the content of a column in file 1 and file 2 match. I already know that I should end up with 14 results, but instead the first line of the CSV file I'm working with gets printed 14 times. Where did I go wrong?
file1 = open("../dir/file1.csv", "r")
for line in file1:
file1splitted = line.strip().split(",")
file2 = open("../dir/file2.csv", "r")
for line in file2:
file2splitted = line.strip().split(",")
for line in file1:
if file1splitted[0] == file2splitted [2]:
print (file1splitted[0],file1splitted[1], file2splitted[6], file2splitted[10], file2splitted[12])
file1.close()
file2.close()
You should be using the csv module for reading these files because splitting on commas is not reliable; it's fine for a single CSV column to contain values that themselves include commas.
I've added a couple of things to try make this cleaner and to help you move forward in your learning:
I've used the with context manager that automatically closes a file once you're done reading it. No need for .close()
I've packaged the csv reading code into a function. Now we only need to write that part once and we can call the function with any file.
I've used the csv module to read the file. This will return a nested list of rows, each inner list representing a single row.
I've used a list comprehension which is a neater way of writing a for loop that creates a list. In this case, it's a list of all the items in the first column of file_1.
I've converted the list in Point 4 into a set. When we iterate through file_2, we can very quickly check whether a row value has been seen in file_1 (set lookup is O(1) rather than having to iterate through file_1 every single time).
The indices I print are from my own test files, you will need to adapt them to your own use-case.
import csv
def read_csv(file_name):
with open(file_name) as infile: # Context manager to auto-close files at end
reader = csv.reader(infile)
#next(reader) remove the hash if you want to drop the headers
return list(reader)
file_1 = read_csv('file_1.csv')
file_2 = read_csv('file_2.csv')
# Make a set of file_1 column 0 with a list comprehension
file_1_vals = set([item[0] for item in file_1])
# Now iterate through file_2
for row in file_2:
if row[2] in file_1_vals:
print(row[1])
file1 = open("../dir/file1.csv", "r")
file2 = open("../dir/file2.csv", "r")
for line in file1:
file1splitted = line.strip().split(",")
for line in file2:
file2splitted = line.strip().split(",")
if file1splitted[0] == file2splitted [2]:
print (file1splitted[0],file1splitted[1], file2splitted[6], file2splitted[10], file2splitted[12])
file1.close()
file2.close()
if you provide your csv files then I can help you more.
i have a problem with the iteration process in python, I've tried and search the solutions, but i think this more complex than my capability (fyi, I've been writing code for 1 month).
The case:
Let say i have 3 csv files (the actual is 350 files), they are file_1.csv, file_2.csv, file_3.csv. I've done the iteration process/algorithm to create all of the filenames in into single list.
each csv contains single column with so many rows.
i.e.
#actual cvs much more like this:
# for file_1.csv:
value_1
value_2
value_3
Below is not the actual csv content (i mean i have converted them into array/series)
file_1.csv --> [['value_1'],['value_2'],['value_3']]
file_2.csv --> [['value_4'],['value_5']]
file_3.csv --> [['value_6']]
#first step was done, storing csv files name to a list, so it can be read and use in csv function.
filename = ['file_1.csv', 'file_2.csv', 'file_3.csv']
I want the result as a list:
#assigning a empty list
result = []
Desired result
print (result)
out:
[{'keys': 'file_1', 'values': 'value_1, value_2, value_3'},
{'keys': 'file_2', 'values': 'value_4, value_5'}
{'keys': 'file_3', 'values': 'value_6'}]
See above that the result's keys are no more containing ('.csv') at the end of file name, they are all replaced. And note that csv values (previously as a list of list or series) become one single string - separated with comma.
Any help is appreciated, Thank you very much
I'd like to answer this to the best of my capacity (I'm a newbie too).
Step1: Reading those 350 filenames
(if you've not figured out already, you could use glob module for this step)
Define the directory where the files are placed, let's say 'C:\Test'
directory = "C:/Test"
import glob
filename = sorted (glob.glob(directory, + "/*.csv"))
This will read all the 'CSV' files in the directory.
Step2: Reading CSV files and mapping them to dictionaries
result = []
import os
for file in files:
filename = str (os.path.basename(file).split('.')[0]) # removes the CSV extension from the filename
with open (file, 'r') as infile:
tempvalue = []
tempdict = {}
print (filename)
for line in infile.readlines():
tempvalue.append(line.strip()) # strips the lines and adds them to a list of temporary values
value = ",".join(tempvalue) # converts the temp list to a string
tempdict[filename] = value # Assigns the filename as key and the contents as value to a temporary dictionary
result.append(tempdict) # Adds the new temp dictionary for each file to the result list
print (result)
This piece of code should work (though there might be a smaller and more pythonic code someone else might share).
Since it seems that the contents of the files is already pretty much in the format you need them (bar the line endings) and you have the names of the 350 files in a list, there isn't a huge amount of processing you need to do. It's mainly a question of reading the contents of each file, and stripping the newline characters.
For example:
import os
result = []
filenames = ['file_1.csv', 'file_2.csv', 'file_3.csv']
for name in filenames:
# Set the filename minus extension as 'keys'
file_data = {'keys': os.path.basename(name).split('.')[0]}
with open(name) as f:
# Read the entire file
contents = f.read()
# Strip the line endings (and trailing comma), and set as 'values'
file_data['values'] = contents.replace(os.linesep, ' ').rstrip(',')
result.append(file_data)
print(result)
I know there's a million questions on this but I couldn't find one that matches what I'm looking for. Let's say I have a text file like this:
1 34
2 65
3 106
And I want to scan this file and read only the second column such that data=[34 65 106], how might I go about this? Further, if I wanted to make this program able to read any length dataset and any specific column input by the user. I can do most things in simple python but reading files eludes me.
pandas is a useful library for tasks such as this:
import pandas as pd
df = pd.read_csv('file.txt', header=None, delimiter=r"\s+")
lst = df.iloc[:, 1].tolist()
Solution
Sound like the case for a small helper function:
def read_col(fname, col=1, convert=int, sep=None):
"""Read text files with columns separated by `sep`.
fname - file name
col - index of column to read
convert - function to convert column entry with
sep - column separator
If sep is not specified or is None, any
whitespace string is a separator and empty strings are
removed from the result.
"""
with open(fname) as fobj:
return [convert(line.split(sep=sep)[col]) for line in fobj]
res = read_col('mydata.txt')
print(res)
Output:
[34, 65, 106]
If you want the first column, i.e. at index 0:
read_col('mydata.txt', col=0)
If you want them to be floats:
read_col('mydata.txt', col=0, convert=float)
If the columns are separated by commas:
read_col('mydata.txt', sep=',')
You can use any combination of these optional arguments.
Explanation
We define a new function with default parameters:
def read_col(fname, col=1, convert=int, sep=None):
This means you have to supply the file fname. All other arguments are optional and the default values will be used if not provide when calling the function.
In the function, we open the file with:
with open(fname) as fobj:
Now fobj is an open file object. The file will be closed when we de-dent, i.e. here when we end the function.
This:
[convert(line.split(sep=sep)[col]) for line in fobj]
creates a list by going through all lines of the file. Each line is split at the separator sep. We take only the value for the column with index col. We also convert the value in the datatype of convert, i.e. into an integer per default.
Edit
You can also skip the first line in the file:
with open(fname) as fobj:
next(fobj)
return [convert(line.split(sep=sep)[col]) for line in fobj]
Or more sophisticated as optional argument:
def read_col(fname, col=1, convert=int, sep=None, skip_lines=0):
# skip first `skip_lines` lines
for _ in range(skip_lines):
next(fobj)
with open(fname) as fobj:
return [convert(line.split(sep=sep)[col]) for line in fobj]
You an use a list comprehension:
data = [b for a, b in [i.strip('\n').split() for i in open('filename.txt')]]
You will first need to get list of all lines via
fileobj.readlines()
Then you can run a for loop to iterate through the lines one by one , for each line you can split it by char (" ")
Then in the same for loop you can add the second index of split result to a existing list which will be your final result
a=fil.readlines()
t=[]
for f in a:
e=f.split(" ")
t.append(e[1])
Is the file delimited?
You'll want to first open the file:
with open('file.txt', 'r') as f:
filedata = f.readlines()
Create a list, loop through the lines and split each line into a list based on your delimiter, and then append the indexed item in the list to your original list.
data = []
for line in filedata:
columns = line.split('*your delimiter*')
data.append(columns[1])
Then the data list should contain what you want.