I have 10 TAB delimited txt files in a folder. It has three columns (with numbers only) preceeded by a 21 line header (text and numbers). In order to process them further, I would like to :
Choose the second column from all text files (starting after the 21 line header; i attached figure with arrow), convert the comma into decimal and stack each of these columns from the 10 files into a new tab delimited/csv file. Once all files.
I know very little scripting. I have Rstudio and Python and have tried to fiddle around a bit. But I have really no clue what to do. Since I have to process multiple folders, my work would be really simplified if it could be possible.
Reference figure
From your requirements it sounds like this Python code should do the trick:
import os
import glob
DIR = "path/to/your/directory"
OUTPUT_FILE = "path/to/your/output.csv"
HEADER_SIZE = 21
input_files = glob.glob(os.path.join(DIR, "*.txt"))
for input_file in input_files:
print("Now processing", input_file)
# read the file
with open(input_file, "r") as h:
contents = h.readlines()
# drop header
contents = contents[HEADER_SIZE:]
# grab the 2nd column
column = []
for row in contents:
# stop at the footer
if "####" in row:
break
split = row.split("\t")
if len(split) >= 2:
column.append(split[1])
# replace the comma
column_replaced = [row.replace(",", ".") for row in column]
# append to the output file
with open(OUTPUT_FILE, "a") as h:
h.write("\n".join(column_replaced))
h.write("\n") # end on a newline
Note that this will discard everything that wasn't part of the second column in the output file.
The code below is not an exact solution but if you follow along its lines you will be close to what you need.
output <- "NewFileName.txt"
old_dir <- setwd("your/folder")
files <- list.files("\\.txt")
df_list <- lapply(files, read.table, skip = 21, sep = "\t")
x <- lapply(df_list, '[[', 2)
x <- gsub(",", ".", unlist(x))
write.table(x, output, row.names = FALSE, col.names = FALSE)
setwd(old_dir)
list =[]
filename = "my_text"
file = open(filename, "r")
for line in file:
res=line.replace(",", ".")
list.append(res)
print(res)
f = open(filename, "w")
for item in list:
f.write(item)`enter code here`
Related
I have an excel files that contain values which i would like to write to text as shown on the right side of the image shown below. I have been doing this by hand but it is very tedious. I have tried using python but I am getting frustrated with my accrued knowledge of python so far. Thanks for the help
for those that can't see it, I would like it outputted as this
[wind#]
Height=
Direction=
Velocity=
You could export your excel file to .csv (I hope you can figure out on how to do this on your own) file and get back something like this:
height,direction,speed
1,2,3
3,2,1
With the following .py script you can take the input file (which is in csv format) and transform it to your output. Where input.csv is your csv file which resides in the same folder as your script and output.txt is the file which will be your result.
f = open('input.csv', 'r')
g = open('output.txt', 'w')
# Header lines must be kept separately since we will be using them for every time
first_line = f.readline()
headers = first_line.split(',')
headers[-1] = headers[-1].strip()
length = len(headers)
# Capitalize each header word.
for i in range(length):
headers[i] = headers[i].capitalize()
counter = 1
for line in f:
values = line.split(',')
values[-1] = values[-1].strip() #remove EOL character
g.write('[Wind' + str(counter) + ']' + "\n")
for i in range(length):
g.write(headers[i] + "=" + values[i] + "\n")
counter += 1
g.close()
f.close()
input:
height,direction,speed
1,2,3
3,2,1
output:
[Wind1]
Height=1
Direction=2
Speed=3
[Wind2]
Height=3
Direction=2
Speed=1
I have been working on a Python script to parse a single delimited column in a csv file. However, the column has multiple different delimiters and I can't figure out how to do this.
I have another script that works on similar data, but can't get this one to work. The data below is in a single column on the row. I want to have the script parse these out and add tabs in between each. Then I want to append this data into a list with only the unique items. Typically I am dealing with several hundred rows of this data and would like to parse the entire file and then return only the unique items in two columns (one for IP and other for URL).
Data to parse: 123.123.123.123::url.com,url2.com,234.234.234.234::url3.com (note ":" and "," are used as delimiters on the same line)
Script I am working with:
import sys
import csv
csv_file = csv.DictReader(open(sys.argv[1], 'rb'), delimiter=':')
uniq_rows = []
for column in csv_file:
X = column[' IP'].split(':')[-1]
row = X + '\t'
if row not in uniq_rows:
uniq_rows.append(row)
for row in uniq_rows:
print row
Does anyone know how to accomplish what I am trying to do?
Change the list (uniq_rows = []) to a set (uniq_rows = set()):
csv_file = csv.DictReader(open(sys.argv[1], 'rU'), delimiter=':')
uniq_rows = set()
for column in csv_file:
X = column[' IP'].split(':')[-1]
row = X + '\t'
uniq_rows.add(row)
for row in list(uniq_rows):
print row
If you need further help, leave a comment
you can also just use replace to change your import lines: (not overly pythonic I guess but standard builtin):
>>> a = "123.123.123.123::url.com,url2.com,234.234.234.234::url3.com"
>>> a = a.replace(',','\t')
>>> a = a.replace(':','\t')
>>> print (a)
123.123.123.123 url.com url2.com 234.234.234.234 url3.com
>>>
as mentioned in comment here a simple text manipulation to get you (hopefully) the right output prior to cleaning non duplicates:
import sys
read_raw_file = open('D:filename.csv') # open current file
read_raw_text = read_raw_file.read()
new_text = read_raw_text.strip()
new_text = new_text.replace(',','\t')
# new_text = new_text.replace('::','\t') optional if you want double : to only include one column
new_text = new_text.replace(':','\t')
text_list = new_text.split('\n')
unique_items = []
for row in text_list:
if row not in unique_items:
unique_items.append(row)
new_file ='D:newfile.csv'
with open(new_file,'w') as write_output_file: #generate new file
for i in range(0,len(unique_items)):
write_output_file.write(unique_items[i]+'\n')
write_output_file.close()
I work with spatial data that is output to text files with the following format:
COMPANY NAME
P.O. BOX 999999
ZIP CODE , CITY
+99 999 9999
23 April 2013 09:27:55
PROJECT: Link Ref
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Design DTM is 30MB 2.5X2.5
Stripping applied to design is 0.000
Point Number Easting Northing R.L. Design R.L. Difference Tol Name
3224808 422092.700 6096059.380 2.520 -19.066 -21.586 --
3224809 422092.200 6096059.030 2.510 -19.065 -21.575 --
<Remainder of lines>
3273093 422698.920 6096372.550 1.240 -20.057 -21.297 --
Average height difference is -21.390
RMS is 21.596
0.00 % above tolerance
98.37 % below tolerance
End of Report
As shown, the files have a header and a footer. The data is delimited by spaces, but not an equal amount between the columns.
What I need, is comma delimited files with Easting, Northing and Difference.
I'd like to prevent having to modify several hundred large files by hand and am writing a small script to process the files. This is what I have so far:
#! /usr/bin/env python
import csv,glob,os
from itertools import islice
list_of_files = glob.glob('C:/test/*.txt')
for filename in list_of_files:
(short_filename, extension )= os.path.splitext(filename)
print short_filename
file_out_name = short_filename + '_ed' + extension
with open (filename, 'rb') as source:
reader = csv.reader( source)
for row in islice(reader, 10, None):
file_out= open (file_out_name, 'wb')
writer= csv.writer(file_out)
writer.writerows(reader)
print 'Created file: '+ file_out_name
file_out.close()
print 'All done!'
Questions:
How can I let the line starting with 'Point number' become the header in the output file? I'm trying to put DictReader in place of the reader/writer bit but can't get it to work.
Writing the output file with delimiter ',' does work but writes a comma in place of each space, giving way too much empty columns in my output file. How do I circumvent this?
How do I remove the footer?
I can see a problem with your code, you are creating a new writer for each row; so you will end up only with the last one.
Your code could be something like this, without the need of CSV readers or writers, as it's simple enough to be parsed as simple text (problem would arise if you had text columns, with escaped characters and so on).
def process_file(source, dest):
found_header = False
for line in source:
line = line.strip()
if not header_found:
#ignore everything until we find this text
header_found = line.starswith('Point Number')
elif not line:
return #we are done when we find an empty line, I guess
else:
#write the needed columns
columns = line.split()
dest.writeline(','.join(columns[i] for i in (1, 2, 5)))
for filename in list_of_files:
short_filename, extension = os.path.splitext(filename)
file_out_name = short_filename + '_ed' + extension
with open(filename, 'r') as source:
with open(file_out_name. 'w') as dest:
process_file(source, dest)
This worked:
#! /usr/bin/env python
import glob,os
list_of_files = glob.glob('C:/test/*.txt')
def process_file(source, dest):
header_found = False
for line in source:
line = line.strip()
if not header_found:
#ignore everything until we find this text
header_found = line.startswith('Stripping applied') #otherwise, header is lost
elif not line:
return #we are done when we find an empty line
else:
#write the needed columns
columns = line.split()
dest.writelines(','.join(columns[i] for i in (1, 2, 5))+"\n") #newline character adding was necessary
for filename in list_of_files:
short_filename, extension = os.path.splitext(filename)
file_out_name = short_filename + '_ed' + ".csv"
with open(filename, 'r') as source:
with open(file_out_name, 'wb') as dest:
process_file(source, dest)
To answer to your first and last question: it is simply about ignoring the corresponding lines, i.e. not to write them to output. This corresponds to if not header_found and else if not line: blocks of fortran proposal.
Second point is that there is no dedicated delimiter in your file: you have one or more spaces, which makes it hard to be parsed using csv module. Using split() will parse each line and return the list of non-blank characters, and will therefore only return useful values.
I have a large number of text files containg data arranged into a fixed number of rows and columns, the columns being separated by spaces. (like a .csv but using spaces as the delimiter). I want to extract a given column from each of these files, and write it into a new text file.
So far I have tried:
results_combined = open('ResultsCombined.txt', 'wb')
def combine_results():
for num in range(2,10):
f = open("result_0."+str(num)+"_.txt", 'rb') # all the text files have similar filename styles
lines = f.readlines() # read in the data
no_lines = len(lines) # get the number of lines
for i in range (0,no_lines):
column = lines[i].strip().split(" ")
results_combined.write(column[5] + " " + '\r\n')
f.close()
if __name__ == "__main__":
combine_results()
This produces a text file containing the data I want from the separate files, but as a single column. (i.e. I've managed to 'stack' the columns on top of each other, rather than have them all side by side as separate columns). I feel I've missed something obvious.
In another attempt, I manage to write all the separate files to a single file, but without picking out the columns that I want.
import glob
files = [open(f) for f in glob.glob("result_*.txt")]
fout = open ("ResultsCombined.txt", 'wb')
for row in range(0,488):
for f in files:
fout.write( f.readline().strip() )
fout.write(' ')
fout.write('\n')
fout.close()
What I basically want is to copy column 5 from each file (it is always the same column) and write them all to a single file.
If you don't know the maximum number of rows in the files and if the files can fit into memory, then the following solution would work:
import glob
files = [open(f) for f in glob.glob("*.txt")]
# Given file, Read the 6th column in each line
def readcol5(f):
return [line.split(' ')[5] for line in f]
filecols = [ readcol5(f) for f in files ]
maxrows = len(max(filecols, key=len))
# Given array, make sure it has maxrows number of elements.
def extendmin(arr):
diff = maxrows - len(arr)
arr.extend([''] * diff)
return arr
filecols = map(extendmin, filecols)
lines = zip(*filecols)
lines = map(lambda x: ','.join(x), lines)
lines = '\n'.join(lines)
fout = open('output.csv', 'wb')
fout.write(lines)
fout.close()
Or this option (following your second approach):
import glob
files = [open(f) for f in glob.glob("result_*.txt")]
fout = open ("ResultsCombined.txt", 'w')
for row in range(0,488):
for f in files:
fout.write(f.readline().strip().split(' ')[5])
fout.write(' ')
fout.write('\n')
fout.close()
... which uses a fixed number of rows per file but will work for very large numbers of rows because it is not storing the intermediate values in memory. For moderate numbers of rows, I'd expect the first answer's solution to run more quickly.
Why not read all the entries from each 5th column into a list and after reading in all the files, write them all to the output file?
data = [
[], # entries from first file
[], # entries from second file
...
]
for i in range(number_of_rows):
outputline = []
for vals in data:
outputline.append(vals[i])
outfile.write(" ".join(outputline))
I have 125 data files containing two columns and 21 rows of data. Please see the image below:
and I'd like to import them into a single .csv file (as 250 columns and 21 rows).
I am fairly new to python but this what I have been advised, code wise:
import glob
Results = [open(f) for f in glob.glob("*.data")]
fout = open("res.csv", 'w')
for row in range(21):
for f in Results:
fout.write( f.readline().strip() )
fout.write(',')
fout.write('\n')
fout.close()
However, there is slight problem with the code as I only get 125 columns, (i.e. the force and displacement columns are written in one column) Please refer to the image below:
I'd very much appreciate it if anyone could help me with this !
import glob
results = [open(f) for f in glob.glob("*.data")]
sep = ","
# Uncomment if your Excel formats decimal numbers like 3,14 instead of 3.14
# sep = ";"
with open("res.csv", 'w') as fout:
for row in range(21):
iterator = (f.readline().strip().replace("\t", sep) for f in results)
line = sep.join(iterator)
fout.write("{0}\n".format(line))
So to explain what went wrong with your code, your source files use tab as a field separator, but your code uses comma to separate the lines it reads from those files. If your excel uses period as a decimal separator, it uses comma as a default field separator. The whitespace is ignored unless enclosed in quotes, and you see the result.
If you use the text import feature of Excel (Data ribbon => From Text) you can ask it to consider both comma and tab as valid field separators, and then I'm pretty sure your original output would work too.
In contrast, the above code should produce a file that will open correctly when double clicked.
You don't need to write your own program to do this, in python or otherwise. You can use an existing unix command (if you are in that environment):
paste *.data > res.csv
Try this:
import glob, csv
from itertools import cycle, islice, count
def roundrobin(*iterables):
"roundrobin('ABC', 'D', 'EF') --> A D E B F C"
# Recipe credited to George Sakkis
pending = len(iterables)
nexts = cycle(iter(it).next for it in iterables)
while pending:
try:
for next in nexts:
yield next()
except StopIteration:
pending -= 1
nexts = cycle(islice(nexts, pending))
Results = [open(f).readlines() for f in glob.glob("*.data")]
fout = csv.writer(open("res.csv", 'wb'), dialect="excel")
row = []
for line, c in zip(roundrobin(Results), cycle(range(len(Results)))):
splitline = line.split()
for item,currItem in zip(splitline, count(1)):
row[c+currItem] = item
if count == len(Results):
fout.writerow(row)
row = []
del fout
It should loop over each line of your input file and stitch them together as one row, which the csv library will write in the listed dialect.
I suggest to get used to csv module. The reason is that if the data is not that simple (simple strings in headings, and then numbers only) it is difficult to implement everything again. Try the following:
import csv
import glob
import os
datapath = './data'
resultpath = './result'
if not os.path.isdir(resultpath):
os.makedirs(resultpath)
# Initialize the empty rows. It does not check how many rows are
# in the file.
rows = []
# Read data from the files to the above matrix.
for fname in glob.glob(os.path.join(datapath, '*.data')):
with open(fname, 'rb') as f:
reader = csv.reader(f)
for n, row in enumerate(reader):
if len(rows) < n+1:
rows.append([]) # add another row
rows[n].extend(row) # append the elements from the file
# Write the data from memory to the result file.
fname = os.path.join(resultpath, 'result.csv')
with open(fname, 'wb') as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)
The with construct for opening a file can be replaced by the couple:
f = open(fname, 'wb')
...
f.close()
The csv.reader and csv.writer are simply wrappers that parse or compose the line of the file. The doc says that they require to open the file in the binary mode.