extract columns from multiple text file with Python - python

I have a folder with 5 text files in it pertaining to various sites--
the title is formatted in this way:
Rockspring_18_SW.417712.WRFc36.ET.2000-2050.txt
Rockspring_18_SW.417712.WRFc36.RAIN.2000-2050.txt
WICA.399347.WRFc36.ET.2000-2050.txt
WICA.399347.WRFc36.RAIN.2000-2050.txt
so, basically the file name follows the format of-
(site name).(site number).(WRFc36).(some variable).(2000-2050.txt
Each of these text files has a similar format to it with no header row: Year Month Day Value (consisting of ~18500 rows in each text file)
I want Python to search for similar filenames(where site name and site number match), and pick out the first through third columns of data from one of the files and paste it to a new txt file. I also want to copy and paste the 4th columns from each variable for a site (rain, et, etc.) and have them pasted in a particular order in the new file.
I know how to grab data using the csv module (and defining the new dialect for a space delimeter) from ALL files and print to a new text file, but I'm not sure how to automate the creation of a new file for each site name/number and make sure my variables plot out in the right order--
The output I want to use is one text file (not 5) for each site with the following format (year, month, day, variable1, variable2, variable3, variable4, variable5) for ~18500 rows...
I'm sure I'm looking over something realy simple here... this seems like it would be pretty rudimentary... but- any help would be greatly appreciated!
Update
========
I have updated the code to reflect the comments below.
http://codepad.org/3mQEM75e
from collections import defaultdict
import glob
import csv
#Create dictionary of lists-- [A] = [Afilename1, Afilename2, Afilename3...]
# [B] = [Bfilename1, Bfilename2, Bfilename3...]
def get_site_files():
sites = defaultdict(list)
#to start, I have a bunch of files in this format ---
#"site name(unique)"."site num(unique)"."WRFc36"."Variable(5 for each site name)"."2000-2050"
for fname in glob.glob("*.txt"):
#split name at every instance of "."
parts = fname.split(".")
#check to make sure i only use the proper files-- having 6 parts to name and having WRFc36 as 3rd part
if len(parts)==6 and parts[2]=='WRFc36':
#Make sure site name is the full unique identifier, the first and second "parts"
sites[parts[0]+"."+parts[1]].append(fname)
return sites
#hardcode the variables for method 2, below
Var=["TAVE","RAIN","SMOIS_INST","ET","SFROFF"]
def main():
for site_name, files in get_site_files().iteritems():
print "Working on *****"+site_name+"*****"
####Method 1- I'd like to not hardcode in my variables (as in method 2), so I can use this script in other applications.
for filename in files:
reader = csv.reader(open(filename, "rb"))
WriteFile = csv.writer(open("XX_"+site_name+"_combined.txt","wb"))
for row in reader:
row = reader.next()
####Method 2 works (mostly), but skips a LOT of random lines of first file, and doesn't utilize the functionality built into my dictionary of lists...
## reader0 = csv.reader(open(site_name+".WRFc36."+Var[0]+".2000-2050.txt", "rb")) #I'd like to copy ALL columns from the first file
## reader1 = csv.reader(open(site_name+".WRFc36."+Var[1]+".2000-2050.txt", "rb")) # and just the fourth column from all the rest of the files
## reader2 = csv.reader(open(site_name+".WRFc36."+Var[2]+".2000-2050.txt", "rb")) # (the columns 1-3 are the same for all files)
## reader3 = csv.reader(open(site_name+".WRFc36."+Var[3]+".2000-2050.txt", "rb"))
## reader4 = csv.reader(open(site_name+".WRFc36."+Var[4]+".2000-2050.txt", "rb"))
## WriteFile = csv.writer(open("XX_"+site_name+"_COMBINED.txt", "wb")) #creates new command to write a text file
##
## for row in reader0:
## row = reader0.next()
## row1 = reader1.next()
## row2 = reader2.next()
## row3 = reader3.next()
## row4 = reader4.next()
## WriteFile.writerow(row + row1 + row2 + row3 + row4)
## print "***finished with site***"
if __name__=="__main__":
main()

Here's an easier way to iterate through your files, grouped by site.
from collections import defaultdict
import glob
def get_site_files():
sites = defaultdict(list)
for fname in glob.glob('*.txt'):
parts = fname.split('.')
if len(parts)==6 and parts[2]=='WRFc36':
sites[parts[0]].append(fname)
return sites
def main():
for site,files in get_site_files().iteritems():
# you need to better explain what you are trying to do here!
print site, files
if __name__=="__main__":
main()
I still don't understand your cutting and pasting columns - you need to more clearly explain what you are trying to accomplish.

As far as getting the filenames goes I would use something like the following:
import os
# Gets a list of all file names that end in .txt
# ON *nix
file_names = os.popen('ls *.txt').read().split('\n')
# ON Windows
file_names = os.popen('dir /b *.txt').read().split('\n')
Then to get the elements normally separated by periods, use:
# For some file_name in file_names
file_name.split('.')
Then you can proceed to comparisons and extract the desired columns (by using open(file_name, 'r') or your CSV parser)
Michael G.

Related

Writing csv file from list but not all elements follow up

I have to format data from a text book to csv file. In the text book my data are already separated by space so i make a list of string (some contains multiple data separated by space).
When I try to write my list into text file it works well, but when I try to write it into a CSV file, in the middle of a string the writing stops and goes to the next element in my list. Don't know why more that half of my data don't follow up. There is no end line character or whatever.
Here is my simple code
# importing libraries
import os
# defining location of parent folder
BASE_DIRECTORY = r'C:\Users\CAVC071777\Documents\1_Projet\Riverstart\Intrant EDPR\6-Background Harmonics Data at POI\test'
output_file = open('output.csv', 'w')
output = []
outputString = ""
file_list = []
i = 0
# scanning through sub folders
for (dirpath, dirnames, filenames) in os.walk(BASE_DIRECTORY):
for f in filenames:
if 'txt' in str(f):
e = os.path.join(str(dirpath), str(f))
file_list.append(e)
for f in file_list:
txtfile = open(f, 'r')
i = 0
for line in txtfile:
if i == 3:
outputString = "=Date(""{0}"",""{1}"",""{2}"")+TEMPS(""{3}"",""{4}"",""{5}"")".format(line[46:48],line[40:42],line[43:45],line[58:60],line[61:63],line[64:66])
if i > 8 and i < 71:
outputString += line[9:71]
i = i + 1
output.append(outputString)
outputString = ""
for row in output:
print(row)
output_file.write(row + "\n")
When I open it in my csv file all the data after 0.830% didn't follow up:
When I print my list of string containing my data in the terminal it's well formatted and all my data is there:
The text files that i try to read on is like this :
ET H
WHM1 SEL-735 Date: 09/17/19 Time: 11:46:03.726
HDW Time Source: ext
Fundamental Frequency = 60.0
Harmonic IA IB IC IN VA VB VC
2 0.166% 0.137% 0.166% 0.000% 0.000% 0.020% 0.010%
3 ...... ......
And so forth till 60
image of the text file i try to read on
You have two problems here:
You are building a space separated file
you are using Excel
Excel is known to have a very poor support for csv files. Long story made short in you read a csv file build from Excel on the same system, it will work smoothly. I you read a csv file specifically build for your system, it should work. In any other use case it may or not load correctly...
Here Excel expects the delimiter to be a ; as it is the default delimiter for a French locale, or , if you managed to tell it that. As there are none in the rows, it just tries to put everything in the first cell, and visibly limits the length of a single field.
How to fix:
use LibreOffice or OpenOffice. Both are far beyond Excel for almost all features except csv. You could declare at load time that the separator is a space and control that the lines are correctly parsed
Change the rows in th csv file to use the separator that your version of Excel expects

Trying to search a CSV for a barcode value, to then rename files based on cells relating to that barcode

I have a bunch of files in a directory named by a barcode, then image number. (Barcode_01.jpg, Barcode_02.jpg) etc.
I'm looking to search a CSV based on the Barcode portion of the filename, then when the barcode is found, to rename the file based on information in the row for that barcode.
Example:
files in directory:
123456789_01.jpg
123456789_02.jpg
012345_01.jpg
012345_02.jpg
98765_01.jpg
98765_02.jpg
The barcodes relate to a range of products, which, in the CSV also have a product code and a colour code that I need to rename the files to in the format {product code in CSV}{colour code in CSV}{image number of original filename}.jpg
Effectively looking to replace the barcode value, with the product and colour code, retaining the image number.
So I've used splitext to parse the filenames and simplify them from their originals which works just fine.
I've also looked at xlwings and csv to search for values within the spreadsheet or CSV But I can't seem to get them to work together.
The code I have for renaming the files was modified after watching some corey shaeffer tutorials
import os
os.chdir('[FILE DIRECTORY I NEED TO RENAME]')
for f in os.listdir('[FILE DIRECTORY I NEED TO RENAME]'):
f_name,f_ext = (os.path.splitext(f))
f_sku = (f_name.split('_')[0])
f_num = (f_name[-2:])
n_name = ('{}_{}{}'.format(f_sku,f_num,f_ext))
os.rename(f, n_name)
I have also looked to open with dictreader and search for the barcode in column 8 of the CSV, however i'm very new to how this all works and can't seem to get any results worth writing home about.
I tried to do the following to see if I could get it to at least return some values on a search:
import csv
import os
for f in os.listdir('[file directory]'):
f_name,f_ext = (os.path.splitext(f))
f_sku = (f_name.split('_')[0])
f_num = (f_name[-2:])
with open('[file directory]','r') as csv_file:
csv_reader = csv.DictReader(csv_file)
for row in csv_file:
if row[8] == f_sku:
print(row)
I have defined the search string as 'f_sku' from above - which would be the part of the filename that i'm looking for in the CSV. Which is about as far as I can get.
EDIT: I was getting bugs with the directories and issues where once the script was run, it wouldn't run again with a new image set in the directory as the split was looking for the wrong things, for whatever reason it was buggy.
So I peeled back and used the file parsing I had on a previous script. I also defined the directory to make it clearer.
import os
import csv
ImageDir = "IMAGE DIR"
RenDir = "RENAMED FILES DIR"
for f in os.listdir(ImageDir):
f_name,f_ext = (os.path.splitext(f))
f_bcode = (f_name.split('_')[0])
f_num = (f_name[-2:])
## FINDING THE NEW FILENAME FROM THE BARDCODES CSV
with open('CSV DIR', 'r') as
csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
if row[8] == f_bcode:
n_name = row[0]+"_"+row[5]+"_"+f_num+f_ext
os.rename(ImageDir + f,ImageDir + n_name)
Let's assume given_name is the name you are given from f in your code. Your code works for certain files, but not all. ALL files have their extension after the "." and all YOUR files are separated by an underscore. Hence, you split the given name by a "." and underscore to render each different part of the filename.
import os
import csv
for f in os.listdir('[FILE DIRECTORY I NEED TO RENAME]'):
given_name = f
f_name, f_ext = given_name.split(".")
f_sku, f_num = f_name.split("_")
with open('CSV File', 'r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
if row[8] == f_sku:
n_name = row[0]+"_"+row[5]+"_"+f_num+"."+f_ext

Combining csv headers with corresponding file paths into new file

I am not sure how to "crack" the following Python-nut. So I was hoping that some of you more experienced Python'ers could push me in the right direction.
What I got:
Several directories containing many csv files
For instance:
/home/Date/Data1 /home/Date/Data2 /home/Date/Data3/sub1 /home/Date/Data3/sub2
What I want:
A file containing the "splitted" path for each file, followed by the variables (=row/headers) of the corresponding file. Like this:
home /t Date /t Data1 /t "variable1" "variable2" "variable3" ...
home /t Date /t Data2 /t "variable1" "variable2" "variable3" ...
home /t Date /t Data3 /t sub1 /t "variable1" "variable2" "variable3" ...
home /t Date /t Data3 /t sub2 /t "variable1" "variable2" "variable3" ...
Where am I right now?: The first step was to figure out how to print out the first row (the variables) of a single csv file (I used a test.txt file for testing)
# print out variables of a single file:
import csv
with open("test.txt") as f:
reader = csv.reader(f)
i = next(reader)
print(i)
The second step was to figure out how to print the paths of the csv files in directories inclusive subfolders. This is what I ended with:
import os
# Getting the current work directory (cwd)
thisdir = os.getcwd()
# r=root, d=directories, f = files
for r, d, f in os.walk(thisdir):
for file in f:
if ".csv" in file:
print(os.path.join(r, file))
Prints:
/home/Date/Data1/file1.csv
/home/Date/Data1/file2.csv
/home/Date/Data2/file1.csv
/home/Date/Data2/file2.csv
/home/Date/Data2/file3.csv
/home/Date/Data3/sub1/file1.csv
/home/Date/Data3/sub2/file1.csv
/home/Date/Data3/sub2/file2.csv
Where I am stuck?: I am struggling to figure out how to get from here, any ideas, approaches etc. in the right direction is greatly appreciated!
Cheers, B
##### UPDATE #####
Inspired by Tim Pietzcker's useful comments I have gotten a long way (Thanks Tim!).
But I could not get the output.write & join part to work, therefore the code is slightly different. The new issue is now to "merge" the two lists as two separate columns with comma as delimiter (I want to create a csv file). Since I am stuck, yet again, I wanted to see if there is any good suggestions from the experienced python'ers inhere?
#!/usr/bin/python
import os
import csv
thisdir = os.getcwd()
# Extract file-paths and append them to "csvfiles"
for r, d, f in os.walk(thisdir): # r=root, d=directories, f = files
for file in f:
if ".csv" in file:
csvfiles.append(os.path.join(r, file))
# get each file-path on new line + convert to list of str
filepath = "\n".join(["".join(sub) for sub in csvfiles])
filepath = filepath.replace(".csv", "") # remove .csv
filepath = filepath.replace("/", ",") # replace / with ,
Results in:
,home,Date,Data1,file1
,home,Date,Data1,file2
,home,Date,Data1,file3
... and so on
Then on to the headers:
# Create header-extraction function:
def get_csv_headers(filename):
with open(filename, newline='') as f:
reader = csv.reader(f)
return next(reader)
# Create empty list for headers
headers=[]
# Extract headers with the function and append them to "headers" list
for l in csvfiles:
headers.append(get_csv_headers(l))
# Create file with headers
headers = "\n".join(["".join(sublst) for sublst in headers]) # new lines + str conversion
headers = headers.replace(";", ",") # replace ; with ,
Results in:
variable1,variable2,variable3
variable1,variable2,variable3,variable4,variable5,variable6
variable1,variable2,variable3,variable4
and so on..
What I want now: a csv like this:
home,Date,Data1,file1,variable1,variable2,variable3
home,Date,Data1,file2,variable1,variable2,variable3,variable4,variable5,variable6
home,Date,Data1,file3, variable1,variable2,variable3,variable4
For instance:
with open('text.csv', 'w') as f:
writer = csv.writer(f, delimiter=',')
writer.writerows(zip(filepath,headers))
resulted in:
",",v
h,a
o,r
m,i,
e,a
and so on..
Any ideas and pushes in the right direction are very welcome!
About your edit: I would recommend against transforming everything into strings that early in the process. It makes much more sense keeping the data in a structured format and allow the modules designed to handle structured data to do the rest. So your program might look something like this:
#!/usr/bin/python
import os
import csv
thisdir = os.getcwd()
# Extract file-paths and append them to "csvfiles"
for r, d, f in os.walk(thisdir): # r=root, d=directories, f = files
for file in f:
if ".csv" in file:
csvfiles.append(os.path.join(r, file))
This (taken directly from your question) leaves you with a list of CSV filenames.
Now let's read those files. From the script in your question it seems that your CSV files are actually semicolon-separated, not comma-separated. This is common in Europe (because the comma is needed as a decimal point), but Python needs to be told that:
# Create header-extraction function:
def get_csv_headers(filename):
with open(filename, newline='') as f:
reader = csv.reader(f, delimiter=";") # semicolon-separated files!
return next(reader)
# Create empty list for headers
headers=[]
# Extract headers with the function and append them to "headers" list
for l in csvfiles:
headers.append(get_csv_headers(l))
Now headers is a list containing many sub-lists (which contain all the headers as separate items, just as we need them).
Let's not try to put everything on a single line; better keep it readable:
with open('text.csv', 'w', newline="") as f:
writer = csv.writer(f, delimiter=',') # maybe use semicolon again??
for path, header in zip(csvfiles, headers):
writer.writerow(list(path.split("\\")) + header)
If all your paths start with \, you could also use
writer.writerow(list(path.split("\\")[1:]) + header)
to avoid the empty field at the start of each line.
This looks promising; you've already done most of the work.
What I would do is
Collect all your CSV filenames in a list. So instead of printing the filenames, create an empty list (csvfiles=[]) before the os.walk() loop and do something like csvfiles.append(os.path.join(r, file)).
Then, iterate over those filenames, passing each to the routine that's currently used to read test.txt. If you place that in a function, it could look like this:
def get_csv_headers(filename):
with open(filename, newline="") as f:
reader = csv.reader(f)
return next(reader)
Now, you can write the split filename to a new file and add the headers. I'm questioning your file format a bit - why separate part of the line by tabs and the rest by spaces (and quotes)? If you insist on doing it like this, you could use something like
output.write("\t".join(filename.split("\\"))
output.write("\t")
output.write(" ".join(['"{}"'.format(header) for header in get_csv_headers(filename)])
but you might want to rethink this approach. A standard format like JSON might be more readable and portable.

How to read in multiple files separately from multiple directories in python

I have x directories which are Star_{v} with v=0 to x.
I have 2 csv files in each directory, one with the word "epoch", one without.
If one of the csv files has the word "epoch" in it needs to be sent through one set of code, else through another.
I think dictionaries are probably the way to go but this section of the code is a bit of a wrong mess
directory_dict={}
for var in range(0, len(subdirectory)):
#var refers to the number by which the subdirectories are labelled by Star_0, Star_1 etc.
directory_dict['Star_{v}'.format(v=var)]=directory\\Star_{var}
#directory_dict['Star_0'], directory_dict['Star_1'] etc.
read_csv(f) for f in os.listdir('directory_dict[Star_{var}') if f.endswith(".csv")
#reads in all the files in the directories(star{v}) ending in csv.
if 'epoch' in open(read_csv[0]).read():
#if the word epoch is in the csv file then it is
directory_dict[Star_{var}][read] = csv.reader(read_csv[0])
directory_dict[Star_{var}][read1] = csv.reader(read_csv[1])
else:
directory_dict[Star_{var}][read] = csv.reader(read_csv[1])
directory_dict[Star_{var}][read1] = csv.reader(read_csv[0])
when dealing with csvs, you should use the csv module, and for your particular case, you can use a dictreader and parse the headers to check for the column you're looking for
import csv
import os
directory = os.path.abspath(os.path.dirname(__file__)) # change this to your directory
csv_list = [os.path.join(directory, c) for c in os.listdir(directory) if os.path.splitext(c) == 'csv']
def parse_csv_file():
" open CSV and check the headers "
for c in csv_list:
with open(c, mode='r') as open_csv:
reader = csv.DictReader(open_csv)
if 'epoch' in reader.fieldnames:
# do whatever you want here
else:
# do whatever else
then you can extract it from the DictReader's CSV header and do whatever you want
Also your python looks invalid

csv row and column fetch

So working on a program in Python 3.3.2. New to it all, but I've been getting through it. I have an app that I made that will take 5 inputs. 3 of those inputs are comboboxs, two are entry widgets. I have then created a button event that will save those 5 inputs into a text file, and a csv file. Opening each file everything looks proper. For example saved info would look like this:
Brad M.,Mike K.,Danny,Iconnoshper,Strong Wolf Lodge
I then followed a csv demo and copied this...
import csv
ifile = open('myTestfile.csv', "r")
reader = csv.reader(ifile)
rownum = 0
for row in reader:
# Save header row.
if rownum == 0:
header = row
else:
colnum = 0
for col in row:
print('%-15s: %s' % (header[colnum], col))
colnum += 1
rownum += 1
ifile.close()
and that ends up printing beautifully as:
rTech: Brad M.
pTech: Mike K.
cTech: Danny
proNam: ohhh
jobNam: Yeah
rTech: Damien
pTech: Aaron
so on and so on. What I'm trying to figure out is if I've named my headers via
if rownum == 0:
header = row
is there a way to pull a specific row / col combo and print what is held there??
I have figured out that I could after the program ran do
print(col)
or
print(col[0:10]
and I am able to print the last col printed, or the letters from the last printed col. But I can't go any farther back than that last printed col.
My ultimate goal is to be able to assign variables so I could in turn have a label in another program get it's information from the csv file.
rTech for job is???
look in Jobs csv at row 1, column 1, and return value for rTech
do I need to create a dictionary that is loaded with the information then call the dictionary?? Thanks for any guidance
Thanks for the direction. So been trying a few different things one of which Im really liking is the following...
import csv
labels = ['rTech', 'pTech', 'cTech', 'productionName', 'jobName']
fn = 'my file.csv'
cameraTech = 'Danny'
f = open(fn, 'r')
reader = csv.DictReader(f, labels)
jobInformation = [(item["productionName"],
item["jobName"],
item["pTech"],
item["rTech"]) for item in reader if \
item['cTech'] == cameraTech]
f.close()
print ("Camera Tech: %s\n" % (cameraTech))
print ("\n".join(["Production Name: %s \nJob Name: %s \nPrep Tech: %s \nRental Agent: %s\n" % (item) for item in jobInformation]))
That shows me that I could create a variable through cameraTech and as long as that matched what was loaded into the reader that holds the csv file and that if cTech column had a match for cameraTech then it would fill in the proper information. 95% there WOOOOOO..
So now what I'm curious about is calling each item. The plan is in a window I have a listbox that is populated with items from a .txt file with "productionName" and "jobName". When I click on one of those items in the listbox a new window opens up and the matching information from the .csv file is then filled into the appropriate labels.
Thoughts??? Thanks again :)
I think that reading the CSV file into a dictionary might be a working solution for your problem.
The Python CSV package has built-in support for reading CSV files into a Python dictionary using DictReader, have a look at the documentation here: http://docs.python.org/2/library/csv.html#csv.DictReader
Here is an (untested) example using DictReader that reads the CSV file into a Python dictionary and prints the contents of the first row:
import csv
csv_data = csv.DictReader(open("myTestfile.csv"))
print(csv_data[0])
Okay so I was able to put this together after seeing the following (https://gist.github.com/zstumgoren/911615)
That showed me how to give each header a variable I could call. From there I could then create a function that would allow for certain variables to be called and compared and if that matched I would be able to see certain data needed. So the example I made to show myself it could be done is as follows:
import csv
source_file = open('jobList.csv', 'r')
for line in csv.DictReader(source_file, delimiter=','):
pTech= line['pTech']
cTech= line['cTech']
rAgent= line['rTech']
prodName= line['productionName']
jobName= line['jobName']
if prodName == 'another':
print(pTech, cTech, rAgent, jobName)
However I just noticed something, while my .csv file has one line this works great!!!! But, creating my proper .csv file, I am only able to print information from the last line read. Grrrrr.... Getting closer though.... I'm still searching but if someone understands my issue, would love some light.

Categories

Resources