Extracting certain columns from multiple files simultaneously by Python

Extracting certain columns from multiple files simultaneously by Python - python

My purpose is to extract one certain column from the multiple data files.
So, I tried to use glob module to read files and tried to extract one column from each file with for statements like below:
filin = diri + '*_7.txt'
FileList=sorted(glob.glob(filin))
for INPUT in FileList:
a = []
b = []
c = []
T = []
f = open(INPUT,'r')
f.seek(0,0)
for columns in ( raw.strip().split() for raw in f):
b.append(columns[11])
t = np.array(b, float)
print t
t = list(t)
T = T + [t]
f.close()
print T
The number of data files which I used is 32. So, I expected the second 'for' statement ran only 32 times while generating only 32 arrays of t. However, the result doesn't look like what I expected.
I assume that it may be due to the influence from the first 'for' statement, but I am not sure.
Any idea or help would be really appreciated.
Thank you,
Isaac

You clear T = [] for every file. Move T = [] line before first loop.

Related

How to store different values in different variables in python?

I have a list as follows: a = ['abc', 'def'] and I have separate files with these file names which consists of some text inside.
Now I have to extract and save this text from those 2 files inside 2 separate variables in python using a for loop as follows: c[abc] = text1 & c[def] = text2
Here is my code:
for b in a:
x = open('/Users/xyz/'+b+'.txt', 'r')
c[b] = x.read()
But I am getting a name error that c is not defined. Can anyone please help me out with this?

1.You have a list a which contains text files.
2.You want to create separate variables to store values of each file.
But this approach which you expect would take lot of unnecessary memory.
3.As per my suggestions you store the values of your text files in to a list and then iterate the list to print the data
a=["reverse","abc","def"]
li=[]
for b in a:
x = open(r"C:\Users\akash\Desktop\dailyChallaneges\\"+b+".txt", "r")
c = x.read()
li.append(c)
for ele in li:
print(ele)
OUTPUT:
afzdvxdvdxxrfc
saddfgf
cmskjfnzsk.zdnc

python structured array composition and transformation

I created a script that collects a huge data from a .txt file into an array in the format I want [3: 4: n] and the information is recorded as follows (I think). The .txt file is in this format
1.000000e-01 1.000000e-01 1.000000e-01
1.000000e-01 2.000000e-01 3.000000e-01
3.000000e-01 2.000000e-01 1.000000e-01
1.000000e-01 2.000000e-01 4.000000e-01
and repeats for N times and I store basically from 4 lines into for lines (like a block) because I'm working with ASCII files from STL parts.
In this sense, I have this code:
f = open("camaSTLfinalmente.txt","r")
b_line = 0
Coord = []
Normal = []
Vertice_coord = []
Tri = []
blook = []
for line in f:
line = line.rstrip()
if(line):
split = line.split()
for axis in range(0,3):
if(b_line == 0): #normal
Normal.append(split[axis])
else: #triangulo
Vertice_coord.append(split[axis])
if(b_line > 0):
Tri.append(Vertice_coord)
Vertice_coord = []
if(b_line == 3):
block.append(Normal)
block.append(Tri)
Coord.append(block)
block = []
Normal = []
Tri = []
b_line = 0
else:
b_line+=1
print(Coord[0]) #prints the follow line that I wrote after the code
the information is store in the way:
[['1.000000e-01', '1.000000e-01', '1.000000e-01'], [['1.000000e-01', '2.000000e-01', '3.000000e-01'], ['3.000000e-01', '2.000000e-01', '1.000000e-01'], ['1.000000e-01', '2.000000e-01', '-4.000000e-01']]]
Is there any way to simplify it?
I would like to take this opportunity to ask: I wanted to convert this information into numbers, and the ideal would be to read the number after the exponential (e) and change the numbers accordingly, that is, 1.000000e-01 goes to 0,1 (in order to make operations with a similar array where I store information from another .txt file with the same format)
Thanks for the attention,
Pedro

You can try changing the line split = line.split() to:
split = [float(x) for x in line.split()]
if you need the result to be in string and not float datatype:
split = [str(float(x)) for x in line.split()]

I'm not 100% sure if I fully understand what you want but the following code produces the same Coord:
coord = []
with open('camaSTLfinalmente.txt','r') as f:
content = [line.strip().split() for line in f]
for i in range(len(content)//4):
coord.append([content[4*i], content[(4*i+1):(4*i+4)]])
Regarding the second question, as remarked in another answer, the easiest way to handle strings containing a number is to convert them to a number and then format them as string.
s = '1.000000e-01'
n = float(s)
m = '{:.1f}'.format(n)
See the section about string formatting in the Python doc.
A couple remarks:
Generally Stackoverflow doesn't like questions of the form "how do I improve this piece of code", try to ask more specific questions.
The above assumes your file contains 4k lines, change the integer division ...//4 accordingly if you have some lines left at the end that do not form a pack of 4.
don't use capital letters for your variables. While style guides are not mandatory, it is good practice to follow them (Look up PEP-8, pylint, ...)

Python: Writing peoples scores to individual lines

I have a task where I need to record peoples scores in a text file. My Idea was to set it out like this:
Jon: 4, 1, 3
Simon: 1, 3, 6
This has the name they inputted along with their 3 last scores (Only 3 should be recorded).
Now for my question; Can anyone point me in the right direction to do this? Im not asking for you to write my code for me, Im simply asking for some tips.
Thanks.
Edit: Im guessing it would look something like this: I dont know how I'd add scores after their first though like above.
def File():
score = str(Name) + ": " + str(correct)
File = open('Test.txt', 'w+')
File.write(score)
File.close()
Name = input("Name: ")
correct = input("Number: ")
File()

You could use pandas to_csv() function and store your data in a dictionary. It will be much easier than creating your own format.
from pandas import DataFrame, read_csv
import pandas as pd
def tfile(names):
df = DataFrame(data = names, columns = names.keys())
with open('directory','w') as f:
f.write(df.to_string(index=False, header=True))
names = {}
for i in xrange(num_people):
name = input('Name: ')
if name not in names:
names[name] = []
for j in xrange(3):
score = input('Score: ')
names[name].append(score)
tfile(names)
Simon Jon
1 4
3 1
6 3
This should meet your text requirement now. It converts it to a string and then writes the string to the .txt file. If you need to read it back in you can use pandas read_table(). Here's a link if you want to read about it.

Since you are not asking for the exact code, here is an idea and some pointers
Collect the last three scores per person in a list variable called last_three
do something like:
",".join(last_three) #this gives you the format 4,1,3 etc
write to file an entry such as
name + ":" + ",".join(last_three)
You'll need to do this for each "line" you process
I'd recommend using with clause to open the file in write mode and process your data (as opposed to just an "open" clause) since with handles try/except/finally problems of opening/closing file handles...So...
with open(my_file_path, "w") as f:
for x in my_formatted_data:
#assuming x is a list of two elements name and last_three elems (example: [Harry, [1,4,5]])
name, last_three = x
f.write(name + ":" + ",".join(last_three))
f.write("\n")# a new line
In this way you don't really need to open/close file as with clause takes care of it for you

How do i format the ouput of a list of list into a textfile properly?

I am really new to python and now I am struggeling with some problems while working on a student project. Basically I try to read data from a text file which is formatted in columns. I store the data in a list of list and sort and manipulate the data and write them into a file again. My problem is to align the written data in proper columns. I found some approaches like
"%i, %f, %e" % (1000, 1000, 1000)
but I don't know how many columns there will be. So I wonder if there is a way to set all columns to a fixed width.
This is how the input data looks like:
2 232.248E-09 74.6825 2.5 5.00008 499.482
5 10. 74.6825 2.5 -16.4304 -12.3
This is how I store the data in a list of list:
filename = getInput('MyPath', workdir)
lines = []
f = open(filename, 'r')
while 1:
line = f.readline()
if line == '':
break
splitted = line.split()
lines.append(splitted)
f.close()
To write the data I first put all the row elements of the list of list into one string with a free fixed space between the elements. But instead i need a fixed total space including the element. But also I don't know the number of columns in the file.
for k in xrange(len(lines)):
stringlist=""
for i in lines[k]:
stringlist = stringlist+str(i)+' '
lines[k] = stringlist+'\n'
f = open(workdir2, 'w')
for i in range(len(lines)):
f.write(lines[i])
f.close()
This code works basically, but sadly the output isn't formatted properly.
Thank you very much in advance for any help on this issue!

You are absolutely right about begin able to format widths as you have above using string formatting. But as you correctly point out, the tricky bit is doing this for a variable sized output list. Instead, you could use the join() function:
output = ['a', 'b', 'c', 'd', 'e',]
# format each column (len(a)) with a width of 10 spaces
width = [10]*len(a)
# write it out, using the join() function
with open('output_example', 'w') as f:
f.write(''.join('%*s' % i for i in zip(width, output)))
will write out:
' a b c d e'
As you can see, the length of the format array width is determined by the length of the output, len(a). This is flexible enough that you can generate it on the fly.
Hope this helps!

String formatting might be the way to go:
>>> print("%10s%9s" % ("test1", "test2"))
test1 test2
Though you might want to first create strings from those numbers and then format them as I showed above.
I cannot fully comprehend your writing code, but try working on it somehow like that:
from itertools import enumerate
with open(workdir2, 'w') as datei:
for key, item in enumerate(zeilen):
line = "%4i %6.6" % key, item
datei.write(item)

reusable function: substituting the values returned by another function

Below is the snippet: I'm parsing job log and the output is the formatted result.
def job_history(f):
def get_value(j,n):
return j[n].split('=')[1]
lines = read_file(f)
for line in lines:
if line.find('Exit_status=') != -1:
nLine = line.split(';')
jobID = '.'.join(nLine[2].split('.',2)[:-1]
jData = nLine[3].split(' ')
jUsr = get_value(jData,0)
jHst = get_value(jData,9)
jQue = get_value(jData,3)
eDate = job_value(jData,14)
global LJ,LU,LH,LQ,LE
LJ = max(LJ, len(jobID))
LU = max(LU, len(jUsr))
LH = max(LH, len(jHst))
LQ = max(LQ, len(jQue))
LE = max(LE, len(eDate))
print "%-14s%-12s%-14s%-12s%-10s" % (jobID,jUsr,eDate,jHst,jQue)
return LJ,LU,LE,LH,LQ
In principle, I should have another function like this:
def fmt_print(a,b,c,d,e):
print "%-14s%-12s%-14s%-12s%-10s\n" % (a,b,c,d,e)
to print the header and call the functions like this to print the complete result:
fmt_print('JOB ID','OWNER','E_DATE','R_HOST','QUEUE')
job_history(inFile)
My question is: how can I make fmt_print() to print both the header and the result using the values LJ,LU,LE,LH,LQ for the format spacing. the job_history() will parse a number of log files from the log-directory. The length of the field of similar type will differ from file to file and I don't wanna go static with the spacing (assuming the max length per field) for this as there gonna be lot more columns to print (than the example). Thanks in advance for your help. Cheers!!
PS. For those who know my posts: I don't have to use python v2.3 anymore. I can use even v2.6 but I want my code to be v2.4 compatible to go with RHEL5 default.
Update: 1
I had a fundamental problem in my original script. As I mentioned above that the job_history() will read the multiple files in a directory in a loop, the max_len were being calculated per file and not for the entire result. After modifying
unutbu's script a little bit and following xtofl's (if this is what it meant) suggestion, I came up with this, which seems to be working.
def job_history(f):
result=[]
for line in lines:
if line.find('Exit_status=') != -1:
....
....
global LJ,LU,LH,LQ,LE
LJ = max(LJ, len(jobID))
LU = max(LU, len(jUsr))
LH = max(LH, len(jHst))
LQ = max(LQ, len(jQue))
LE = max(LE, len(eDate))
result.append((jobID,jUsr,eDate,jHst,jQue))
return LJ,LU,LH,LQ,LE,result
# list of log files
inFiles = [ m for m in os.listdir(logDir) ]
saved_ary = []
for inFile in sorted(inFiles):
LJ,LU,LE,LH,LQ,result = job_history(inFile)
saved_ary += result
# format printing
fmt_print = "%%-%ds %%-%ds %%-%ds %%-%ds %%-%ds" % (LJ,LU,LE,LH,LQ)
print_head = fmt_print % ('Job Id','User','End Date','Exec Host','Queue')
print '%s\n%s' % (print_head, len(print_head)*'-')
for lines in saved_ary:
print fmt_print % lines
I'm sure there are lot other better ways of doing this, so suggestion(s) are welcomed. cheers!!
Update: 2
Sorry for brining up this "solved" post again. Later discovered, I was even wrong with my updated script, so I thought I'd post another update for future reference. Even though it appeared to be working, actually length_data were overwritten with the new one for every file in the loop. This works correctly now.
def job_history(f):
def get_value(j,n):
return j[n].split('=')[1]
lines = read_file(f)
for line in lines:
if "Exit_status=" in line:
nLine = line.split(';')
jobID = '.'.join(nLine[2].split('.',2)[:-1]
jData = nLine[3].split(' ')
jUsr = get_value(jData,0)
....
result.append((jobID,jUsr,...,....,...))
return result
# list of log files
inFiles = [ m for m in os.listdir(logDir) ]
saved_ary = []
LJ = 0; LU = 0; LE = 0; LH = 0; LQ = 0
for inFile in sorted(inFiles):
j_data = job_history(inFile)
saved_ary += j_data
for ix in range(len(saved_ary)):
LJ = max(LJ, len(saved_ary[ix][0]))
LU = max(LU, len(saved_ary[ix][1]))
....
# format printing
fmt_print = "%%-%ds %%-%ds %%-%ds %%-%ds %%-%ds" % (LJ,LU,LE,LH,LQ)
print_head = fmt_print % ('Job Id','User','End Date','Exec Host','Queue')
print '%s\n%s' % (print_head, len(print_head)*'-')
for lines in saved_ary:
print fmt_print % lines
The only problem is it's taking a bit of time to start printing the info on the screen, just because, I think, as it's putting all the in the array first and then printing. Is there any why can it be improved? Cheers!!

Since you don't know LJ, LU, LH, LQ, LE until the for-loop ends, you have to complete this for-loop before you print.
result=[]
for line in lines:
if line.find('Exit_status=') != -1:
...
LJ = max(LJ, len(jobID))
LU = max(LU, len(jUsr))
LH = max(LH, len(jHst))
LQ = max(LQ, len(jQue))
LE = max(LE, len(eDate))
result.append((jobID,jUsr,eDate,jHst,jQue))
fmt="%%-%ss%%-%ss%%-%ss%%-%ss%%-%ss"%(LJ,LU,LE,LH,LQ)
for jobID,jUsr,eDate,jHst,jQue in result:
print fmt % (jobID,jUsr,eDate,jHst,jQue)
The fmt line is a bit tricky. When you use string interpolation, each %s gets replaced by a number, and %% gets replaced by a single %. This prepares the correct format for the subsequent print statements.

Since the column header and column content are so closely related, why not couple them into one structure, and return an array of 'columns' from your job_history function? The task of that function would be to
output the header for each colum
create the output for each line, into the corresponding column
remember the maximum width for each column, and store it in the column struct
Then, the prinf_fmt function can 'just'
iterate over the column headers, and print them using the respective width
iterate over the 'rest' of the output, printing each cell with 'the respective width'
This design will separate output definition from actual formatting.
This is the general idea. My python is not that good; but I may think up some example code later...

Depending on how many lines are there, you could:
read everything first to figure out the maximum field lengths, then go through the lines again to actually print out the results (if you have only a handful of lines)
read one page of results at a time and figure out maximum length for the next 30 or so results (if you can handle the delay and have many lines)
don't care about the format and output in a csv or some database format instead - let the final person / actual report generator worry about importing it

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting certain columns from multiple files simultaneously by Python - python

You clear T = [] for every file. Move T = [] line before first loop.

Related

How to store different values in different variables in python?

python structured array composition and transformation

Python: Writing peoples scores to individual lines

How do i format the ouput of a list of list into a textfile properly?

reusable function: substituting the values returned by another function

Categories

Resources