reading variable number of columns with Python

reading variable number of columns with Python - python

I need to read a variable number of columns from my input file ( the number of columns is defined by the user, there's no limitation ). For every column I have multiple variables to read, three in my case, set by the user as well.
So the file to read is like:
2 3 5
6 7 9
3 6 8
In Fortran this is really easy to do:
DO 180 I=1,NMOD
READ(10,*) QARR(I),RARR(I),WARR(I)
NMOD is defined by the user, as well as all the values in the example. All of them are input parameters to be stored in memory. By doing these I can save all the variables I need and I can use it whenever I want, recalling them by changing the I index. How can I obtain the same result with Python?

Example file 'text'
2 3 5
6 7 9
3 6 8
Python code
data = []
with open('text') as file:
columns_to_read = 1 # here you tell how many columns you want to read per line
for line in file:
data.append(list(map(int, line.split()[:columns_to_read])))
print(data) # print: [[2], [6], [3]]
data will hold an array of arrays that represent your lines.

from itertools import islice
with open('file.txt', 'rt') as f:
# default slice from row 0 until end with step 1
# example islice(10, 20, 2) take only row 10,12,14,16,18
dat = islice(f, 0, None, 1)
column = None # change column here, default to all
# this keep the list value as string
# mylist = [i.split() for i in dat]
# this keep the list value as int
mylist = [[int(j) for j for i.split()[:column] for i in dat]
Code above construct 2d list
access with mylist[row][column]
Example - mylist[2][3] access row 2 column 3
Edit : improve code efficiency with #Guillaume #Javier suggestion

Related

How to pre-process a very large data in python

I have a couple of files 100 MB each. The format for those files looks like this:
0 1 2 5 8 67 9 122
1 4 5 2 5 8
0 2 1 5 6
.....
(note the actual file does not have the alignment spaces added in, only one space separates each element, added alignment for aesthetic effect)
this first element in each row is it's binary classification, and the rest of the row are indices of features where the value is 1. For instance, the third row says the row's second, first, fifth and sixth features are 1, the rest are zeros.
I tried to read each line from each file, and use sparse.coo_matrix create a sparse matrix like this:
for train in train_files:
with open(train) as f:
row = []
col = []
for index, line in enumerate(f):
record = line.rstrip().split(' ')
row = row+[index]*(len(record)-4)
col = col+record[4:]
row = np.array(row)
col = np.array(col)
data = np.array([1]*len(row))
mtx = sparse.coo_matrix((data, (row, col)), shape=(n_row, max_feature))
mmwrite(train+'trans',mtx)
but this took forever to finish. I started reading the data at night, and let the computer run after I went to sleep, and when I woke up, it still haven't finish the first file!
What are the better ways to process this kind of data?

I think this would be a bit faster than your method because it does not read file line by line. You can try this code with a small portion of one file and compare with your code.
This code also requires to know the feature number in advance. If we don't know the feature number, it would require another line of code which was commented out.
import pandas as pd
from scipy.sparse import lil_matrix
from functools import partial
def writeMx(result, row):
# zero-based matrix requires the feature number minus 1
col_ind = row.dropna().values - 1
# Assign values without duplicating row index and values
result[row.name, col_ind] = 1
def fileToMx(f):
# number of features
col_n = 136
df = pd.read_csv(f, names=list(range(0,col_n+2)),sep=' ')
# This is the label of the binary classification
label = df.pop(0)
# Or get the feature number by the line below
# But it would not be the same across different files
# col_n = df.max().max()
# Number of row
row_n = len(label)
# Generate feature matrix for one file
result = lil_matrix((row_n, col_n))
# Save features in matrix
# DataFrame.apply() is usually faster than normal looping
df.apply(partial(writeMx, result), axis=0)
return(result)
for train in train_files:
# result is the sparse matrix you can further save or use
result = fileToMx(train)
print(result.shape, result.nnz)
# The shape of matrix and number of nonzero values
# ((420, 136), 15)

Python - average of unique values

I have a CSV file that looks like this:
DATE,TEMP
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
...
0101,41.0
0102,39.9
0103,44.6
0104,42.0
0105,43.0
0106,42.4
It's a list of temperatures for specific dates. It contains data for several years so the same dates occur multiple times. I would like to average the temperature so that I get a new table where each date is only occurring once and has the average temperature for that date in the second column.
I know that Stack Overflow requires you to include what you've attempted, but I really don't know how to do this and couldn't find any other answers on this.
I hope someone can help. Any help is much appreciated.

You can use pandas, and run the groupby command, when df is your data frame:
df.groupby('DATE').mean()
Here is some toy example to depict the behaviour
import pandas as pd
df=pd.DataFrame({"a":[1,2,3,1,2,3],"b":[1,2,3,4,5,6]})
df.groupby('a').mean()
Will result in
a b
1 2.5
2 3.5
3 4.5
When the original dataframe was
a b
0 1 1
1 2 2
2 3 3
3 1 4
4 2 5
5 3 6

If you can use the defaultdict pacakge from collections, makes this type of thing pretty easy.
Assuming your list is in the same directory as the python script and it looks like this:
list.csv:
DATE,TEMP
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
Here is the code I used to print out the averages.
#test.py
#usage: python test.py list.csv
import sys
from collections import defaultdict
#Open a file who is listed in the command line in the second position
with open(sys.argv[1]) as File:
#Skip the first line of the file, if its just "data,value"
File.next()
#Create a dictionary of lists
ourDict = defaultdict(list)
#parse the file, line by line
for each in File:
# Split the file, by a comma,
#or whatever separates them (Comma Seperated Values = CSV)
each = each.split(',')
# now each[0] is a year, and each[1] is a value.
# We use each[0] as the key, and append vallues to the list
ourDict[each[0]].append(float(each[1]))
print "Date\tValue"
for key,value in ourDict.items():
# Average is the sum of the value of all members of the list
# divided by the list's length
print key,'\t',sum(value)/len(value)

Specifying Range of Columns in Python

Specifically, I am asking how one would take several columns from a text file and input those into an array without specifying each column individually.
1 2 1 4.151E-12 4.553E-12 4.600E-12 4.852E-12 6.173E-12 7.756E-12 9.383E-12 1.096E-11 1.243E-11 1.379E-11 1.504E-11 1.619E-11 1.724E-11 2.139E-11 2.426E-11 2.791E-11 3.009E-11 3.152E-11 3.252E-11 3.326E-11 3.382E-11 3.426E-11 3.462E-11 3.572E-11 3.640E-11 3.698E-11 3.752E-11
2 3 1 1.433E-12 1.655E-12 1.907E-12 2.014E-12 2.282E-12 2.682E-12 3.159E-12 3.685E-12 4.246E-12 4.833E-12 5.440E-12 6.059E-12 6.688E-12 9.845E-12 1.285E-11 1.810E-11 2.238E-11 2.590E-11 2.886E-11 3.139E-11 3.359E-11 3.552E-11 3.724E-11 4.375E-11 4.832E-11 5.192E-11 5.486E-11
For example, I want the second column of this data set in an array by itself, and I want the third column in an array by itself. However, I want column four through the last column in arrays that are separated by column. I don't know how to do this without specifying each individual column.

Given that you mentioned a text file, I'm gonna treat it as the content is fetched from a text file line by line:
with open("data.txt") as f:
for line in f:
data = line.split()
# I want the second column of this data set in an array by itself
second_column = data[1]
# I want the third column in an array by itself
third_column = data[2]
# I want column four through the last column in arrays that are separated by column
fourth_to_last_column = data[3:]
If you then print second_column, third_column and fourth_to_last_column of the first line/input, it would look like this:
2
1
['4.151E-12', '4.553E-12', '4.600E-12', '4.852E-12', '6.173E-12', '7.756E-12', '9.383E-12', '1.096E-11', '1.243E-11', '1.379E-11', '1.504E-11', '1.619E-11', '1.724E-11', '2.139E-11', '2.426E-11', '2.791E-11', '3.009E-11', '3.152E-11', '3.252E-11', '3.326E-11', '3.382E-11', '3.426E-11', '3.462E-11', '3.572E-11', '3.640E-11', '3.698E-11', '3.752E-11']

How to iterate and print the list of dictionaries in xls using python

I want to read data from database and convert it into list of dictionaries to put it in to a XLS File for reporting.
I tried python code for report since it's easier for me write code with minimum programming knowledge
I want to Write the list of dictionaries within list of dictionaries to an XLS File.
I try to generate the xls file but not getting the result correctly
data1 = [{'a':1,'b':2,'c':[{'d':4,'e':5},{'d':8,'e':9}]},{'a':5,'b':3,'c':[{'d':8,'e':7},{'d':1,'e':3}]}]
#Output need to be print like this in excel
A B D E
1 2
4 5
8 9
5 3
8 7
1 3
Here is code i tried
try:
import xlwt
except Exception, e:
raise osv.except_osv(_('User Error'), _('Please Install xlwt Library.!'))
filename = 'Report.xls'
string = 'enquiry'
worksheet = wb.add_sheet(string)
data1 = [{'a':1,'b':2,'c':[{'d':4,'e':5},{'d':8,'e':9}]},{'a':5,'b':3,'c':[{'d':8,'e':7},{'d':1,'e':3}]}]
i=0;j=0;m=0;
if data1:
columns = sorted(list(data1[0].keys()))
worksheet.write_merge(0, 0, 0, 9, 'Report')
worksheet.write(2,0,"A")
worksheet.write(2,1,"B")
worksheet.write(2,2,"D")
worksheet.write(2,3,"E")
for i, row in enumerate(data1,3):
for j, col in enumerate(columns):
if type(row[col]) != list:
worksheet.write(i+m, j, row[col], other_tstyle1)
else:
#if list then loop and group it in new cell
if row[col] != []:
row_columns = sorted(list(row[col][0].keys()))
for k, row1 in enumerate(row[col],1):
for l, col1 in enumerate(row_columns):
worksheet.write(k+m+1, l+3, row1[col1])
#iteration of m for new row
m+=1
#m+=1
I got output like this
A B D E
1 2
4
5
8
9
5 3
8
7
1
3

I think it's because you have
m += 1
inside your inner for loop. So, for every element in c, you are putting it down one more row. (Your commented out line at the end was right.)
By the way, it's better to use meaningful variable names than just letters for variables (e.g. row_offset).

how to save a column vector in an iteration process to a text file

I need to save a column vector obtained in an iteration to a text file using python.
This is what I have been using till now
savetxt('displacement{0}.out'.format(globdat.cycle), a, delimiter=',', fmt='%10.4e',)
globdat.cycle is used as a count so that in each iteration a separate file is made.
requirement - I do not require separate files but a single file which contains all the vectors corresponding to each iteration.
eg - iteration 1 values = [ 1 2 3 4 5 6 ]' and iteration 2 values = [ a c v b f h ]'
my text file should look something similar to
1,a
2,c
3,v
4,b
5,f
6,h
I would much appreciate some help.
Thanks

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

reading variable number of columns with Python - python

Related

How to pre-process a very large data in python

Python - average of unique values

Specifying Range of Columns in Python

How to iterate and print the list of dictionaries in xls using python

how to save a column vector in an iteration process to a text file

Categories

Resources