Split txt file into tables - input as non uniform format - python

I'm trying to read a file into tables I can work with. I have one input file that contains 4 tables with coefficients. Each table begins with a line which describes its contents. Each table contains 25 numbers for each latitude, from -85 to 85, and each month. I would like to split the input file into a matrix like TAB(4,12,18,25) - 4 tables, 12 months, 18 latitudes and 25 levels. It's pretty messy as I don't have fixed separator - sometimes I could use space but then later on there are negative values and the space is used for that.
O2 CLIMATOLOGY : k*=11, 12, ..., 35
JAN -85 O2 cli 2.452E-07-8.040E-07 8.850E-07 7.970E-07 7.875E-06 8.494E-06\n
5.082E-06 4.159E-06-5.252E-06 5.892E-06 7.188E-06-7.641E-06 5.082E-06 5.350E-06\n
5.380E-06 5.079E-06 4.229E-06-3.367E-06-2.600E-06 2.043E-06-1.706E-06 7.413E-06\n
1.158E-06 9.480E-07 7.570E-07\n
JAN -75 O2 cli 2.300E-07 3.020E-07 4.760E-07 9.210E-07 1.729E-06 2.486E-06\n
3.163E-06 3.668E-06 3.838E-06 3.993E-06 4.401E-06 4.911E-06 5.304E-06 5.506E-06\n
.
.
.
TEMPERATURE CLIMATOLOGY : Z*=11, 12, ..., 35
JAN -85 T clim 2.278E+02 2.303E+02 2.323E+02 2.334E+02 2.340E+02 2.344E+02\n
It is a fortran model output. I tried readlines and split. In tables with " " delimiter it worked well but in the other where the space is taken for the minus character, it is not working.
I am not used to work with such a data and have no more idea how to proceed.

It looks to me like you have a fixed-length format for your data.
Each section of your file is identified with a title. Within a section, you have a number of subsections. Each subsection conforms to the following format:
Seven characters for your month: JAN, FEB, JUL, etc.
Five characters for your latitude
I don't know what this third field is, or if you want it, but it requires 8 characters.
Each of your 25 values is given a width of 10 characters.
Before we construct an algorithm, let's create a class to hold our data. You might want to use a pandas dataframe for this but I'll use standard Python because that's easier to work with for people who don't already know pandas:
class TableData:
def __init__(self, name, month, latitude, levels):
self.name = name
self.month = month
self.latitude = latitude
self.levels = levels
From this we can construct an algorithm:
# First, setup some constants for the data file
num_tables = 4
num_months = 12
num_latitudes = 18
num_levels = 25
month_stop = 7
latitude_stop = 12
col3_stop = 20
level_width = 10
table_size = num_months * num_latitudes
# Next, read in the raw data, separated by newlines
lines = []
with open('data', 'r') as f:
lines = f.readlines()
# Remove lines with no data in them
lines = [l in lines if l]
# Now, iterate over the entire file and convert it to the proper format
tables = {}
index = 0
for table in range(4):
# First, get the name of the table
name = lines[index]
# Next, iterate over all the data in the table
data = []
for i in range(index + 1, 4 * table_size, 4):
# First, get the month associated with the data
month = lines[i][:3]
# Next, get the latitude associated with the data
latitude = int(lines[i][3:8])
# Now, get the levels by splitting the remainder of this line, and
# each of the following three lines, into chunks equivalent to the
# width of the level column and convert them to a list of floating-
# point numbers and combine all the lists into a single list
levels = []
map(levels.extend, [get_levels(lines[i + j], col3_stop if j == 0 else 0) for j in range(4)])
# Finally, create an instance of our table data and add it to the
# list of data for the table
data.append(TableData(name, month, latitude, levels))
# Finally, associate the data with the name and continue on to the next
# table and update the index
tables[name] = data
index += 4 * table_size
# Helper function to extract levels data from a single line of the table
def get_levels(line, start):
return [float(line[k:k + level_width]) for k in range(start, len(line), level_width)]
This algorithm is pretty brittle as it relies on your file being structured exactly as you described, but it'll do the job if that's the case.

Related

Working with Two Different Input Files --- example: Hourly Data and Daily Data (with different lengths)

I'm working on some code to manipulate hourly and daily data for a year and am a little confused about how to combine data from the two files. What I am doing is using the hourly pattern of Data Set B but scaling it using Daily Set A. ... so in essence (using the example below) I will take the daily average (Data Set A) of 93 cfs and multiple it by 24 hrs in a day which would equal 2232 . I'll then sum the hourly cfs values for all 24hrs of each day (Data Set B)... which in this case for 1/1/2021 would equal 2596. Normally manipulating a rate in these manners doesn't make sense but in this case it doesn't matter because the units cancel out. I'd then need to take these values and divide them by each other 2232/2596 = 0.8597 and apply that to the hourly cfs values for all 24hrs of each day (Data Set B) for a new "scaled" dataset (to be Data Set C).
My problem is that I have never coded in Python using two different input datasets (I am a complete newbie). I started experimenting with the code but the problem is - is I can't seem to integrate the two datasets. If anyone can point me in the direction of how to integrate two separate input files I'd be most appreciative. Beneath the datasets is my attempts at the code (please note the reverse order of code - working first with hourly data (Data Set B) and then the daily data (Data Set A). My print out of the final scaling factor (SF) is only giving me one print out... not all 8,760 because I'm not in the loop... but how can I be in the loop of both input files at the same time???
Data Set A (Daily) -- 365 lines of data:
1/1/2021 93 cfs
1/2/2021 0 cfs
1/3/2021 70 cfs
1/4/2021 70 cfs
Data Set B (Hourly) -- 8,760 lines of data:
1/1/2021 0:00 150 cfs
1/1/2021 1:00 0 cfs
1/1/2021 2:00 255 cfs
(where summation of all 24 hrs of 1/1/2021 = 2596 cfs)
etc.
Sorry if this is a ridiculously easy question... I am very new to coding.
Here is the code that I've written so far... what I need is 8,760 lines of SF... that I can then use to multiple by the original Data Set B. The final product of Data Set C will be Date - Time - rescaled hourly data. I actually have to do this for three pumping units total... to give me a matrix of 5 columns by 8,760 rows but I think I'll be able to figure the unit thing out. My problem now is how to integrate the two data sets. Thank you for reading!
print('Solving the Temperature Model programming problem')
fhand1 = open('Interpolate_CY21_short.txt')
fhand2 = open('WSE_Daily_CY21_short.txt')
#Hourly Interpolated Pardee PowerHouse Data
for line1 in fhand1:
line1 = line1.rstrip()
words1 = line1.split()
#Hourly interpolated data - parsed down (cfs)
x = float(words1[7])
if x<100:
x = 0
#print(x)
#WSE Daily Average PowerHouse Data
for line2 in fhand2:
line2 = line2.rstrip()
words2 = line2.split()
#Daily cfs average x 24 hrs
aa = float(words2[2])*24
#print(a)
SF = x * aa
print(SF)
This is how you would get the data into two lists,
fhand1 = open('Interpolate_CY21_short.txt', 'r')
fhand2 = open('WSE_Daily_CY21_short.txt', 'r')
daily_average = fhand1.readlines()
daily = fhand2.readlines()
# this is what the to lists would look like, roughly
# each line would be a separate string
daily_average = ["1/1/2021 93 cfs","1/2/2021 0 cfs"]
daily = ["1/1/2021 0:00 150 cfs", "1/1/2021 1:00 0 cfs", "1/2/2021 1:00 0 cfs"]
Then, to process the lists could probably use a double nested for loop
for average_line in daily_average:
average_line = average_line.rstrip()
average_date, average_count, average_symbol = average_line.split()
for daily_line in daily:
daily_line = daily_line.rstrip()
date, hour, count, symbol = daily_line.split()
if average_date == date:
print(f"date={date}, average_count={average_count} count={count}")
Or a dictionary
# populate data into dictionaries
daily_average_data = dict()
for line in daily_average:
line = line.rstrip()
day, count, symbol = line.split()
daily_average_data[day] = (day, count, symbol)
daily_data = dict()
for line in daily:
line = line.rstrip()
day, hour, count, symbol = line.split()
if day not in daily_data:
daily_data[day] = list()
daily_data[day].append((day, hour, count, symbol))
# now you can access daily_average_data and daily_data as
# dictionaries instead of files
# process data
result = list()
for date in daily_data.keys():
print(date)
print(daily_average_data[date])
print(daily_data[date])
If the data items corresponded with one another line by line, you could use https://realpython.com/python-zip-function/
here is an example:
for data1, data2 in zip(daiy_average, daily):
print(f"{data1} {data2}")
Similar to what #oasispolo decribed, the solution is to make a single loop and process both lists in it. I'm personally not fond of the "zip" function. (It's a purely stylistic objection; lots of other people like it and that's fine.)
Here's a solution with syntax that I find more intuitive:
print('Solving the Temperature Model programming problem')
fhand1 = open('Interpolate_CY21_short.txt', 'r')
fhand2 = open('WSE_Daily_CY21_short.txt', 'r')
# Convert each file into a list of lines. You're doing this
# implicitly, but I like to be explicit about it.
lines1 = fhand1.readlines()
lines2 = fhand2.readlines()
if len(lines1) != len(lines2):
raise ValueError("The two files have different length!")
# Initialize an output array. You cold also construct it
# one item at a time, but that can be slow for large arrays.
# It is more efficient to initialize the entire array at
# once if possible.
sf_list = [0]*len(lines1)
for position in range(len(lines1)):
# range(L) generates numbers 0...L-1
line1 = lines1[position].rstrip()
words1 = line1.split()
x = float(words1[7])
if x<100:
x = 0
line2 = lines2[position].rstrip()
words2 = line2.split()
aa = float(words2[2])*24
sf_list[position] = x * aa
print(sf_list)

Selecting specific data, from a comparison relation, from a file and write it in a list

Basically I am reading different files that contain two columns of values. What I'm trying to do is to get the data pairs, data located in the same row, every increase of a certain value in the second column. The point is that in the second column there are repeated numbers that I would like to avoid. The data is in a .txt file and the columns are separated by an empty space ' '.
The data pairs are similar to: 46.68 255 ; 46.81 226 ; 46.94 225 ; 47.07 225 ; 47.22 225 (but every pair of data is a new row).
So basically if the value of the second column equals 10, column2[i] = 10, I want to get the data pair when the value of the second column equals 10+increase: column2[j] = 10+increase then append column1[j] and column2[j].
for filename in txtfiles:
X, y_pixel = np.loadtxt(filename, unpack=True, usecols=(0,1))
var = y_pixel[0]
goodx, goody = [], []
for k in range(len(X)):
if y_pixel[k] <= var+25.:
var = y[k]
goody.append(var)
goodx.append(X[k])
X, y = np.asarray(goodx), np.asarray(goody)
So if it was a list of numbers from 0 to 100, what I would like to get is a list with the numbers: 0 ; 25 ; 50 ; 75 ; 100
And what I am getting now is the whole dataset.

How to pre-process a very large data in python

I have a couple of files 100 MB each. The format for those files looks like this:
0 1 2 5 8 67 9 122
1 4 5 2 5 8
0 2 1 5 6
.....
(note the actual file does not have the alignment spaces added in, only one space separates each element, added alignment for aesthetic effect)
this first element in each row is it's binary classification, and the rest of the row are indices of features where the value is 1. For instance, the third row says the row's second, first, fifth and sixth features are 1, the rest are zeros.
I tried to read each line from each file, and use sparse.coo_matrix create a sparse matrix like this:
for train in train_files:
with open(train) as f:
row = []
col = []
for index, line in enumerate(f):
record = line.rstrip().split(' ')
row = row+[index]*(len(record)-4)
col = col+record[4:]
row = np.array(row)
col = np.array(col)
data = np.array([1]*len(row))
mtx = sparse.coo_matrix((data, (row, col)), shape=(n_row, max_feature))
mmwrite(train+'trans',mtx)
but this took forever to finish. I started reading the data at night, and let the computer run after I went to sleep, and when I woke up, it still haven't finish the first file!
What are the better ways to process this kind of data?
I think this would be a bit faster than your method because it does not read file line by line. You can try this code with a small portion of one file and compare with your code.
This code also requires to know the feature number in advance. If we don't know the feature number, it would require another line of code which was commented out.
import pandas as pd
from scipy.sparse import lil_matrix
from functools import partial
def writeMx(result, row):
# zero-based matrix requires the feature number minus 1
col_ind = row.dropna().values - 1
# Assign values without duplicating row index and values
result[row.name, col_ind] = 1
def fileToMx(f):
# number of features
col_n = 136
df = pd.read_csv(f, names=list(range(0,col_n+2)),sep=' ')
# This is the label of the binary classification
label = df.pop(0)
# Or get the feature number by the line below
# But it would not be the same across different files
# col_n = df.max().max()
# Number of row
row_n = len(label)
# Generate feature matrix for one file
result = lil_matrix((row_n, col_n))
# Save features in matrix
# DataFrame.apply() is usually faster than normal looping
df.apply(partial(writeMx, result), axis=0)
return(result)
for train in train_files:
# result is the sparse matrix you can further save or use
result = fileToMx(train)
print(result.shape, result.nnz)
# The shape of matrix and number of nonzero values
# ((420, 136), 15)

Python image file manipulation

Python beginner here. I am trying to make us of some data stored in a dictionary.
I have some .npy files in a folder. It is my intention to build a dictionary that encapsulates the following: reading of the map, done with np.load, the year, month, and date of the current map (as integers), the fractional time in years (given that a month has 30 days - it does not affect my calculations afterwards), and the number of pixels, and number of pixels above a certain value. At the end I expect to get a dictionary like:
{'map0':'array(from np.load)', 'year', 'month', 'day', 'fractional_time', 'pixels'
'map1':'....}
What I managed until now is the following:
import glob
file_list = glob.glob('*.npy')
def only_numbers(seq): #for getting rid of any '.npy' or any other string
seq_type= type(seq)
return seq_type().join(filter(seq_type.isdigit, seq))
maps = {}
for i in range(0, len(file_list)-1):
maps[i] = np.load(file_list[i])
numbers[i]=list(only_numbers(file_list[i]))
I have no idea how to to get a dictionary to have more values that are under the for loop. I can only manage to generate a new dictionary, or a list (e.g. numbers) for every task. For the numbers dictionary, I have no idea how to manipulate the date in the format YYYYMMDD to get the integers I am looking for.
For the pixels, I managed to get it for a single map, using:
data = np.load('20100620.npy')
print('Total pixel count: ', data.size)
c = (data > 50).astype(int)
print('Pixel >50%: ',np.count_nonzero(c))
Any hints? Until now, image processing seems to be quite a challenge.
Edit: Managed to split the dates and make them integers using
date=list(only_numbers.values())
year=int(date[i][0:4])
month=int(date[i][4:6])
day=int(date[i][6:8])
print (year, month, day)
If anyone is interested, I managed to do something else. I dropped the idea of a dictionary containing everything, as I needed to manipulate further easier. I did the following:
file_list = glob.glob('data/...') # files named YYYYMMDD.npy
file_list.sort()
def only_numbers(seq): # i make sure that i remove all characters and symbols from the name of the file
seq_type = type(seq)
return seq_type().join(filter(seq_type.isdigit, seq))
numbers = {}
time = []
np_above_value = []
for i in range(0, len(file_list) - 1):
maps = np.load(file_list[i])
maps[np.isnan(maps)] = 0 # had some NANs and getting some errors
numbers[i] = only_numbers(file_list[i]) # getting a dictionary with the name of the files that contain only the dates - calling the function I defined earlier
date = list(numbers.values()) # registering the name of the files (only the numbers) as a list
year = int(date[i][0:4]) # selecting first 4 values (YYYY) and transform them as integers, as required
month = int(date[i][4:6]) # selecting next 2 values (MM)
day = int(date[i][6:8]) # selecting next 2 values (DD)
time.append(year + ((month - 1) * 30 + day) / 360) # fractional time
print('Total pixel count for map '+ str(i) +':', maps.size) # total number of pixels for the current map in iteration
c = (maps > value).astype(int)
np_above_value.append (np.count_nonzero(c)) # list of the pixels with a value bigger than value
print('Pixels with concentration >value% for map '+ str(i) +':', np.count_nonzero(c)) # total number of pixels with a value bigger than value for the current map in iteration
plt.plot(time, np_above_value) # pixels with concentration above value as a function of time
I know it might be very clumsy. Second week of python, so please overlook that. It does the trick :)

How to plot an output of a function in Python?

These three functions give me the progression of number of customers and their orders from state 0 to next 365 states (or days). In function state_evolution, I want to plot the output from line
custA = float(custA*1.09**(1.0/365))
against the output from line
A = sum(80 + random.random() * 50 for i in range(ordsA))
and do the same for custB so I can compare their outputs graphically.
def get_state0():
""" functions gets four columns from base data and finds their state 0"""
statetype0 = {'custt':{'typeA':100,'typeB':200}}
orderstype0 = {'orders':{'typeA':1095, 'typeB':4380}}
return {'custtypeA' : int(statetype0['custt']['typeA']),
'custtypeB' : int(statetype0['custt']['typeB']),
'ordstypeA': orderstype0['orders']['typeA'],'A':1095, 'B':4380,
'ordstypeB':orderstype0['orders']['typeB'],
'day':0 }
def state_evolution(state):
"""function takes state 0 and predicts state evolution """
custA = state['custtypeA']
custB = state['custtypeB']
ordsA = state['ordstypeA']
ordsB = state['ordstypeB']
A = state['A']
B = state['B']
day = state['day']
# evolve day
day += 1
#evolve cust typea
custA = float(custA*1.09**(1.0/365))
#evolve cust typeb
custB = float (custB*1.063**(1.0/365))
# evolve orders cust type A
ordsA += int(custA * order_rateA(day))
A = sum(80 + random.random() * 50 for i in range(ordsA))
# evolve orders cust type B
ordsB += int(custB * order_rateB(day))
B = sum(70 + random.random() * 40 for i in range(ordsB))
return {'custtypeA':custA ,'ordstypeA':ordsA, 'A':A, 'B':B,
'custtypeB':custB, 'ordstypeB':ordsB, 'day': day}
def show_all_states():
""" function runs state evolution function to find other states"""
s = get_state0()
for day in range(365):
s = state_evolution(s)
print day, s
You should do the following:
Modify your custA function so that it returns a sequence (list, tuple) with, say, 365 items. Alternatively, use custA inside a list comprehension or for loop to get the sequence of 365 results;
Do the same for ordsA function, to get the other sequence.
From now on, you can do different things depending on what you want.
If you want two plots in parallel (superimposed), then:
pyplot.plot(custA_result_list);
pyplot.plot(ordsA_result_list);
pyplot.show()
If you want to CORRELATE the data, you can do a scatter plot (faster), or a regular plot with dot markers (slower but more customizeable IMO):
pyplot.scatter(custA_result_list, ordsA_result_list)
# or
pyplot.plot(custA_result_list, ordsA_result_list, 'o')
## THIS WILL ONLY WORK IF BOTH SEQUENCES HAVE SAME LENGTH! (e.g. 365 elements each)
At last, if the data were irregularly sampled, you could also provide a sequence for the horizontal axis values. That would allow, for example, to plot only weekdays' results without "collapsing" the weekends (otherwise the gap between friday and monday would look like a single day):
weekdays = [1,2,3,4,5, 8,9,10,11,12, 15,16,17,18,19, 22, ...] # len(weekdays) ~ 260
pyplot.plot(weekdays, custA_result_list);
pyplot.plot(weekdays, ordsA_result_list);
pyplot.show()
Hope this helps!
EDIT: About excel, if I understand right you ALREADY have a csv file. Then, you could use the csv python module, or read it yourself like this:
with open('file.csv') as csv_in:
content = [line.strip().split(',') for line in csv_in]
Now if you have an actual .xls or .xlsx file, use the xlrd module that you can download here or by running pip install xlrd in a command prompt.

Categories

Resources