These three functions give me the progression of number of customers and their orders from state 0 to next 365 states (or days). In function state_evolution, I want to plot the output from line
custA = float(custA*1.09**(1.0/365))
against the output from line
A = sum(80 + random.random() * 50 for i in range(ordsA))
and do the same for custB so I can compare their outputs graphically.
def get_state0():
""" functions gets four columns from base data and finds their state 0"""
statetype0 = {'custt':{'typeA':100,'typeB':200}}
orderstype0 = {'orders':{'typeA':1095, 'typeB':4380}}
return {'custtypeA' : int(statetype0['custt']['typeA']),
'custtypeB' : int(statetype0['custt']['typeB']),
'ordstypeA': orderstype0['orders']['typeA'],'A':1095, 'B':4380,
'ordstypeB':orderstype0['orders']['typeB'],
'day':0 }
def state_evolution(state):
"""function takes state 0 and predicts state evolution """
custA = state['custtypeA']
custB = state['custtypeB']
ordsA = state['ordstypeA']
ordsB = state['ordstypeB']
A = state['A']
B = state['B']
day = state['day']
# evolve day
day += 1
#evolve cust typea
custA = float(custA*1.09**(1.0/365))
#evolve cust typeb
custB = float (custB*1.063**(1.0/365))
# evolve orders cust type A
ordsA += int(custA * order_rateA(day))
A = sum(80 + random.random() * 50 for i in range(ordsA))
# evolve orders cust type B
ordsB += int(custB * order_rateB(day))
B = sum(70 + random.random() * 40 for i in range(ordsB))
return {'custtypeA':custA ,'ordstypeA':ordsA, 'A':A, 'B':B,
'custtypeB':custB, 'ordstypeB':ordsB, 'day': day}
def show_all_states():
""" function runs state evolution function to find other states"""
s = get_state0()
for day in range(365):
s = state_evolution(s)
print day, s
You should do the following:
Modify your custA function so that it returns a sequence (list, tuple) with, say, 365 items. Alternatively, use custA inside a list comprehension or for loop to get the sequence of 365 results;
Do the same for ordsA function, to get the other sequence.
From now on, you can do different things depending on what you want.
If you want two plots in parallel (superimposed), then:
pyplot.plot(custA_result_list);
pyplot.plot(ordsA_result_list);
pyplot.show()
If you want to CORRELATE the data, you can do a scatter plot (faster), or a regular plot with dot markers (slower but more customizeable IMO):
pyplot.scatter(custA_result_list, ordsA_result_list)
# or
pyplot.plot(custA_result_list, ordsA_result_list, 'o')
## THIS WILL ONLY WORK IF BOTH SEQUENCES HAVE SAME LENGTH! (e.g. 365 elements each)
At last, if the data were irregularly sampled, you could also provide a sequence for the horizontal axis values. That would allow, for example, to plot only weekdays' results without "collapsing" the weekends (otherwise the gap between friday and monday would look like a single day):
weekdays = [1,2,3,4,5, 8,9,10,11,12, 15,16,17,18,19, 22, ...] # len(weekdays) ~ 260
pyplot.plot(weekdays, custA_result_list);
pyplot.plot(weekdays, ordsA_result_list);
pyplot.show()
Hope this helps!
EDIT: About excel, if I understand right you ALREADY have a csv file. Then, you could use the csv python module, or read it yourself like this:
with open('file.csv') as csv_in:
content = [line.strip().split(',') for line in csv_in]
Now if you have an actual .xls or .xlsx file, use the xlrd module that you can download here or by running pip install xlrd in a command prompt.
Related
I'm trying to read a file into tables I can work with. I have one input file that contains 4 tables with coefficients. Each table begins with a line which describes its contents. Each table contains 25 numbers for each latitude, from -85 to 85, and each month. I would like to split the input file into a matrix like TAB(4,12,18,25) - 4 tables, 12 months, 18 latitudes and 25 levels. It's pretty messy as I don't have fixed separator - sometimes I could use space but then later on there are negative values and the space is used for that.
O2 CLIMATOLOGY : k*=11, 12, ..., 35
JAN -85 O2 cli 2.452E-07-8.040E-07 8.850E-07 7.970E-07 7.875E-06 8.494E-06\n
5.082E-06 4.159E-06-5.252E-06 5.892E-06 7.188E-06-7.641E-06 5.082E-06 5.350E-06\n
5.380E-06 5.079E-06 4.229E-06-3.367E-06-2.600E-06 2.043E-06-1.706E-06 7.413E-06\n
1.158E-06 9.480E-07 7.570E-07\n
JAN -75 O2 cli 2.300E-07 3.020E-07 4.760E-07 9.210E-07 1.729E-06 2.486E-06\n
3.163E-06 3.668E-06 3.838E-06 3.993E-06 4.401E-06 4.911E-06 5.304E-06 5.506E-06\n
.
.
.
TEMPERATURE CLIMATOLOGY : Z*=11, 12, ..., 35
JAN -85 T clim 2.278E+02 2.303E+02 2.323E+02 2.334E+02 2.340E+02 2.344E+02\n
It is a fortran model output. I tried readlines and split. In tables with " " delimiter it worked well but in the other where the space is taken for the minus character, it is not working.
I am not used to work with such a data and have no more idea how to proceed.
It looks to me like you have a fixed-length format for your data.
Each section of your file is identified with a title. Within a section, you have a number of subsections. Each subsection conforms to the following format:
Seven characters for your month: JAN, FEB, JUL, etc.
Five characters for your latitude
I don't know what this third field is, or if you want it, but it requires 8 characters.
Each of your 25 values is given a width of 10 characters.
Before we construct an algorithm, let's create a class to hold our data. You might want to use a pandas dataframe for this but I'll use standard Python because that's easier to work with for people who don't already know pandas:
class TableData:
def __init__(self, name, month, latitude, levels):
self.name = name
self.month = month
self.latitude = latitude
self.levels = levels
From this we can construct an algorithm:
# First, setup some constants for the data file
num_tables = 4
num_months = 12
num_latitudes = 18
num_levels = 25
month_stop = 7
latitude_stop = 12
col3_stop = 20
level_width = 10
table_size = num_months * num_latitudes
# Next, read in the raw data, separated by newlines
lines = []
with open('data', 'r') as f:
lines = f.readlines()
# Remove lines with no data in them
lines = [l in lines if l]
# Now, iterate over the entire file and convert it to the proper format
tables = {}
index = 0
for table in range(4):
# First, get the name of the table
name = lines[index]
# Next, iterate over all the data in the table
data = []
for i in range(index + 1, 4 * table_size, 4):
# First, get the month associated with the data
month = lines[i][:3]
# Next, get the latitude associated with the data
latitude = int(lines[i][3:8])
# Now, get the levels by splitting the remainder of this line, and
# each of the following three lines, into chunks equivalent to the
# width of the level column and convert them to a list of floating-
# point numbers and combine all the lists into a single list
levels = []
map(levels.extend, [get_levels(lines[i + j], col3_stop if j == 0 else 0) for j in range(4)])
# Finally, create an instance of our table data and add it to the
# list of data for the table
data.append(TableData(name, month, latitude, levels))
# Finally, associate the data with the name and continue on to the next
# table and update the index
tables[name] = data
index += 4 * table_size
# Helper function to extract levels data from a single line of the table
def get_levels(line, start):
return [float(line[k:k + level_width]) for k in range(start, len(line), level_width)]
This algorithm is pretty brittle as it relies on your file being structured exactly as you described, but it'll do the job if that's the case.
I'm working on some code to manipulate hourly and daily data for a year and am a little confused about how to combine data from the two files. What I am doing is using the hourly pattern of Data Set B but scaling it using Daily Set A. ... so in essence (using the example below) I will take the daily average (Data Set A) of 93 cfs and multiple it by 24 hrs in a day which would equal 2232 . I'll then sum the hourly cfs values for all 24hrs of each day (Data Set B)... which in this case for 1/1/2021 would equal 2596. Normally manipulating a rate in these manners doesn't make sense but in this case it doesn't matter because the units cancel out. I'd then need to take these values and divide them by each other 2232/2596 = 0.8597 and apply that to the hourly cfs values for all 24hrs of each day (Data Set B) for a new "scaled" dataset (to be Data Set C).
My problem is that I have never coded in Python using two different input datasets (I am a complete newbie). I started experimenting with the code but the problem is - is I can't seem to integrate the two datasets. If anyone can point me in the direction of how to integrate two separate input files I'd be most appreciative. Beneath the datasets is my attempts at the code (please note the reverse order of code - working first with hourly data (Data Set B) and then the daily data (Data Set A). My print out of the final scaling factor (SF) is only giving me one print out... not all 8,760 because I'm not in the loop... but how can I be in the loop of both input files at the same time???
Data Set A (Daily) -- 365 lines of data:
1/1/2021 93 cfs
1/2/2021 0 cfs
1/3/2021 70 cfs
1/4/2021 70 cfs
Data Set B (Hourly) -- 8,760 lines of data:
1/1/2021 0:00 150 cfs
1/1/2021 1:00 0 cfs
1/1/2021 2:00 255 cfs
(where summation of all 24 hrs of 1/1/2021 = 2596 cfs)
etc.
Sorry if this is a ridiculously easy question... I am very new to coding.
Here is the code that I've written so far... what I need is 8,760 lines of SF... that I can then use to multiple by the original Data Set B. The final product of Data Set C will be Date - Time - rescaled hourly data. I actually have to do this for three pumping units total... to give me a matrix of 5 columns by 8,760 rows but I think I'll be able to figure the unit thing out. My problem now is how to integrate the two data sets. Thank you for reading!
print('Solving the Temperature Model programming problem')
fhand1 = open('Interpolate_CY21_short.txt')
fhand2 = open('WSE_Daily_CY21_short.txt')
#Hourly Interpolated Pardee PowerHouse Data
for line1 in fhand1:
line1 = line1.rstrip()
words1 = line1.split()
#Hourly interpolated data - parsed down (cfs)
x = float(words1[7])
if x<100:
x = 0
#print(x)
#WSE Daily Average PowerHouse Data
for line2 in fhand2:
line2 = line2.rstrip()
words2 = line2.split()
#Daily cfs average x 24 hrs
aa = float(words2[2])*24
#print(a)
SF = x * aa
print(SF)
This is how you would get the data into two lists,
fhand1 = open('Interpolate_CY21_short.txt', 'r')
fhand2 = open('WSE_Daily_CY21_short.txt', 'r')
daily_average = fhand1.readlines()
daily = fhand2.readlines()
# this is what the to lists would look like, roughly
# each line would be a separate string
daily_average = ["1/1/2021 93 cfs","1/2/2021 0 cfs"]
daily = ["1/1/2021 0:00 150 cfs", "1/1/2021 1:00 0 cfs", "1/2/2021 1:00 0 cfs"]
Then, to process the lists could probably use a double nested for loop
for average_line in daily_average:
average_line = average_line.rstrip()
average_date, average_count, average_symbol = average_line.split()
for daily_line in daily:
daily_line = daily_line.rstrip()
date, hour, count, symbol = daily_line.split()
if average_date == date:
print(f"date={date}, average_count={average_count} count={count}")
Or a dictionary
# populate data into dictionaries
daily_average_data = dict()
for line in daily_average:
line = line.rstrip()
day, count, symbol = line.split()
daily_average_data[day] = (day, count, symbol)
daily_data = dict()
for line in daily:
line = line.rstrip()
day, hour, count, symbol = line.split()
if day not in daily_data:
daily_data[day] = list()
daily_data[day].append((day, hour, count, symbol))
# now you can access daily_average_data and daily_data as
# dictionaries instead of files
# process data
result = list()
for date in daily_data.keys():
print(date)
print(daily_average_data[date])
print(daily_data[date])
If the data items corresponded with one another line by line, you could use https://realpython.com/python-zip-function/
here is an example:
for data1, data2 in zip(daiy_average, daily):
print(f"{data1} {data2}")
Similar to what #oasispolo decribed, the solution is to make a single loop and process both lists in it. I'm personally not fond of the "zip" function. (It's a purely stylistic objection; lots of other people like it and that's fine.)
Here's a solution with syntax that I find more intuitive:
print('Solving the Temperature Model programming problem')
fhand1 = open('Interpolate_CY21_short.txt', 'r')
fhand2 = open('WSE_Daily_CY21_short.txt', 'r')
# Convert each file into a list of lines. You're doing this
# implicitly, but I like to be explicit about it.
lines1 = fhand1.readlines()
lines2 = fhand2.readlines()
if len(lines1) != len(lines2):
raise ValueError("The two files have different length!")
# Initialize an output array. You cold also construct it
# one item at a time, but that can be slow for large arrays.
# It is more efficient to initialize the entire array at
# once if possible.
sf_list = [0]*len(lines1)
for position in range(len(lines1)):
# range(L) generates numbers 0...L-1
line1 = lines1[position].rstrip()
words1 = line1.split()
x = float(words1[7])
if x<100:
x = 0
line2 = lines2[position].rstrip()
words2 = line2.split()
aa = float(words2[2])*24
sf_list[position] = x * aa
print(sf_list)
I am using the this dataset for a project.
I am trying to find the total yield for each inverter for the 34 day duration of the dataset (basically use the final and initial value available for each inverter). I have been able to get the list of inverters using pd.unique()(there are 22 inverters for each solar power plant.
I am having trouble querying the total_yield data for each inverter.
Here is what I have tried:
def get_yields(arr: np.ndarray, df:pd.core.frame.DataFrame) -> np.ndarray:
delta = np.zeros(len(arr))
index =0
for i in arr:
initial = df.loc[df["DATE_TIME"]=="15-05-2020 02:00"]
initial = initial.loc[initial["INVERTER_ID"]==i]
initial.reset_index(inplace=True,drop=True)
initial = initial.at[0,"TOTAL_YIELD"]
final = df.loc[(df["DATE_TIME"]=="17-06-2020 23:45")]
final = final.loc[final["INVERTER_ID"]==i]
final.reset_index(inplace=True, drop=True)
final = final.at[0,"TOTAL_YIELD"]
delta[index] = final - initial
index = index + 1
return delta
Reference: arr is the array of inverters, listed below. df is the generation dataframe for each plant.
The problem is that not every inverter has a data point for each interval. This makes this function only work for the inverters at the first plant, not the second one.
My second approach was to filter by the inverter first, then take the first and last data points. But I get an error- 'Series' objects are mutable, thus they cannot be hashed
Here is the code for that so far:
def get_yields2(arr: np.ndarray, df: pd.core.frame.DataFrame) -> np.ndarry:
delta = np.zeros(len(arr))
index = 0
for i in arr:
initial = df.loc(df["INVERTER_ID"] == i)
index += 1
break
return delta
List of inverters at plant 1 for reference(labeled as SOURCE_KEY):
['1BY6WEcLGh8j5v7' '1IF53ai7Xc0U56Y' '3PZuoBAID5Wc2HD' '7JYdWkrLSPkdwr4'
'McdE0feGgRqW7Ca' 'VHMLBKoKgIrUVDU' 'WRmjgnKYAwPKWDb' 'ZnxXDlPa8U1GXgE'
'ZoEaEvLYb1n2sOq' 'adLQvlD726eNBSB' 'bvBOhCH3iADSZry' 'iCRJl6heRkivqQ3'
'ih0vzX44oOqAx2f' 'pkci93gMrogZuBj' 'rGa61gmuvPhdLxV' 'sjndEbLyjtCKgGv'
'uHbuxQJl8lW7ozc' 'wCURE6d3bPkepu2' 'z9Y9gH1T5YWrNuG' 'zBIq5rxdHJRwDNY'
'zVJPv84UY57bAof' 'YxYtjZvoooNbGkE']
List of inverters at plant 2:
['4UPUqMRk7TRMgml' '81aHJ1q11NBPMrL' '9kRcWv60rDACzjR' 'Et9kgGMDl729KT4'
'IQ2d7wF4YD8zU1Q' 'LYwnQax7tkwH5Cb' 'LlT2YUhhzqhg5Sw' 'Mx2yZCDsyf6DPfv'
'NgDl19wMapZy17u' 'PeE6FRyGXUgsRhN' 'Qf4GUc1pJu5T6c6' 'Quc1TzYxW2pYoWX'
'V94E5Ben1TlhnDV' 'WcxssY2VbP4hApt' 'mqwcsP2rE7J0TFp' 'oZ35aAeoifZaQzV'
'oZZkBaNadn6DNKz' 'q49J1IKaHRwDQnt' 'rrq4fwE8jgrTyWY' 'vOuJvMaM2sgwLmb'
'xMbIugepa2P7lBB' 'xoJJ8DcxJEcupym']
Thank you very much.
I can't download the dataset to test this. Getting "To May Requests" Error.
However, you should be able to do this with a groupby.
import pandas as pd
result = df.groupby('INVERTER_ID')['TOTAL_YIELD'].agg(['max','min'])
result['delta'] = result['max']-result['min']
print(result[['delta']])
So if I'm understanding this right, what you want is the TOTAL_YIELD for each inverter for the beginning of the time period starting 5-05-2020 02:00 and ending 17-06-2020 23:45. Try this:
# enumerate lets you have an index value along with iterating through the array
for i, code in enumerate(arr):
# to filter the info to between the two dates, but not necessarily assuming that
# each inverter's data starts and ends at each date
inverter_df = df.loc[df['DATE_TIME'] >= pd.to_datetime('15-05-2020 02:00:00')]
inverter_df = inverter_df.loc[inverter_df['DATE_TIME'] <= pd.to_datetime('17-06-2020
23:45:00')]
inverter_df = inverter_df.loc[inverter_df["INVERTER_ID"]==code]]
# sort by date
inverter_df.sort_values(by='DATE_TIME', inplace= True)
# grab TOTAL_YIELD at the first available date
initial = inverter_df['TOTAL_YIELD'].iloc[0]
# grab TOTAL_YIELD at the last available date
final = inverter_df['TOTAL_YIELD'].iloc[-1]
delta[index] = final - initial
UPDATE My question has been fully answered, I have applied it to my program using jarmod's answer, and although the code looks neater, it has not effected the speed of (when my graph appears( i plot this data using matplotlib) I am a a little confused on why my program runs slowly and how I can increase the speed ( takes about 30 seconds and I know this portion of the code is slowing it down) I have shown my real code in the second block of code. Also, the speed is strongly determined by the Range I set, with a short range it is quiet fast
I have this sample code here that shows my calculation needed to conduct forecasting and extracting values. I use the for loops to run through a specific range of CSV files that I labeled 1-100. I return numbers for each month (1-12) to get the forecasting average for a forecast for a given amount of month.
My full code includes 12 functions for a full year forecast but I feel the code is inefficient because the functions are very similar except for one number and reading the csv file so many times slows the program.
Is there a way I can combine these functions and perhaps add another parameter to make it run so. The biggest concern I had was that it would be hard to return separate numbers and categorize them. In other words, I would like to ideally only have one function for all 12 month accuracy predictions and the way I can possibly see how to do that would to add another parameter and another loop series, but have no idea how to go about that or if it is possible. Essentially, I would like to store all the values of onemonthaccuracy ( which goes into the file before the current file and compares the predicted value for the date associated with the currentfile) and then store all the values of twomonthaccurary and so on... so I can later use these variables for graphing and other purposes
import csv
import pandas as pd
def onemonthaccuracy(basefilenumber):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
onemonthread = pd.read_csv(str(basefilenumber-1)+'.csv', encoding='latin-1')
onemonthvalue = onemonthread.loc[onemonthread['Customer'].str.contains('Customer A', na=False),'Jun-16\nQty']
onetotal = int(onemonthvalue)/int(basefilevalue)
return onetotal
def twomonthaccuracy(basefilenumber):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
twomonthread = pd.read_csv(str(basefilenumber-2)+'.csv', encoding = 'Latin-1')
twomonthvalue = twomonthread.loc[twomonthread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
twototal = int(twomonthvalue)/int(basefilevalue)
return twototal
onetotal = 0
twototal = 0
onetotallist = []
twototallist = []
for basefilenumber in range(24,36):
onetotal += onemonthaccuracy(basefilenumber)
twototal +=twomonthaccuracy(basefilenumber)
onetotallist.append(onemonthaccuracy(i))
twototallist.append(twomonthaccuracy(i))
onetotalpermonth = onetotal/12
twototalpermonth = twototal/12
x = [1,2]
y = [onetotalpermonth, twototalpermonth]
z = [1,2]
w = [(onetotallist),(twototallist)]
for ze, we in zip(z, w):
plt.scatter([ze] * len(we), we, marker='D', s=5)
plt.scatter(x,y)
plt.show()
This is the real block of code I am using in my program, perhaps something is slowing it down that I am unaware of?
#other parts of code
#StartRange = yearvalue+Value
#EndRange = endValue + endyearvalue
#Range = EndRange - StartRange
# Department
#more code....
def nmonthaccuracy(basefilenumber, n):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
baseheader = getfileheader(basefilenumber)
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains(Department, na=False), baseheader]
nmonthread = pd.read_csv(str(basefilenumber-n)+'.csv', encoding = 'Latin-1')
nmonthvalue = nmonthread.loc[nmonthread['Customer'].str.contains(Department, na=False), baseheader]
return (1-(int(basefilevalue)/int(nmonthvalue))+1) if int(nmonthvalue) > int(basefilevalue) else int(nmonthvalue)/int(basefilevalue)
N = 13
total = [0] * N
total_by_month_list = [[] for _ in range(N)]
for basefilenumber in range(int(StartRange),int(EndRange)):
for n in range(N):
total[n] += nmonthaccuracy(basefilenumber, n)
total_by_month_list[n].append(nmonthaccuracy(basefilenumber,n))
onetotal=total[1]/ Range
twototal=total[2]/ Range
threetotal=total[3]/ Range
fourtotal=total[4]/ Range
fivetotal=total[5]/ Range #... all the way to 12
onetotallist=total_by_month_list[1]
twototallist=total_by_month_list[2]
threetotallist=total_by_month_list[3]
fourtotallist=total_by_month_list[4]
fivetotallist=total_by_month_list[5] #... all the way to 12
# alot more code after this
Something like this:
def nmonthaccuracy(basefilenumber, n):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Lam DepT', na=False), 'Jun-16\nQty']
nmonthread = pd.read_csv(str(basefilenumber-n)+'.csv', encoding = 'Latin-1')
nmonthvalue = nmonthread.loc[nmonthread['Customer'].str.contains('Lam DepT', na=False), 'Jun-16\nQty']
return int(nmonthvalue)/int(basefilevalue)
N = 2
total_by_month = [0] * N
total_aggregate = 0
for basefilenumber in range(20,30):
for n in range(N):
a = nmonthaccuracy(basefilenumber, n)
total_by_month[n] += a
total_aggregate += a
In case you are wondering what the following code does:
N = 2
total_by_month = [0] * N
It sets N to the number of months desired (2, but you could make it 12 or another value) and it then creates a total_by_month array that can store N results, one per month. It then initializes total_by_month to all zeroes (N zeroes) so that each of the N monthly totals starts at zero.
Python beginner here. I am trying to make us of some data stored in a dictionary.
I have some .npy files in a folder. It is my intention to build a dictionary that encapsulates the following: reading of the map, done with np.load, the year, month, and date of the current map (as integers), the fractional time in years (given that a month has 30 days - it does not affect my calculations afterwards), and the number of pixels, and number of pixels above a certain value. At the end I expect to get a dictionary like:
{'map0':'array(from np.load)', 'year', 'month', 'day', 'fractional_time', 'pixels'
'map1':'....}
What I managed until now is the following:
import glob
file_list = glob.glob('*.npy')
def only_numbers(seq): #for getting rid of any '.npy' or any other string
seq_type= type(seq)
return seq_type().join(filter(seq_type.isdigit, seq))
maps = {}
for i in range(0, len(file_list)-1):
maps[i] = np.load(file_list[i])
numbers[i]=list(only_numbers(file_list[i]))
I have no idea how to to get a dictionary to have more values that are under the for loop. I can only manage to generate a new dictionary, or a list (e.g. numbers) for every task. For the numbers dictionary, I have no idea how to manipulate the date in the format YYYYMMDD to get the integers I am looking for.
For the pixels, I managed to get it for a single map, using:
data = np.load('20100620.npy')
print('Total pixel count: ', data.size)
c = (data > 50).astype(int)
print('Pixel >50%: ',np.count_nonzero(c))
Any hints? Until now, image processing seems to be quite a challenge.
Edit: Managed to split the dates and make them integers using
date=list(only_numbers.values())
year=int(date[i][0:4])
month=int(date[i][4:6])
day=int(date[i][6:8])
print (year, month, day)
If anyone is interested, I managed to do something else. I dropped the idea of a dictionary containing everything, as I needed to manipulate further easier. I did the following:
file_list = glob.glob('data/...') # files named YYYYMMDD.npy
file_list.sort()
def only_numbers(seq): # i make sure that i remove all characters and symbols from the name of the file
seq_type = type(seq)
return seq_type().join(filter(seq_type.isdigit, seq))
numbers = {}
time = []
np_above_value = []
for i in range(0, len(file_list) - 1):
maps = np.load(file_list[i])
maps[np.isnan(maps)] = 0 # had some NANs and getting some errors
numbers[i] = only_numbers(file_list[i]) # getting a dictionary with the name of the files that contain only the dates - calling the function I defined earlier
date = list(numbers.values()) # registering the name of the files (only the numbers) as a list
year = int(date[i][0:4]) # selecting first 4 values (YYYY) and transform them as integers, as required
month = int(date[i][4:6]) # selecting next 2 values (MM)
day = int(date[i][6:8]) # selecting next 2 values (DD)
time.append(year + ((month - 1) * 30 + day) / 360) # fractional time
print('Total pixel count for map '+ str(i) +':', maps.size) # total number of pixels for the current map in iteration
c = (maps > value).astype(int)
np_above_value.append (np.count_nonzero(c)) # list of the pixels with a value bigger than value
print('Pixels with concentration >value% for map '+ str(i) +':', np.count_nonzero(c)) # total number of pixels with a value bigger than value for the current map in iteration
plt.plot(time, np_above_value) # pixels with concentration above value as a function of time
I know it might be very clumsy. Second week of python, so please overlook that. It does the trick :)