Python image file manipulation - python

Python beginner here. I am trying to make us of some data stored in a dictionary.
I have some .npy files in a folder. It is my intention to build a dictionary that encapsulates the following: reading of the map, done with np.load, the year, month, and date of the current map (as integers), the fractional time in years (given that a month has 30 days - it does not affect my calculations afterwards), and the number of pixels, and number of pixels above a certain value. At the end I expect to get a dictionary like:
{'map0':'array(from np.load)', 'year', 'month', 'day', 'fractional_time', 'pixels'
'map1':'....}
What I managed until now is the following:
import glob
file_list = glob.glob('*.npy')
def only_numbers(seq): #for getting rid of any '.npy' or any other string
seq_type= type(seq)
return seq_type().join(filter(seq_type.isdigit, seq))
maps = {}
for i in range(0, len(file_list)-1):
maps[i] = np.load(file_list[i])
numbers[i]=list(only_numbers(file_list[i]))
I have no idea how to to get a dictionary to have more values that are under the for loop. I can only manage to generate a new dictionary, or a list (e.g. numbers) for every task. For the numbers dictionary, I have no idea how to manipulate the date in the format YYYYMMDD to get the integers I am looking for.
For the pixels, I managed to get it for a single map, using:
data = np.load('20100620.npy')
print('Total pixel count: ', data.size)
c = (data > 50).astype(int)
print('Pixel >50%: ',np.count_nonzero(c))
Any hints? Until now, image processing seems to be quite a challenge.
Edit: Managed to split the dates and make them integers using
date=list(only_numbers.values())
year=int(date[i][0:4])
month=int(date[i][4:6])
day=int(date[i][6:8])
print (year, month, day)

If anyone is interested, I managed to do something else. I dropped the idea of a dictionary containing everything, as I needed to manipulate further easier. I did the following:
file_list = glob.glob('data/...') # files named YYYYMMDD.npy
file_list.sort()
def only_numbers(seq): # i make sure that i remove all characters and symbols from the name of the file
seq_type = type(seq)
return seq_type().join(filter(seq_type.isdigit, seq))
numbers = {}
time = []
np_above_value = []
for i in range(0, len(file_list) - 1):
maps = np.load(file_list[i])
maps[np.isnan(maps)] = 0 # had some NANs and getting some errors
numbers[i] = only_numbers(file_list[i]) # getting a dictionary with the name of the files that contain only the dates - calling the function I defined earlier
date = list(numbers.values()) # registering the name of the files (only the numbers) as a list
year = int(date[i][0:4]) # selecting first 4 values (YYYY) and transform them as integers, as required
month = int(date[i][4:6]) # selecting next 2 values (MM)
day = int(date[i][6:8]) # selecting next 2 values (DD)
time.append(year + ((month - 1) * 30 + day) / 360) # fractional time
print('Total pixel count for map '+ str(i) +':', maps.size) # total number of pixels for the current map in iteration
c = (maps > value).astype(int)
np_above_value.append (np.count_nonzero(c)) # list of the pixels with a value bigger than value
print('Pixels with concentration >value% for map '+ str(i) +':', np.count_nonzero(c)) # total number of pixels with a value bigger than value for the current map in iteration
plt.plot(time, np_above_value) # pixels with concentration above value as a function of time
I know it might be very clumsy. Second week of python, so please overlook that. It does the trick :)

Related

How to iterate to find days? in smth["smth"]["1.555"]["2022-04-05T08:34:39+02:00"]

im trying to iterate this and cant figureout how. Theres .csv file.
QUESTION: so i finds LOW_num's data[0], and got to get TOP_num for TOP_num's data[0] < LOW_num's data[0] What the formula could be? The example:
for line in file:
data = line.split(sep)
A line looks like this:
2022-04-05T08:34:39+02:00, 1.2024, 1.2024, 1.2024, 1.2024, 1.2185, 1.2059028833000065, 1.2024784243912705, 1.2004400559932131, 1.198116316019428
So data[0] means Column A, data[1] is Column B, data[2] is Column C, (...)
memory["high"] = {}
memory["low"] = {}
for line in file:
data = line.split(sep)
if data[5] < data[9]:
memory["high"][float(data[2])] = str(data[0])
memory["low"][float(data[3])] = str(data[0])
# those are collecing data[2] and data[3] only between events when
# values changes from column F > J, to F < J, in that .csv file
then in the same "for line in file:", but different if:
LOW_num = min(memory["low"]) # it gets lowest number of all collected data[3] (Column D)
TOP_num = max(memory["high"]) # it gets biggest number of all collected data[2] (Column C)
#so TOP_num is for example: "1.555"
#but that TOP_num got day, month, and year attached to it as well:
#ex: memory["high"]["1.555"]["2022-04-05T08:34:39+02:00"]
TOP_data0 = str(memory["high"][TOP_num])
LOW_data0 = str(memory["low"][LOW_num])
i tried some things but, cant get it righ, example:
for i in memory["high"][i][j]:
if memory["high"][i][memory["high"][TOP_num][TOP_data0] < memory["low"][LOW_num]LOW_data0]:
print(memory["high"][i][TOP_num])
The .csv file represents some coin's price data ex: ADAUSDT frome exchange,
(time, open, high, low, close, somthing, somthing, somthing, somthing, somthing)
I finds Lowest price of given time period (Low_num), starting from some start price earlier.
And must find the biggest price between those start point and Low_num point.
That biggets price is the Stop loss numer had to be set, in order to achive the Lowest point for this example, it was a short.
figured out!
memory["SL"] = {}
for number in memory["high"]:
if number > LOW_num: # so its only numbers higher than Lowest obviously
x = number
if memory["high"][x] < LOW_data0: # and among them, with date only earlier than LOW_date0
memory["SL"][x] = str(memory["high"][x]) # and saving it to new memory set for later max() or min() upon it
Wow python can compare dates!

Calculate average and normalize GPS data in Python

I have a dataset in json with gps coordinates:
"utc_date_and_time":"2021-06-05 13:54:34", # timestamp
"hdg":"018.0", # heading
"sog":"000.0", # speed
"lat":"5905.3262N", # latitude
"lon":"00554.2433E" # longitude
This data will be imported into a database, with one entry every second for every "vessel".
As you can imagine this is a huge amount of data that provides a level of accuracy I do not need.
My goal:
Create a new entry in the database for every X seconds
If I set X to 60 (a minute) and there are missing 10 entries within this period, 50 entries should be used. Data can be missing for certain periods, and I do not want this to create bogus positions.
Use timestamp from last entry in period.
Use the heading (hdg) that is appearing the most times within this period.
Calculate average speed within this period.
Latitude and longitude could use the last entry, but I have seen "spikes" that needs to be filtered out, or use average, and remove values that differ too much.
My script is now pushing all the data to the database via a for loop with different data-checks inside it, and this is working.
I am new to python and still learning every day through reading and youtube videos, but it would be great if anyone could point me in the right direction for how to achieve the above goal.
As of now the data is imported into a dictionary. And I am wondering if creating a dictionary where the timestamp is the key is the way to go, but I am a little lost.
Code:
import os
import json
from pathlib import Path
from datetime import datetime, timedelta, date
def generator(data):
for entry in data:
yield entry
data = json.load(open("5_gps_2021-06-05T141524.1397180000.json"))["gps_data"]
gps_count = len(data)
start_time = None
new_gps = list()
tempdata = list()
seconds = 60
i = 0
for entry in generator(data):
i = i+1
if start_time == None:
start_time = datetime.fromisoformat(entry['utc_date_and_time'])
# TODO: Filter out values with too much deviation
tempdata.append(entry)
elapsed = (datetime.fromisoformat(entry['utc_date_and_time']) - start_time).total_seconds()
if (elapsed >= seconds) or (i == gps_count):
# TODO: Calculate average values etc. instead of using last
new_gps.append(tempdata)
tempdata = []
start_time = None
print("GPS count before:" + str(gps_count))
print("GPS count after:" + str(len(new_gps)))
Output:
GPS count before:1186
GPS count after:20

Pandas- locate a value based on logical statements

I am using the this dataset for a project.
I am trying to find the total yield for each inverter for the 34 day duration of the dataset (basically use the final and initial value available for each inverter). I have been able to get the list of inverters using pd.unique()(there are 22 inverters for each solar power plant.
I am having trouble querying the total_yield data for each inverter.
Here is what I have tried:
def get_yields(arr: np.ndarray, df:pd.core.frame.DataFrame) -> np.ndarray:
delta = np.zeros(len(arr))
index =0
for i in arr:
initial = df.loc[df["DATE_TIME"]=="15-05-2020 02:00"]
initial = initial.loc[initial["INVERTER_ID"]==i]
initial.reset_index(inplace=True,drop=True)
initial = initial.at[0,"TOTAL_YIELD"]
final = df.loc[(df["DATE_TIME"]=="17-06-2020 23:45")]
final = final.loc[final["INVERTER_ID"]==i]
final.reset_index(inplace=True, drop=True)
final = final.at[0,"TOTAL_YIELD"]
delta[index] = final - initial
index = index + 1
return delta
Reference: arr is the array of inverters, listed below. df is the generation dataframe for each plant.
The problem is that not every inverter has a data point for each interval. This makes this function only work for the inverters at the first plant, not the second one.
My second approach was to filter by the inverter first, then take the first and last data points. But I get an error- 'Series' objects are mutable, thus they cannot be hashed
Here is the code for that so far:
def get_yields2(arr: np.ndarray, df: pd.core.frame.DataFrame) -> np.ndarry:
delta = np.zeros(len(arr))
index = 0
for i in arr:
initial = df.loc(df["INVERTER_ID"] == i)
index += 1
break
return delta
List of inverters at plant 1 for reference(labeled as SOURCE_KEY):
['1BY6WEcLGh8j5v7' '1IF53ai7Xc0U56Y' '3PZuoBAID5Wc2HD' '7JYdWkrLSPkdwr4'
'McdE0feGgRqW7Ca' 'VHMLBKoKgIrUVDU' 'WRmjgnKYAwPKWDb' 'ZnxXDlPa8U1GXgE'
'ZoEaEvLYb1n2sOq' 'adLQvlD726eNBSB' 'bvBOhCH3iADSZry' 'iCRJl6heRkivqQ3'
'ih0vzX44oOqAx2f' 'pkci93gMrogZuBj' 'rGa61gmuvPhdLxV' 'sjndEbLyjtCKgGv'
'uHbuxQJl8lW7ozc' 'wCURE6d3bPkepu2' 'z9Y9gH1T5YWrNuG' 'zBIq5rxdHJRwDNY'
'zVJPv84UY57bAof' 'YxYtjZvoooNbGkE']
List of inverters at plant 2:
['4UPUqMRk7TRMgml' '81aHJ1q11NBPMrL' '9kRcWv60rDACzjR' 'Et9kgGMDl729KT4'
'IQ2d7wF4YD8zU1Q' 'LYwnQax7tkwH5Cb' 'LlT2YUhhzqhg5Sw' 'Mx2yZCDsyf6DPfv'
'NgDl19wMapZy17u' 'PeE6FRyGXUgsRhN' 'Qf4GUc1pJu5T6c6' 'Quc1TzYxW2pYoWX'
'V94E5Ben1TlhnDV' 'WcxssY2VbP4hApt' 'mqwcsP2rE7J0TFp' 'oZ35aAeoifZaQzV'
'oZZkBaNadn6DNKz' 'q49J1IKaHRwDQnt' 'rrq4fwE8jgrTyWY' 'vOuJvMaM2sgwLmb'
'xMbIugepa2P7lBB' 'xoJJ8DcxJEcupym']
Thank you very much.
I can't download the dataset to test this. Getting "To May Requests" Error.
However, you should be able to do this with a groupby.
import pandas as pd
result = df.groupby('INVERTER_ID')['TOTAL_YIELD'].agg(['max','min'])
result['delta'] = result['max']-result['min']
print(result[['delta']])
So if I'm understanding this right, what you want is the TOTAL_YIELD for each inverter for the beginning of the time period starting 5-05-2020 02:00 and ending 17-06-2020 23:45. Try this:
# enumerate lets you have an index value along with iterating through the array
for i, code in enumerate(arr):
# to filter the info to between the two dates, but not necessarily assuming that
# each inverter's data starts and ends at each date
inverter_df = df.loc[df['DATE_TIME'] >= pd.to_datetime('15-05-2020 02:00:00')]
inverter_df = inverter_df.loc[inverter_df['DATE_TIME'] <= pd.to_datetime('17-06-2020
23:45:00')]
inverter_df = inverter_df.loc[inverter_df["INVERTER_ID"]==code]]
# sort by date
inverter_df.sort_values(by='DATE_TIME', inplace= True)
# grab TOTAL_YIELD at the first available date
initial = inverter_df['TOTAL_YIELD'].iloc[0]
# grab TOTAL_YIELD at the last available date
final = inverter_df['TOTAL_YIELD'].iloc[-1]
delta[index] = final - initial

How to make categories out of my text file and calculate average out of the numbers?

I am working on a assignment, but I am stuck and I do not know how to proceed.
I need to make different categories out of the different categories from the first line (from the txt file) and calculate averages over every numerical value. The program has to work flawless when I add new lines to the txt file.
Category;currency;sellerRating;Duration;endDay;ClosePrice;OpenPrice;Competitive?
Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No
Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No
Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No
Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No
Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No
Music/Automotive/Game;US;3249;5;Mon;0,01;0,01;No
Music/Automotive/Game;US;3249;5;Mon;0,01;0,01;No
This is the text file. I tried to make different categories out of them, but I do not know if I did it correctly and how to let Python know that he has to calculate all the numbers from 1 group.
with open('bijlage2.txt') as bestand:
maak_er_lists_van = [(line.strip()).split(';') for line in bestand]
keys = maak_er_lists_van[0]
lijst = list(zip([keys]*len(maak_er_lists_van[1:]),
maak_er_lists_van[1:]))
x = [zip(i[0], i[1]) for i in lijst]
maak_dict = [dict(i) for i in x]
for i in maak_dict:
categorieen =[i['Category'], i['currency'], i['sellerRating'],
i['Duration'], i['endDay'], i['ClosePrice'], i['OpenPrice'],
i['Competitive?']]
categorieen = list(map(int, categorieen))
This is what I have so far. I am a Python beginner so the whole text file thing is new to me. Can somebody help me or explain what I have to do so that I can work further on this project? Many thanks in advance!
Here's how I would do it. I had to add using locale.atof() because where I am . is used as the decimal point, not commas. You may have to change this as indicated.
The csv module is used to read the file, and the averages are computed in a two-step process. First the values for each category are summed, and then afterwards, the average value of each one is calculated based on the number of values read.
import csv
import locale
from pprint import pprint, pformat
import locale
#locale.setlocale(locale.LC_ALL, '') # empty string for platform's default settings
# Following used for testing to force ',' to be considered as a decimal point.
locale.setlocale(locale.LC_ALL, 'French_France.1252')
avg_names = 'sellerRating', 'Duration', 'ClosePrice', 'OpenPrice'
averages = {avg_name: 0 for avg_name in avg_names} # Initialze.
# Find total of each category of interest.
num_values = 0
with open('bijlage2.txt', newline='') as bestand:
csvreader = csv.DictReader(bestand, delimiter=';')
for row in csvreader:
num_values += 1
for avg_name in avg_names:
averages[avg_name] += locale.atof(row[avg_name])
# Calculate average of each summed value.
for avg_name, total in averages.items():
averages[avg_name] = total / num_values
print('raw results:')
pprint(averages)
print() # Formatted output
print('Averages:')
for avg_name in avg_names:
rounded = locale.format_string('%.2f', round(averages[avg_name], 2),
grouping=True)
print(' {:<13} {:>10}'.format(avg_name, rounded))
Output:
raw results:
{'ClosePrice': 0.01, 'Duration': 5.0, 'OpenPrice': 0.01, 'sellerRating': 3249.0}
Averages:
sellerRating 3 249,00
Duration 5,00
ClosePrice 0,01
OpenPrice 0,01
Everything is fine with your way to read the file and creating a dictionary with the categories and values, imo. Your list maak_dict contains one dictionary for every line. To calculate an average for one category, you could do something like this:
def calc_average(categ):
values = [i[categ] for i in maak_dict]
average = sum(values)/len(values)
return average
assuming that you want to calculate the mean average. categ has to be a string.
After that, you can create a new dictionary that contains all the averages:
new_dict = {}
for category in maak_dict[0].keys():
avg = calc_average(category)
new_dict[category] = avg

How to plot an output of a function in Python?

These three functions give me the progression of number of customers and their orders from state 0 to next 365 states (or days). In function state_evolution, I want to plot the output from line
custA = float(custA*1.09**(1.0/365))
against the output from line
A = sum(80 + random.random() * 50 for i in range(ordsA))
and do the same for custB so I can compare their outputs graphically.
def get_state0():
""" functions gets four columns from base data and finds their state 0"""
statetype0 = {'custt':{'typeA':100,'typeB':200}}
orderstype0 = {'orders':{'typeA':1095, 'typeB':4380}}
return {'custtypeA' : int(statetype0['custt']['typeA']),
'custtypeB' : int(statetype0['custt']['typeB']),
'ordstypeA': orderstype0['orders']['typeA'],'A':1095, 'B':4380,
'ordstypeB':orderstype0['orders']['typeB'],
'day':0 }
def state_evolution(state):
"""function takes state 0 and predicts state evolution """
custA = state['custtypeA']
custB = state['custtypeB']
ordsA = state['ordstypeA']
ordsB = state['ordstypeB']
A = state['A']
B = state['B']
day = state['day']
# evolve day
day += 1
#evolve cust typea
custA = float(custA*1.09**(1.0/365))
#evolve cust typeb
custB = float (custB*1.063**(1.0/365))
# evolve orders cust type A
ordsA += int(custA * order_rateA(day))
A = sum(80 + random.random() * 50 for i in range(ordsA))
# evolve orders cust type B
ordsB += int(custB * order_rateB(day))
B = sum(70 + random.random() * 40 for i in range(ordsB))
return {'custtypeA':custA ,'ordstypeA':ordsA, 'A':A, 'B':B,
'custtypeB':custB, 'ordstypeB':ordsB, 'day': day}
def show_all_states():
""" function runs state evolution function to find other states"""
s = get_state0()
for day in range(365):
s = state_evolution(s)
print day, s
You should do the following:
Modify your custA function so that it returns a sequence (list, tuple) with, say, 365 items. Alternatively, use custA inside a list comprehension or for loop to get the sequence of 365 results;
Do the same for ordsA function, to get the other sequence.
From now on, you can do different things depending on what you want.
If you want two plots in parallel (superimposed), then:
pyplot.plot(custA_result_list);
pyplot.plot(ordsA_result_list);
pyplot.show()
If you want to CORRELATE the data, you can do a scatter plot (faster), or a regular plot with dot markers (slower but more customizeable IMO):
pyplot.scatter(custA_result_list, ordsA_result_list)
# or
pyplot.plot(custA_result_list, ordsA_result_list, 'o')
## THIS WILL ONLY WORK IF BOTH SEQUENCES HAVE SAME LENGTH! (e.g. 365 elements each)
At last, if the data were irregularly sampled, you could also provide a sequence for the horizontal axis values. That would allow, for example, to plot only weekdays' results without "collapsing" the weekends (otherwise the gap between friday and monday would look like a single day):
weekdays = [1,2,3,4,5, 8,9,10,11,12, 15,16,17,18,19, 22, ...] # len(weekdays) ~ 260
pyplot.plot(weekdays, custA_result_list);
pyplot.plot(weekdays, ordsA_result_list);
pyplot.show()
Hope this helps!
EDIT: About excel, if I understand right you ALREADY have a csv file. Then, you could use the csv python module, or read it yourself like this:
with open('file.csv') as csv_in:
content = [line.strip().split(',') for line in csv_in]
Now if you have an actual .xls or .xlsx file, use the xlrd module that you can download here or by running pip install xlrd in a command prompt.

Categories

Resources