i've written the following code that plots the relative bpm for the various hours contained in the csv given in input, for a given date.
dfMonday['date'] = pd.to_datetime(dfMonday['date'])
df_temp = dfMonday.loc[dfMonday['date'] == '2021/04/26']
bpmMon = df_temp.tempo
hMon = df_temp.time
x = '26/04/2021'
meanBpmMon = bpmMon .mean()
second = plt.figure(figsize=(10,5))
plt.title('HOURS - BPM (MONDAY- 26/04/2021)')
plt.scatter(hMon , bpmMon , c = bpmMon)
plt.xticks(rotation=45)
Within the CSV I have other dates, all different from each other, which refer to other days of listening to music. What I would like to do is create more charts based on the date which is contained in the csv. In detail: if I have other n dates I would like to output n graphs based on the date contained in the csv.
Csv file have the following structure:
artist_name;ms_played;track_name;...date;time;week_day
Taylor Swift;35260;Wildest Dreams;...;2021-01-25;07:55;0
Edward Sharpe & The Magnetic Zeros;...2021-01-25;15:34;0
Kanye West; 127964; ...; 2021-02-21;08:08;0
Billie Eilish; 125412; ...; 2021-15-2; 15:02; 0
......
As you can see from the date column, I have several dates inside the csv
Related
I'm working on some code to manipulate hourly and daily data for a year and am a little confused about how to combine data from the two files. What I am doing is using the hourly pattern of Data Set B but scaling it using Daily Set A. ... so in essence (using the example below) I will take the daily average (Data Set A) of 93 cfs and multiple it by 24 hrs in a day which would equal 2232 . I'll then sum the hourly cfs values for all 24hrs of each day (Data Set B)... which in this case for 1/1/2021 would equal 2596. Normally manipulating a rate in these manners doesn't make sense but in this case it doesn't matter because the units cancel out. I'd then need to take these values and divide them by each other 2232/2596 = 0.8597 and apply that to the hourly cfs values for all 24hrs of each day (Data Set B) for a new "scaled" dataset (to be Data Set C).
My problem is that I have never coded in Python using two different input datasets (I am a complete newbie). I started experimenting with the code but the problem is - is I can't seem to integrate the two datasets. If anyone can point me in the direction of how to integrate two separate input files I'd be most appreciative. Beneath the datasets is my attempts at the code (please note the reverse order of code - working first with hourly data (Data Set B) and then the daily data (Data Set A). My print out of the final scaling factor (SF) is only giving me one print out... not all 8,760 because I'm not in the loop... but how can I be in the loop of both input files at the same time???
Data Set A (Daily) -- 365 lines of data:
1/1/2021 93 cfs
1/2/2021 0 cfs
1/3/2021 70 cfs
1/4/2021 70 cfs
Data Set B (Hourly) -- 8,760 lines of data:
1/1/2021 0:00 150 cfs
1/1/2021 1:00 0 cfs
1/1/2021 2:00 255 cfs
(where summation of all 24 hrs of 1/1/2021 = 2596 cfs)
etc.
Sorry if this is a ridiculously easy question... I am very new to coding.
Here is the code that I've written so far... what I need is 8,760 lines of SF... that I can then use to multiple by the original Data Set B. The final product of Data Set C will be Date - Time - rescaled hourly data. I actually have to do this for three pumping units total... to give me a matrix of 5 columns by 8,760 rows but I think I'll be able to figure the unit thing out. My problem now is how to integrate the two data sets. Thank you for reading!
print('Solving the Temperature Model programming problem')
fhand1 = open('Interpolate_CY21_short.txt')
fhand2 = open('WSE_Daily_CY21_short.txt')
#Hourly Interpolated Pardee PowerHouse Data
for line1 in fhand1:
line1 = line1.rstrip()
words1 = line1.split()
#Hourly interpolated data - parsed down (cfs)
x = float(words1[7])
if x<100:
x = 0
#print(x)
#WSE Daily Average PowerHouse Data
for line2 in fhand2:
line2 = line2.rstrip()
words2 = line2.split()
#Daily cfs average x 24 hrs
aa = float(words2[2])*24
#print(a)
SF = x * aa
print(SF)
This is how you would get the data into two lists,
fhand1 = open('Interpolate_CY21_short.txt', 'r')
fhand2 = open('WSE_Daily_CY21_short.txt', 'r')
daily_average = fhand1.readlines()
daily = fhand2.readlines()
# this is what the to lists would look like, roughly
# each line would be a separate string
daily_average = ["1/1/2021 93 cfs","1/2/2021 0 cfs"]
daily = ["1/1/2021 0:00 150 cfs", "1/1/2021 1:00 0 cfs", "1/2/2021 1:00 0 cfs"]
Then, to process the lists could probably use a double nested for loop
for average_line in daily_average:
average_line = average_line.rstrip()
average_date, average_count, average_symbol = average_line.split()
for daily_line in daily:
daily_line = daily_line.rstrip()
date, hour, count, symbol = daily_line.split()
if average_date == date:
print(f"date={date}, average_count={average_count} count={count}")
Or a dictionary
# populate data into dictionaries
daily_average_data = dict()
for line in daily_average:
line = line.rstrip()
day, count, symbol = line.split()
daily_average_data[day] = (day, count, symbol)
daily_data = dict()
for line in daily:
line = line.rstrip()
day, hour, count, symbol = line.split()
if day not in daily_data:
daily_data[day] = list()
daily_data[day].append((day, hour, count, symbol))
# now you can access daily_average_data and daily_data as
# dictionaries instead of files
# process data
result = list()
for date in daily_data.keys():
print(date)
print(daily_average_data[date])
print(daily_data[date])
If the data items corresponded with one another line by line, you could use https://realpython.com/python-zip-function/
here is an example:
for data1, data2 in zip(daiy_average, daily):
print(f"{data1} {data2}")
Similar to what #oasispolo decribed, the solution is to make a single loop and process both lists in it. I'm personally not fond of the "zip" function. (It's a purely stylistic objection; lots of other people like it and that's fine.)
Here's a solution with syntax that I find more intuitive:
print('Solving the Temperature Model programming problem')
fhand1 = open('Interpolate_CY21_short.txt', 'r')
fhand2 = open('WSE_Daily_CY21_short.txt', 'r')
# Convert each file into a list of lines. You're doing this
# implicitly, but I like to be explicit about it.
lines1 = fhand1.readlines()
lines2 = fhand2.readlines()
if len(lines1) != len(lines2):
raise ValueError("The two files have different length!")
# Initialize an output array. You cold also construct it
# one item at a time, but that can be slow for large arrays.
# It is more efficient to initialize the entire array at
# once if possible.
sf_list = [0]*len(lines1)
for position in range(len(lines1)):
# range(L) generates numbers 0...L-1
line1 = lines1[position].rstrip()
words1 = line1.split()
x = float(words1[7])
if x<100:
x = 0
line2 = lines2[position].rstrip()
words2 = line2.split()
aa = float(words2[2])*24
sf_list[position] = x * aa
print(sf_list)
I have downloaded a NetCDF4 file of total hourly precipitation across Sierra Leone from 1974 to Present, and have started to create a code to analyze it.
I'm trying to form a table in Python that will display my average annual rainfall for different rainfall durations, rather like this one below:
I'm wondering if anyone has done anything similar to this before and could possibly help me out as I'm very new to programming?
Here is the script I've written so far that records the hourly data for each year. From here I need to find a way to store this information onto a table, then to change the duration to say, 2 hours, and repeat until I have a complete table:
import glob
import numpy as np
from netCDF4 import Dataset
import pandas as pd
import xarray as xr
all_years = []
for file in glob.glob('*.nc'):
data = Dataset(file, 'r')
time = data.variables['time']
year = time.units[11:16]
all_years.append(year)
year_start = '01-01-1979'
year_end = '31-12-2021'
date_range = pd.date_range(start = str(year_start),
end = str(year_end),
freq = 'H')
df = pd.DataFrame(0.0,columns = ['tp'], index = date_range)
lat_freetown = 8.4657
lon_freetown = 13.2317
all_years.sort()
for yr in range(1979,2021):
data = Dataset('era5_year' + str(yr)+ '.nc', 'r')
lat = data.variables['latitude'][:]
lon = data.variables['longitude'][:]
sq_diff_lat = (lat - lat_freetown)**2
sq_diff_lon = (lon - lon_freetown)**2
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
tp = data.variables['tp']
start = str(yr) + '-01-01'
end = str(yr) + '-12-31'
d_range = pd.date_range(start = start,
end = end,
freq = 'H')
for t_index in np.arange(0, len(d_range)):
print('Recording the value for: ' + str(d_range[t_index])+str(tp[t_index, min_index_lat, min_index_lon]))
df.loc[d_range[t_index]]['tp'] = tp[t_index, min_index_lat, min_index_lon]
I gave this a try, I hope it helps.
I downloaded two years of coarse US precip data here:
https://downloads.psl.noaa.gov/Datasets/cpc_us_hour_precip/precip.hour.2000.nc
https://downloads.psl.noaa.gov/Datasets/cpc_us_hour_precip/precip.hour.2001.nc
import xarray as xr
import pandas as pd
# Read two datasets and append them so there are multiple years of hourly data
precip_full1 = xr.open_dataset('precip.hour.2000.nc') * 25.4
precip_full2 = xr.open_dataset('precip.hour.2001.nc') * 25.4
precip_full = xr.concat([precip_full1,precip_full2],dim='time')
# Select only the Western half of the US
precip = precip_full.where(precip_full.lon<257,drop=True)
# Initialize output
output = []
# Select number of hours to sum
# This assumes that the data is hourly
intervals = [1,2,6,12,24]
# Loop through each desired interval
for interval in intervals:
# Take rolling sum
# This means the value at any time is the sum of the preceeding times
# So when interval is 6, it's the sum of the previous six values
roll = precip.rolling(time=interval,center=False).sum()
# Take the annual mean and average over all space
annual = roll.groupby('time.year').mean('time').mean(['lat','lon'])
# Convert output to a pandas dataframe
# and rename the column to correspond to the interval length
tab = annual.to_dataframe().rename(columns={'precip':str(interval)})
# Keep track of the output by appending it to the output list
output.append(tab)
# Combine the dataframes into one, by rows
output = pd.concat(output,1)
The output looks like this:
1 2 6 12 24
year
2000 0.014972 0.029947 0.089856 0.179747 0.359576
2001 0.015610 0.031219 0.093653 0.187290 0.374229
Again, this assumes that the data is already hourly. It also takes the average of any (for example) 6 hour period, so it's not just 00:00-06:00, 06:00-12:00, etc., it's 00:00-06:00, 001:00-07:00, etc., and then the annual mean. If you wanted the former you could use xarray's resample function after taking the rolling sum.
I am trying to extract from multi-temporal files (netCDF) the value of the pixel at a specific location and time.
Each file is named: T2011, T2012, and so on until T2017.
Each file contains 365 layers, every layer corresponds to one day of a the year and that layer expresses the temperature of that day.
My goal is to extract information according to my input dataset.
I have a csv (locd.csv) with my targets and it looks like this:
id lat lon DateFin DateCount
1 46.63174271 7.405986324 02-02-18 43,131
2 46.64972969 7.484352537 25-01-18 43,123
3 47.27028727 7.603811832 20-01-18 43,118
4 46.99994455 7.063905466 05-02-18 43,134
5 47.08125481 7.19501811 20-01-18 43,118
6 47.37833814 7.432005368 11-12-18 43,443
7 47.43230354 7.445253182 30-12-18 43,462
8 46.73777711 6.777871255 09-04-18 43,197
69 47.42285191 7.113934735 09-04-18 43,197
The id is the location I am interested in, lat and lon: latitude and longitude), DateFin correspond to the date I am interested to know the temperature at that particular location and DateCount corresponds to the number
of days from 01-01-1900 to the date I am interested in (that's how the layers are indexed in the file).
For doing that I have something like this:
import glob
from netCDF4 import Dataset
import pandas as pd
import numpy as np
from datetime import date
import os
# Record all the years of the netCDF files into a Python list
all_years = []
for file in glob.glob('*.nc'):
print(file)
data = Dataset(file, 'r')
time = data.variables['time'] # that's how the days are stored
year = file[0:4]
all_years.append(year)
# define my input data
cities = pd.read_csv('locd.csv')
# extracting the data
for index, row in cities.iterrows():
id_row = row['id'] # id from the database
location_latitude = row['lat']
location_longitude = row['lon']
location_date = row['DateCount'] #number of day counting since 1900-01-01
# Sorting the all_years python list
all_years.sort()
for yr in all_years:
# Reading-in the data
data = Dataset(str(yr)+'.nc', 'r')
# Storing the lat and lon data of the netCDF file into variables
lat = data.variables['lat'][:]
lon = data.variables['lon'][:]
# Squared difference between the specified lat,lon and the lat,lon of the netCDF
sq_diff_lat = (lat - location_latitude)**2
sq_diff_lon = (lon - location_longitude)**2
# Identify the index of the min value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
# Accessing the precipitation data
prec= data.variables['precipi'] # that's how the variable is called
for p_index in np.arange(0, len(location_date)):
print('Recording the value for '+ id_row+': ' + str(location_date[p_index]))
df.loc[id_row[location_date]]['Precipitation'] = prec[location_date, min_index_lat, min_index_lon]
# to record it in a new archive
df.to_csv('locationNew.csv')
My issues:
I don't manage to make it work. Every time there is a new thing coming, now it says that "id_row" must be a string.
Does anybody have a hint or experience working with these type of files?
I am working with a csv sheet which contains data from a brewery, for e.g Data required, Quantity order etc.
I want to write a module to read the csv file structure and load the data into a suitable data structure
in Python. I have to interpret the data by calculating the average growth rate, the ratio of sales for
different beers and use these values to predict sales for a given week or month in the future.
I have no idea where to start. The only line of code I have so far are :
df = pd.read_csv (r'file location')
print (df)
To illustrate, I have downloaded data on the US employment level (https://fred.stlouisfed.org/series/CE16OV) and population (https://fred.stlouisfed.org/series/POP).
import pandas as pd
employ = pd.read_csv('/home/brb/bugs/data/CE16OV.csv')
employ = employ.rename(columns={'DATE':'date'})
employ = employ.rename(columns={'CE16OV':'employ'})
employ = employ[employ['date']>='1952-01-01']
pop = pd.read_csv('/home/brb/bugs/data/POP.csv')
pop = pop.rename(columns={'DATE':'date'})
pop = pop.rename(columns={'POP':'pop'})
pop = pop[pop['date']<='2019-10-01']
df = pd.merge(employ,pop)
df['employ_monthly'] = df['employ'].pct_change()
df['employ_yoy'] = df['employ'].pct_change(periods=12)
df['employ_pop'] = df['employ']/df['pop']
df.head()
I have created a charging simulation program that simulates different electric cars arriving to different stations to charge.
When simulation is finished, the program creates CSV files for charging stations, both about stats per hour and stats per day, first for now, the stats per hour CSV is important for me.
I want to plot queue_length_per_hour (how many cars are waiting in queue, every hour from 0 to 24), for the different stations.
But the thing is that I do NOT want to include all the station because there are too many of them, therefore I think only 3 stations is more than enough.
Which 3 stations should I pick? I choose to pick 3 stations based on which of them had most visited cars to the station during the day (which I can see at hour 24),
As you can see in the code, I have used the filter method from pandas so I can pick top 3 stations based on who had most visited cars at hour 24 from the CSV file.
And now I have the top three stations, and now I want to plot the entire column cars_in_queue_per_hour, not only for hour 24, but all the way down from Hour 0.
from time import sleep
import pandas as pd
import csv
import matplotlib.pyplot as plt
file_to_read = pd.read_csv('results_per_hour/hotspot_districts_results_from_simulation.csv', sep=";",encoding = "ISO-8859-1")
read_columns_of_file = file_to_read.columns
read_description = file_to_read.describe()
visited_cars_at_hour_24 = file_to_read["hour"] == 24
filtered = file_to_read.where(visited_cars_at_hour_24, inplace = True, axis=0)
top_three = (file_to_read.nlargest(3, 'visited_cars'))
# This pick top 3 station based on how many visited cars they had during the day
#print("Top Three stations based on amount of visisted cars:\n{}".format(top_three))
#print(type(top_three))
top_one_station = (top_three.iloc[0]) # HOW CAN I PLOT QUEUE_LENGTH_PER_HOUR COLUMN FROM THIS STATION TO A GRAPH?
top_two_station = (top_three.iloc[1]) # HOW CAN I ALSO PLOT QUEUE_LENGTH_PER_HOUR COLUMN FROM THIS STATION TO A GRAPH?
top_three_station = (top_three.iloc[2]) # AND ALSO THIS?
#print(top_one_station)
#print(file_to_read.where(file_to_read["name"] == "Vushtrri"))
#for row_index, row in top_three.iterrows():
# print(row)
# print(row_index)
# print(file_to_read.where(file_to_read["name"] == row["name"]))
# print(file_to_read.where(file_to_read["name"] == row["name"]).columns)
xlabel = []
for hour in range(0,25):
xlabel.append(hour)
ylabel = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] # how to append queue length per hour for the top 3 stations here?
plt.plot(xlabel,ylabel)
plt.show()
The code is also available at this repl.it link together with the CSV files: https://repl.it/#raxor2k/almost-done
I really like the seaborn-package to make this type of plot, so I would use
import seaborn as sns
df_2 = file_to_read[file_to_read['name'].isin(top_three['name'])]
sns.factorplot(x='hour', y='cars_in_queue_per_hour', data=df_2, hue='name')
You already selected the top three names, so the only relevant part is to use pd.isin to select the lines of the dataframe where the name matches the one in the top three and let seaborn make the plot.
For this to work, make sure you change one line of code by removing the inplace:
filtered = file_to_read.where(visited_cars_at_hour_24, axis=0)
top_three = (filtered.nlargest(3, 'visited_cars'))
This leaves your original dataframe intact to use all the data from. If you use inplace, you cannot assign it back - the operation is acted inplace and returns None.
I cleaned the lines of code you don't need for the plot, so your complete code to reproduce would be
import seaborn as sns
top_three = file_to_read[file_to_read['hour'] == 24].nlargest(3, 'visited_cars')
df_2 = file_to_read[file_to_read['name'].isin(top_three['name'])]
sns.factorplot(x='hour', y='cars_in_queue_per_hour', data=df_2, hue='name')