Python: daily average function - python

I am trying to make a function that returns a list/array with the daily averages for a variable from either one of 3 csv files
Each csv file is similar to this:
date, time, variable1, variable2, variable3
2021-01-01,01:00:00,1.43738,25.838,22.453
2021-01-01,02:00:00,2.08652,21.028,19.099
2021-01-01,03:00:00,1.39101,23.18,20.925
2021-01-01,04:00:00,0.76506,22.053,19.974
The date contains the entire year of 2021 with increments of 1 hour
def daily_average(data, station, variable):
The function has 3 parameters:
data
station: One of the 3 csv files
variable: Either variable 1 or 2 or 3
Libraries such as datetime, calendar and numpy can be used
Pandas can also be used

Well.
First of all - try to do it yourself before asking a question.
It will help you to learn things.
But now to your question.
csv_lines_test = [
"date, time, variable1, variable2, variable3\n",
"2021-01-01,01:00:00,1.43738,25.838,22.453\n",
"2021-01-01,02:00:00,2.08652,21.028,19.099\n",
"2021-01-01,03:00:00,1.39101,23.18,20.925\n",
"2021-01-01,04:00:00,0.76506,22.053,19.974\n",
]
import datetime as dt
def daily_average(csv_lines, date, variable_num):
# variable_num should be 1-3
avg_arr = []
# Read csv file line by line.
for i, line in enumerate(csv_lines):
if i == 0:
# Skip headers
continue
line = line.rstrip()
values = line.split(',')
date_csv = dt.datetime.strptime(values[0], "%Y-%m-%d").date()
val_arr = [float(val) for val in values[2:]]
if date == date_csv:
avg_arr.append(val_arr[variable_num-1])
return sum(avg_arr) / len(avg_arr)
avg = daily_average(csv_lines_test, dt.date(2021, 1, 1), 1)
print(avg)
If you want to read data directly from csv file:
with open("csv_file_path.csv", 'r') as f:
data = [line for line in f]
avg = daily_average(data, dt.date(2021, 1, 1), 1)
print(avg)

Related

How to use python to find the maximum value in a csv file list

I am very new to python and need some help finding the maximum/highest value of in a column of data (time) that is imported from a csv file. this is the code i have tried.
file = open ("results.csv")
unneeded = file.readline()
for line in file:
data = file.readline ()
linelist = line.split(",")
hours = linelist[4]
maxtime = 0
for x in hours:
if x > maxtime:
maxtime = x
print (maxtime)
any help is appreciated
edit: i tried this code but it gives me the wrong answer :(
file = open ("results.csv")
unneeded = file.readline()
maxtime = 0
for line in file:
data = file.readline ()
linelist = line.split(",")
hours = linelist[4]
if hours > str(maxtime):
maxtime = hours
print (maxtime)
[first few lines of results][1]
edit:
results cvs
[1]: https://i.stack.imgur.com/z3pEJ.png
I haven't tested it but this should work. Using the CSV library is easy for parsing CSV files.
import csv
with open("results.csv") as file:
csv_reader = csv.reader(file, delimiter=',')
for row in csv_reader:
hours = row[4]
maxtime = 0
if hours > maxtime:
maxtime = x
print (maxtime)
file.close()
My recommendation is using the pandas module for anything CSV-related.
Using dateutil, I create a dataset of dates and identification numbers whose date values are shuffled (no specific order) like so:
from dateutil.parser import *
from dateutil.rrule import *
from random import shuffle
dates = list(rrule(
DAILY,
dtstart=parse("19970902T090000"),
until=parse("19971224T000000")
))
shuffle(dates)
with open('out.csv', mode='w', encoding='utf-8') as out:
for i,date in enumerate(dates):
out.write(f'{i},{date}\n')
So thus, in this particular dataset, 1997-12-23 09:00:00 would be the "largest" date. Then, to extract the maximum date, we can just do it via string comparisons if it is formatted in the ISO 8601 date/time format:
from pandas import read_csv
df = read_csv('out.csv', names=['index', 'time'], header=1)
print(max(df['time']))
After running it, we indeed get 1997-12-23 09:00:00 printed in the terminal!

Code Optimization Parsing Columns and Creating DateTime

The following code parses a text file, extracts column values, and creates a datetime column by combining a few of the columns. What can I do to improve this code? More specifically, writing float(eq_params[1]), etc. was time-consuming. Is there any way to more quickly parse the column data in Python, without using Pandas? Thanks.
#!/usr/bin/env python
import datetime
file = 'input.txt'
fopen = open(file)
data = fopen.readlines()
fout = open('output.txt', 'w')
for line in data:
eq_params = line.split()
lat = float(eq_params[1])
lon = float(eq_params[2])
dep = float(eq_params[3])
yr = int(eq_params[10])
mo = int(eq_params[11])
day = int(eq_params[12])
hr = int(eq_params[13])
minute = int(eq_params[14])
sec = int(float(eq_params[15]))
dt = datetime.datetime(yr,mo,day)
tm = datetime.time(hr,minute,sec)
time = dt.combine(dt,tm)
fout.write('%s %s %s %s\n'%(lat,lon,dep,time))
fout.close()
fopen.close()
You could reduce the amount of typing by using a combination of map, slicing and variable unpacking to convert the parsed parameters and assign them in a single line:
lat, lon, dep = map(int, eq_params[1:4])
yr, mo, day, hr, minute, sec = map(float, eq_params[10:16])
using a list comprehension would work similarly:
lat, lon, dep = [int(x) for x in eq_params[1:4]]

How to count lines in a text file with specified values?

I'm working with a .csv file that lists Timestamps in one column and Wind Speeds in the second column. I need to read through this .csv file and calculate the percent of time where wind speed was above 2m/s. Here's what I have so far.
txtFile = r"C:\Data.csv"
line = o_txtFile.readline()[:-1]
while line:
line = oTextfile.readline()
for line in txtFile:
line = line.split(",")[:-1]
How do I get a count of the lines where the 2nd element in the line is greater than 2?
CSV File Sample
You will probably have to update slightly your CSV, depending on the chosen option (for option 1 and option 2, you will definitely want to remove all header rows, whereas for option 3, you will keep only the middle one, i.e. the one that starts with TIMESTAMP).
You actually have three options:
Option 1: Vanilla Python
count = 0
with open('data.csv', 'r') as file:
for line in file:
value = int(line.split(',')[1])
if value > 100:
count += 1
# Now you have the value in ``count`` variable
Option 2: CSV module
Here I use the Python's CSV module (you could as well use the DictReader, but I'll let you do the search yourself).
import csv
count = 0
with open('data.csv', 'r') as file:
reader = csv.read(file, delimiter=',')
for row in reader:
if int(row[1]) > 100:
count += 1
# Now you have the value in ``count`` variable
Option 3: Pandas
Pandas is a really cool, awesome library used by a lot of people to do data analysis. Doing what you want to do would look like:
import pandas as pd
df = pd.read_csv('data.csv')
# Here you are
count = len(df[df['WindSpd_ms'] > 100])
You can read in the file line by line, if something in it, split it.
You count the lines read and how many are above 10m/s - then calculate the percentage:
# create data file for processing with random data
import random
random.seed(42)
with open("data.txt","w") as f:
f.write("header\n")
f.write("header\n")
f.write("header\n")
f.write("header\n")
for sp in random.choices(range(10),k=200):
f.write(f"some date,{sp+3.5}, data,data,data\n")
# open/read/calculate percentage of data that has 10m/s speeds
days = 0
speedGreater10 = 0
with open("data.txt","r") as f:
for _ in range(4):
next(f) # ignore first 4 rows containing headers
for line in f:
if line: # not empty
_ , speed, *p = line.split(",")
# _ and *p are ignored (they take 'some date' + [data,data,data])
days += 1
if float(speed) > 10:
speedGreater10 += 1
print(f"{days} datapoints, of wich {speedGreater10} "+
f"got more then 10m/s: {speedGreater10/days}%")
Output:
200 datapoints, of wich 55 got more then 10m/s: 0.275%
Datafile:
header
header
header
header
some date,9.5, data,data,data
some date,3.5, data,data,data
some date,5.5, data,data,data
some date,5.5, data,data,data
some date,10.5, data,data,data
[... some more ...]
some date,8.5, data,data,data
some date,3.5, data,data,data
some date,12.5, data,data,data
some date,11.5, data,data,data

Finding maximum values within a data set given restrictions

I have a task where I have been giving a set of data as follows
Station1.txt sample #different sets of data for different no. stations
Date Temperature
19600101 46.1
19600102 46.7
19600103 99999.9 #99999 = not recorded
19600104 43.3
19600105 38.3
19600106 40.0
19600107 42.8
I am trying to create a function
display_maxs(stations, dates, data, start_date, end_date) which displays
a table of maximum temperatures for the given station/s and the given date
range. For example:
stations = load_stations('stations2.txt')
5
data = load_all_stations_data(stations)
dates = load_dates(stations)
display_maxs(stations, dates, data, '20021224','20021228' #these are date yyyy/mm/dd)
I have created functions for data
def load_all_stations_data(stations):
data = {}
file_list = ("Brisbane.txt", "Rockhampton.txt", "Cairns.txt", "Melbourne.txt", "Birdsville.txt", "Charleville.txt") )
for file_name in file_list:
file = open(stations(), 'r')
station = file_name.split()[0]
data[station] = []
for line in file:
values = line.strip().strip(' ')
if len(values) == 2:
data[station] = values[1]
file.close()
return data
functions for stations
def load_all_stations_data(stations):
stations = []
f = open(stations[0] + '.txt', 'r')
stations = []
for line in f:
x = (line.split()[1])
x = x.strip()
temp.append(x)
f.close()
return stations
and functions for dates
def load_dates(stations):
f = open(stations[0] + '.txt', 'r')
dates = []
for line in f:
dates.append(line.split()[0])
f.close()
return dates
Now I just need help with creating the table which displays the max temp for any given date restrictions and calls the above functions with data, dates and station.
Not really sure what those functions are supposed to do, particularly as two of them seem to have the same name. Also there are many errors in your code.
file = open(stations(), 'r') here, you try to call stations as a function, but it seems to be a list.
station = file_name.split()[0] the files names have no space, so this has no effect. Did you mean split('.')?
values = line.strip().strip(' ') probably one of those strip should be split?
data[station] = values[1] overwrites data[station] in each iteration. You probably wanted to append the value?
temp.append(x) the variable temp is not defined; did you mean stations?
Also, instead of reading the dates and the values into two separate list, I suggest you create a list of tuples. This way you will only need a single function:
def get_data(filename):
with open(filename) as f:
data = []
for line in f:
try:
date, value = line.split()
data.append((int(date), float(value)))
except:
pass # pass on header, empty lines ,etc.
return data
If this is not an option, you might create a list of tuples by zipping the lists of dates and values, i.e. data = zip(dates, values). Then, you can use the max builtin function together with a list comprehension or generator expression for filtering the values between the dates and a special key function for sorting by the value.
def display_maxs(data, start_date, end_date):
return max(((d, v) for (d, v) in data
if start_date <= d <= end_date and v < 99999),
key=lambda x: x[1])
print display_maxs(get_data("Station1.txt"), 19600103, 19600106)
Use pandas. Reading in each text file is just a single function, with comment handling, missing data (99999.9) handling, and date handling. The below code will read in the files from a sequence of file names fnames, with handling for comments and converting 9999.9 to "missing" value. Then it will get the date from start to stop, and the sequence of station names (the file names minus the extensions), then get the maximum of each (in maxdf).
import pandas as pd
import os
def load_all_stations_data(stations):
"""Load the stations defined in the sequence of file names."""
sers = []
for fname in stations:
ser = pd.read_csv(fname, sep='\s+', header=0, index_col=0,
comment='#', engine='python', parse_dates=True,
squeeze=True, na_values=['99999.9'])
ser.name = os.path.splitext(fname)[0]
sers.append(ser)
return pd.concat(sers, axis=1)
def get_maxs(startdate, stopdate, stations):
"""Get the stations and date range given, then get the max for each"""
return df.loc[start:stop, sites].max(skipna=True)
Usage of the second function would be like so:
maxdf = get_maxs(df, '20021224','20021228', ("Rockhampton", "Cairns"))
If the #99999 = not recorded comment is not actually in your files, you can get rid of the engine='python' and comment='#' arguments:
def load_all_stations_data(stations):
"""Load the stations defined in the sequence of file names."""
sers = []
for fname in stations:
ser = pd.read_csv(fname, sep='\s+', header=0, index_col=0,
parse_dates=True, squeeze=True,
na_values=['99999.9'])
ser.name = os.path.splitext(fname)[0]
sers.append(ser)
return pd.concat(sers, axis=1)

Making various groupings

My data set is a list of people either working together or alone.
I have a row for each project, and columns with names of all the people who worked on that project. If column 2 is the first empty column in a row, it was a solo job. If column 4 is the first empty column in a row, there were 3 people working together.
I have the code to find all pairs. In the output data set, a square N x N is created with every actor labelling columns and rows. Cells (A,B) and (B,A) contain how many times that pair worked together. A working with B is treated the same as B working with A.
An example of the input data, in a comma delimited fashion:
A,.,.
A,B,.
B,C,E
B,F,.
D,F,.
A,B,C
D,B,.
E,C,B
X,D,A
F,D,.
B,.,.
F,.,.
F,X,C
C,F,D
I am using Python 3.2. The code that does this:
import csv
import collections
import itertools
grid = collections.Counter()
with open("connect.csv", "r") as fp:
reader = csv.reader(fp)
for line in reader:
# clean empty names
line = [name.strip() for name in line if name.strip()]
# count single works
if len(line) == 1:
grid[line[0], line[0]] += 1
# do pairwise counts
for pair in itertools.combinations(line, 2):
grid[pair] += 1
grid[pair[::-1]] += 1
actors = sorted(set(pair[0] for pair in grid))
with open("connection_grid.csv", "w") as fp:
writer = csv.writer(fp)
writer.writerow([''] + actors)
for actor in actors:
line = [actor,] + [grid[actor, other] for other in actors]
writer.writerow(line)
My questions are:
If I had a column with months and years, is it possible to make a matrix spreadsheet for each month year? (i.e., for 2011, I would have 12 matrices)?
For whatever breakdown I use, is it possible to make a variable such that the variable name is a combo of all the people who worked together? e.g. 'ABD' would mean a project Person A, Person B, and Person D worked together on and would equal how many times ABD worked as a group of three, in whatever order. Projects can hold up to 20 people so it would have to be able to make groups of 2 to 20. Also, it would be easiest if the variables should be in alphabetical order.
1) Sort your projects by month & year, then create a new 'grid' for every month. e.g.:
Pull the month & year from every row. Remove month & year from the row, then add the remaining data to a dictionary. In the end you get something like {(month, year): [line, line, ...]} . From there, it's easy to loop through each month/year and create a grid, output spreadsheet, etc.
2) ''.join(sorted(list)).replace('.','') gives you the persons who worked together listed alphabetically.
import csv
import collections
import itertools
grids = dict()
groups = dict()
with open("connect.csv", "r") as fp:
reader = csv.reader(fp)
for line in reader:
# extract month/year from the last column
date = line.pop(-1)
month,year = date.split('/')
# clean empty names
line = [name.strip() for name in line if name.strip()]
# generate group name
group = ''.join(sorted(line)).replace('.','')
#increment group count
if group in groups:
groups[group]+=1
else:
groups[group]=1
#if grid exists for month, update else create
if (month,year) in grids:
grid = grids[(month,year)]
else:
grid = collections.Counter()
grids[(month,year)] = grid
# count single works
if len(line) == 1:
grid[line[0], line[0]] += 1
# do pairwise counts
for pair in itertools.combinations(line, 2):
grid[pair] += 1
grid[pair[::-1]] += 1
for date,grid in grids.items():
actors = sorted(set(pair[0] for pair in grid))
#Filename from date
filename = "connection_grid_%s_%s.csv" % date
with open(filename, "w") as fp:
writer = csv.writer(fp)
writer.writerow([''] + actors)
for actor in actors:
line = [actor,] + [grid[actor, other] for other in actors]
writer.writerow(line)
with open('groups.csv','w') as fp:
writer = csv.writer(fp)
for item in sorted(groups.items()):
writer.writerow(item)

Categories

Resources