The following code parses a text file, extracts column values, and creates a datetime column by combining a few of the columns. What can I do to improve this code? More specifically, writing float(eq_params[1]), etc. was time-consuming. Is there any way to more quickly parse the column data in Python, without using Pandas? Thanks.
#!/usr/bin/env python
import datetime
file = 'input.txt'
fopen = open(file)
data = fopen.readlines()
fout = open('output.txt', 'w')
for line in data:
eq_params = line.split()
lat = float(eq_params[1])
lon = float(eq_params[2])
dep = float(eq_params[3])
yr = int(eq_params[10])
mo = int(eq_params[11])
day = int(eq_params[12])
hr = int(eq_params[13])
minute = int(eq_params[14])
sec = int(float(eq_params[15]))
dt = datetime.datetime(yr,mo,day)
tm = datetime.time(hr,minute,sec)
time = dt.combine(dt,tm)
fout.write('%s %s %s %s\n'%(lat,lon,dep,time))
fout.close()
fopen.close()
You could reduce the amount of typing by using a combination of map, slicing and variable unpacking to convert the parsed parameters and assign them in a single line:
lat, lon, dep = map(int, eq_params[1:4])
yr, mo, day, hr, minute, sec = map(float, eq_params[10:16])
using a list comprehension would work similarly:
lat, lon, dep = [int(x) for x in eq_params[1:4]]
Related
I am trying to make a function that returns a list/array with the daily averages for a variable from either one of 3 csv files
Each csv file is similar to this:
date, time, variable1, variable2, variable3
2021-01-01,01:00:00,1.43738,25.838,22.453
2021-01-01,02:00:00,2.08652,21.028,19.099
2021-01-01,03:00:00,1.39101,23.18,20.925
2021-01-01,04:00:00,0.76506,22.053,19.974
The date contains the entire year of 2021 with increments of 1 hour
def daily_average(data, station, variable):
The function has 3 parameters:
data
station: One of the 3 csv files
variable: Either variable 1 or 2 or 3
Libraries such as datetime, calendar and numpy can be used
Pandas can also be used
Well.
First of all - try to do it yourself before asking a question.
It will help you to learn things.
But now to your question.
csv_lines_test = [
"date, time, variable1, variable2, variable3\n",
"2021-01-01,01:00:00,1.43738,25.838,22.453\n",
"2021-01-01,02:00:00,2.08652,21.028,19.099\n",
"2021-01-01,03:00:00,1.39101,23.18,20.925\n",
"2021-01-01,04:00:00,0.76506,22.053,19.974\n",
]
import datetime as dt
def daily_average(csv_lines, date, variable_num):
# variable_num should be 1-3
avg_arr = []
# Read csv file line by line.
for i, line in enumerate(csv_lines):
if i == 0:
# Skip headers
continue
line = line.rstrip()
values = line.split(',')
date_csv = dt.datetime.strptime(values[0], "%Y-%m-%d").date()
val_arr = [float(val) for val in values[2:]]
if date == date_csv:
avg_arr.append(val_arr[variable_num-1])
return sum(avg_arr) / len(avg_arr)
avg = daily_average(csv_lines_test, dt.date(2021, 1, 1), 1)
print(avg)
If you want to read data directly from csv file:
with open("csv_file_path.csv", 'r') as f:
data = [line for line in f]
avg = daily_average(data, dt.date(2021, 1, 1), 1)
print(avg)
I am very new to python and need some help finding the maximum/highest value of in a column of data (time) that is imported from a csv file. this is the code i have tried.
file = open ("results.csv")
unneeded = file.readline()
for line in file:
data = file.readline ()
linelist = line.split(",")
hours = linelist[4]
maxtime = 0
for x in hours:
if x > maxtime:
maxtime = x
print (maxtime)
any help is appreciated
edit: i tried this code but it gives me the wrong answer :(
file = open ("results.csv")
unneeded = file.readline()
maxtime = 0
for line in file:
data = file.readline ()
linelist = line.split(",")
hours = linelist[4]
if hours > str(maxtime):
maxtime = hours
print (maxtime)
[first few lines of results][1]
edit:
results cvs
[1]: https://i.stack.imgur.com/z3pEJ.png
I haven't tested it but this should work. Using the CSV library is easy for parsing CSV files.
import csv
with open("results.csv") as file:
csv_reader = csv.reader(file, delimiter=',')
for row in csv_reader:
hours = row[4]
maxtime = 0
if hours > maxtime:
maxtime = x
print (maxtime)
file.close()
My recommendation is using the pandas module for anything CSV-related.
Using dateutil, I create a dataset of dates and identification numbers whose date values are shuffled (no specific order) like so:
from dateutil.parser import *
from dateutil.rrule import *
from random import shuffle
dates = list(rrule(
DAILY,
dtstart=parse("19970902T090000"),
until=parse("19971224T000000")
))
shuffle(dates)
with open('out.csv', mode='w', encoding='utf-8') as out:
for i,date in enumerate(dates):
out.write(f'{i},{date}\n')
So thus, in this particular dataset, 1997-12-23 09:00:00 would be the "largest" date. Then, to extract the maximum date, we can just do it via string comparisons if it is formatted in the ISO 8601 date/time format:
from pandas import read_csv
df = read_csv('out.csv', names=['index', 'time'], header=1)
print(max(df['time']))
After running it, we indeed get 1997-12-23 09:00:00 printed in the terminal!
I need to remove the day in date and I tried to use datetime.strftime and datetime.strptime but it couldn't work. I need to create a tuple of 2 items(date,price) from a nested list but I need to change the date format first.
here's part of the code:
def get_data(my_csv):
with open("my_csv.csv", "r") as csv_file:
csv_reader = csv.reader(csv_file, delimiter = (','))
next(csv_reader)
data = []
for line in csv_reader:
data.append(line)
return data
def get_monthly_avg(data):
oldformat = '20040819'
datetimeobject = datetime.strptime(oldformat,'%y%m%d')
newformat = datetime.strftime('%y%m ')
You miss print with date formats. 'Y' has to be capitalized.
from datetime import datetime
# use datetime to convert
def strip_date(data):
d = datetime.strptime(data,'%Y%m%d')
return datetime.strftime(d,'%Y%m')
data = '20110513'
print (strip_date(data))
# or just cut off day (2 last symbols) from date string
print (data[:6])
The first variant is better because you can verify that string is in proper date format.
Output:
201105
201105
You didnt specify any code, but this might work:
date = functionThatGetsDate()
date = date[0:6]
below is some code written to open a CSV file. Its values are stored like this:
03/05/2017 09:40:19,21.2,35.0
03/05/2017 09:40:27,21.2,35.0
03/05/2017 09:40:38,21.1,35.0
03/05/2017 09:40:48,21.1,35.0
This is just a snippet of code I use in a real time plotting program, which fully works but the fact that the array is getting so big is unclean. Normally new values get added to the CSV while the program is running and the length of the arrays is very high. Is there a way to not have exploding arrays like this?
Just run the program, you will have to make a CSV with those values too and you will see my problem.
from datetime import datetime
import time
y = [] #temperature
t = [] #time object
h = [] #humidity
def readfile():
readFile = open('document.csv', 'r')
sepFile = readFile.read().split('\n')
readFile.close()
for idx, plotPair in enumerate(sepFile):
if plotPair in '. ':
# skip. or space
continue
if idx > 1: # to skip the first line
xAndY = plotPair.split(',')
time_string = xAndY[0]
time_string1 = datetime.strptime(time_string, '%d/%m/%Y %H:%M:%S')
t.append(time_string1)
y.append(float(xAndY[1]))
h.append(float(xAndY[2]))
print([y])
while True:
readfile()
time.sleep(2)
This is the output I get:
[[21.1]]
[[21.1, 21.1]]
[[21.1, 21.1, 21.1]]
[[21.1, 21.1, 21.1, 21.1]]
[[21.1, 21.1, 21.1, 21.1, 21.1]]
Any help is appreciated.
You can use Python's deque if you also want to limit the total number of entries you wish to keep. It produces a list which features a maximum length. Once the list is full, any new entries push the oldest entry off the start.
The reason your list is growing is that you need to re-read your file up to the point of you last entry before continuing to add new entries. Assuming your timestamps are unique, you could use takewhile() to help you do this, which reads entries until a condition is met.
from itertools import takewhile
from collections import deque
from datetime import datetime
import csv
import time
max_length = 1000 # keep this many entries
t = deque(maxlen=max_length) # time object
y = deque(maxlen=max_length) # temperature
h = deque(maxlen=max_length) # humidity
def read_file():
with open('document.csv', newline='') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input) # skip over the header line
# If there are existing entries, read until the last read item is found again
if len(t):
list(takewhile(lambda row: datetime.strptime(row[0], '%d/%m/%Y %H:%M:%S') != t[-1], csv_input))
for row in csv_input:
print(row)
t.append(datetime.strptime(row[0], '%d/%m/%Y %H:%M:%S'))
y.append(float(row[1]))
h.append(float(row[2]))
while True:
read_file()
print(t)
time.sleep(1)
Also, it is easier to work with the entries using Python's built in csv library to read each of the values into a list for each row. As you have a header row, read this in using next() before starting the loop.
I have a task where I have been giving a set of data as follows
Station1.txt sample #different sets of data for different no. stations
Date Temperature
19600101 46.1
19600102 46.7
19600103 99999.9 #99999 = not recorded
19600104 43.3
19600105 38.3
19600106 40.0
19600107 42.8
I am trying to create a function
display_maxs(stations, dates, data, start_date, end_date) which displays
a table of maximum temperatures for the given station/s and the given date
range. For example:
stations = load_stations('stations2.txt')
5
data = load_all_stations_data(stations)
dates = load_dates(stations)
display_maxs(stations, dates, data, '20021224','20021228' #these are date yyyy/mm/dd)
I have created functions for data
def load_all_stations_data(stations):
data = {}
file_list = ("Brisbane.txt", "Rockhampton.txt", "Cairns.txt", "Melbourne.txt", "Birdsville.txt", "Charleville.txt") )
for file_name in file_list:
file = open(stations(), 'r')
station = file_name.split()[0]
data[station] = []
for line in file:
values = line.strip().strip(' ')
if len(values) == 2:
data[station] = values[1]
file.close()
return data
functions for stations
def load_all_stations_data(stations):
stations = []
f = open(stations[0] + '.txt', 'r')
stations = []
for line in f:
x = (line.split()[1])
x = x.strip()
temp.append(x)
f.close()
return stations
and functions for dates
def load_dates(stations):
f = open(stations[0] + '.txt', 'r')
dates = []
for line in f:
dates.append(line.split()[0])
f.close()
return dates
Now I just need help with creating the table which displays the max temp for any given date restrictions and calls the above functions with data, dates and station.
Not really sure what those functions are supposed to do, particularly as two of them seem to have the same name. Also there are many errors in your code.
file = open(stations(), 'r') here, you try to call stations as a function, but it seems to be a list.
station = file_name.split()[0] the files names have no space, so this has no effect. Did you mean split('.')?
values = line.strip().strip(' ') probably one of those strip should be split?
data[station] = values[1] overwrites data[station] in each iteration. You probably wanted to append the value?
temp.append(x) the variable temp is not defined; did you mean stations?
Also, instead of reading the dates and the values into two separate list, I suggest you create a list of tuples. This way you will only need a single function:
def get_data(filename):
with open(filename) as f:
data = []
for line in f:
try:
date, value = line.split()
data.append((int(date), float(value)))
except:
pass # pass on header, empty lines ,etc.
return data
If this is not an option, you might create a list of tuples by zipping the lists of dates and values, i.e. data = zip(dates, values). Then, you can use the max builtin function together with a list comprehension or generator expression for filtering the values between the dates and a special key function for sorting by the value.
def display_maxs(data, start_date, end_date):
return max(((d, v) for (d, v) in data
if start_date <= d <= end_date and v < 99999),
key=lambda x: x[1])
print display_maxs(get_data("Station1.txt"), 19600103, 19600106)
Use pandas. Reading in each text file is just a single function, with comment handling, missing data (99999.9) handling, and date handling. The below code will read in the files from a sequence of file names fnames, with handling for comments and converting 9999.9 to "missing" value. Then it will get the date from start to stop, and the sequence of station names (the file names minus the extensions), then get the maximum of each (in maxdf).
import pandas as pd
import os
def load_all_stations_data(stations):
"""Load the stations defined in the sequence of file names."""
sers = []
for fname in stations:
ser = pd.read_csv(fname, sep='\s+', header=0, index_col=0,
comment='#', engine='python', parse_dates=True,
squeeze=True, na_values=['99999.9'])
ser.name = os.path.splitext(fname)[0]
sers.append(ser)
return pd.concat(sers, axis=1)
def get_maxs(startdate, stopdate, stations):
"""Get the stations and date range given, then get the max for each"""
return df.loc[start:stop, sites].max(skipna=True)
Usage of the second function would be like so:
maxdf = get_maxs(df, '20021224','20021228', ("Rockhampton", "Cairns"))
If the #99999 = not recorded comment is not actually in your files, you can get rid of the engine='python' and comment='#' arguments:
def load_all_stations_data(stations):
"""Load the stations defined in the sequence of file names."""
sers = []
for fname in stations:
ser = pd.read_csv(fname, sep='\s+', header=0, index_col=0,
parse_dates=True, squeeze=True,
na_values=['99999.9'])
ser.name = os.path.splitext(fname)[0]
sers.append(ser)
return pd.concat(sers, axis=1)