Finding maximum values within a data set given restrictions - python

I have a task where I have been giving a set of data as follows
Station1.txt sample #different sets of data for different no. stations
Date Temperature
19600101 46.1
19600102 46.7
19600103 99999.9 #99999 = not recorded
19600104 43.3
19600105 38.3
19600106 40.0
19600107 42.8
I am trying to create a function
display_maxs(stations, dates, data, start_date, end_date) which displays
a table of maximum temperatures for the given station/s and the given date
range. For example:
stations = load_stations('stations2.txt')
5
data = load_all_stations_data(stations)
dates = load_dates(stations)
display_maxs(stations, dates, data, '20021224','20021228' #these are date yyyy/mm/dd)
I have created functions for data
def load_all_stations_data(stations):
data = {}
file_list = ("Brisbane.txt", "Rockhampton.txt", "Cairns.txt", "Melbourne.txt", "Birdsville.txt", "Charleville.txt") )
for file_name in file_list:
file = open(stations(), 'r')
station = file_name.split()[0]
data[station] = []
for line in file:
values = line.strip().strip(' ')
if len(values) == 2:
data[station] = values[1]
file.close()
return data
functions for stations
def load_all_stations_data(stations):
stations = []
f = open(stations[0] + '.txt', 'r')
stations = []
for line in f:
x = (line.split()[1])
x = x.strip()
temp.append(x)
f.close()
return stations
and functions for dates
def load_dates(stations):
f = open(stations[0] + '.txt', 'r')
dates = []
for line in f:
dates.append(line.split()[0])
f.close()
return dates
Now I just need help with creating the table which displays the max temp for any given date restrictions and calls the above functions with data, dates and station.

Not really sure what those functions are supposed to do, particularly as two of them seem to have the same name. Also there are many errors in your code.
file = open(stations(), 'r') here, you try to call stations as a function, but it seems to be a list.
station = file_name.split()[0] the files names have no space, so this has no effect. Did you mean split('.')?
values = line.strip().strip(' ') probably one of those strip should be split?
data[station] = values[1] overwrites data[station] in each iteration. You probably wanted to append the value?
temp.append(x) the variable temp is not defined; did you mean stations?
Also, instead of reading the dates and the values into two separate list, I suggest you create a list of tuples. This way you will only need a single function:
def get_data(filename):
with open(filename) as f:
data = []
for line in f:
try:
date, value = line.split()
data.append((int(date), float(value)))
except:
pass # pass on header, empty lines ,etc.
return data
If this is not an option, you might create a list of tuples by zipping the lists of dates and values, i.e. data = zip(dates, values). Then, you can use the max builtin function together with a list comprehension or generator expression for filtering the values between the dates and a special key function for sorting by the value.
def display_maxs(data, start_date, end_date):
return max(((d, v) for (d, v) in data
if start_date <= d <= end_date and v < 99999),
key=lambda x: x[1])
print display_maxs(get_data("Station1.txt"), 19600103, 19600106)

Use pandas. Reading in each text file is just a single function, with comment handling, missing data (99999.9) handling, and date handling. The below code will read in the files from a sequence of file names fnames, with handling for comments and converting 9999.9 to "missing" value. Then it will get the date from start to stop, and the sequence of station names (the file names minus the extensions), then get the maximum of each (in maxdf).
import pandas as pd
import os
def load_all_stations_data(stations):
"""Load the stations defined in the sequence of file names."""
sers = []
for fname in stations:
ser = pd.read_csv(fname, sep='\s+', header=0, index_col=0,
comment='#', engine='python', parse_dates=True,
squeeze=True, na_values=['99999.9'])
ser.name = os.path.splitext(fname)[0]
sers.append(ser)
return pd.concat(sers, axis=1)
def get_maxs(startdate, stopdate, stations):
"""Get the stations and date range given, then get the max for each"""
return df.loc[start:stop, sites].max(skipna=True)
Usage of the second function would be like so:
maxdf = get_maxs(df, '20021224','20021228', ("Rockhampton", "Cairns"))
If the #99999 = not recorded comment is not actually in your files, you can get rid of the engine='python' and comment='#' arguments:
def load_all_stations_data(stations):
"""Load the stations defined in the sequence of file names."""
sers = []
for fname in stations:
ser = pd.read_csv(fname, sep='\s+', header=0, index_col=0,
parse_dates=True, squeeze=True,
na_values=['99999.9'])
ser.name = os.path.splitext(fname)[0]
sers.append(ser)
return pd.concat(sers, axis=1)

Related

Code Optimization Parsing Columns and Creating DateTime

The following code parses a text file, extracts column values, and creates a datetime column by combining a few of the columns. What can I do to improve this code? More specifically, writing float(eq_params[1]), etc. was time-consuming. Is there any way to more quickly parse the column data in Python, without using Pandas? Thanks.
#!/usr/bin/env python
import datetime
file = 'input.txt'
fopen = open(file)
data = fopen.readlines()
fout = open('output.txt', 'w')
for line in data:
eq_params = line.split()
lat = float(eq_params[1])
lon = float(eq_params[2])
dep = float(eq_params[3])
yr = int(eq_params[10])
mo = int(eq_params[11])
day = int(eq_params[12])
hr = int(eq_params[13])
minute = int(eq_params[14])
sec = int(float(eq_params[15]))
dt = datetime.datetime(yr,mo,day)
tm = datetime.time(hr,minute,sec)
time = dt.combine(dt,tm)
fout.write('%s %s %s %s\n'%(lat,lon,dep,time))
fout.close()
fopen.close()
You could reduce the amount of typing by using a combination of map, slicing and variable unpacking to convert the parsed parameters and assign them in a single line:
lat, lon, dep = map(int, eq_params[1:4])
yr, mo, day, hr, minute, sec = map(float, eq_params[10:16])
using a list comprehension would work similarly:
lat, lon, dep = [int(x) for x in eq_params[1:4]]

Add column from one .csv to another .csv file using Python

I am currently writing a script where I am creating a csv file ('tableau_input.csv') composed both of other csv files columns and columns created by myself. I tried the following code:
def make_tableau_file(mp, current_season = 2016):
# Produces a csv file containing predicted and actual game results for the current season
# Tableau uses the contents of the file to produce visualization
game_data_filename = 'game_data' + str(current_season) + '.csv'
datetime_filename = 'datetime' + str(current_season) + '.csv'
with open('tableau_input.csv', 'wb') as writefile:
tableau_write = csv.writer(writefile)
tableau_write.writerow(['Visitor_Team', 'V_Team_PTS', 'Home_Team', 'H_Team_PTS', 'True_Result', 'Predicted_Results', 'Confidence', 'Date'])
with open(game_data_filename, 'rb') as readfile:
scores = csv.reader(readfile)
scores.next()
for score in scores:
tableau_content = score[1::]
# Append True_Result
if int(tableau_content[3]) > int(tableau_content[1]):
tableau_content.append(1)
else:
tableau_content.append(0)
# Append 'Predicted_Result' and 'Confidence'
prediction_results = mp.make_predictions(tableau_content[0], tableau_content[2])
tableau_content += list(prediction_results)
tableau_write.writerow(tableau_content)
with open(datetime_filename, 'rb') as readfile2:
days = csv.reader(readfile2)
days.next()
for day in days:
tableau_write.writerow(day)
'tableau_input.csv' is the file I am creating. The columns 'Visitor_Team', 'V_Team_PTS', 'Home_Team', 'H_Team_PTS' come from 'game_data_filename'(e.g tableau_content = score[1::]). The columns 'True_Result', 'Predicted_Results', 'Confidence' are columns created in the first for loop.
So far, everything works but finally I tried to add to the 'Date' column data from the 'datetime_filename' using the same structure as above but when I open my 'tableau_input' file, there is no data in my 'Date' column. Can someone solve this problem?
For info, below are screenshots of csv files respectively for 'game_data_filename' and 'datetime_filename' (nb: datetime values are in datetime format)
It's hard to test this as I don't really know what the input should look like, but try something like this:
def make_tableau_file(mp, current_season=2016):
# Produces a csv file containing predicted and actual game results for the current season
# Tableau uses the contents of the file to produce visualization
game_data_filename = 'game_data' + str(current_season) + '.csv'
datetime_filename = 'datetime' + str(current_season) + '.csv'
with open('tableau_input.csv', 'wb') as writefile:
tableau_write = csv.writer(writefile)
tableau_write.writerow(
['Visitor_Team', 'V_Team_PTS', 'Home_Team', 'H_Team_PTS', 'True_Result', 'Predicted_Results', 'Confidence', 'Date'])
with open(game_data_filename, 'rb') as readfile, open(datetime_filename, 'rb') as readfile2:
scoreReader = csv.reader(readfile)
scores = [row for row in scoreReader]
scores = scores[1::]
daysReader = csv.reader(readfile2)
days = [day for day in daysReader]
if(len(scores) != len(days)):
print("File lengths do not match")
else:
for i in range(len(days)):
tableau_content = scores[i][1::]
tableau_date = days[i]
# Append True_Result
if int(tableau_content[3]) > int(tableau_content[1]):
tableau_content.append(1)
else:
tableau_content.append(0)
# Append 'Predicted_Result' and 'Confidence'
prediction_results = mp.make_predictions(tableau_content[0], tableau_content[2])
tableau_content += list(prediction_results)
tableau_content += tableau_date
tableau_write.writerow(tableau_content)
This combines both of the file reading parts into one.
As per your questions below:
scoreReader = csv.reader(readfile)
scores = [row for row in scoreReader]
scores = scores[1::]
This uses list comprehension to create a list called scores, with every element being one of the rows from scoreReader. As scorereader is a generator, every time we ask it for a row, it spits one out for us, until there are no more.
The second line scores = scores[1::] just chops off the first element of the list, as you don't want the header.
For more info try these:
Generators on Wiki
List Comprehensions
Good luck!

Remove specific rows from CSV file in python

I am trying to remove rows with a specific ID within particular dates from a large CSV file.
The CSV file contains a column [3] with dates formatted like "1962-05-23" and a column with identifiers [2]: "ddd:011232700:mpeg21:a00191".
Within the following date range:
01-01-1951 to 12-31-1951
07-01-1962 to 12-31-1962
01-01 to 09-30-1963
7-01 to 07-31-1965
10-01 to 10-31-1965
04-01-1966 to 11-30-1966
01-01-1969 to 12-31-1969
01-01-1970 to 12-31-1989
I want to remove rows that contain the ID ddd:11*
I think I have to create a variable that contains both the date range and the ID. And look for these in every row, but I'm very new to python so I'm not sure what would be an eloquent way to do this.
This is what I have now. -CODE UPDATED
import csv
import collections
import sys
import re
from datetime import datetime
csv.field_size_limit(sys.maxsize)
dateranges = [("01-01-1951","12-31-1951"),("07-01-1962","12-31-1962")]
dateranges = list(map(lambda dr: tuple(map(lambda x: datetime.strptime(x,"%m-%d-%Y"),dr)),dateranges))
def datefilter(x):
x = datetime.strptime(x,"%Y-%m-%d")
for r in dateranges:
if r[0]<=x and r[1]>=x: return True
return False
writer = csv.writer(open('filtered.csv', 'wb'))
for row in csv.reader('my_file.csv', delimiter='\t'):
if datefilter(row[3]):
if not row[2].startswith("dd:111"):
writer.writerow(row)
else:
writer.writerow(row)
writer.close()
I'd recommend using pandas: it's great for filtering tables. Nice and readable.
import pandas as pd
# assumes the csv contains a header, and the 2 columns of interest are labeled "mydate" and "identifier"
# Note that "date" is a pandas keyword so not wise to use for column names
df = pd.read_csv(inputFilename, parse_dates=[2]) # assumes mydate column is the 3rd column (0-based)
df = df[~df.identifier.str.contains('ddd:11')] # filters out all rows with 'ddd:11' in the 'identifier' column
# then filter out anything not inside the specified date ranges:
df = df[((pd.to_datetime("1951-01-01") <= df.mydate) & (df.mydate <= pd.to_datetime("1951-12-31"))) |
((pd.to_datetime("1962-07-01") <= df.mydate) & (df.mydate <= pd.to_datetime("1962-12-31")))]
df.to_csv(outputFilename)
See Pandas Boolean Indexing
Here is how I would approach that, but it may not be the best method.
from datetime import datetime
dateranges = [("01-01-1951","12-31-1951"),("07-01-1962","12-31-1962")]
dateranges = list(map(lambda dr: tuple(map(lambda x: datetime.strptime(x,"%m-%d-%Y"),dr)),dateranges))
def datefilter(x):
# The date format is different here to match the format of the csv
x = datetime.strptime(x,"%Y-%m-%d")
for r in dateranges:
if r[0]<=x and r[1]>=x: return True
return False
with open(main_file, "rb") as fp:
root = csv.reader(fp, delimiter='\t')
result = collections.defaultdict(list)
for row in root:
if datefilter(row[3]):
# use a regular expression or any other means to filter on id here
if row[2].startswith("dd:111"): #code to remove item
What I have done is create a list of tuples of your date ranges (for brevity, I only put 2 ranges in it), and then I convert those into datetime objects.
I have used maps for doing that in one line: first loop over all tuples in that list, applying a function which loops over all entries in that tuple and converts to a date time, using the tuple and list functions to get back to the original structure. Doing it the long way would look like:
dateranges2=[]
for dr in dateranges:
dateranges2.append((datetime.strptime(dr[0],"%m-%d-%Y"),datetime.strptime(dr[1],"%m-%d-%Y"))
dateranges = dateranges2
Notice that I just convert each item in the tuple into a datetime, and add the tuples to the new list, replacing the original (which I don't need anymore).
Next, I create a datefilter function which takes a datestring, converts it to a datetime, and then loops over all the ranges, checking if the value is in the range. If it is, we return True (indicating this item should be filtered), otherwise return False if we have checking all ranges with no match (indicating that we don't filter this item).
Now you can check out the id using any method that you want once the date has matched, and remove the item if desired. As your example is constant in the first few characters, we can just use the string startswith function to check the id. If it is more complex, we could use a regex.
My kinda approach workds like this -
import csv
import re
import datetime
field_id = 'ddd:11'
d1 = datetime.date(1951,1,01) #change the start date
d2 = datetime.date(1951,12,31) #change the end date
diff = d2 - d1
date_list = []
for i in range(diff.days + 1):
date_list.append((d1 + datetime.timedelta(i)).isoformat())
with open('mwevers_example_2016.01.02-07.25.55.csv','rb') as csv_file:
reader = csv.reader(csv_file)
for row in reader:
for date in date_list:
if row[3] == date:
print row
var = re.search('\\b'+field_id,row[2])
if bool(var) == True:
print 'olalala'#here you can make a function to copy those rows into another file or any list
import csv
import sys
import re
from datetime import datetime
csv.field_size_limit(sys.maxsize)
field_id = 'ddd:11'
dateranges = [("1951-01-01", "1951-12-31"),
("1962-07-01", "1962-12-31"),
("1963-01-01", "1963-09-30"),
("1965-07-01", "1965-07-30"),
("1965-10-01", "1965-10-31"),
("1966-04-01", "1966-11-30"),
("1969-01-01", "1989-12-31")
]
dateranges = list(map(lambda dr:
tuple(map(lambda x:
datetime.strptime(x, "%Y-%m-%d"), dr)),
dateranges))
def datefilter(x):
x = datetime.strptime(x, "%Y-%m-%d")
for r in dateranges:
if r[0] <= x and r[1] >= x:
return True
return False
output = []
with open('my_file.csv', 'r') as f:
reader = csv.reader(f, delimiter='\t', quotechar='"')
next(reader)
for row in reader:
if datefilter(row[4]):
var = re.search('\\b'+field_id, row[3])
if bool(var) == False:
output.append(row)
else:
output.append(row)
with open('output.csv', 'w') as outputfile:
writer = csv.writer(outputfile, delimiter='\t', quotechar='"')
writer.writerows(output)

Iterating over a csv file

I have to produce a function that will return a tuple containing the maximum temperature in the country and a list of the cities in which that temperature has been recorded. The data was input as a CSV file in the format:
Month, Jan, Feb, March....etc.
So far I have managed to get the maximum temperature but when it tries to find the cities in which that temperature was found, it comes up as an empty list.
def hottest_city(csv_filename):
import csv
file = open(csv_filename)
header = next(file)
data = csv.reader(file)
city_maxlist = []
for line in data:
maxtemp = 0.0
for temp in line[1:]:
if float(temp >= maxtemp:
maxtemp = float(temp)
city_maxlist.append([maxtemp, line[0])
maxtempcity = 0.0
for city in city_maxlist:
if city[0] >= maxtempcity:
maxtempcity = city[0]
citylist = []
for line in data:
for temp in line:
if temp == maxtempcity:
citylist.append(line[0])
return (maxtempcity, citylist)
Your main problem is that "data" is an iterator, so you can iterate over it only once (your second loop over data does not loop at all).
data = list(csv.reader(file))
should help. Then as Hans Then noticed, you did not provide the code you execute, so I let you correct your tests in the second loop over data.

Matching line number with string in a table.

I have a file with a list of columns describing particular parameters:
size magnitude luminosity
I need only particular data (in particular lines, and columns) from this file. So far I have a code in python, where I have appended the necessary line numbers. I just need to know how i can match it to get the right string in the text file along with just the variables in columns (magnitude) and (luminosity.) Any suggestions on how I could approach this?
Here is a sample of my code (#comments describe what I have done and what I want to do):
temp_ListMatch = (point[5]).strip()
if temp_ListMatch:
ListMatchaddress = (point[5]).strip()
ListMatchaddress = re.sub(r'\s', '_', ListMatchaddress)
ListMatch_dirname = '/projects/XRB_Web/apmanuel/499/Lists/' + ListMatchaddress
#print ListMatch_dirname+"\n"
try:
file5 = open(ListMatch_dirname, 'r')
except IOError:
print 'Cannot open: '+ListMatch_dirname
Optparline = []
for line in file5:
point5 = line.split()
j = int(point5[1])
Optparline.append(j)
#Basically file5 contains the line numbers I need,
#and I have appended these numbers to the variable j.
temp_others = (point[4]).strip()
if temp_others:
othersaddress = (point[4]).strip()
othersaddress =re.sub(r'\s', '_', othersaddress)
othersbase_dirname = '/projects/XRB_Web/apmanuel/499/Lists/' + othersaddress
try:
file6 = open(othersbase_dirname, 'r')
except IOError:
print 'Cannot open: '+othersbase_dirname
gmag = []
z = []
rh = []
gz = []
for line in file6:
point6 = line.split()
f = float(point6[2])
g = float(point6[4])
h = float(point6[6])
i = float(point6[9])
# So now I have opened file 6 where this list of data is, and have
# identified the columns of elements that I need.
# I only need the particular rows (provided by line number)
# with these elements chosen. That is where I'm stuck!
Load the whole data file in to a pandas DataFrame (assuming that the data file has a header from which we can get the column names)
import pandas as pd
df = pd.read_csv('/path/to/file')
Load the file of line numbers into a pandas Series (assuming there's one per line):
# squeeze = True makes the function return a series
row_numbers = pd.read_csv('/path/to/rows_file', squeeze = True)
Return only those lines which are in the row number file, and the columns magnitude and luminosity (this assumes that the first row is numbered 0):
relevant_rows = df.ix[row_numbers][['magnitude', 'luminosity']

Categories

Resources