Python: Reading files and editing its content - python

I got the following problem: I would like to read a data textfile which consists of two columns, year and temperature, and be able to calculate the minimum temperature etc. for each year. The whole file starts like this:
1995.0012 -1.34231
1995.3030 -3.52533
1995.4030 -7.54334
and so on, until year 2013. I had the following idea:
f=open('munich_temperatures_average.txt', 'r')
for line in f:
line = line.strip()
columns = line.split()
year = float(columns[0])
temperature=columns[1]
if year-1995<1 and year-1995>0:
print 1995, min(temperature)
With this I get only the year 1995 data which is what I want in a first step. In a second step I would like to calculate the minimal temperature of the whole dataset in year 1995. By using the script above, I however get the minimum temperature for every line in the datafile. I tried building a list and then appending the temperature but I run into trouble if I want to transform the year into an integer or the temperature into a float etc.
I feel like I am missing the right idea how to calculate the minimum value of a set of values in a column (but not of the whole column).
Any ideas how I could approach said problem? I am trying to learn Python but still at a beginners stage so if there is a way to do the whole thing without using "advanced" commands, I'd be ecstatic!

I could do this using the regexp
import re
from collections import defaultdict
REGEX = re.compile(ur"(\d{4})\.\d+ ([0-9\-\.\+]+)")
f = open('munich_temperatures_average.txt', 'r')
data = defaultdict(list)
for line in f:
year, temperature = REGEX.findall(line)[0]
temperature = float(temperature)
data[year].append(temperature)
print min(data["1995"])

You could use the csv module which would make it very easy to read and manipulate each row of your file:
import csv
with open('munich_temperatures_average.txt', 'r') as temperatures:
for row in csv.reader(temperatures, delimiter=' '):
print "year", row[0], "temp", row[1]
Afterwards it is just a matter of finding the min temperature in the rows. See
csv module documentation

If you just want the years and the temps:
years,temp =[],[]
with open("f.txt") as f:
for line in f:
spl = line.rstrip().split()
years.append(int(spl[0].split(".")[0]))
temp.append(float(spl[1]))
print years,temp
[1995, 1995, 1995] [-1.34231, -3.52533, -7.54334]

I've previously submit another approach, using a numpy library, that could be confusing considering that you are new to python. Sorry for that. As you mentioned yourself, you need to have some kind of record of the year 1995, but you don't need a list for that:
mintemp1995 = None
for line in f:
line = line.strip()
columns = line.split()
year = int(float(columns[0]))
temp = float(columns[1])
if year == 1995 and (mintemp1995 is None or temp < mintemp1995):
mintemp1995 = temp
print "1995:", mintemp1995
Note the cast to int of the year, so you can directly compare it to 1995, and the condition after it:
If the variable mintemp1995 has never set before (is None and therefore, the first entry of the dataset), or the current temperature is lower than that, it replaces it, so you have a record of only the lowest temperature.

Related

How can I print a specific set of data in Python 3.9, from a column of a CSV file based on two sets of descriptors from two other columns?

I have an assignment to print specific data within a CSV file. The data to be printed are the registration numbers of vehicles caught by the camera at a location whose descriptor is stored in the variable search_Descriptor, during an hour specified in the variable search_HH.
The CSV file is called: Carscaught.csv
All the registration numbers of the vehicles are under the column titled: Plates.
The descriptors are locations where the vehicles were caught, under the column titled: Descriptor.
And the hours of when each vehicles were caught are under the column titled: HH.
This is the file, it's quite big so I have shared it from google drive:
https://drive.google.com/file/d/1zhIxg5s_nVGzk_5JUXkujbetSIuUcNRU/view?usp=sharing
This is an image of a few lines of the CSV file from the top, the actual data on the file fills 3170 rows and goes all the way from 0-23 hours on the "HH" column:
Carscaught.csv
In my code I have defined two variables as I want to print only the registration plates of vehicles that were caught at the Location of "Kilburn Bldg" specifically at "17" hours:
search_Descriptor = "Kilburn Bldg"
search_HH = "17"
This is the code I have used, but have no clue how to go further by using the variables defined to print the specific data I need. And I HAVE to use those specific variables as they are shown, by the way.
search_Descriptor = "Kilburn Bldg"
search_HH = "17"
fo = open ('Carscaught.csv', 'r')
counter = 0;
line = fo.readline()
while line:
print(line, end = "")
line = fo.readline();
counter = counter + 1;
fo.close()
All that code does is read the entire entire file and closes it. I have no idea on how to get the desired output which should be these three specific registration numbers:
JOHNZS
KEENAS
KR8IVE
Hopefully you can help me with this. Thank you.
You possible want to look at DictReader
with open('Carscaught.csv') as f:
reader = csv.DictReader(f)
for row in reader:
plate = row['Plates']
hour = row['HH']
minute = row['MM']
descriptor = row['Descriptor']
if descriptor in search_Descriptor:
print("Found a row, now to check time")
Then you can use simple logic to search for the data you need.
Try:
import pandas as pd
search_Descriptor = "Kilburn Bldg"
search_HH = 17 # based on the csv that you have posted above, HH is int and not str so I removed the quotation marks
df = pd.read_csv("Carscaught.csv")
df2 = df[df["Descriptor"].eq(search_Descriptor) & df["HH"].eq(search_HH)]
df3 = df2["Plates"]
print(df3)
Output (the numbers 1636, 1648, and 1660 are their row numbers):
1636 JOHNZS
1648 KEENAS
1660 KR8IVE
If you don't have pandas yet, there are different tutorials on how to download/use it depending on where are you writing your code.

How to find max and min values within lists without using maps/SQL?

I'm learning python and have a data set (csv file) I've been able to split the lines by comma but now I need to find the max and min value in the third column and output the corresponding value in the first column in the same row.
This is the .csv file: https://www.dropbox.com/s/fj8tanwy1lr24yk/loan.csv?dl=0
I also can't use Pandas or any external libraries; I think it would have been easier if I used them
I have written this code so far:
f = open("loanData.csv", "r")
mylist = []
for line in f:
mylist.append(line)
newdata = []
for row in mylist:
data = row.split(",")
newdata.append(data)
I'd use the built-in csv library for parsing your CSV file, and then just generate a list with the 3rd column values in it:
import csv
with open("loanData.csv", "r") as loanCsv:
loanCsvReader = csv.reader(loanCsv)
# Comment out if no headers
next(loanCsvReader, None)
loan_data = [ row[2] for row in loanCsvReader]
max_val = max(loan_data)
min_val = min(loan_data)
print("Max: {}".format(max_val))
print("Max: {}".format(min_val))
Don't know if the details of your file, whether it has a headers or not but you can comment out
next(loanCsvReader, None)
if you don't have any headers present
Something like this might work. The index would start at zero, so the third column should be 2.
min = min([row.split(',')[2] for row in mylist])
max = max([row.split(',')[2] for row in mylist])
Separately, you could probably read and reformat your data to a list with the following:
with open('loanData.csv', 'r') as f:
data = f.read()
mylist = list(data.split('\n'))
This assumes that the end of each row of data is newline (\n) delimited (Windows), but that might be different depending on the OS you're using.

Breaking up columns in a simple csv file

My csv file looks like this:
Test Number,Score
1,100 2,40 3,80 4,90.
I have been trying to figure out how to write a code that ignores the header + first column and focuses on scores because the assignment was to find the averages of the test scores and print out a float(for those particular numbers the output should be 77.5). I've looked online and found pieces that I think would work but I'm getting errors every time. Were learning about read, realines, split, rstrip and \n if that helps! I'm sure the answer is so simple, but I'm new to coding and I have no idea what I'm doing. Thank you!
def calculateTestAverage(fileName):
myFile = open(fileName, "r")
column = myFile.readline().rstrip("\n")
for column in myFile:
scoreColumn = column.split(",")
(scoreColumn[1])
This is my code so far my professor wanted us to define a function and go from there using the stuff we learned in lecture. I'm stuck because it's printing out all the scores I need on separate returned lines, yet I am not able to sum those without getting an error. Thanks for all your help, I don't think I would be able to use any of the suggestions because we never went over them. If anyone has an idea of how to take those test scores that printed out vertically as a column and sum them that would help me a ton!
You can use csv library. This code should do the work:
import csv
reader = csv.reader(open('csvfile.txt','r'), delimiter=' ')
reader.next() # this line lets you skip the header line
for row_number, row in enumerate(reader):
total_score = 0
for element in row:
test_number, score = element.split(',')
total_score += score
average_score = total_score/float(len(row))
print("Average score for row #%d is: %.1f" % (row_number, average_score))
The output should look like this:
Average score for row #1 is: 77.5
I always approach this with a pandas data frame. Specifically the read_csv() function. You don’t need to ignore the header, just state that it is in row 0 (for example) and then also the same with the row labels.
So for example:
import pandas as pd
import numpy as np
df=read_csv(“filename”,header=0,index_col=0)
scores=df.values
print(np.average(scores))
I will break it down for you.
Since you're dealing with .csv files, I recommend using the csv library. You can import it with:
import csv
Now we need to open() the file. One common way is to use with:
with open('test.csv') as file:
Which is a context manager that avoids having to close the file at the end. The other option is to open and close normally:
file = open('test.csv')
# Do your stuff here
file.close()
Now you need to wrap the opened file with csv.reader(), which allows you to read .csv files and do things with them:
csv_reader = csv.reader(file)
To skip the headers, you can use next():
next(csv_reader)
Now for the average calculation part. One simple way is to have two variables, score_sum and total. The aim is to increment the scores and totals to these two variables respectively. Here is an example snippet :
score_sum = 0
total = 0
for number, score in csv_reader:
score_sum += int(score)
total += 1
Here's how to do it with indexing also:
score_sum = 0
total = 0
for line in csv_reader:
score_sum += int(line[1])
total += 1
Now that we have our score and totals calculated, getting the average is simply:
score_sum / total
The above code combined will then result in an average of 77.5.
Off course, this all assumes that your .csv file is actually in this format:
Test Number,Score
1,100
2,40
3,80
4,90

Import CSV and create one list for each column in Python

I am processing a CSV file in python thats delimited by a comma (,).
Each column is a sampled parameter, for instance column 0 is time, sampled at once a second, column 1 is altitude sampled at 4 times a second, etc.
So columns will look like as below:
Column 0 -> ["Time", 0, " "," "," ",1]
Column 1 -> ["Altitude", 100, 200, 300, 400]
I am trying to create a list for each column that captures its name and all its data. That way I can do calculations and organize my data into a new file automatically (the sampled data I am working with has substantial number of rows)
I want to do this for any file not just one, so the number of columns can vary.
Normally if every file was consistent I would do something like:
import csv
time =[]
alt = []
dct = {}
with open('test.csv',"r") as csvfile:
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0])
alt.append(row[1]) #etc for all columns
I am pretty new in python. Is this a good way to tackle this, if not what is better methodology?
Thanks for your time
Pandas will probably work best for you. If you use csv_read from pandas, it will create a DataFrame based on the column. It's roughly a dictionary of lists.
You can also use the .tolist() functionality of pandas to convert it to a list if you want a list specifically.
import pandas as pd
data = pd.read_csv("soqn.csv")
dict_of_lists = {}
for column_name in data.columns:
temp_list = data[column_name].tolist()
dict_of_lists[column_name] = temp_list
print dict_of_lists
EDIT:
dict_of_lists={column_name: data[column_name].tolist() for column_name in data.columns}
#This list comprehension might work faster.
I think I made my problem more simpler and just focused on one column.
What I ultimately wanted to do was to interpolate to the highest sampling rate. So here is what I came up with... Please let me know if I can do anything more efficient. I used A LOT of searching on this site to help build this. Again I am new at Python (about 2-3 weeks but some former programming experience)
import csv
header = []
#initialize variables
loc_int = 0
loc_fin = 0
temp_i = 0
temp_f = 0
with open('test2.csv',"r") as csvfile: # open csv file
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0]) #make a list that consists of all content in column A
for x in range(0,len(header)-1): #go through entire column
if header[x].isdigit() and header[x+1]=="": # find lower bound of sample to be interpolated
loc_int = x
temp_i = int(header[x])
elif header[x+1].isdigit() and header[x]=="": # find upper bound of sample to be interpolated
loc_fin = x
temp_f = int(header[x+1])
if temp_f>temp_i: #calculate interpolated values
f_min_i = temp_f - temp_i
interp = f_min_i/float((loc_fin+1)-loc_int)
for y in range(loc_int, loc_fin+1):
header[y] = temp_i + interp*(y-loc_int)
print header
with open("output.csv", 'wb') as g: #write to new file
writer = csv.writer(g)
for item in header:
writer.writerow([item])
I couldn't figure out how to write my new list "header" with its interpolated values and replace it with column A of my old file , test2.csv.
Anywho thank you very much for looking...

select certain dates inside loop for .csv file

Name,USAF,NCDC,Date,HrMn,I,Type,Dir,Q,I,Spd,Q
OXNARD,723927,93110,19590101,0000,4,SAO,270,1,N,3.1,1,
OXNARD,723927,93110,19590101,0100,4,SAO,338,1,N,1.0,1,
OXNARD,723927,93110,19590101,0200,4,SAO,068,1,N,1.0,1,
OXNARD,723927,93110,19590101,0300,4,SAO,068,1,N,2.1,1,
OXNARD,723927,93110,19590101,0400,4,SAO,315,1,N,1.0,1,
OXNARD,723927,93110,19590101,0500,4,SAO,999,1,C,0.0,1,
....
OXNARD,723927,93110,19590102,0000,4,SAO,225,1,N,2.1,1,
OXNARD,723927,93110,19590102,0100,4,SAO,248,1,N,2.1,1,
OXNARD,723927,93110,19590102,0200,4,SAO,999,1,C,0.0,1,
OXNARD,723927,93110,19590102,0300,4,SAO,068,1,N,2.1,1,
Here is a snippet of a csv file storing hourly wind speeds (Spd) in each row. What I'd like to do is select all hourly winds for each day in the csv file and store them into a temporary daily list storing all of that day's hourly values (24 if no missing values). Then I'll output the current day's list, create new empty list for the next day, locate hourly speeds in the next day, output that daily list, and so forth until the end of the file.
I'm struggling with a good method to do this. One thought I have is to read in line i, determine the date(YYYY-MM-DD), then read in line i+1 and see if that date matchs date i. If they match, then we're in the same day. If they don't, then we are onto the next day. But I can't even figure out how to read in the next line in the file...
Any suggestions to execute this method or a completely new (and better?!) method are most welcome. Thanks you in advance!
obs_in = open(csv_file).readlines()
for i in range(1,len(obs_in)):
# Skip over the header lines
if not str(obs_in[i]).startswith("Identification") and not str(obs_in[i]).startswith("Name"):
name,usaf,ncdc,date,hrmn,i,type,dir,q,i2,spd,q2,blank = obs_in[i].split(',')
current_dt = datetime.date(int(date[0:4]),int(date[4:6]),int(date[6:8]))
current_spd = spd
# Read in next line's date: is it in the same day?
# If in the same day, then append spd into tmp daily list
# If not, then start a new list for the next day
You can take advantage of the well-ordered nature of the data file and use csv.dictreader. Then you can build up a dictionary of the windspeeds organized by date quite simply, which you can process as you like. Note that the csv reader returns strings, so you might want to convert to other types as appropriate while you assemble the list.
import csv
from collections import defaultdict
bydate = defaultdict(list)
rdr = csv.DictReader(open('winds.csv','rt'))
for k in rdr:
bydate[k['Date']].append(float(k['Spd']))
print(bydate)
defaultdict(<type 'list'>, {'19590101': [3.1000000000000001, 1.0, 1.0, 2.1000000000000001, 1.0, 0.0], '19590102': [2.1000000000000001, 2.1000000000000001, 0.0, 2.1000000000000001]})
You can obviously change the argument to the append call to a tuple, for instance append((float(k['Spd']), datetime.datetime.strptime(k['Date']+k['HrMn'],'%Y%m%D%H%M)) so that you can also collect the times.
If the file has extraneous spaces, you can use the skipinitialspace parameter: rdr = csv.DictReader(open('winds.csv','rt'), fieldnames=ff, skipinitialspace=True). If this still doesn't work, you can pre-process the header line:
bydate = defaultdict(list)
with open('winds.csv', 'rt') as f:
fieldnames = [k.strip() for k in f.readline().split(', ')]
rdr = csv.DictReader(f, fieldnames=fieldnames, skipinitialspace=True)
for k in rdr:
bydate[k['Date']].append(k['Spd'])
return bydate
bydate is accessed like a regular dictionary. To access a specific day's data, do bydate['19590101']. To get the list of dates that were processed, you can do bydate.keys().
If you want to convert them to Python datetime objects at the time of reading the file, you can import datetime, then replace the assignment line with bydate[datetime.datetime.strptime(k['Date'], '%Y%m%d')].append(k['Spd']).
It can be something like this.
def dump(buf, date):
"""dumps buffered line into file 'spdYYYYMMDD.csv'"""
if len(buf) == 0: return
with open('spd%s.csv' % date, 'w') as f:
for line in buf:
f.write(line)
obs_in = open(csv_file).readlines()
# buf stores one day record
buf = []
# date0 is meant for time stamp for the buffer
date0 = None
for i in range(1,len(obs_in)):
# Skip over the header lines
if not str(obs_in[i]).startswith("Identification") and \
not str(obs_in[i]).startswith("Name"):
name,usaf,ncdc,date,hrmn,ii,type,dir,q,i2,spd,q2,blank = \
obs_in[i].split(',')
current_dt = datetime.date(int(date[0:4]),int(date[4:6]),int(date[6:8]))
current_spd = spd
# see if the time stamp of current record is different. if it is different
# dump the buffer, and also set the time stamp of buffer
if date != date0:
dump(buf, date0)
buf = []
date0 = date
# you change this. i am simply writing entire line
buf.append(obs_in[i])
# when you get out the buffer should be filled with the last day's record.
# so flush that too.
dump(buf, date0)
I also found that i have to use ii instead of i for the filed "I" of the data, as you used i for loop counter.
I know this question is from years ago but just wanted to point out that a small bash script can neatly perform this task. I copied your example into a file called data.txt and this is the script:
#!/bin/bash
date=19590101
date_end=19590102
while [[ $date -le $date_end ]] ; do
grep ",${date}," data.txt > file_${date}.txt
date=`date +%Y%m%d -d ${date}+1day` # NOTE: MAC-OSX date differs
done
Note that this won't work on MAC as for some reason the date command implementation is different, so on MAC you either need to use gdate (from coreutils) or change the options to match those for date on MAC.
If there are dates missing from the file the grep command produces an empty file - this link shows ways to avoid this:
how to stop grep creating empty file if no results

Categories

Resources