Name,USAF,NCDC,Date,HrMn,I,Type,Dir,Q,I,Spd,Q
OXNARD,723927,93110,19590101,0000,4,SAO,270,1,N,3.1,1,
OXNARD,723927,93110,19590101,0100,4,SAO,338,1,N,1.0,1,
OXNARD,723927,93110,19590101,0200,4,SAO,068,1,N,1.0,1,
OXNARD,723927,93110,19590101,0300,4,SAO,068,1,N,2.1,1,
OXNARD,723927,93110,19590101,0400,4,SAO,315,1,N,1.0,1,
OXNARD,723927,93110,19590101,0500,4,SAO,999,1,C,0.0,1,
....
OXNARD,723927,93110,19590102,0000,4,SAO,225,1,N,2.1,1,
OXNARD,723927,93110,19590102,0100,4,SAO,248,1,N,2.1,1,
OXNARD,723927,93110,19590102,0200,4,SAO,999,1,C,0.0,1,
OXNARD,723927,93110,19590102,0300,4,SAO,068,1,N,2.1,1,
Here is a snippet of a csv file storing hourly wind speeds (Spd) in each row. What I'd like to do is select all hourly winds for each day in the csv file and store them into a temporary daily list storing all of that day's hourly values (24 if no missing values). Then I'll output the current day's list, create new empty list for the next day, locate hourly speeds in the next day, output that daily list, and so forth until the end of the file.
I'm struggling with a good method to do this. One thought I have is to read in line i, determine the date(YYYY-MM-DD), then read in line i+1 and see if that date matchs date i. If they match, then we're in the same day. If they don't, then we are onto the next day. But I can't even figure out how to read in the next line in the file...
Any suggestions to execute this method or a completely new (and better?!) method are most welcome. Thanks you in advance!
obs_in = open(csv_file).readlines()
for i in range(1,len(obs_in)):
# Skip over the header lines
if not str(obs_in[i]).startswith("Identification") and not str(obs_in[i]).startswith("Name"):
name,usaf,ncdc,date,hrmn,i,type,dir,q,i2,spd,q2,blank = obs_in[i].split(',')
current_dt = datetime.date(int(date[0:4]),int(date[4:6]),int(date[6:8]))
current_spd = spd
# Read in next line's date: is it in the same day?
# If in the same day, then append spd into tmp daily list
# If not, then start a new list for the next day
You can take advantage of the well-ordered nature of the data file and use csv.dictreader. Then you can build up a dictionary of the windspeeds organized by date quite simply, which you can process as you like. Note that the csv reader returns strings, so you might want to convert to other types as appropriate while you assemble the list.
import csv
from collections import defaultdict
bydate = defaultdict(list)
rdr = csv.DictReader(open('winds.csv','rt'))
for k in rdr:
bydate[k['Date']].append(float(k['Spd']))
print(bydate)
defaultdict(<type 'list'>, {'19590101': [3.1000000000000001, 1.0, 1.0, 2.1000000000000001, 1.0, 0.0], '19590102': [2.1000000000000001, 2.1000000000000001, 0.0, 2.1000000000000001]})
You can obviously change the argument to the append call to a tuple, for instance append((float(k['Spd']), datetime.datetime.strptime(k['Date']+k['HrMn'],'%Y%m%D%H%M)) so that you can also collect the times.
If the file has extraneous spaces, you can use the skipinitialspace parameter: rdr = csv.DictReader(open('winds.csv','rt'), fieldnames=ff, skipinitialspace=True). If this still doesn't work, you can pre-process the header line:
bydate = defaultdict(list)
with open('winds.csv', 'rt') as f:
fieldnames = [k.strip() for k in f.readline().split(', ')]
rdr = csv.DictReader(f, fieldnames=fieldnames, skipinitialspace=True)
for k in rdr:
bydate[k['Date']].append(k['Spd'])
return bydate
bydate is accessed like a regular dictionary. To access a specific day's data, do bydate['19590101']. To get the list of dates that were processed, you can do bydate.keys().
If you want to convert them to Python datetime objects at the time of reading the file, you can import datetime, then replace the assignment line with bydate[datetime.datetime.strptime(k['Date'], '%Y%m%d')].append(k['Spd']).
It can be something like this.
def dump(buf, date):
"""dumps buffered line into file 'spdYYYYMMDD.csv'"""
if len(buf) == 0: return
with open('spd%s.csv' % date, 'w') as f:
for line in buf:
f.write(line)
obs_in = open(csv_file).readlines()
# buf stores one day record
buf = []
# date0 is meant for time stamp for the buffer
date0 = None
for i in range(1,len(obs_in)):
# Skip over the header lines
if not str(obs_in[i]).startswith("Identification") and \
not str(obs_in[i]).startswith("Name"):
name,usaf,ncdc,date,hrmn,ii,type,dir,q,i2,spd,q2,blank = \
obs_in[i].split(',')
current_dt = datetime.date(int(date[0:4]),int(date[4:6]),int(date[6:8]))
current_spd = spd
# see if the time stamp of current record is different. if it is different
# dump the buffer, and also set the time stamp of buffer
if date != date0:
dump(buf, date0)
buf = []
date0 = date
# you change this. i am simply writing entire line
buf.append(obs_in[i])
# when you get out the buffer should be filled with the last day's record.
# so flush that too.
dump(buf, date0)
I also found that i have to use ii instead of i for the filed "I" of the data, as you used i for loop counter.
I know this question is from years ago but just wanted to point out that a small bash script can neatly perform this task. I copied your example into a file called data.txt and this is the script:
#!/bin/bash
date=19590101
date_end=19590102
while [[ $date -le $date_end ]] ; do
grep ",${date}," data.txt > file_${date}.txt
date=`date +%Y%m%d -d ${date}+1day` # NOTE: MAC-OSX date differs
done
Note that this won't work on MAC as for some reason the date command implementation is different, so on MAC you either need to use gdate (from coreutils) or change the options to match those for date on MAC.
If there are dates missing from the file the grep command produces an empty file - this link shows ways to avoid this:
how to stop grep creating empty file if no results
Related
I have an assignment to print specific data within a CSV file. The data to be printed are the registration numbers of vehicles caught by the camera at a location whose descriptor is stored in the variable search_Descriptor, during an hour specified in the variable search_HH.
The CSV file is called: Carscaught.csv
All the registration numbers of the vehicles are under the column titled: Plates.
The descriptors are locations where the vehicles were caught, under the column titled: Descriptor.
And the hours of when each vehicles were caught are under the column titled: HH.
This is the file, it's quite big so I have shared it from google drive:
https://drive.google.com/file/d/1zhIxg5s_nVGzk_5JUXkujbetSIuUcNRU/view?usp=sharing
This is an image of a few lines of the CSV file from the top, the actual data on the file fills 3170 rows and goes all the way from 0-23 hours on the "HH" column:
Carscaught.csv
In my code I have defined two variables as I want to print only the registration plates of vehicles that were caught at the Location of "Kilburn Bldg" specifically at "17" hours:
search_Descriptor = "Kilburn Bldg"
search_HH = "17"
This is the code I have used, but have no clue how to go further by using the variables defined to print the specific data I need. And I HAVE to use those specific variables as they are shown, by the way.
search_Descriptor = "Kilburn Bldg"
search_HH = "17"
fo = open ('Carscaught.csv', 'r')
counter = 0;
line = fo.readline()
while line:
print(line, end = "")
line = fo.readline();
counter = counter + 1;
fo.close()
All that code does is read the entire entire file and closes it. I have no idea on how to get the desired output which should be these three specific registration numbers:
JOHNZS
KEENAS
KR8IVE
Hopefully you can help me with this. Thank you.
You possible want to look at DictReader
with open('Carscaught.csv') as f:
reader = csv.DictReader(f)
for row in reader:
plate = row['Plates']
hour = row['HH']
minute = row['MM']
descriptor = row['Descriptor']
if descriptor in search_Descriptor:
print("Found a row, now to check time")
Then you can use simple logic to search for the data you need.
Try:
import pandas as pd
search_Descriptor = "Kilburn Bldg"
search_HH = 17 # based on the csv that you have posted above, HH is int and not str so I removed the quotation marks
df = pd.read_csv("Carscaught.csv")
df2 = df[df["Descriptor"].eq(search_Descriptor) & df["HH"].eq(search_HH)]
df3 = df2["Plates"]
print(df3)
Output (the numbers 1636, 1648, and 1660 are their row numbers):
1636 JOHNZS
1648 KEENAS
1660 KR8IVE
If you don't have pandas yet, there are different tutorials on how to download/use it depending on where are you writing your code.
I have generated csv file which has formate as shown in below image:
In this image, I have data week wise but somewhere I couldn't arrange data week wise. If you look into the below image, you will see the red mark and blue mark. I want to separate this both marks. How I will do it?
Note: If Holiday on Friday then it should set a week from Monday to Thursday.
currently, I'm using below logic :
Image: Please click here to see image
current logic:
import csv
blank_fields=[' ']
fields=[' ','Weekly Avg']
# Read csv file
file1 = open('test.csv', 'rb')
reader = csv.reader(file1)
new_rows_list = []
# Read data row by row and store into new list
for row in reader:
new_rows_list.append(row)
if 'Friday' in row:
new_rows_list.append(fields)
file1.close()
overall you are going towards the right direction, your condition is just a little too error prone, things can get worse (e.g., just one day in a week appears in your list). So testing for the weekday string isn't the best choice here.
I would suggest "understanding" the date/time in your table to solve this using weekdays, like this:
from datetime import datetime as dt, timedelta as td
# remember the last weekday
last = None
# each item in list represents one week, while active_week is a temp-var
weeks = []
_cur_week = []
for row in reader:
# assuming the date is in row[1] (split, convert to int, pass as args)
_cur_date = dt(*map(int, row[1].split("-")))
# weekday will be in [0 .. 6]
# now if _cur_date.weekday <= last.weekday, a week is complete
# (also catching corner-case with more than 7 days, between two entries)
if last and (_cur_date.weekday <= last.weekday or (_cur_date - last) >= td(days=7)):
# append collected rows to 'weeks', empty _cur_week for next rows
weeks.append(_cur_week)
_cur_week = []
# regular invariant, append row and set 'last' to '_cur_date'
_cur_week.append(row)
last = _cur_date
Pretty verbose and extensive, but I hope I can transport the pattern used here:
parse existing date and use weekday to distinguish one week from another (i.e., weekday will increase monotonously, means any decrease (or equality) will tell you the current date represents the next week).
store rows in a temporary list during one week
append _cur_week into weeks once the condition for next week gets triggered
empty _cur_week for next rows i.e., week
Finally the last thing to do is to "concat" the data e.g. like this:
new_rows_list = [[fields] + week for week in weeks]
I have another logic for this same thing and it is successfully worked and easy solution.
import csv
import datetime
fields={'':' ', 'Date':'Weekly Avg'}
#Read csv file
file1 = open('temp.csv', 'rb')
reader = csv.DictReader(file1)
new_rows_list = []
last_date = None
# Read data row by row and store into new list
for row in reader:
cur_date = datetime.datetime.strptime(row['Date'], '%Y-%m-%d').date()
if last_date and ((last_date.weekday() > cur_date.weekday()) or (cur_date.weekday() == 0)):
new_rows_list.append(fields)
last_date = cur_date
new_rows_list.append(row)
file1.close()
I have one CSV file, and I want to extract the first column of it. My CSV file is like this:
Device ID;SysName;Entry address(es);IPv4 address;Platform;Interface;Port ID (outgoing port);Holdtime
PE1-PCS-RANCAGUA;;;192.168.203.153;cisco CISCO7606 Capabilities Router Switch IGMP;TenGigE0/5/0/1;TenGigabitEthernet3/3;128 sec
P2-CORE-VALPO.cisco.com;P2-CORE-VALPO.cisco.com;;200.72.146.220;cisco CRS Capabilities Router;TenGigE0/5/0/0;TenGigE0/5/0/4;128 sec
PE2-CONCE;;;172.31.232.42;Cisco 7204VXR Capabilities Router;GigabitEthernet0/0/0/14;GigabitEthernet0/3;153 sec
P1-CORE-CRS-CNT.entel.cl;P1-CORE-CRS-CNT.entel.cl;;200.72.146.49;cisco CRS Capabilities Router;TenGigE0/5/0/0;TenGigE0/1/0/6;164 sec
For that purpose I use the following code that I saw here:
import csv
makes = []
with open('csvoutput/topologia.csv', 'rb') as f:
reader = csv.reader(f)
# next(reader) # Ignore first row
for row in reader:
makes.append(row[0])
print makes
Then I want to replace into a textfile a particular value for each one of the values of the first column and save it as a new file.
Original textfile:
PLANNED.IMPACTO_ID = IMPACTO.ID AND
PLANNED.ESTADOS_ID = ESTADOS_PLANNED.ID AND
TP_CLASIFICACION.ID = TP_DATA.ID_TP_CLASIFICACION AND
TP_DATA.PLANNED_ID = PLANNED.ID AND
PLANNED.FECHA_FIN >= CURDATE() - INTERVAL 1 DAY AND
PLANNED.DESCRIPCION LIKE '%P1-CORE-CHILLAN%’;
Expected output:
PLANNED.IMPACTO_ID = IMPACTO.ID AND
PLANNED.ESTADOS_ID = ESTADOS_PLANNED.ID AND
TP_CLASIFICACION.ID = TP_DATA.ID_TP_CLASIFICACION AND
TP_DATA.PLANNED_ID = PLANNED.ID AND
PLANNED.FECHA_FIN >= CURDATE() - INTERVAL 1 DAY AND
PLANNED.DESCRIPCION LIKE 'FIRST_COLUMN_VALUE’;
And so on for every value in the first column, and save it as a separate file.
How can I do this? Thank you very much for your help.
You could just read the file, apply changes, and write the file back again. There is no efficient way to edit a file (inserting characters is not efficiently possible), you can only rewrite it.
If your file is going to be big, you should not keep the whole table in memory.
import csv
makes = []
with open('csvoutput/topologia.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
makes.append(row)
# Apply changes in makes
with open('csvoutput/topologia.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(makes);
I got the following problem: I would like to read a data textfile which consists of two columns, year and temperature, and be able to calculate the minimum temperature etc. for each year. The whole file starts like this:
1995.0012 -1.34231
1995.3030 -3.52533
1995.4030 -7.54334
and so on, until year 2013. I had the following idea:
f=open('munich_temperatures_average.txt', 'r')
for line in f:
line = line.strip()
columns = line.split()
year = float(columns[0])
temperature=columns[1]
if year-1995<1 and year-1995>0:
print 1995, min(temperature)
With this I get only the year 1995 data which is what I want in a first step. In a second step I would like to calculate the minimal temperature of the whole dataset in year 1995. By using the script above, I however get the minimum temperature for every line in the datafile. I tried building a list and then appending the temperature but I run into trouble if I want to transform the year into an integer or the temperature into a float etc.
I feel like I am missing the right idea how to calculate the minimum value of a set of values in a column (but not of the whole column).
Any ideas how I could approach said problem? I am trying to learn Python but still at a beginners stage so if there is a way to do the whole thing without using "advanced" commands, I'd be ecstatic!
I could do this using the regexp
import re
from collections import defaultdict
REGEX = re.compile(ur"(\d{4})\.\d+ ([0-9\-\.\+]+)")
f = open('munich_temperatures_average.txt', 'r')
data = defaultdict(list)
for line in f:
year, temperature = REGEX.findall(line)[0]
temperature = float(temperature)
data[year].append(temperature)
print min(data["1995"])
You could use the csv module which would make it very easy to read and manipulate each row of your file:
import csv
with open('munich_temperatures_average.txt', 'r') as temperatures:
for row in csv.reader(temperatures, delimiter=' '):
print "year", row[0], "temp", row[1]
Afterwards it is just a matter of finding the min temperature in the rows. See
csv module documentation
If you just want the years and the temps:
years,temp =[],[]
with open("f.txt") as f:
for line in f:
spl = line.rstrip().split()
years.append(int(spl[0].split(".")[0]))
temp.append(float(spl[1]))
print years,temp
[1995, 1995, 1995] [-1.34231, -3.52533, -7.54334]
I've previously submit another approach, using a numpy library, that could be confusing considering that you are new to python. Sorry for that. As you mentioned yourself, you need to have some kind of record of the year 1995, but you don't need a list for that:
mintemp1995 = None
for line in f:
line = line.strip()
columns = line.split()
year = int(float(columns[0]))
temp = float(columns[1])
if year == 1995 and (mintemp1995 is None or temp < mintemp1995):
mintemp1995 = temp
print "1995:", mintemp1995
Note the cast to int of the year, so you can directly compare it to 1995, and the condition after it:
If the variable mintemp1995 has never set before (is None and therefore, the first entry of the dataset), or the current temperature is lower than that, it replaces it, so you have a record of only the lowest temperature.
First off, I apologize for the terrible title; I didn't know how to summarize my problem. Okay, so here's a the first few lines of my .csv file. The first column is the timestamp. The program I'm getting this data from samples 24 times per second, so there are 24 rows that start with 15:40:15, 24 that start with 15:40:16, and so on. Instead of 24 rows with the same timestamp, I want the timestamp to increase increments of 1/24 seconds, or .042 seconds. So 15:40:15.042, 15:40:15.084, etc.
Another problem is that there aren't 24 rows for the first second, because we start in the middle of the second. For example, there are only 13 15:40:14 rows. For those it would preferably count backwards from 15:40:15.000 and subtract .042 seconds for every row.
How can I do this in Python? Thanks in advance!
CPUtime,Displacement Into Surface,Load On Sample,Time On Sample,Raw Load,Raw Displacement
15:40:14,-990.210561,-0.000025,1.7977E+308,-115.999137,-989.210000
15:40:14,-989.810561,-0.000025,1.7977E+308,-115.999105,-988.810000
15:40:14,-989.410561,-0.000025,1.7977E+308,-115.999073,-988.410000
15:40:14,-989.010561,-0.000025,1.7977E+308,-115.999041,-988.010000
15:40:14,-988.590561,-0.000025,1.7977E+308,-115.999007,-987.590000
15:40:14,-988.170561,-0.000025,1.7977E+308,-115.998974,-987.170000
15:40:14,-987.770561,-0.000025,1.7977E+308,-115.998942,-986.770000
15:40:14,-987.310561,-0.000025,1.7977E+308,-115.998905,-986.310000
15:40:14,-986.870561,-0.000025,1.7977E+308,-115.998870,-985.870000
15:40:14,-986.430561,-0.000025,1.7977E+308,-115.998834,-985.430000
15:40:14,-985.990561,-0.000025,1.7977E+308,-115.998799,-984.990000
15:40:14,-985.570561,-0.000025,1.7977E+308,-115.998766,-984.570000
15:40:14,-985.170561,-0.000025,1.7977E+308,-115.998734,-984.170000
15:40:15,-984.730561,-0.000025,1.7977E+308,-115.998698,-983.730000
15:40:15,-984.310561,-0.000025,1.7977E+308,-115.998665,-983.310000
15:40:15,-983.890561,-0.000025,1.7977E+308,-115.998631,-982.890000
15:40:15,-983.490561,-0.000025,1.7977E+308,-115.998599,-982.490000
15:40:15,-983.090561,-0.000025,1.7977E+308,-115.998567,-982.090000
I'd add to #robert king's answer that you could use itertools.groupby() to group rows with the same timestamp:
import csv
import shutil
from itertools import groupby
n = 24
time_increment = 1./n
fractions = [("%.3f" % (i*time_increment,)).lstrip('0') for i in xrange(n)]
with open('input.csv', 'rb') as f, open('output.csv', 'wb') as fout:
writer = csv.writer(fout)
# assume the file is sorted by timestamp
for timestamp, group in groupby(csv.reader(f), key=lambda row: row[0]):
sametime = list(group) # all rows that have the same timestamp
assert n >= len(sametime)
for i, row in enumerate(sametime, start=n-len(sametime)):
row[0] += fractions[i] # append fractions of a second
writer.writerows(sametime)
shutil.move('output.csv', 'input.csv') # update input file
'b' file mode is mandatory for csv in Python 2 otherwise entries that may span several physical lines won't work
if there are less than n entries with the same timestamp then the code assumes that they are consecutive values from the end of a second
open the csv file and create a csv reader as per http://docs.python.org/library/csv.html
Also create a csv writer as per http://docs.python.org/library/csv.html
Now loop through each row of the file. On each row, modify the timestamp and then write it to your new csv file.
If you want the new csv file to replace the old csv file, at the end use shutil http://docs.python.org/library/shutil.html to replace it.
I recommend inside your loop you have a variable called "current_timestamp" and a variable called "current_increment". If the timestamp in the row is equal to the current_timestamp, simply add the increment, otherwise change them both appropriately.