I have two CSV files with timestamp data in str format.
the first CSV_1 has resampled data from a pandas timeseries, into 15 minute blocks and looks like:
time ave_speed
1/13/15 4:30 34.12318398
1/13/15 4:45 0.83396195
1/13/15 5:00 1.466816057
CSV_2 has regular times from gps points e.g.
id time lat lng
513620 1/13/15 4:31 -8.15949 118.26005
513667 1/13/15 4:36 -8.15215 118.25847
513668 1/13/15 5:01 -8.15211 118.25847
I'm trying to iterate through both files to find instances where time in CSV_2 is found within the 15 min time group in CSV_1 and then do something. In this case append ave_speed to every entry which this condition is true.
Desired result using the above examples:
id time lat lng ave_speed
513620 1/13/15 4:31 -8.15949 118.26005 0.83396195
513667 1/13/15 4:36 -8.15215 118.25847 0.83396195
513668 1/13/15 5:01 -8.15211 118.25847 something else
I tried doing it solely in pandas dataframes but ran into some troubles I thought this might be a workaround to achieve what i'm after.
This is the code i've written so far and I feel like it's close but I can't seem to nail the logic to get my for loop returning entries within the 15 min time group.
with open('path/CSV_2.csv', mode="rU") as infile:
with open('path/CSV_1.csv', mode="rU") as newinfile:
reader = csv.reader(infile)
nreader = csv.reader(newinfile)
next(nreader, None) # skip the headers
next(reader, None) # skip the headers
for row in nreader:
for dfrow in reader:
if (datetime.datetime.strptime(dfrow[2],'%Y-%m-%d %H:%M:%S') < datetime.datetime.strptime(row[0],'%Y-%m-%d %H:%M:%S') and
datetime.datetime.strptime(dfrow[2],'%Y-%m-%d %H:%M:%S') > datetime.datetime.strptime(row[0],'%Y-%m-%d %H:%M:%S') - datetime.timedelta(minutes=15)):
print dfrow[2]
Link to pandas question I posted with same problem Pandas, check if timestamp value exists in resampled 30 min time bin of datetimeindex
EDIT:
Creating two lists of time, i.e. listOne with all the times from CSV_1 and listTwo with all the times in CSV_2 I'm able to find instances in the time groups. So something is weird with using CSV values. Any help would be appreciated.
I feel like this is pretty close to what I want if anyone is curious on how to do the same thing. It's not massively efficient and the current script takes roughly 1 day to iterate over all the rows multiple times because of the double loop.
If anyone has any thoughts on how to make this easier or quicker i'd be very interested.
#OPEN THE CSV FILES
with open('/GPS_Timepoints.csv', mode="rU") as infile:
with open('/Resampled.csv', mode="rU") as newinfile:
reader = csv.reader(infile)
nreader = csv.reader(newinfile)
next(nreader, None) # skip the headers
next(reader, None) # skip the headers
#DICT COMPREHENSION TO GET ONLY THE DESIRED DATA FROM CSV
checkDates = {row[0] : row[7] for row in nreader }
x = checkDates.items()
# READ CSV INTO LIST (SEEMED TO BE EASIER THAN READING DIRECT FROM CSV FILE, I DON'T KNOW IF IT'S FASTER)
csvDates = []
for row in reader:
csvDates.append(row)
#LOOP 1 TO ITERATE OVER FULL RANGE OF DATES IN RESAMPLED DATA AND A PRINT STATEMENT TO GIVE ME HOPE THE PROGRAM IS RUNNING
for i in range(0,len(x)):
print 'checking', i
#TEST TO SEE IF THE TIME IS IN THE TIME RANGE, THEN IF TRUE INSERT THE DESIRED ATTRIBUTE, IN THIS CASE SPEED TO THE ROW
for row in csvDates:
if row[2] > x[i-1][0] and row[2] < x[i][0]:
row.insert(9,x[i][1])
# GET THE RESULT TO CSV TO UPLOAD INTO GIS
with open('/result.csv', mode="w") as outfile:
wr = csv.writer(outfile)
wr.writerow(['id','boat_id','time','state','lat','lng','activity','speed', 'state_reason'])
for row in csvDates:
wr.writerow(row)
Related
I have generated csv file which has formate as shown in below image:
In this image, I have data week wise but somewhere I couldn't arrange data week wise. If you look into the below image, you will see the red mark and blue mark. I want to separate this both marks. How I will do it?
Note: If Holiday on Friday then it should set a week from Monday to Thursday.
currently, I'm using below logic :
Image: Please click here to see image
current logic:
import csv
blank_fields=[' ']
fields=[' ','Weekly Avg']
# Read csv file
file1 = open('test.csv', 'rb')
reader = csv.reader(file1)
new_rows_list = []
# Read data row by row and store into new list
for row in reader:
new_rows_list.append(row)
if 'Friday' in row:
new_rows_list.append(fields)
file1.close()
overall you are going towards the right direction, your condition is just a little too error prone, things can get worse (e.g., just one day in a week appears in your list). So testing for the weekday string isn't the best choice here.
I would suggest "understanding" the date/time in your table to solve this using weekdays, like this:
from datetime import datetime as dt, timedelta as td
# remember the last weekday
last = None
# each item in list represents one week, while active_week is a temp-var
weeks = []
_cur_week = []
for row in reader:
# assuming the date is in row[1] (split, convert to int, pass as args)
_cur_date = dt(*map(int, row[1].split("-")))
# weekday will be in [0 .. 6]
# now if _cur_date.weekday <= last.weekday, a week is complete
# (also catching corner-case with more than 7 days, between two entries)
if last and (_cur_date.weekday <= last.weekday or (_cur_date - last) >= td(days=7)):
# append collected rows to 'weeks', empty _cur_week for next rows
weeks.append(_cur_week)
_cur_week = []
# regular invariant, append row and set 'last' to '_cur_date'
_cur_week.append(row)
last = _cur_date
Pretty verbose and extensive, but I hope I can transport the pattern used here:
parse existing date and use weekday to distinguish one week from another (i.e., weekday will increase monotonously, means any decrease (or equality) will tell you the current date represents the next week).
store rows in a temporary list during one week
append _cur_week into weeks once the condition for next week gets triggered
empty _cur_week for next rows i.e., week
Finally the last thing to do is to "concat" the data e.g. like this:
new_rows_list = [[fields] + week for week in weeks]
I have another logic for this same thing and it is successfully worked and easy solution.
import csv
import datetime
fields={'':' ', 'Date':'Weekly Avg'}
#Read csv file
file1 = open('temp.csv', 'rb')
reader = csv.DictReader(file1)
new_rows_list = []
last_date = None
# Read data row by row and store into new list
for row in reader:
cur_date = datetime.datetime.strptime(row['Date'], '%Y-%m-%d').date()
if last_date and ((last_date.weekday() > cur_date.weekday()) or (cur_date.weekday() == 0)):
new_rows_list.append(fields)
last_date = cur_date
new_rows_list.append(row)
file1.close()
I am processing a CSV file in python thats delimited by a comma (,).
Each column is a sampled parameter, for instance column 0 is time, sampled at once a second, column 1 is altitude sampled at 4 times a second, etc.
So columns will look like as below:
Column 0 -> ["Time", 0, " "," "," ",1]
Column 1 -> ["Altitude", 100, 200, 300, 400]
I am trying to create a list for each column that captures its name and all its data. That way I can do calculations and organize my data into a new file automatically (the sampled data I am working with has substantial number of rows)
I want to do this for any file not just one, so the number of columns can vary.
Normally if every file was consistent I would do something like:
import csv
time =[]
alt = []
dct = {}
with open('test.csv',"r") as csvfile:
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0])
alt.append(row[1]) #etc for all columns
I am pretty new in python. Is this a good way to tackle this, if not what is better methodology?
Thanks for your time
Pandas will probably work best for you. If you use csv_read from pandas, it will create a DataFrame based on the column. It's roughly a dictionary of lists.
You can also use the .tolist() functionality of pandas to convert it to a list if you want a list specifically.
import pandas as pd
data = pd.read_csv("soqn.csv")
dict_of_lists = {}
for column_name in data.columns:
temp_list = data[column_name].tolist()
dict_of_lists[column_name] = temp_list
print dict_of_lists
EDIT:
dict_of_lists={column_name: data[column_name].tolist() for column_name in data.columns}
#This list comprehension might work faster.
I think I made my problem more simpler and just focused on one column.
What I ultimately wanted to do was to interpolate to the highest sampling rate. So here is what I came up with... Please let me know if I can do anything more efficient. I used A LOT of searching on this site to help build this. Again I am new at Python (about 2-3 weeks but some former programming experience)
import csv
header = []
#initialize variables
loc_int = 0
loc_fin = 0
temp_i = 0
temp_f = 0
with open('test2.csv',"r") as csvfile: # open csv file
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0]) #make a list that consists of all content in column A
for x in range(0,len(header)-1): #go through entire column
if header[x].isdigit() and header[x+1]=="": # find lower bound of sample to be interpolated
loc_int = x
temp_i = int(header[x])
elif header[x+1].isdigit() and header[x]=="": # find upper bound of sample to be interpolated
loc_fin = x
temp_f = int(header[x+1])
if temp_f>temp_i: #calculate interpolated values
f_min_i = temp_f - temp_i
interp = f_min_i/float((loc_fin+1)-loc_int)
for y in range(loc_int, loc_fin+1):
header[y] = temp_i + interp*(y-loc_int)
print header
with open("output.csv", 'wb') as g: #write to new file
writer = csv.writer(g)
for item in header:
writer.writerow([item])
I couldn't figure out how to write my new list "header" with its interpolated values and replace it with column A of my old file , test2.csv.
Anywho thank you very much for looking...
I am writing a scipt (i.e. once upon the time) where I am reading the data from an excel-file. For that data I create an id based on the date and time. I have one missing variable, which is contained in a txt-file. The txt-file has also date and time to create an id.
Now I would like to link the data from the excel-file and txt-file based on the id. Right no I am building two lists from the txt-file. One containing the id and the other containing the value I need. Then, I get the index from the id list, where the id is the same in both data sets using the enumerate function. I use that index to get the value from the valuelist. The code looks something like that:
datelist = []
valuelist = []
txtfile = open(folder + os.sep + "Textfile.txt", "r")
ILines = txtfile.readlines()
for i,row in enumerate(ILines):
datelist.append(row.split(",")[1])
valuelist.append(row.split(",")[2])
rows = myexceldata
for row in rows:
x = row[id]
row = row + valuelist[[i for i,e in enumerate(datelist ) if e == x][0]]
However, that takes ages and I wonder if there is a better way to to that.
The files look like that:
Excelfile:
Date Time Var1 Var2
03.02.2016 12:53:24 10 27
03.02.2016 12:53:25 10 27
03.02.2016 12:53:26 10 27
Textfile:
Date Time Var3
03.02.2016 12:53:24 16
03.02.2016 12:53:25 20
Result:
Date Time Var1 Var2 Var3
03.02.2016 12:53:24 10 27 16
03.02.2016 12:53:25 10 27 20
03.02.2016 12:53:26 10 27 *)
*) It would be perfect, if here would be the same value as above, but empty would be ok, too
Ok, I forgot one important thing. Sorry about that: Not all times of the excelfile are in the textfile. The best option would be to get var3 from the previous time of the textfile just before the time of the excelfile. But it would also be an option to leave it blank than.
If both of your files are sorted in time order then the following kind of approach would be fast:
from heapq import merge
from itertools import groupby, chain
import csv
with open('excel.txt', 'rb') as f_excel, open('textfile.txt', 'rb') as f_text, open('output.txt', 'wb') as f_output:
csv_excel = csv.reader(f_excel)
csv_text = csv.reader(f_text)
csv_output = csv.writer(f_output)
header_excel = next(csv_excel)
header_text = next(csv_text)
csv_output.writerow(header_excel + [header_text[-1]])
for k, g in groupby(merge(csv_text, csv_excel), key=lambda x: x[0:2]):
csv_output.writerow(k + list(chain.from_iterable(cols[2:] for cols in g)))
This assumes your two input files are both in csv format, and works as follows:
Create csv readers/writers for all of the files. This allows the files to automatically be read in as lists of columns without requiring each line to be split.
Extract the headers from both of the files and write a combined form to the output.
Take the two input files and pass them to merge. This returns a row at a time from either input file in order.
Pass this to groupby to group rows with the same date and time together. This returns a key and a group, where the key is the date and time that matched, and the group is an iterable of the matching rows.
For each grouped entry, write the key and columns 2 onwards from each row to the output file. chain is used to produce a flat list.
This would give you an output file as follows:
Date,Time,Var1,Var2,Var3
03.02.2016,12:53:24,10,27,16
03.02.2016,12:53:25,10,27,20
As you already have the excel data, this would need to be passed to merge instead of csv_excel as a list of rows/cols.
I got the following problem: I would like to read a data textfile which consists of two columns, year and temperature, and be able to calculate the minimum temperature etc. for each year. The whole file starts like this:
1995.0012 -1.34231
1995.3030 -3.52533
1995.4030 -7.54334
and so on, until year 2013. I had the following idea:
f=open('munich_temperatures_average.txt', 'r')
for line in f:
line = line.strip()
columns = line.split()
year = float(columns[0])
temperature=columns[1]
if year-1995<1 and year-1995>0:
print 1995, min(temperature)
With this I get only the year 1995 data which is what I want in a first step. In a second step I would like to calculate the minimal temperature of the whole dataset in year 1995. By using the script above, I however get the minimum temperature for every line in the datafile. I tried building a list and then appending the temperature but I run into trouble if I want to transform the year into an integer or the temperature into a float etc.
I feel like I am missing the right idea how to calculate the minimum value of a set of values in a column (but not of the whole column).
Any ideas how I could approach said problem? I am trying to learn Python but still at a beginners stage so if there is a way to do the whole thing without using "advanced" commands, I'd be ecstatic!
I could do this using the regexp
import re
from collections import defaultdict
REGEX = re.compile(ur"(\d{4})\.\d+ ([0-9\-\.\+]+)")
f = open('munich_temperatures_average.txt', 'r')
data = defaultdict(list)
for line in f:
year, temperature = REGEX.findall(line)[0]
temperature = float(temperature)
data[year].append(temperature)
print min(data["1995"])
You could use the csv module which would make it very easy to read and manipulate each row of your file:
import csv
with open('munich_temperatures_average.txt', 'r') as temperatures:
for row in csv.reader(temperatures, delimiter=' '):
print "year", row[0], "temp", row[1]
Afterwards it is just a matter of finding the min temperature in the rows. See
csv module documentation
If you just want the years and the temps:
years,temp =[],[]
with open("f.txt") as f:
for line in f:
spl = line.rstrip().split()
years.append(int(spl[0].split(".")[0]))
temp.append(float(spl[1]))
print years,temp
[1995, 1995, 1995] [-1.34231, -3.52533, -7.54334]
I've previously submit another approach, using a numpy library, that could be confusing considering that you are new to python. Sorry for that. As you mentioned yourself, you need to have some kind of record of the year 1995, but you don't need a list for that:
mintemp1995 = None
for line in f:
line = line.strip()
columns = line.split()
year = int(float(columns[0]))
temp = float(columns[1])
if year == 1995 and (mintemp1995 is None or temp < mintemp1995):
mintemp1995 = temp
print "1995:", mintemp1995
Note the cast to int of the year, so you can directly compare it to 1995, and the condition after it:
If the variable mintemp1995 has never set before (is None and therefore, the first entry of the dataset), or the current temperature is lower than that, it replaces it, so you have a record of only the lowest temperature.
First off, I apologize for the terrible title; I didn't know how to summarize my problem. Okay, so here's a the first few lines of my .csv file. The first column is the timestamp. The program I'm getting this data from samples 24 times per second, so there are 24 rows that start with 15:40:15, 24 that start with 15:40:16, and so on. Instead of 24 rows with the same timestamp, I want the timestamp to increase increments of 1/24 seconds, or .042 seconds. So 15:40:15.042, 15:40:15.084, etc.
Another problem is that there aren't 24 rows for the first second, because we start in the middle of the second. For example, there are only 13 15:40:14 rows. For those it would preferably count backwards from 15:40:15.000 and subtract .042 seconds for every row.
How can I do this in Python? Thanks in advance!
CPUtime,Displacement Into Surface,Load On Sample,Time On Sample,Raw Load,Raw Displacement
15:40:14,-990.210561,-0.000025,1.7977E+308,-115.999137,-989.210000
15:40:14,-989.810561,-0.000025,1.7977E+308,-115.999105,-988.810000
15:40:14,-989.410561,-0.000025,1.7977E+308,-115.999073,-988.410000
15:40:14,-989.010561,-0.000025,1.7977E+308,-115.999041,-988.010000
15:40:14,-988.590561,-0.000025,1.7977E+308,-115.999007,-987.590000
15:40:14,-988.170561,-0.000025,1.7977E+308,-115.998974,-987.170000
15:40:14,-987.770561,-0.000025,1.7977E+308,-115.998942,-986.770000
15:40:14,-987.310561,-0.000025,1.7977E+308,-115.998905,-986.310000
15:40:14,-986.870561,-0.000025,1.7977E+308,-115.998870,-985.870000
15:40:14,-986.430561,-0.000025,1.7977E+308,-115.998834,-985.430000
15:40:14,-985.990561,-0.000025,1.7977E+308,-115.998799,-984.990000
15:40:14,-985.570561,-0.000025,1.7977E+308,-115.998766,-984.570000
15:40:14,-985.170561,-0.000025,1.7977E+308,-115.998734,-984.170000
15:40:15,-984.730561,-0.000025,1.7977E+308,-115.998698,-983.730000
15:40:15,-984.310561,-0.000025,1.7977E+308,-115.998665,-983.310000
15:40:15,-983.890561,-0.000025,1.7977E+308,-115.998631,-982.890000
15:40:15,-983.490561,-0.000025,1.7977E+308,-115.998599,-982.490000
15:40:15,-983.090561,-0.000025,1.7977E+308,-115.998567,-982.090000
I'd add to #robert king's answer that you could use itertools.groupby() to group rows with the same timestamp:
import csv
import shutil
from itertools import groupby
n = 24
time_increment = 1./n
fractions = [("%.3f" % (i*time_increment,)).lstrip('0') for i in xrange(n)]
with open('input.csv', 'rb') as f, open('output.csv', 'wb') as fout:
writer = csv.writer(fout)
# assume the file is sorted by timestamp
for timestamp, group in groupby(csv.reader(f), key=lambda row: row[0]):
sametime = list(group) # all rows that have the same timestamp
assert n >= len(sametime)
for i, row in enumerate(sametime, start=n-len(sametime)):
row[0] += fractions[i] # append fractions of a second
writer.writerows(sametime)
shutil.move('output.csv', 'input.csv') # update input file
'b' file mode is mandatory for csv in Python 2 otherwise entries that may span several physical lines won't work
if there are less than n entries with the same timestamp then the code assumes that they are consecutive values from the end of a second
open the csv file and create a csv reader as per http://docs.python.org/library/csv.html
Also create a csv writer as per http://docs.python.org/library/csv.html
Now loop through each row of the file. On each row, modify the timestamp and then write it to your new csv file.
If you want the new csv file to replace the old csv file, at the end use shutil http://docs.python.org/library/shutil.html to replace it.
I recommend inside your loop you have a variable called "current_timestamp" and a variable called "current_increment". If the timestamp in the row is equal to the current_timestamp, simply add the increment, otherwise change them both appropriately.