Reading and calculation using csv - python

I'm new to python and pardon me if this question might sound silly -
I have csv file that has 2 columns - Value and Timestamp. I'm trying to write a code that would take 2 paramenters - start_date and end_date and traverse the csv file to obtain all the values between those 2 dates and print the sum of Value
Below is my code. I'm trying to read and store the values in a list.
f_in = open('Users2.csv').readlines()
Value1 = []
Created = []
for i in range(1, len(f_in)):
Value, created_date = f_in[i].split(',')
Value1.append(Value)
Created.append(created_date)
print Value1
print Created
My csv has the following format
10 2010-02-12 23:31:40
20 2010-10-02 23:28:11
40 2011-03-12 23:39:40
10 2013-09-10 23:29:34
420 2013-11-19 23:26:17
122 2014-01-01 23:41:51
When I run my code - File1.py as below
File1.py 2010-01-01 2011-03-31
The output should be 70
I'm running into the following issues -
The data in csv is in timestamp (created_date), but the parameter passed should be date and I need to convert and get the data between those 2 dates regardless of time.
Once I have it in list - as described above - how do I proceed to do my calculation considering the condition in point-1

You can try this:
import csv
data = csv.reader(open('filename.csv'))
start_date = 10
end_data = 30
times = [' '.join(i) for i in data if int(i[0]) in range(start_date, end_date)]

Depends on your file size, but you may consider putting values from csv file, into some database, and then query your results.
csv module has DictReader which allows you to predefine your column names, it greatly improves readability, specially while working on really big files.
from datetime import datetime
COLUMN_NAMES = ['value', 'timestamp']
def sum_values(start_date, end_date):
sum = 0
with open('Users2.csv', mode='r') as csvfile:
table = csv.DictReader(csvfile, fieldnames=COLUMN_NAMES)
for row in table:
if row['timestamp'] >= min_date and row['timestamp'] <= max_date:
sum += int(row['value'])
return sum

If you are open to using pandas, try this:
>>> import pandas as pd
>>> data = 'Users2.csv'
>>>
>>> dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
>>> df = pd.read_csv(data, names=['value', 'date'], parse_dates=['date'], date_parser=dateparse)
>>> result = df['value'][(df['date'] > '2010-01-01') &
... (df['date'] < '2011-03-31')
... ].sum()
>>> result
70

Since you said that dates are in timestamp, you can compare them like strings. By realizing that, what you want to achieve (sum the values if created is between start_date and end_date) can be done like this:
def sum_values(start_date, end_date):
sum = 0
with open('Users2.csv') as f:
for line in f:
value, created = line.split(' ', 1)
if created > start_date && created < end_date:
sum += int(value)
return sum
str.split(' ', 1) will split on ' ' but will stop splitting after 1 split has been done. start_date and end_date must be in format yyyy-MM-dd hh:mm:ss which I assume they are, cause they are in timestamp format. Just mind it.

Related

How to find who's Next Birthdate given CSV file of Names and Birthdays?

We are given a CSV file containing names and birthdays, we have to output who's birthday is next in function.
kept getting a local unbound error, not sure how to fix it, basically trying to read the file, check the dates, find which date is next, then return the name connected with that date
birthdates.csv:
Draven Brock,01/21/1952
​Easton Mclean,09/02/1954
​Destiny Pacheco,10/10/1958
​Ariella Wood,12/20/1961
​Keely Sanders,08/03/1985
​Bryan Sloan,04/06/1986
​Shannon Brewer,05/11/1986
​Julianne Farrell,01/29/2000
​Makhi Weeks,03/20/2000
​Lucian Fields,08/02/2010
Function Call:
​nextBirthdate("birthdates.csv", "01/01/2022")
Output:
​Draven Brock
def nextBirthdate(filename, date):
f = open(filename, 'r')
lines = f.readlines()
f.close()
for i in range(len(lines)):
lines[i] = lines[i].strip()
# split for the target date
date = line.split('/')
month = date[0]
day = date[1]
diff = 365
diffDays = 0
bName = None
bDays = []
for line in lines:
items = line.split(",")
names = items[0]
# split the date apart between month, day, and year
bDay = items[1].split("/")
bDays.append(bDay)
for d in bDays:
if bDay[0] == month:
if bDay[1] > day:
diffDays = int(bDay[0]) - int(day)
if diffdays < diff:
diff = diffDays
bName = name
elif bDay[0] > month:
diffDays = ((int(bDay[0]) - 1) * 31) + int(day)
if diffDays < diff:
diff = diffDays
bName = name
if bName == None:
return nextBirthdate(filename, "01/01/2022")
return bName
if __name__ == "__main__":
filename = "birthdates.csv"
date = "12/31/2022"
print(nextBirthdate(filename, date))
Welcome to stack overflow.
The way to do this is
Read in your CSV data
Convert the data to e.g. a pandas DataFrame
2a. I suggest you convert the dates to e.g. pd.Timestamp and not keep them as strings to avoid bad things happening when comparing
Sort the DataFrame based on dates, or find the row with the min value, its up to you
How to do each of those things is its own question, that has at least one answer already in stack overflow and other places, so you probably do not need to ask a new one, just search for it.
So here I got this - You just want to know who's next bday it is and want the name of the person based on the CSV file. Below code should work
import pandas as pd
from datetime import date
def nextBirthdate(file, inp_date):
d,m,y = [int(x) for x in inp_date.split('/')]
df = pd.read_csv('file.csv', header=None)
df[1] = pd.to_datetime(df[1])
df[1] = df[1].apply(lambda x: x.replace(year=y))
inp_date = date(y,m,d)
df['Diff'] = (df[1].dt.date - inp_date).dt.days
return df.sort_values(by=['Diff'])[0].values[0]
nextBirthdate("birthdates.csv", "01/01/2022")

split the date range into multiple ranges

I have data in CSV like this:
1940-10-01,somevalue
1940-11-02,somevalue
1940-11-03,somevalue
1940-11-04,somevalue
1940-12-05,somevalue
1940-12-06,somevalue
1941-01-07,somevalue
1941-02-08,somevalue
1941-03-09,somevalue
1941-05-01,somevalue
1941-06-02,somevalue
1941-07-03,somevalue
1941-10-04,somevalue
1941-12-05,somevalue
1941-12-06,somevalue
1942-01-07,somevalue
1942-02-08,somevalue
1942-03-09,somevalue
I want to separate the dates from 1-oct-year to 31-march-next-year for all data. So for data above output will be:
1940/1941:
1940-11-02,somevalue
1940-11-03,somevalue
1940-11-04,somevalue
1940-12-05,somevalue
1940-12-06,somevalue
1941-01-07,somevalue
1941-02-08,somevalue
1941-03-09,somevalue
1941/1942:
1941-10-04,somevalue
1941-12-05,somevalue
1941-12-06,somevalue
1942-01-07,somevalue
1942-02-08,somevalue
1942-03-09,somevalue
1942-10-01,somevalue
My code trails are:
import csv
from datetime import datetime
with open('data.csv','r') as f:
data = list(csv.reader(f))
quaters = []
year = datetime.strptime(data[0][0], '%Y-%m-%d').year
for each in data:
date = datetime.strptime(each[0], '%Y-%m-%d')
print(each)
if (date>=datetime(year=date.year,month=10,day=1) and date<=datetime(year=date.year+1,month=3,day=31)):
middle_quaters[-1].append(each)
if year != date.year:
quaters.append([])
But I am not getting expected output. I want to store each range of dates in separate list.
I would use pandas dataframe to do this..
it would be easier..
follow this
Pandas: Selecting DataFrame rows between two dates (Datetime Index)
so for your case
data = pd.read_csv("data.csv")
df.loc[startDate : endDate]
# you can walk through a bunch of ranges like so..
listOfDateRanges = [(), (), ()]
for date_range in listOfDateRanges:
df.loc[date_range[0] : date_range[1]]
Without external packages... create a lookup based on the field of choice, and then make an int of it and do a less that vs greater than to establish the range.
import re
data = '''1940-10-01,somevalue
1940-11-02,somevalue
1940-11-03,somevalue
1940-11-04,somevalue
1940-12-05,somevalue
1940-12-06,somevalue
1941-01-07,somevalue
1941-02-08,somevalue
1941-03-09,somevalue
1941-05-01,somevalue
1941-06-02,somevalue
1941-07-03,somevalue
1941-10-04,somevalue
1941-12-05,somevalue
1941-12-06,somevalue
1942-01-07,somevalue
1942-02-08,somevalue
1942-03-09,somevalue'''
lookup={}
lines = data.split('\n')
for line in lines:
d = re.sub(r'-','',line.split(',')[0])
lookup[d]=line
dates=sorted(lookup.keys())
_in=19401201
out=19411004
outfile=[]
for date in dates:
if int(date) > _in and int(date) < out:
outfile.append(lookup[date])
for l in outfile:
print outfile
For this purpose you can use pandas library. Here is the sample code for the same:
import pandas as pd
df = pd.read_csv('so.csv', parse_dates=['timestamp']) #timestamp is your time column
current_year, next_year = 1940, 1941
df = df.query(f'(timestamp >= "{current_year}-10-01") & (timestamp <= "{next_year}-03-31")')
print (df)
This gives following result on your data:
timestamp value
0 1940-10-01 somevalue
1 1940-11-02 somevalue
2 1940-11-03 somevalue
3 1940-11-04 somevalue
4 1940-12-05 somevalue
5 1940-12-06 somevalue
6 1941-01-07 somevalue
7 1941-02-08 somevalue
8 1941-03-09 somevalue
Hope this helps!

Iterate over dates, calculate averages for every 24-hour period

I have a csv file with data every ~minute over 2 years, and am wanting to run code to calculate 24-hour averages. Ideally I'd like the code to iterate over the data, calculate averages and standard deviations, and R^2 between dataA and dataB, for every 24hr period and then output this new data into a new csv file (with datestamp and calculated data for each 24hr period).
The data has an unusual timestamp which I think might be tripping me up slightly. I've been trying different For Loops to iterate over the data, but I'm not sure how to specify that I want the averages,etc for each 24hr period.
This is the code I have so far, but I'm not sure how to complete the For Loop to achieve what I'm wanting. If anyone can help that would be great!
import math
import pandas as pd
import os
import numpy as np
from datetime import timedelta, date
# read the file in csv
data = pd.read_csv("Jacaranda_data_HST.csv")
# Extract the data columns from the csv
data_date = data.iloc[:,1]
dataA = data.iloc[:,2]
dataB = data.iloc[:,3]
# set the start and end dates of the data
start_date = data_date.iloc[0]
end_date = data_date.iloc[-1:]
# for loop to run over every 24 hours of data
day_count = (end_date - start_date).days + 1
for single_date in [d for d in (start_date + timedelta(n) for n in
range(day_count)) if d <= end_date]:
print np.mean(dataA), np.mean(dataB), np.std(dataA), np.std(dataB)
# output new csv file - **unsure how to call the data**
csvfile = "Jacaranda_new.csv"
outdf = pd.DataFrame()
#outdf['dataA_mean'] = ??
#outdf['dataB_mean'] = ??
#outdf['dataA_stdev'] = ??
#outdf['dataB_stdev'] = ??
outdf.to_csv(csvfile, index=False)
A simplified aproach could be to group by calendar day in a dict. I don't have much experience with pandas time management in DataFrames, so this could be an alternative.
You could create a dict where the keys are the dates of the data (without the time part), so you can later calculate the mean of all the data points that are under each key.
data_date = data.iloc[:,1]
data_a = data.iloc[:,2]
data_b = data.iloc[:,3]
import collections
dd_a = collections.defaultdict(list)
dd_b = collections.defaultdict(list)
for date_str, data_point_a, data_point_b in zip(data_date, data_a, data_b):
# we split the string by the first space, so we get only the date part
date_part, _ = date_str.split(' ', maxsplit=1)
dd_a[date_part].append(data_point_a)
dd_b[date_part].append(data_point_b)
Now you can calculate the averages:
for date, v_list in dd_a.items():
if len(v_list) > 0:
print(date, 'mean:', sum(v_list) / len(v_list))
for date, v_list in dd_b.items():
if len(v_list) > 0:
print(date, 'mean:', sum(v_list) / len(v_list))

Convert DataFrame column date from 2/3/2007 format to 20070223 with python

I have a dataframe with 'Date' and 'Value', where the Date is in format m/d/yyyy. I need to convert to yyyymmdd.
df2= df[["Date", "Transaction"]]
I know datetime can do this for me, but I can't get it to accept my format.
example data files:
6/15/2006,-4.27,
6/16/2006,-2.27,
6/19/2006,-6.35,
You first need to convert to datetime, using pd.datetime, then you can format it as you wish using strftime:
>>> df
Date Transaction
0 6/15/2006 -4.27
1 6/16/2006 -2.27
2 6/19/2006 -6.35
df['Date'] = pd.to_datetime(df['Date'],format='%m/%d/%Y').dt.strftime('%Y%m%d')
>>> df
Date Transaction
0 20060615 -4.27
1 20060616 -2.27
2 20060619 -6.35
You can say:
df['Date']=df['Date'].dt.strftime('%Y%m%d')
dt accesor's strftime method is your clear friend now.
Note: if didn't convert to pandas datetime yet, do:
df['Date']=pd.to_datetime(df['Date']).dt.strftime('%Y%m%d')
Output:
Date Transaction
0 20060615 -4.27
1 20060616 -2.27
2 20060619 -6.35
For a raw python solution, you could try something along the following (assuming datafile is a string).
datafile="6/15/2006,-4.27,\n6/16/2006,-2.27,\n6/19/2006,-6.35"
def zeroPad(str, desiredLen):
while (len(str) < desiredLen):
str = "0" + str
return str
def convToYYYYMMDD(datafile):
datafile = ''.join(datafile.split('\n')) # remove \n's, they're unreliable and not needed
datafile = datafile.split(',') # split by comma so '1,2' becomes ['1','2']
out = []
for i in range(0, len(datafile)):
if (i % 2 == 0):
tmp = datafile[i].split('/')
yyyymmdd = zeroPad(tmp[2], 4) + zeroPad(tmp[0], 2) + zeroPad(tmp[1], 2)
out.append(yyyymmdd)
else:
out.append(datafile[i])
return out
print(convToYYYYMMDD(datafile))
This outputs: ['20060615', '-4.27', '20060616', '-2.27', '20060619', '-6.35'].

How can I subtract a fixed date from a columns of date in excel file using Python?

I have a file with the below format:
name date
sam 21/1/2003
bil 5/4/2006
sam 4/7/2009
Mali 24/7/2009
bil 13/2/2008
etc...
I want to set a fix date for instance: 1/1/2003 and subtract all of the dates from my fix date and divide them by week to find out which names are registered in which weeks and put them in a set. So I would like to get the below final result:
Sam=[week3,week12]
bil=[week25,week13] etc..
I have write the below python script but It is not working.I have this error:
val=set(start_date-'date(data.files.datetime)')
TypeError: unsupported operand type(s) for -: 'int' and 'str'
Anyone has any idea what is the best way to write the code for it?
import pprint
import csv
with open('d:/Results/names_info.csv', 'r') as csvfile:
start_date= 1/1/2003
filereader=csv.reader(csvfile,'excel')
for row in filereader:
for name in row:
key=name
val=set(start_date-'date(data.files.datetime)')
datedict[key]=val
pprint.pprint (datedict)
You have several errors in your code:
Not ignoring the first line of your csv file which contains 'name' and 'date'.
Using strings to store dates instead of the date type.
Attempting to subtract one string from another.
Modifying items in datedict without first checking that they exist.
The slashes in 1/1/2003 are going to be treated as divide signs and the result will be 0.
Here is what your code would look like with these errors fixed:
import csv
from collections import defaultdict
import datetime
from datetime import date
import math
def weeks(filename, start_date):
# The defaultdict class will create items when a key is accessed that does
# not exist
datedict = defaultdict(set)
with open(filename, 'r') as csvfile:
filereader = csv.reader(csvfile, 'excel')
read_header = False
for row in filereader:
# Ignore the first row of the file
if not read_header:
read_header = True
continue
# Strip out any whitespace
cells = [col.strip() for col in row]
name = cells[0]
date_str = cells[1]
# Parse the date string into a date
row_date = datetime.datetime.strptime(date_str, '%d/%m/%Y').date()
# Calculate the difference between dates
delta = start_date-row_date
# Convert from days to weeks, you could use math.floor() here if
# needed
delta_weeks = int(math.ceil(delta.days / 7.0))
datedict[name].add(delta_weeks)
return datedict
date_dict = weeks('a.csv', start_date=date(year=2013, month=1, day=1))
for name, dates in date_dict.iteritems():
print name, list(dates)
This prints out:
bil [351, 254]
sam [519, 182]
Mali [179]
You should be able to figure out how to get it to print 'weeks'.
You definitely want to make use of the datetime module in the standard library. A quick and dirty method to calculate the week difference could be the following:
import datetime
start_date = datetime.date(2003,1,1) # (YYYY,MM,DD)
another_date = datetime.date(2003,10,20)
difference = start_date - another_date # another datetime object
weeks_between = difference.days / 7 + 1 # integer division, first week = 1
also if you want a dict of lists replace datedict[key]=val with
try :
datedict[key] += [val] # add the element val to your existing list
except KeyError : # catch error if key not in dict yet
datedict[key] = [val] # add key to dict with val as one element list
also if you'd prefer the lists to have strings of the form week1, week12 etc then simply use
val = 'week%d' % val

Categories

Resources