split the date range into multiple ranges - python

I have data in CSV like this:
1940-10-01,somevalue
1940-11-02,somevalue
1940-11-03,somevalue
1940-11-04,somevalue
1940-12-05,somevalue
1940-12-06,somevalue
1941-01-07,somevalue
1941-02-08,somevalue
1941-03-09,somevalue
1941-05-01,somevalue
1941-06-02,somevalue
1941-07-03,somevalue
1941-10-04,somevalue
1941-12-05,somevalue
1941-12-06,somevalue
1942-01-07,somevalue
1942-02-08,somevalue
1942-03-09,somevalue
I want to separate the dates from 1-oct-year to 31-march-next-year for all data. So for data above output will be:
1940/1941:
1940-11-02,somevalue
1940-11-03,somevalue
1940-11-04,somevalue
1940-12-05,somevalue
1940-12-06,somevalue
1941-01-07,somevalue
1941-02-08,somevalue
1941-03-09,somevalue
1941/1942:
1941-10-04,somevalue
1941-12-05,somevalue
1941-12-06,somevalue
1942-01-07,somevalue
1942-02-08,somevalue
1942-03-09,somevalue
1942-10-01,somevalue
My code trails are:
import csv
from datetime import datetime
with open('data.csv','r') as f:
data = list(csv.reader(f))
quaters = []
year = datetime.strptime(data[0][0], '%Y-%m-%d').year
for each in data:
date = datetime.strptime(each[0], '%Y-%m-%d')
print(each)
if (date>=datetime(year=date.year,month=10,day=1) and date<=datetime(year=date.year+1,month=3,day=31)):
middle_quaters[-1].append(each)
if year != date.year:
quaters.append([])
But I am not getting expected output. I want to store each range of dates in separate list.

I would use pandas dataframe to do this..
it would be easier..
follow this
Pandas: Selecting DataFrame rows between two dates (Datetime Index)
so for your case
data = pd.read_csv("data.csv")
df.loc[startDate : endDate]
# you can walk through a bunch of ranges like so..
listOfDateRanges = [(), (), ()]
for date_range in listOfDateRanges:
df.loc[date_range[0] : date_range[1]]

Without external packages... create a lookup based on the field of choice, and then make an int of it and do a less that vs greater than to establish the range.
import re
data = '''1940-10-01,somevalue
1940-11-02,somevalue
1940-11-03,somevalue
1940-11-04,somevalue
1940-12-05,somevalue
1940-12-06,somevalue
1941-01-07,somevalue
1941-02-08,somevalue
1941-03-09,somevalue
1941-05-01,somevalue
1941-06-02,somevalue
1941-07-03,somevalue
1941-10-04,somevalue
1941-12-05,somevalue
1941-12-06,somevalue
1942-01-07,somevalue
1942-02-08,somevalue
1942-03-09,somevalue'''
lookup={}
lines = data.split('\n')
for line in lines:
d = re.sub(r'-','',line.split(',')[0])
lookup[d]=line
dates=sorted(lookup.keys())
_in=19401201
out=19411004
outfile=[]
for date in dates:
if int(date) > _in and int(date) < out:
outfile.append(lookup[date])
for l in outfile:
print outfile

For this purpose you can use pandas library. Here is the sample code for the same:
import pandas as pd
df = pd.read_csv('so.csv', parse_dates=['timestamp']) #timestamp is your time column
current_year, next_year = 1940, 1941
df = df.query(f'(timestamp >= "{current_year}-10-01") & (timestamp <= "{next_year}-03-31")')
print (df)
This gives following result on your data:
timestamp value
0 1940-10-01 somevalue
1 1940-11-02 somevalue
2 1940-11-03 somevalue
3 1940-11-04 somevalue
4 1940-12-05 somevalue
5 1940-12-06 somevalue
6 1941-01-07 somevalue
7 1941-02-08 somevalue
8 1941-03-09 somevalue
Hope this helps!

Related

Iterate over dates, calculate averages for every 24-hour period

I have a csv file with data every ~minute over 2 years, and am wanting to run code to calculate 24-hour averages. Ideally I'd like the code to iterate over the data, calculate averages and standard deviations, and R^2 between dataA and dataB, for every 24hr period and then output this new data into a new csv file (with datestamp and calculated data for each 24hr period).
The data has an unusual timestamp which I think might be tripping me up slightly. I've been trying different For Loops to iterate over the data, but I'm not sure how to specify that I want the averages,etc for each 24hr period.
This is the code I have so far, but I'm not sure how to complete the For Loop to achieve what I'm wanting. If anyone can help that would be great!
import math
import pandas as pd
import os
import numpy as np
from datetime import timedelta, date
# read the file in csv
data = pd.read_csv("Jacaranda_data_HST.csv")
# Extract the data columns from the csv
data_date = data.iloc[:,1]
dataA = data.iloc[:,2]
dataB = data.iloc[:,3]
# set the start and end dates of the data
start_date = data_date.iloc[0]
end_date = data_date.iloc[-1:]
# for loop to run over every 24 hours of data
day_count = (end_date - start_date).days + 1
for single_date in [d for d in (start_date + timedelta(n) for n in
range(day_count)) if d <= end_date]:
print np.mean(dataA), np.mean(dataB), np.std(dataA), np.std(dataB)
# output new csv file - **unsure how to call the data**
csvfile = "Jacaranda_new.csv"
outdf = pd.DataFrame()
#outdf['dataA_mean'] = ??
#outdf['dataB_mean'] = ??
#outdf['dataA_stdev'] = ??
#outdf['dataB_stdev'] = ??
outdf.to_csv(csvfile, index=False)
A simplified aproach could be to group by calendar day in a dict. I don't have much experience with pandas time management in DataFrames, so this could be an alternative.
You could create a dict where the keys are the dates of the data (without the time part), so you can later calculate the mean of all the data points that are under each key.
data_date = data.iloc[:,1]
data_a = data.iloc[:,2]
data_b = data.iloc[:,3]
import collections
dd_a = collections.defaultdict(list)
dd_b = collections.defaultdict(list)
for date_str, data_point_a, data_point_b in zip(data_date, data_a, data_b):
# we split the string by the first space, so we get only the date part
date_part, _ = date_str.split(' ', maxsplit=1)
dd_a[date_part].append(data_point_a)
dd_b[date_part].append(data_point_b)
Now you can calculate the averages:
for date, v_list in dd_a.items():
if len(v_list) > 0:
print(date, 'mean:', sum(v_list) / len(v_list))
for date, v_list in dd_b.items():
if len(v_list) > 0:
print(date, 'mean:', sum(v_list) / len(v_list))

Data Formatting pandas

I am trying to enter a line of code that creates a row for the index 31'st January 1995. I am unable to get the row to look like 31/01/1995 and instead the output is 1995-01-31 00:00:00 .
My original data in a dataframe called MainData
I am trying to add a row at the top for 31st January 1995 in the same format as the data below.
My code is
MainData.loc[pd.to_datetime('31/01/1995',format='%d/%m/%Y'),:] = [100 for number in range(7)]
MainData
Please let me know if there is a way to reformat this to 31/01/1995.
Thanks in advance.
#Making the data look more normal by removing the first column index level
MainData = MainData.rename(columns=MainData.iloc[0])
MainData = MainData.iloc[1:]
#Re-adjusting the Index to a datetime format
MainData['DateAdjusted'] = MainData.index
MainData = MainData.reset_index(drop=True)
MainData['DateAdjusted'] = pd.to_datetime(MainData['DateAdjusted'],dayfirst=True)
#Just renaming the Column and converting the index back to Date
MainData.rename(columns={'DateAdjusted':'Date'},inplace=True)
MainData.index = MainData['Date']
del MainData['Date']
#Defining the date for the row I want to add
InitialDate = "31/01/1995"
format_str = '%d/%m/%Y'
datetime_obj = datetime.datetime.strptime(InitialDate, format_str)
print (datetime_obj.date())
MainData.loc[datetime_obj,:] = [100 for number in range(7)]
MainData = MainData.sort_index(ascending=True)

Convert DataFrame column date from 2/3/2007 format to 20070223 with python

I have a dataframe with 'Date' and 'Value', where the Date is in format m/d/yyyy. I need to convert to yyyymmdd.
df2= df[["Date", "Transaction"]]
I know datetime can do this for me, but I can't get it to accept my format.
example data files:
6/15/2006,-4.27,
6/16/2006,-2.27,
6/19/2006,-6.35,
You first need to convert to datetime, using pd.datetime, then you can format it as you wish using strftime:
>>> df
Date Transaction
0 6/15/2006 -4.27
1 6/16/2006 -2.27
2 6/19/2006 -6.35
df['Date'] = pd.to_datetime(df['Date'],format='%m/%d/%Y').dt.strftime('%Y%m%d')
>>> df
Date Transaction
0 20060615 -4.27
1 20060616 -2.27
2 20060619 -6.35
You can say:
df['Date']=df['Date'].dt.strftime('%Y%m%d')
dt accesor's strftime method is your clear friend now.
Note: if didn't convert to pandas datetime yet, do:
df['Date']=pd.to_datetime(df['Date']).dt.strftime('%Y%m%d')
Output:
Date Transaction
0 20060615 -4.27
1 20060616 -2.27
2 20060619 -6.35
For a raw python solution, you could try something along the following (assuming datafile is a string).
datafile="6/15/2006,-4.27,\n6/16/2006,-2.27,\n6/19/2006,-6.35"
def zeroPad(str, desiredLen):
while (len(str) < desiredLen):
str = "0" + str
return str
def convToYYYYMMDD(datafile):
datafile = ''.join(datafile.split('\n')) # remove \n's, they're unreliable and not needed
datafile = datafile.split(',') # split by comma so '1,2' becomes ['1','2']
out = []
for i in range(0, len(datafile)):
if (i % 2 == 0):
tmp = datafile[i].split('/')
yyyymmdd = zeroPad(tmp[2], 4) + zeroPad(tmp[0], 2) + zeroPad(tmp[1], 2)
out.append(yyyymmdd)
else:
out.append(datafile[i])
return out
print(convToYYYYMMDD(datafile))
This outputs: ['20060615', '-4.27', '20060616', '-2.27', '20060619', '-6.35'].

Reading and calculation using csv

I'm new to python and pardon me if this question might sound silly -
I have csv file that has 2 columns - Value and Timestamp. I'm trying to write a code that would take 2 paramenters - start_date and end_date and traverse the csv file to obtain all the values between those 2 dates and print the sum of Value
Below is my code. I'm trying to read and store the values in a list.
f_in = open('Users2.csv').readlines()
Value1 = []
Created = []
for i in range(1, len(f_in)):
Value, created_date = f_in[i].split(',')
Value1.append(Value)
Created.append(created_date)
print Value1
print Created
My csv has the following format
10 2010-02-12 23:31:40
20 2010-10-02 23:28:11
40 2011-03-12 23:39:40
10 2013-09-10 23:29:34
420 2013-11-19 23:26:17
122 2014-01-01 23:41:51
When I run my code - File1.py as below
File1.py 2010-01-01 2011-03-31
The output should be 70
I'm running into the following issues -
The data in csv is in timestamp (created_date), but the parameter passed should be date and I need to convert and get the data between those 2 dates regardless of time.
Once I have it in list - as described above - how do I proceed to do my calculation considering the condition in point-1
You can try this:
import csv
data = csv.reader(open('filename.csv'))
start_date = 10
end_data = 30
times = [' '.join(i) for i in data if int(i[0]) in range(start_date, end_date)]
Depends on your file size, but you may consider putting values from csv file, into some database, and then query your results.
csv module has DictReader which allows you to predefine your column names, it greatly improves readability, specially while working on really big files.
from datetime import datetime
COLUMN_NAMES = ['value', 'timestamp']
def sum_values(start_date, end_date):
sum = 0
with open('Users2.csv', mode='r') as csvfile:
table = csv.DictReader(csvfile, fieldnames=COLUMN_NAMES)
for row in table:
if row['timestamp'] >= min_date and row['timestamp'] <= max_date:
sum += int(row['value'])
return sum
If you are open to using pandas, try this:
>>> import pandas as pd
>>> data = 'Users2.csv'
>>>
>>> dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
>>> df = pd.read_csv(data, names=['value', 'date'], parse_dates=['date'], date_parser=dateparse)
>>> result = df['value'][(df['date'] > '2010-01-01') &
... (df['date'] < '2011-03-31')
... ].sum()
>>> result
70
Since you said that dates are in timestamp, you can compare them like strings. By realizing that, what you want to achieve (sum the values if created is between start_date and end_date) can be done like this:
def sum_values(start_date, end_date):
sum = 0
with open('Users2.csv') as f:
for line in f:
value, created = line.split(' ', 1)
if created > start_date && created < end_date:
sum += int(value)
return sum
str.split(' ', 1) will split on ' ' but will stop splitting after 1 split has been done. start_date and end_date must be in format yyyy-MM-dd hh:mm:ss which I assume they are, cause they are in timestamp format. Just mind it.

Replace text with numbers using dictionary in pandas

I'm trying to replace months represented as a character (e.g. 'NOV') for their numerical counterparts ('-11-'). I can get the following piece of code to work properly.
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('NOV','-11-')
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('DEC','-12-')
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('JAN','-01-')
However, to avoid redundancy, I'd like to use a dictionary and .replace to replace the character variable for all months.
r_month1 = {'JAN':'-01-','FEB':'-02-','MAR':'-03-','APR':'-04-','MAY':'-05-','JUN':'-06-','JUL':'-07-','AUG':'-08-','SEP':'-09-','OCT':'-10-','NOV':'-11-','DEC':'-12-'}
df_cohorts.replace({'conversion_datetime': r_month1,'ltouch_datetime': r_month1})
When I enter the code above, my output dataset is unchanged. For reference, please see my sample data below.
User_ID ltouch_datetime conversion_datetime
001 11NOV14:13:12:56 11NOV14:16:12:00
002 07NOV14:17:46:14 08NOV14:13:10:00
003 04DEC14:17:46:14 04DEC15:13:12:00
Thanks!
Let me suggest a different approach: You could parse the date strings into a column of pandas TimeStamps like this:
import pandas as pd
df = pd.read_table('data', sep='\s+')
for col in ('ltouch_datetime', 'conversion_datetime'):
df[col] = pd.to_datetime(df[col], format='%d%b%y:%H:%M:%S')
print(df)
# User_ID ltouch_datetime conversion_datetime
# 0 1 2014-11-11 13:12:56 2014-11-11 16:12:00
# 1 2 2014-11-07 17:46:14 2014-11-08 13:10:00
# 2 3 2014-12-04 17:46:14 2015-12-04 13:12:00
I would stop right here, since representing dates as TimeStamps is the ideal
form for the data in Pandas.
However, if you need/want date strings with 3-letter months like 'NOV' converted to -11-, then you can convert the Timestamps with strftime and apply:
for col in ('ltouch_datetime', 'conversion_datetime'):
df[col] = df[col].apply(lambda x: x.strftime('%d-%m-%y:%H:%M:%S'))
print(df)
yields
User_ID ltouch_datetime conversion_datetime
0 1 11-11-14:13:12:56 11-11-14:16:12:00
1 2 07-11-14:17:46:14 08-11-14:13:10:00
2 3 04-12-14:17:46:14 04-12-15:13:12:00
To answer your question literally, in order to use Series.str.replace you need a column with the month string abbreviations all by themselves. You can arrange for that by first calling Series.str.extract. Then you can join the columns back into one using apply:
import pandas as pd
import calendar
month_map = {calendar.month_abbr[m].upper():'-{:02d}-'.format(m)
for m in range(1,13)}
df = pd.read_table('data', sep='\s+')
for col in ('ltouch_datetime', 'conversion_datetime'):
tmp = df[col].str.extract(r'(.*?)(\D+)(.*)')
tmp[1] = tmp[1].replace(month_map)
df[col] = tmp.apply(''.join, axis=1)
print(df)
yields
User_ID ltouch_datetime conversion_datetime
0 1 11-11-14:13:12:56 11-11-14:16:12:00
1 2 07-11-14:17:46:14 08-11-14:13:10:00
2 3 04-12-14:17:46:14 04-12-15:13:12:00
Finally, although you haven't asked for this directly, it's good to be aware
that if your data is in a file, you can parse the datestring columns into
TimeStamps directly using
import pandas as pd
import datetime as DT
df = pd.read_table(
'data', sep='\s+', parse_dates=[1,2],
date_parser=lambda x: DT.datetime.strptime(x, '%d%b%y:%H:%M:%S'))
This might be the most convenient method of all (assuming you want TimeStamps).

Categories

Resources