Python: Writing and printing to CSV a dataframe - python

I have a vector of dates of size 10 and type numpy.ndarray. I also have an array of temperatures at each hour of size 10x24.
I want to print the dates in column A and the corresponding temperature in columns B through Y for rows 1 though 10 in a csv file.
My arrays look as following:
print(AllDays)
[datetime.date(2008, 12, 31) datetime.date(2009, 1, 1)
datetime.date(2009, 1, 2) ..., datetime.date(2015, 11, 28)
datetime.date(2015, 11, 29) datetime.date(2015, 11, 30)]
So far I have to trying to implement this using dataframes as below:
TempDay = pd.DataFrame()
TempDay['Dates'] = AllDays #of size 10
TempDay['Temperature'] = TemperatureArray #of size 10x24
If the previous step had worked I aimed at:
TempDay.to_csv('C:\MyFile.csv')
But the above method has not been working.

It's not working because you trying to assign dataframe to column. You could construct pandas dataframe with your TemperatureArray and then add Dates column:
TempDay = pd.DataFrame(TemperatureArray)
TempDay['Dates'] = AllDays
TempDay.to_csv('C:\MyFile.csv')

Related

Take 2 random dates from list of dates based on days range condition

I would like to randomly choose 2 dates from the list of dates based on condition that if date range between 2 randomly choosen dates from the list is lower than 100 days we want to take them if not we look for other 2 random dates from the list.
Code below will not work but this is more or less what I would like to achieve:
dates = [date(2020, 1, 25), date(2020, 2, 23), date(2020, 3, 27)]
def get_2_random_dates_with_with_days_range_100():
choosen_dates = random.choice([a, b for a, b in dates if a.days - b.days <= 100])
Also code above shows more or less what I've tried which is to use list comprehension with the condition but it would require me to unpack 2 values from the list
Thanks
Maybe first get one random date, next create list with dates which match condition(s), and next get second random date from this list. And if it can't find matching dates then start from the beginning and select first random date again, etc.
from datetime import date
import random
dates = [date(2020, 1, 25), date(2020, 2, 23), date(2020, 3, 27),
date(1920, 1, 25), date(1920, 2, 23), date(1920, 3, 27),
date(2025, 1, 25), date(2025, 2, 23), date(2025, 3, 27)]
while True:
first = random.choice(dates)
print(' first:', first)
matching = [x for x in dates if 0 < abs(x - first).days <= 100]
print('matching:', matching)
if matching:
second = random.choice(matching)
print('second:', second)
break

How to sum a slice from a pandas dataframe

I'm trying to sum the a portion of the sessions in my dictionary so I can get totals for the current and previous week.
I've converted the JSON into a pandas dataframe in one test. I'm summing the total of the sessions using the .sum() function in pandas. However, I also need to know the total sessions from this week and the week prior. I've tried a few methods to sum values (-1:-7) and (-8:-15), but I'm pretty sure I need to use .iloc.
IN:
response = requests.get("url")
data = response.json()
df=pd.DataFrame(data['DailyUsage'])
total_sessions = df['Sessions'].sum()
current_week= df['Sessions'].iloc[-1:-7]
print(current_week)
total_sessions =['current_week'].sum
OUT:
Series([], Name: Sessions, dtype: int64)
AttributeError 'list' object has no attribute 'sum'
Note: I've tried this with and without pd.to_numeric and also with variations on the syntax of the slice and sum methods. Pandas doesn't feel very Pythonic and I'm out of ideas as to what to try next.
Assuming that df['Sessions'] holds each day, and you are comparing current and previous week only, you can use reshape to create a weekly sum for the last 14 values.
weekly_matrix = df['Sessions'][:-15:-1].values.reshape((2, 7))
Then, you can sum each row and get the weekly sum, most recent will be the first element.
import numpy as np
weekly_sum = np.sum(weekly_matrix, axis=1)
current_week = weekly_sum[0]
previous_week = weekly_sum[1]
EDIT: how the code works
Let's take the 1D-array which is accessed by the values attribute of the pandas Series. It contains the last 14 days, which is ordered from most recent to the oldest. I will call it x.
x = array([14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
The array's reshape function is then called on x to split this data into a 2D-array (matrix) with 2 rows and 7 columns.
The default behavior of the reshape function is to first fill all columns in a row before moving to the next row. Therefore, x[0] will be the element (1,1) in the reshaped array, x[1] will be the element (1,2), and so on. After the element (1,7) is filled with x[6] (ending the current week), the next element x[7] will then be placed in (2,1). This continues until finishing the reshape operation, with the placement of x[13] in (2,7).
This results in placing the first 7 elements of x (current week) in the first row, and the last 7 elements of x (previous week) in the second row. This was called weekly_matrix.
weekly_matrix = x.reshape((2, 7))
# weekly_matrix = array([[14, 13, 12, 11, 10, 9, 8],
# [ 7, 6, 5, 4, 3, 2, 1]])
Since now we have the values of each week organized in a matrix, we can use numpy.sum function to finish our operation. numpy.sum can take an axis argument, which will control how the value is computed:
if axis=None, all elements are added in a grand total.
if axis=0, all rows in each column will be added. In the case of weekly_matrix, this will result in a 7 element 1D-array ([21, 19,
17, 15, 13, 11, 9], which is not the result we want, as we are
actually adding equivalent days on each week).
if axis=1 (as the case of the solution), all columns in each row will be added, producing a 2 element 1D-array in the case of weekly_matrix. Order of this result
array follows the same order of the rows in the matrix (i.e., element
0 is the total of the first row, and element 1 is the total of the
second row). Since we know that the first row is the current week, and
the second row is the previous week, we can extract the information
using these indexes, which is
# weekly_sum = array([77, 28])
current_week = weekly_sum[0] # sum of [14, 13, 12, 11, 10, 9, 8] = 77
previous_week = weekly_sum[1] # sum of [ 7, 6, 5, 4, 3, 2, 1] = 28
To group and sum by a fixed number of values, for instance with daily data and weekly aggregation, consider groupby. You can do this forwards or backwards by slicing your series as appropriate:
np.random.seed(0)
df = pd.DataFrame({'col': np.random.randint(0, 10, 21)})
print(df['col'].values)
# array([5, 0, 3, 3, 7, 9, 3, 5, 2, 4, 7, 6, 8, 8, 1, 6, 7, 7, 8, 1, 5])
# forwards groupby
res = df['col'].groupby(df.index // 7).sum()
# 0 30
# 1 40
# 2 35
# Name: col, dtype: int32
# backwards groupby
df['col'].iloc[::-1].reset_index(drop=True).groupby(df.index // 7).sum()
# 0 35
# 1 40
# 2 30
# Name: col, dtype: int32

Python bar chart with duplicate datetimes in index

I would like to make a barchart in python using matplotlib pyplot. The data consists of an index, which is a datetime list, and a number corresponding to that datetime. I have various samples that all belong to the same day. However, when making the bar chart, it only shows the first samples corresponding to a certain datetime, instead of all of them. How can I make a barchart showing every entry?
The index has the following structure:
ind = [datetime.datetime(2017, 3, 1, 0, 0), datetime.datetime(2017, 3, 1, 0, 0),
datetime.datetime(2017, 3, 15, 0, 0), datetime.datetime(2017, 3, 15, 0, 0)]
and the values are just integers:
values = [10, 20, 30, 40]
So when plotting, it only shows the bars 2017-3-1 with value 10, and 2017-3-15 with value 30. How can I make them show all of them?
You can groupby the dates, add the values and then plot the barchart from the same dataframe:
df = pd.DataFrame(data=values, index=ind)
df = df.groupby(df.index).sum()
df.plot(kind='bar')
If what you want is all values to appear in the plot, regardless of the date, you can simply use:
df.plot(kind='bar')
And entries with duplicate date will be plotted independently.

Python: averaging across repeat dates

I have a numpy array with datetime stored in array A of size 100 as:
>>>A[0]
datetime.datetime(2011, 1, 1, 0, 0)
The other 99 elements are datetime.datetime objects also but few of them repeat e.g.
A[55]
datetime.datetime(2011, 11, 2, 0, 0)
A[56]
datetime.datetime(2011, 11, 2, 0, 0)
I have another array of Temperatures of same size as A with values corresponding to rows of A as:
Temperature[0] = 55
Temperature[55] = 40
Temperature[56] = 50
I am trying to obtain a new array from A2 which only has unique datetime from A and takes average of corresponding temperature repeats.
So in this case I will have A2 with only 1 datetime.datetime(2011, 11, 2, 0, 0) and temperature will be 0.5*(40+50) = 45
I am trying to use pandas pivot table as:
DayLightSavCure = pd.pivot_table(pd.DataFrame({'DateByHour': A, 'Temp': Temperature}), index=['DateByHour'], values=['Temp'], aggfunc=[np.mean])
But the error is:
ValueError: If using all scalar values, you must pass an index
I do actually concurr with #someone else, this could be achieved without digging into pandas. itertools is really nice for this. Written for Python 3.5+(because of statistics:
from itertools import groupby
from operator import itemgetter
from random import randint
import datetime
from statistics import mean
# Generate test data
dates = [datetime.datetime(2005, i % 12 + 1, 5, 5, 5, 5) for i in range(100)]
temperatures = [randint(0, 100) for _ in range(100)]
# Calculate averages
## Group data points by unique dates using `groupby`, `sorted` and `zip`
grouped = groupby(sorted(zip(dates, temperatures)), key=itemgetter(0))
##Calculate mean per unique date
averaged = [(key, mean(temperature[1] for temperature in values)) for key, values in grouped]
print(averaged) # List of tuples
#[(datetime.datetime(2005, 1, 5, 5, 5, 5), 65.22222222222223), (datetime.datetime(2005, 2, 5, 5, 5, 5), 60.0),.......
print(dict(averaged)) # Nicer as a dict
{datetime.datetime(2005, 3, 5, 5, 5, 5): 48.111111111111114, datetime.datetime(2005, 12, 5, 5, 5, 5): 43.75, ..........
If you have to have two separate lists/iterators at the end of the calculation just apply zip to averaged.

Group together arbitrary date objects that are within a time range of each other

I want to split the calendar into two-week intervals starting at 2008-May-5, or any arbitrary starting point.
So I start with several date objects:
import datetime as DT
raw = ("2010-08-01",
"2010-06-25",
"2010-07-01",
"2010-07-08")
transactions = [(DT.datetime.strptime(datestring, "%Y-%m-%d").date(),
"Some data here") for datestring in raw]
transactions.sort()
By manually analyzing the dates, I am quite able to figure out which dates fall within the same fortnight interval. I want to get grouping that's similar to this one:
# Fortnight interval 1
(datetime.date(2010, 6, 25), 'Some data here')
(datetime.date(2010, 7, 1), 'Some data here')
(datetime.date(2010, 7, 8), 'Some data here')
# Fortnight interval 2
(datetime.date(2010, 8, 1), 'Some data here')
import datetime as DT
import itertools
start_date=DT.date(2008,5,5)
def mkdate(datestring):
return DT.datetime.strptime(datestring, "%Y-%m-%d").date()
def fortnight(date):
return (date-start_date).days //14
raw = ("2010-08-01",
"2010-06-25",
"2010-07-01",
"2010-07-08")
transactions=[(date,"Some data") for date in map(mkdate,raw)]
transactions.sort(key=lambda (date,data):date)
for key,grp in itertools.groupby(transactions,key=lambda (date,data):fortnight(date)):
print(key,list(grp))
yields
# (55, [(datetime.date(2010, 6, 25), 'Some data')])
# (56, [(datetime.date(2010, 7, 1), 'Some data'), (datetime.date(2010, 7, 8), 'Some data')])
# (58, [(datetime.date(2010, 8, 1), 'Some data')])
Note that 2010-6-25 is in the 55th fortnight from 2008-5-5, while 2010-7-1 is in the 56th. If you want them grouped together, simply change start_date (to something like 2008-5-16).
PS. The key tool used above is itertools.groupby, which is explained in detail here.
Edit: The lambdas are simply a way to make "anonymous" functions. (They are anonymous in the sense that they are not given names like functions defined by def). Anywhere you see a lambda, it is also possible to use a def to create an equivalent function. For example, you could do this:
import operator
transactions.sort(key=operator.itemgetter(0))
def transaction_fortnight(transaction):
date,data=transaction
return fortnight(date)
for key,grp in itertools.groupby(transactions,key=transaction_fortnight):
print(key,list(grp))
Use itertools groupby with lambda function to divide by the length of period the distance from starting point.
>>> for i, group in groupby(range(30), lambda x: x // 7):
print list(group)
[0, 1, 2, 3, 4, 5, 6]
[7, 8, 9, 10, 11, 12, 13]
[14, 15, 16, 17, 18, 19, 20]
[21, 22, 23, 24, 25, 26, 27]
[28, 29]
So with dates:
import itertools as it
start = DT.date(2008,5,5)
lenperiod = 14
for fnight,info in it.groupby(transactions,lambda data: (data[0]-start).days // lenperiod):
print list(info)
You can use also weeknumbers from strftime, and lenperiod in number of weeks:
for fnight,info in it.groupby(transactions,lambda data: int (data[0].strftime('%W')) // lenperiod):
print list(info)
Using a pandas DataFrame with resample works too. Given OP's data, but change "some data here" to 'abcd'.
>>> import datetime as DT
>>> raw = ("2010-08-01",
... "2010-06-25",
... "2010-07-01",
... "2010-07-08")
>>> transactions = [(DT.datetime.strptime(datestring, "%Y-%m-%d"), data) for
... datestring, data in zip(raw,'abcd')]
[(datetime.datetime(2010, 8, 1, 0, 0), 'a'),
(datetime.datetime(2010, 6, 25, 0, 0), 'b'),
(datetime.datetime(2010, 7, 1, 0, 0), 'c'),
(datetime.datetime(2010, 7, 8, 0, 0), 'd')]
Now try using pandas. First create a DataFrame, naming the columns and setting the indices to the dates.
>>> import pandas as pd
>>> df = pd.DataFrame(transactions,
... columns=['date','data']).set_index('date')
data
date
2010-08-01 a
2010-06-25 b
2010-07-01 c
2010-07-08 d
Now use the Series Offset Aliases to every 2 weeks starting on Sundays and concatenate the results.
>>> fortnight = df.resample('2W-SUN').sum()
data
date
2010-06-27 b
2010-07-11 cd
2010-07-25 0
2010-08-08 a
Now drill into the data as needed by weekstart
>>> fortnight.loc['2010-06-27']['data']
b
or index
>>> fortnight.iloc[0]['data']
b
or indices
>>> data = fortnight.iloc[:2]['data']
b
date
2010-06-27 b
2010-07-11 cd
Freq: 2W-SUN, Name: data, dtype: object
>>> data[0]
b
>>> data[1]
cd

Categories

Resources