I have a numpy array with datetime stored in array A of size 100 as:
>>>A[0]
datetime.datetime(2011, 1, 1, 0, 0)
The other 99 elements are datetime.datetime objects also but few of them repeat e.g.
A[55]
datetime.datetime(2011, 11, 2, 0, 0)
A[56]
datetime.datetime(2011, 11, 2, 0, 0)
I have another array of Temperatures of same size as A with values corresponding to rows of A as:
Temperature[0] = 55
Temperature[55] = 40
Temperature[56] = 50
I am trying to obtain a new array from A2 which only has unique datetime from A and takes average of corresponding temperature repeats.
So in this case I will have A2 with only 1 datetime.datetime(2011, 11, 2, 0, 0) and temperature will be 0.5*(40+50) = 45
I am trying to use pandas pivot table as:
DayLightSavCure = pd.pivot_table(pd.DataFrame({'DateByHour': A, 'Temp': Temperature}), index=['DateByHour'], values=['Temp'], aggfunc=[np.mean])
But the error is:
ValueError: If using all scalar values, you must pass an index
I do actually concurr with #someone else, this could be achieved without digging into pandas. itertools is really nice for this. Written for Python 3.5+(because of statistics:
from itertools import groupby
from operator import itemgetter
from random import randint
import datetime
from statistics import mean
# Generate test data
dates = [datetime.datetime(2005, i % 12 + 1, 5, 5, 5, 5) for i in range(100)]
temperatures = [randint(0, 100) for _ in range(100)]
# Calculate averages
## Group data points by unique dates using `groupby`, `sorted` and `zip`
grouped = groupby(sorted(zip(dates, temperatures)), key=itemgetter(0))
##Calculate mean per unique date
averaged = [(key, mean(temperature[1] for temperature in values)) for key, values in grouped]
print(averaged) # List of tuples
#[(datetime.datetime(2005, 1, 5, 5, 5, 5), 65.22222222222223), (datetime.datetime(2005, 2, 5, 5, 5, 5), 60.0),.......
print(dict(averaged)) # Nicer as a dict
{datetime.datetime(2005, 3, 5, 5, 5, 5): 48.111111111111114, datetime.datetime(2005, 12, 5, 5, 5, 5): 43.75, ..........
If you have to have two separate lists/iterators at the end of the calculation just apply zip to averaged.
Related
I have a dataframe which stores different variables. I'm using OLS linear regression and using all of the variables to predict the 'price' column.
import pandas as pd
import statsmodels.api as sm
data = {'accommodates':[2, 2, 3, 2, 2, 6, 8, 4, 3, 2],
'bedrooms':[1, 2, 1, 1, 3, 4, 2, 2, 2, 3],
'instant_bookable':[1, 0, 1, 1, 1, 1, 0, 0, 0, 1],
'availability_365':[123, 3, 33, 14, 15, 16, 3, 41, 61, 74],
'minimum_nights':[3, 12, 1, 4, 6, 7, 2, 3, 6, 10],
'beds':[2, 2, 3, 4, 1, 5, 6, 2, 3, 2],
'price':[59, 234, 15, 162, 56, 42, 28, 52, 22, 31]}
df = pd.DataFrame(data, columns = ['accommodates', 'bedrooms', 'instant_bookable', 'availability_365',
'minimum_nights', 'beds', 'price'])
I have a for loop which calculates the Adjusted R squared value for each variable:
fit_d = {}
for columns in [x for x in df.columns if x != 'price']:
Y = df['price']
X = df[columns]
X = sm.add_constant(X)
model = sm.OLS(Y,X, missing = 'drop').fit()
fit_d[columns] = model.rsquared
fit_d
How can I modify my code in order to find the combination of variables that give the largest Adjusted R squared value? Ideally the function would find the variable with the largest adj. R squared value first, then using the 1st variable iterate with the remaining variables to get 2 variables that give the highest value, then 3 variables etc. until the value cannot be increased further. I'd like the output to be something like
Best variables: {'accommodates, 'availability', 'bedrooms'}
Here is a "brute force way" to do all possible combinations (from itertools) of different length to find the variables with higher R value. The idea is to do 2 loops, one for the number of variables to try, and one for all the combinations with the number of variables.
from itertools import combinations
# all possible columns for X
cols = [x for x in df.columns if x != 'price']
# define Y as same accross the loops
Y = df['price']
# define result dictionary
fit_d = {}
# loop for any length of combinations
for i in range(1, len(cols)+1):
# loop for any combinations with length i
for comb in combinations(cols, i):
# Define X from the combination
X = df[list(comb)]
X = sm.add_constant(X)
# perform the OLS opertion
model = sm.OLS(Y,X, missing = 'drop').fit()
# save the rsquared in a dictionnary
fit_d[comb] = model.rsquared
# extract the key for the max R value
key_max = max(fit_d, key=fit_d.get)
print(f'Best variables {key_max} for a R-value of {round(fit_d[key_max], 5)}')
# Best variables ('accommodates', 'bedrooms', 'instant_bookable', 'availability_365', 'minimum_nights', 'beds') for a R-value of 0.78506
I'm trying to sum the a portion of the sessions in my dictionary so I can get totals for the current and previous week.
I've converted the JSON into a pandas dataframe in one test. I'm summing the total of the sessions using the .sum() function in pandas. However, I also need to know the total sessions from this week and the week prior. I've tried a few methods to sum values (-1:-7) and (-8:-15), but I'm pretty sure I need to use .iloc.
IN:
response = requests.get("url")
data = response.json()
df=pd.DataFrame(data['DailyUsage'])
total_sessions = df['Sessions'].sum()
current_week= df['Sessions'].iloc[-1:-7]
print(current_week)
total_sessions =['current_week'].sum
OUT:
Series([], Name: Sessions, dtype: int64)
AttributeError 'list' object has no attribute 'sum'
Note: I've tried this with and without pd.to_numeric and also with variations on the syntax of the slice and sum methods. Pandas doesn't feel very Pythonic and I'm out of ideas as to what to try next.
Assuming that df['Sessions'] holds each day, and you are comparing current and previous week only, you can use reshape to create a weekly sum for the last 14 values.
weekly_matrix = df['Sessions'][:-15:-1].values.reshape((2, 7))
Then, you can sum each row and get the weekly sum, most recent will be the first element.
import numpy as np
weekly_sum = np.sum(weekly_matrix, axis=1)
current_week = weekly_sum[0]
previous_week = weekly_sum[1]
EDIT: how the code works
Let's take the 1D-array which is accessed by the values attribute of the pandas Series. It contains the last 14 days, which is ordered from most recent to the oldest. I will call it x.
x = array([14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
The array's reshape function is then called on x to split this data into a 2D-array (matrix) with 2 rows and 7 columns.
The default behavior of the reshape function is to first fill all columns in a row before moving to the next row. Therefore, x[0] will be the element (1,1) in the reshaped array, x[1] will be the element (1,2), and so on. After the element (1,7) is filled with x[6] (ending the current week), the next element x[7] will then be placed in (2,1). This continues until finishing the reshape operation, with the placement of x[13] in (2,7).
This results in placing the first 7 elements of x (current week) in the first row, and the last 7 elements of x (previous week) in the second row. This was called weekly_matrix.
weekly_matrix = x.reshape((2, 7))
# weekly_matrix = array([[14, 13, 12, 11, 10, 9, 8],
# [ 7, 6, 5, 4, 3, 2, 1]])
Since now we have the values of each week organized in a matrix, we can use numpy.sum function to finish our operation. numpy.sum can take an axis argument, which will control how the value is computed:
if axis=None, all elements are added in a grand total.
if axis=0, all rows in each column will be added. In the case of weekly_matrix, this will result in a 7 element 1D-array ([21, 19,
17, 15, 13, 11, 9], which is not the result we want, as we are
actually adding equivalent days on each week).
if axis=1 (as the case of the solution), all columns in each row will be added, producing a 2 element 1D-array in the case of weekly_matrix. Order of this result
array follows the same order of the rows in the matrix (i.e., element
0 is the total of the first row, and element 1 is the total of the
second row). Since we know that the first row is the current week, and
the second row is the previous week, we can extract the information
using these indexes, which is
# weekly_sum = array([77, 28])
current_week = weekly_sum[0] # sum of [14, 13, 12, 11, 10, 9, 8] = 77
previous_week = weekly_sum[1] # sum of [ 7, 6, 5, 4, 3, 2, 1] = 28
To group and sum by a fixed number of values, for instance with daily data and weekly aggregation, consider groupby. You can do this forwards or backwards by slicing your series as appropriate:
np.random.seed(0)
df = pd.DataFrame({'col': np.random.randint(0, 10, 21)})
print(df['col'].values)
# array([5, 0, 3, 3, 7, 9, 3, 5, 2, 4, 7, 6, 8, 8, 1, 6, 7, 7, 8, 1, 5])
# forwards groupby
res = df['col'].groupby(df.index // 7).sum()
# 0 30
# 1 40
# 2 35
# Name: col, dtype: int32
# backwards groupby
df['col'].iloc[::-1].reset_index(drop=True).groupby(df.index // 7).sum()
# 0 35
# 1 40
# 2 30
# Name: col, dtype: int32
I want to calculate absolute difference between all elements in a set of integers. I am trying to do abs(x-y) where x and y are two elements in the set. I want to do that for all combinations and save the resulting list in a new set.
I want to calculate absolute difference between all elements in a set of integers (...) and save the resulting list in a new set.
You can use itertools.combinations:
s = { 1, 4, 7, 9 }
{ abs(i - j) for i,j in combinations(s, 2) }
=>
set([8, 2, 3, 5, 6])
combinations returns the r-length tuples of all combinations in s without replacement, i.e.:
list(combinations(s, 2))
=>
[(9, 4), (9, 1), (9, 7), (4, 1), (4, 7), (1, 7)]
As sets do not maintain order, you may use something like an ordered-set and iterate till last but one.
For completeness, here's a solution based on Numpy ndarray's and pdist():
In [69]: import numpy as np
In [70]: from scipy.spatial.distance import pdist
In [71]: s = {1, 4, 7, 9}
In [72]: set(pdist(np.array(list(s))[:, None], 'cityblock'))
Out[72]: {2.0, 3.0, 5.0, 6.0, 8.0}
Here is another solution based on numpy:
data = np.array([33,22,21,1,44,54])
minn = np.inf
index = np.array(range(data.shape[0]))
for i in range(data.shape[0]):
to_sub = (index[:i], index[i+1:])
temp = np.abs(data[i] - data[np.hstack(to_sub)])
min_temp = np.min(temp)
if min_temp < minn : minn = min_temp
print('Min difference is',minn)
Output: "Min difference is 1"
Here is another way using combinations:
from itertools import combinations
def find_differences(lst):
" Find all differences, min & max difference "
d = [abs(i - j) for i, j in combinations(set(lst), 2)]
return min(d), max(d), d
Test:
list_of_nums = [1, 9, 7, 13, 56, 5]
min_, max_, diff_ = find_differences(list_of_nums)
print(f'All differences: {diff_}\nMaximum difference: {max_}\nMinimum difference: {min_}')
Result:
All differences: [4, 6, 8, 12, 55, 2, 4, 8, 51, 2, 6, 49, 4, 47, 43]
Maximum difference: 55
Minimum difference: 2
I am trying to sort values in an numpy array so that I can store all of the values that are in a certain range (That could probably be phrased better). Anyway ill give an example of what I am trying to do. I have an array called bins that looks like this:
bins = array([11,11.5,12,12.5,13,13.5,14])
I also have another array called avgs:
avgs = array([11.02, 13.67, 11.78, 12.34, 13.24, 12.98, 11.3, 12.56, 13.95, 13.56,
11.64, 12.45, 13.23, 13.64, 12.46, 11.01, 11.87, 12.34, 13,87, 13.04,
12.49, 12.5])
What I am trying to do is to find the index values of the avgs array that are in the ranges between the values of the bins array. For example I was trying to make a while loop that would create new variables for each bin. The first bin would be everything that is between bins[0] and bins[1] and would look like:
bin1 = array([0, 6, 15])
Those index values would correspond to the values 11.02, 11.3, and 11.01 in the avgs and would be the values of avgs that were between index values 0 and 1 in bins. I also need the other bins so another example would be:
bin2 = array([2, 10, 16])
However the challenging part of this for me was that the size of bins and avgs changes based on other parameters so I was trying to build something that would be able to be expanded to larger or smaller bins and avgs arrays.
Numpy has some pretty powerful bin counting functions.
>>> binplace = np.digitize(avgs, bins) #Returns which bin an average belongs
>>> binplace
array([1, 6, 2, 3, 5, 4, 1, 4, 6, 6, 2, 3, 5, 6, 3, 1, 2, 3, 5, 7, 5, 3, 4])
>>> np.where(binplace == 1)
(array([ 0, 6, 15]),)
>>> np.where(binplace == 2)
(array([ 2, 10, 16]),)
>>> avgs[np.where(binplace == 1)]
array([ 11.02, 11.3 , 11.01])
I want to split the calendar into two-week intervals starting at 2008-May-5, or any arbitrary starting point.
So I start with several date objects:
import datetime as DT
raw = ("2010-08-01",
"2010-06-25",
"2010-07-01",
"2010-07-08")
transactions = [(DT.datetime.strptime(datestring, "%Y-%m-%d").date(),
"Some data here") for datestring in raw]
transactions.sort()
By manually analyzing the dates, I am quite able to figure out which dates fall within the same fortnight interval. I want to get grouping that's similar to this one:
# Fortnight interval 1
(datetime.date(2010, 6, 25), 'Some data here')
(datetime.date(2010, 7, 1), 'Some data here')
(datetime.date(2010, 7, 8), 'Some data here')
# Fortnight interval 2
(datetime.date(2010, 8, 1), 'Some data here')
import datetime as DT
import itertools
start_date=DT.date(2008,5,5)
def mkdate(datestring):
return DT.datetime.strptime(datestring, "%Y-%m-%d").date()
def fortnight(date):
return (date-start_date).days //14
raw = ("2010-08-01",
"2010-06-25",
"2010-07-01",
"2010-07-08")
transactions=[(date,"Some data") for date in map(mkdate,raw)]
transactions.sort(key=lambda (date,data):date)
for key,grp in itertools.groupby(transactions,key=lambda (date,data):fortnight(date)):
print(key,list(grp))
yields
# (55, [(datetime.date(2010, 6, 25), 'Some data')])
# (56, [(datetime.date(2010, 7, 1), 'Some data'), (datetime.date(2010, 7, 8), 'Some data')])
# (58, [(datetime.date(2010, 8, 1), 'Some data')])
Note that 2010-6-25 is in the 55th fortnight from 2008-5-5, while 2010-7-1 is in the 56th. If you want them grouped together, simply change start_date (to something like 2008-5-16).
PS. The key tool used above is itertools.groupby, which is explained in detail here.
Edit: The lambdas are simply a way to make "anonymous" functions. (They are anonymous in the sense that they are not given names like functions defined by def). Anywhere you see a lambda, it is also possible to use a def to create an equivalent function. For example, you could do this:
import operator
transactions.sort(key=operator.itemgetter(0))
def transaction_fortnight(transaction):
date,data=transaction
return fortnight(date)
for key,grp in itertools.groupby(transactions,key=transaction_fortnight):
print(key,list(grp))
Use itertools groupby with lambda function to divide by the length of period the distance from starting point.
>>> for i, group in groupby(range(30), lambda x: x // 7):
print list(group)
[0, 1, 2, 3, 4, 5, 6]
[7, 8, 9, 10, 11, 12, 13]
[14, 15, 16, 17, 18, 19, 20]
[21, 22, 23, 24, 25, 26, 27]
[28, 29]
So with dates:
import itertools as it
start = DT.date(2008,5,5)
lenperiod = 14
for fnight,info in it.groupby(transactions,lambda data: (data[0]-start).days // lenperiod):
print list(info)
You can use also weeknumbers from strftime, and lenperiod in number of weeks:
for fnight,info in it.groupby(transactions,lambda data: int (data[0].strftime('%W')) // lenperiod):
print list(info)
Using a pandas DataFrame with resample works too. Given OP's data, but change "some data here" to 'abcd'.
>>> import datetime as DT
>>> raw = ("2010-08-01",
... "2010-06-25",
... "2010-07-01",
... "2010-07-08")
>>> transactions = [(DT.datetime.strptime(datestring, "%Y-%m-%d"), data) for
... datestring, data in zip(raw,'abcd')]
[(datetime.datetime(2010, 8, 1, 0, 0), 'a'),
(datetime.datetime(2010, 6, 25, 0, 0), 'b'),
(datetime.datetime(2010, 7, 1, 0, 0), 'c'),
(datetime.datetime(2010, 7, 8, 0, 0), 'd')]
Now try using pandas. First create a DataFrame, naming the columns and setting the indices to the dates.
>>> import pandas as pd
>>> df = pd.DataFrame(transactions,
... columns=['date','data']).set_index('date')
data
date
2010-08-01 a
2010-06-25 b
2010-07-01 c
2010-07-08 d
Now use the Series Offset Aliases to every 2 weeks starting on Sundays and concatenate the results.
>>> fortnight = df.resample('2W-SUN').sum()
data
date
2010-06-27 b
2010-07-11 cd
2010-07-25 0
2010-08-08 a
Now drill into the data as needed by weekstart
>>> fortnight.loc['2010-06-27']['data']
b
or index
>>> fortnight.iloc[0]['data']
b
or indices
>>> data = fortnight.iloc[:2]['data']
b
date
2010-06-27 b
2010-07-11 cd
Freq: 2W-SUN, Name: data, dtype: object
>>> data[0]
b
>>> data[1]
cd