I'm trying to create weekly groups using xarray groupy based on a datetime64 time dimension. For some reason it is creating extra groups and putting some dates in the wrong groups. I'm using the S coordinate to group by weeks. For each year there should be five weekly groups, but it is creating seven groups.
Groups being created:
In [38]: em.groupby('S.week').groups
Out[38]:
{1: [1, 6, 10, 15, 20, 25, 31, 35, 40, 45, 50, 56, 61, 65, 70, 75, 80, 86, 90],
2: [2, 7, 11, 16, 21, 26, 32, 36, 41, 46, 51, 57, 62, 66, 71, 76, 81, 87, 91],
3: [3, 8, 12, 17, 22, 27, 33, 37, 42, 47, 52, 58, 63, 67, 72, 77, 82, 88, 92],
4: [4, 9, 13, 18, 23, 28, 34, 38, 43, 48, 53, 59, 64, 68, 73, 78, 83, 89, 93],
5: [14, 19, 24, 29, 39, 44, 49, 54, 69, 74, 79, 84, 94],
52: [5, 60],
53: [0, 30, 55, 85]}
information on em:
In [39]: em
Out[39]:
<xarray.Dataset>
Dimensions: (S: 95, latitude: 181, lead: 32, longitude: 360)
Coordinates:
* latitude (latitude) float64 -90.0 -89.0 -88.0 -87.0 ... 88.0 89.0 90.0
* longitude (longitude) float64 0.0 1.0 2.0 3.0 ... 356.0 357.0 358.0 359.0
* lead (lead) timedelta64[ns] 0 days 12:00:00 ... 31 days 12:00:00
* S (S) datetime64[ns] 1999-01-02 1999-01-09 ... 2017-01-30
Data variables:
eto (S, lead, latitude, longitude) float64 dask.array<shape=(95, 32, 181, 360), chunksize=(1, 32, 181, 360)>
Values for S:
In [35]: em.S
Out[35]:
<xarray.DataArray 'S' (S: 95)>
array(['1999-01-02T00:00:00.000000000', '1999-01-09T00:00:00.000000000',
'1999-01-16T00:00:00.000000000', '1999-01-23T00:00:00.000000000',
'1999-01-30T00:00:00.000000000', '2000-01-02T00:00:00.000000000',
'2000-01-09T00:00:00.000000000', '2000-01-16T00:00:00.000000000',
'2000-01-23T00:00:00.000000000', '2000-01-30T00:00:00.000000000',
'2001-01-02T00:00:00.000000000', '2001-01-09T00:00:00.000000000',
'2001-01-16T00:00:00.000000000', '2001-01-23T00:00:00.000000000',
'2001-01-30T00:00:00.000000000', '2002-01-02T00:00:00.000000000',
'2002-01-09T00:00:00.000000000', '2002-01-16T00:00:00.000000000',
'2002-01-23T00:00:00.000000000', '2002-01-30T00:00:00.000000000',
'2003-01-02T00:00:00.000000000', '2003-01-09T00:00:00.000000000',
'2003-01-16T00:00:00.000000000', '2003-01-23T00:00:00.000000000',
'2003-01-30T00:00:00.000000000', '2004-01-02T00:00:00.000000000',
'2004-01-09T00:00:00.000000000', '2004-01-16T00:00:00.000000000',
'2004-01-23T00:00:00.000000000', '2004-01-30T00:00:00.000000000',
'2005-01-02T00:00:00.000000000', '2005-01-09T00:00:00.000000000',
'2005-01-16T00:00:00.000000000', '2005-01-23T00:00:00.000000000',
'2005-01-30T00:00:00.000000000', '2006-01-02T00:00:00.000000000',
'2006-01-09T00:00:00.000000000', '2006-01-16T00:00:00.000000000',
'2006-01-23T00:00:00.000000000', '2006-01-30T00:00:00.000000000',
'2007-01-02T00:00:00.000000000', '2007-01-09T00:00:00.000000000',
'2007-01-16T00:00:00.000000000', '2007-01-23T00:00:00.000000000',
'2007-01-30T00:00:00.000000000', '2008-01-02T00:00:00.000000000',
'2008-01-09T00:00:00.000000000', '2008-01-16T00:00:00.000000000',
'2008-01-23T00:00:00.000000000', '2008-01-30T00:00:00.000000000',
'2009-01-02T00:00:00.000000000', '2009-01-09T00:00:00.000000000',
'2009-01-16T00:00:00.000000000', '2009-01-23T00:00:00.000000000',
'2009-01-30T00:00:00.000000000', '2010-01-02T00:00:00.000000000',
'2010-01-09T00:00:00.000000000', '2010-01-16T00:00:00.000000000',
'2010-01-23T00:00:00.000000000', '2010-01-30T00:00:00.000000000',
'2011-01-02T00:00:00.000000000', '2011-01-09T00:00:00.000000000',
'2011-01-16T00:00:00.000000000', '2011-01-23T00:00:00.000000000',
'2011-01-30T00:00:00.000000000', '2012-01-02T00:00:00.000000000',
'2012-01-09T00:00:00.000000000', '2012-01-16T00:00:00.000000000',
'2012-01-23T00:00:00.000000000', '2012-01-30T00:00:00.000000000',
'2013-01-02T00:00:00.000000000', '2013-01-09T00:00:00.000000000',
'2013-01-16T00:00:00.000000000', '2013-01-23T00:00:00.000000000',
'2013-01-30T00:00:00.000000000', '2014-01-02T00:00:00.000000000',
'2014-01-09T00:00:00.000000000', '2014-01-16T00:00:00.000000000',
'2014-01-23T00:00:00.000000000', '2014-01-30T00:00:00.000000000',
'2015-01-02T00:00:00.000000000', '2015-01-09T00:00:00.000000000',
'2015-01-16T00:00:00.000000000', '2015-01-23T00:00:00.000000000',
'2015-01-30T00:00:00.000000000', '2016-01-02T00:00:00.000000000',
'2016-01-09T00:00:00.000000000', '2016-01-16T00:00:00.000000000',
'2016-01-23T00:00:00.000000000', '2016-01-30T00:00:00.000000000',
'2017-01-02T00:00:00.000000000', '2017-01-09T00:00:00.000000000',
'2017-01-16T00:00:00.000000000', '2017-01-23T00:00:00.000000000',
'2017-01-30T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
* S (S) datetime64[ns] 1999-01-02 1999-01-09 ... 2017-01-23 2017-01-30
So, for example, group 53 should all actually be in group 1, and then others are in the wrong group. All of the group 53 dates:
In [40]: em.S[0].values
Out[40]: numpy.datetime64('1999-01-02T00:00:00.000000000')
In [41]: em.S[5].values
Out[41]: numpy.datetime64('2000-01-02T00:00:00.000000000')
In [42]: em.S[10].values
Out[42]: numpy.datetime64('2001-01-02T00:00:00.000000000')
In [43]: em.S[55].values
Out[43]: numpy.datetime64('2010-01-02T00:00:00.000000000')
In [44]: em.S[85].values
Out[44]: numpy.datetime64('2016-01-02T00:00:00.000000000')
Any suggestions?
The groups being created are not actually wrong, as pointed out by several already. I was expecting that each weekly group would have the same month-days, but that is not the case since groups are based on ISO weeks. So, January 2 can actually be in week 1, 52, or 53 based on ISO weeks.
Related
I have a data set I have made with random numbers containing the sales data for each sales representative for all previous months and I want to know if there is a way to predict what the sales would look like for each representative for the upcoming month. I'm not sure if machine learning methods are something that can be used here.
I am mostly asking for the best way to solve this, not necessary a code but maybe a method that is best for these types of questions. This is something I am interested in and would like to apply to a bigger data sets in the future.
data = [[1 , 55, 12, 25, 42, 66, 89, 75, 32, 43, 15, 32, 45],
[2 , 35, 28, 43, 25, 54, 76, 92, 34, 12, 14, 35, 63],
[3 ,13, 31, 15, 75, 4, 14, 54, 23, 15, 72, 12, 51],
[4 ,42, 94, 22, 34, 32, 45, 31, 34, 65, 10, 15, 18],
[5 ,7, 51, 29, 14, 92, 28, 64, 100, 69, 89, 4, 95],
[6 , 34, 20, 59, 49, 94, 92, 45, 91, 28, 22, 43, 30],
[7 , 50, 4, 5, 45, 62, 71, 87, 8, 74, 30, 3, 46],
[8 , 12, 54, 35, 25, 52, 97, 67, 56, 62, 99, 83, 9],
[9 , 50, 75, 92, 57, 45, 91, 83, 13, 31, 89, 33, 58],
[10 , 5, 89, 90, 14, 72, 99, 51, 29, 91, 34, 25, 2]]
df = pd.DataFrame (data, columns = ['sales representative ID#',
'January Sales Quantity',
'Fabruary Sales Quantity',
'March Sales Quantity',
'April Sales Quantity',
'May Sales Quantity' ,
'June Sales Quantity',
'July Sales Quantity',
'August Sales Quantity',
'September Sales Quantity',
'October Sales Quantity',
'November Sales Quantity',
'December Sales Quantity'])
Your case with multiple sales representatives is more complex, because since they are responsible for the same product, there may be a complex correlation between their performance, besides seasonality, autocorrelation, etc. Your data is not even a pure time series — it rather belongs to the class of so called "panel" datasets.
I've recently written a Python micro-package salesplansuccess, which deals with prediction of the current (or next) year's annual sales from historic monthly sales data. But a major assumption for that model is a quarterly seasonality (more specifically a repeating drift from the 2nd to the 3rd month in each quarter), which is more characteristic for wholesalers.
The package is installed as usual with pip install salesplansuccess.
You can modify its source code for it to better fit your needs.
The minimalistic use case is below:
import pandas as pd
from salesplansuccess.api import SalesPlanSuccess
myHistoricalData = pd.read_excel('myfile.xlsx')
myAnnualPlan = 1000
sps = SalesPlanSuccess(data=myHistoricalData, plan=myAnnualPlan)
sps.fit()
sps.simulate()
sps.plot()
For more detailed illustration of its use, you may want to refer to a Jupyter Notebook illustration file at its GitHub repository.
Choose method of prediction and iterate over reps calculating their parameters. Here you have simple linear regression in python you can use. With time you can add something smarter.
#!/usr/bin/python
data = [[1 , 55, 12, 25, 42, 66, 89, 75, 32, 43, 15, 32, 45],
(...)
months = []
for m in range(len(data[0])):
months.append(m+1)
for rep in range(len(data)):
linear_regression(months, data[rep])
My Code
import numpy as np import pandas as pd
print('-' * 50)
filename = r'''C:\Users\Computer\Documents\Python Scripts\weather.txt'''
df = pd.read_csv(filename)
pd.set_option('display.max_columns', None)
print (df.describe())
print (df.record_high)
My data
month, avg_high, avg_low, record_high, record_low, avg_percipitation
Jan, 58, 42, 74, 22, 2.95
Feb, 61, 45, 78, 26, 3.02
Mar, 65, 48, 84, 25, 2.34
Apr, 67, 50, 92, 28, 1.02
May, 71, 53, 98, 35, 0.48
Jun, 75, 56, 107, 41, 0.11
Jul, 77, 58, 105, 44, 0.0
Aug, 77, 59, 102, 43, 0.03
Sep, 77, 57, 103, 40, 0.17
Oct, 73, 54, 96, 34, 0.81
Nov, 64, 48, 84, 30, 1.7
Dec, 58, 42, 73, 21, 2.56
When I run it, it gives me an error saying AttributeError: 'DataFrame' object has no attribute 'record_high' but there clearly is that attribute. Does anyone have a solution?
There may be a spacing error in your data. Try accessing the column by doing (df[' record_high']).
If that is the case, run
df.columns = df.columns.str.strip()
after you read in df. You should then be able to access
df['record_high']
I have two arrays and I am wanting to loop through a second array to only return arrays whose first element is equal to an element from another array.
a = [10, 11, 12, 13, 14]
b = [[9, 23, 45, 67, 56, 23, 54], [10, 8, 52, 30, 15, 47, 109], [11, 81,
152, 54, 112, 78, 167], [13, 82, 84, 63, 24, 26, 78], [18, 182, 25, 63, 96,
104, 74]]
I have two different arrays, a and b. I would like to find a way to look through each of the sub-arrays(?) within b in which
the first value is equal to the values in array a to create a new array, c.
The result I am looking for is:
c = [[10, 8, 52, 30, 15, 47, 109],[11, 81, 152, 54, 112, 78, 167],[13, 82, 84, 63, 24, 26, 78]]
Does Python have a tool to do this in a way Excel has MATCH()?
I tried looping in a manner such as:
for i in a:
if i in b:
print (b)
But because there are other elements within the array, this way is not working. Any help would be greatly appreciated.
Further explanation of the problem:
a = [5, 6, 7, 9, 12]
I read in a excel file using XLRD (b_csv_data):
Start Count Error Constant Result1 Result2 Result3 Result4
5 41 0 45 23 54 66 19
5.4 44 1 21 52 35 6 50
6 16 1 42 95 39 1 13
6.9 50 1 22 71 86 59 97
7 38 1 43 50 47 83 67
8 26 1 29 100 63 15 40
9 46 0 28 85 9 27 81
12 43 0 21 74 78 20 85
Next, I created a look to read in a select number of rows. For simplicity, this file above only has a few rows. My current file has about 100 rows.
for r in range (1, 7): #skipping headers and only wanting first few rows to start
b_raw = b_csv_data.row_values(r)
b = np.array(b_raw) # I created this b numpy array from the line of code above
Use np.isin -
In [8]: b[np.isin(b[:,0],a)]
Out[8]:
array([[ 10, 8, 52, 30, 15],
[ 11, 81, 152, 54, 112],
[ 13, 82, 84, 63, 24]])
With sorted a, we can also use np.searchsorted -
idx = np.searchsorted(a,b[:,0])
idx[idx==len(a)] = 0
out = b[a[idx] == b[:,0]]
If you have an array with different number of elements per row, which is essentially array of lists, you need to modify the slicing part. So, in that case, get the first off elements -
b0 = [bi[0] for bi in b]
Then, use b0 to replace all instances of b[:,0] in earlier posted methods.
Use list comprehension:
c = [l for l in b if l[0] in a]
Output:
[[10, 8, 52, 30, 15], [11, 81, 152, 54, 112], [13, 82, 84, 63, 24]]
If your list or arrays are considerably large, using numpy.isin can be significantly faster:
b[np.isin(b[:, 0], a), :]
Benchmark:
a = [10, 11, 12, 13, 14]
b = [[9, 23, 45, 67, 56], [10, 8, 52, 30, 15], [11, 81, 152, 54, 112],
[13, 82, 84, 63, 24], [18, 182, 25, 63, 96]]
list_comp, np_isin = [], []
for i in range(1,100):
a_test = a * i
b_test = b * i
list_comp.append(timeit.timeit('[l for l in b_test if l[0] in a_test]', number=10, globals=globals()))
a_arr = np.array(a_test)
b_arr = np.array(b_test)
np_isin.append(timeit.timeit('b_arr[np.isin(b_arr[:, 0], a_arr), :]', number=10, globals=globals()))
While it is not clear and concise, I would recommend using list comprehension if the b is shorter than 100. Otherwise, numpy is your way to go.
You are doing it reverse. It is better to loop through the elements of b array and check if it is present in a. If yes then print that element of b. See the answer below.
a = [10, 11, 12, 13, 14]
b = [[9, 23, 45, 67, 56, 23, 54], [10, 8, 52, 30, 15, 47, 109], [11, 81, 152, 54, 112, 78, 167], [13, 82, 84, 63, 24, 26, 78], [18, 182, 25, 63, 96, 104, 74]]
for bb in b: # if you want to check only the first element of b is in a
if bb[0] in a:
print(bb)
for bb in b: # if you want to check if any element of b is in a
for bbb in bb:
if bbb in a:
print(bb)
Output:
[10, 8, 52, 30, 15, 47, 109]
[11, 81, 152, 54, 112, 78, 167]
[13, 82, 84, 63, 24, 26, 78]
I am trying to use regex to identify particular rows of a large pandas dataframe. Specifically, I intend to match the DOI of a paper to an xml ID that contains the DOI number.
# An example of the dataframe and a test doi:
ScID.xml journal year topic1
0 0009-3570(2017)050[0199:omfg]2.3.co.xml Journal_1 2017 0.000007
1 0001-3568(2001)750[0199:smdhmf]2.3.co.xml Journal_3 2001 0.000648
2 0002-3568(2004)450[0199:gissaf]2.3.co.xml Journal_1 2004 0.000003
3 0003-3568(2011)150[0299:easayy]2.3.co.xml Journal_1 2011 0.000003
# A dummy doi:
test_doi = '0002-3568(2004)450'
In this example case I would like to be able to return the index of the third row (2) by finding the partial match in the ScID.xml column. The DOI is not always at the beginning of the ScID.xml string.
I have searched this site and applied the methods described for similar scenarios.
Including:
df.iloc[:,0].apply(lambda x: x.contains(test_doi)).values.nonzero()
This returns:
AttributeError: 'str' object has no attribute 'contains'
and:
df.filter(regex=test_doi)
gives:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54,
55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
[287459 rows x 0 columns]
and finally:
df.loc[:, df.columns.to_series().str.contains(test_doi).tolist()]
which also returns the Empty DataFrame as above.
All help is appreciated. Thank you.
There are two reasons why your first approach does not work:
First - If you use apply on a series the values in the lambda function will not be a series but a scalar. And because contains is a function from pandas and not from strings you get your error message.
Second - Brackets have a special meaning in a regex (the delimit a capture group). If you want the brackets as characters you have to escape them.
test_doi = '0002-3568\(2004\)450'
df.loc[df.iloc[:,0].str.contains(test_doi)]
ScID.xml journal year topic1
2 0002-3568(2004)450[0199:gissaf]2.3.co.xml Journal_1 2004 0.000003
Bye the way, pandas filter function filters on the label of the index, not the values.
I have a simple DataFrame, AR, with 83 columns and 1428 rows:
In [128]:
AR.index
Out[128]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')
In [129]:
AR.columns
Out[129]:
Index([u'ARKOD', u'ANSNR', u'PNR', u'NAMN', u'MEDB', u'LAEN', u'GATA', u'ORT1', u'ORT2', u'LAND', u'TFNA', u'TFNB', u'BEH_BEHAR', u'BEH_BEHS1', u'BEH_BEHS2', u'BEH_BEHKV', u'BEH_BEHAR2', u'BEH_BEHS1_2', u'BEH_BEHS2_2', u'BEH_BEHKV2', u'BEH_BEHAR3', u'BEH_BEHS1_3', u'BEH_BEHS2_3', u'BEH_BEHKV_3', u'BEH_BEHAR_4', u'BEH_BEHS1_4', u'BEH_BEHS2_4', u'BEH_BEHKV_4', u'BEH25', u'FILLER1', u'BEHFT', u'SLP_SPLAR', u'SLP_SLPP', u'MOTSV', u'FILLER2', u'ATG_ATG25', u'ATG_ATG9', u'ATG_ATGFT', u'ATG_ATGOB', u'ATG_ATGUT', u'ATG_ATGSI', u'ATG_ATGDI', u'ATG_ATGFO', u'ATG_ATGUG', u'ATG_ATGAL ', u'ATG_ATGUL1', u'ATG_ATGUL2', u'ATG_ATGUL3', u'ATG_ATGUL4', u'ATG_ATGUL5', u'ATG_ATGUL6', u'ATG_ATGUL7', u'ATG_ATGUL8', u'ATG_ATGUL9', u'ATG_ATGUL10', u'ATG_ATGUL11', u'ATG_ATGUL12', u'ATG_ATGFU1', u'ATG_ATGFU2', u'ATG_ATGFU3', u'ATG_ATGFU4', u'ATG_ATGB1', u'ATG_ATGB2', u'SLUMP', u'STAT_STATF', u'STAT_STATO', u'STAT_STATA', u'STAT_STATK', u'STAT_STATU', u'STAT_STATH', u'STAT_STATR', u'ANTAL', u'ANTBT', u'ANTSM', u'ANTAE', u'ANTFU', u'ANTZL', u'ANTYL', u'STATL', u'ATB', u'ANTB ', u'FILLER2'], dtype='object')
When I do for example:
In [121]:
AR[AR.ANSNR==10042]
I get
AssertionError: Cannot create BlockManager._ref_locs because block [IntBlock: [ANSNR, PNR, MEDB, SLUMP, ANTAL, ANTBT, ANTSM, ANTAE, ANTFU, ANTZL, ANTYL, ATB], 12 x 1, dtype: int64] with duplicate items [Index([u'ARKOD', u'ANSNR', u'PNR', u'NAMN', u'MEDB', u'LAEN', u'GATA', u'ORT1', u'ORT2', u'LAND', u'TFNA', u'TFNB', u'BEH_BEHAR', u'BEH_BEHS1', u'BEH_BEHS2', u'BEH_BEHKV', u'BEH_BEHAR2', u'BEH_BEHS1_2', u'BEH_BEHS2_2', u'BEH_BEHKV2', u'BEH_BEHAR3', u'BEH_BEHS1_3', u'BEH_BEHS2_3', u'BEH_BEHKV_3', u'BEH_BEHAR_4', u'BEH_BEHS1_4', u'BEH_BEHS2_4', u'BEH_BEHKV_4', u'BEH25', u'FILLER1', u'BEHFT', u'SLP_SPLAR', u'SLP_SLPP', u'MOTSV', u'FILLER2', u'ATG_ATG25', u'ATG_ATG9', u'ATG_ATGFT', u'ATG_ATGOB', u'ATG_ATGUT', u'ATG_ATGSI', u'ATG_ATGDI', u'ATG_ATGFO', u'ATG_ATGUG', u'ATG_ATGAL ', u'ATG_ATGUL1', u'ATG_ATGUL2', u'ATG_ATGUL3', u'ATG_ATGUL4', u'ATG_ATGUL5', u'ATG_ATGUL6', u'ATG_ATGUL7', u'ATG_ATGUL8', u'ATG_ATGUL9', u'ATG_ATGUL10', u'ATG_ATGUL11', u'ATG_ATGUL12', u'ATG_ATGFU1', u'ATG_ATGFU2', u'ATG_ATGFU3', u'ATG_ATGFU4', u'ATG_ATGB1', u'ATG_ATGB2', u'SLUMP', u'STAT_STATF', u'STAT_STATO', u'STAT_STATA', u'STAT_STATK', u'STAT_STATU', u'STAT_STATH', u'STAT_STATR', u'ANTAL', u'ANTBT', u'ANTSM', u'ANTAE', u'ANTFU', u'ANTZL', u'ANTYL', u'STATL', u'ATB', u'ANTB ', u'FILLER2'], dtype='object')] does not have _ref_locs set
Thank you for any suggestions
Edit: sorry, here is the Pandas version:
in [136]:
pd.__version__
Out[136]:
'0.13.1'
Jeff's question:
In [139]:
AR.index.is_unique
Out[139]:
True
In [140]:
AR.columns.is_unique
Out[140]:
False
Is is the last one making problems?