Related
I'm trying to create weekly groups using xarray groupy based on a datetime64 time dimension. For some reason it is creating extra groups and putting some dates in the wrong groups. I'm using the S coordinate to group by weeks. For each year there should be five weekly groups, but it is creating seven groups.
Groups being created:
In [38]: em.groupby('S.week').groups
Out[38]:
{1: [1, 6, 10, 15, 20, 25, 31, 35, 40, 45, 50, 56, 61, 65, 70, 75, 80, 86, 90],
2: [2, 7, 11, 16, 21, 26, 32, 36, 41, 46, 51, 57, 62, 66, 71, 76, 81, 87, 91],
3: [3, 8, 12, 17, 22, 27, 33, 37, 42, 47, 52, 58, 63, 67, 72, 77, 82, 88, 92],
4: [4, 9, 13, 18, 23, 28, 34, 38, 43, 48, 53, 59, 64, 68, 73, 78, 83, 89, 93],
5: [14, 19, 24, 29, 39, 44, 49, 54, 69, 74, 79, 84, 94],
52: [5, 60],
53: [0, 30, 55, 85]}
information on em:
In [39]: em
Out[39]:
<xarray.Dataset>
Dimensions: (S: 95, latitude: 181, lead: 32, longitude: 360)
Coordinates:
* latitude (latitude) float64 -90.0 -89.0 -88.0 -87.0 ... 88.0 89.0 90.0
* longitude (longitude) float64 0.0 1.0 2.0 3.0 ... 356.0 357.0 358.0 359.0
* lead (lead) timedelta64[ns] 0 days 12:00:00 ... 31 days 12:00:00
* S (S) datetime64[ns] 1999-01-02 1999-01-09 ... 2017-01-30
Data variables:
eto (S, lead, latitude, longitude) float64 dask.array<shape=(95, 32, 181, 360), chunksize=(1, 32, 181, 360)>
Values for S:
In [35]: em.S
Out[35]:
<xarray.DataArray 'S' (S: 95)>
array(['1999-01-02T00:00:00.000000000', '1999-01-09T00:00:00.000000000',
'1999-01-16T00:00:00.000000000', '1999-01-23T00:00:00.000000000',
'1999-01-30T00:00:00.000000000', '2000-01-02T00:00:00.000000000',
'2000-01-09T00:00:00.000000000', '2000-01-16T00:00:00.000000000',
'2000-01-23T00:00:00.000000000', '2000-01-30T00:00:00.000000000',
'2001-01-02T00:00:00.000000000', '2001-01-09T00:00:00.000000000',
'2001-01-16T00:00:00.000000000', '2001-01-23T00:00:00.000000000',
'2001-01-30T00:00:00.000000000', '2002-01-02T00:00:00.000000000',
'2002-01-09T00:00:00.000000000', '2002-01-16T00:00:00.000000000',
'2002-01-23T00:00:00.000000000', '2002-01-30T00:00:00.000000000',
'2003-01-02T00:00:00.000000000', '2003-01-09T00:00:00.000000000',
'2003-01-16T00:00:00.000000000', '2003-01-23T00:00:00.000000000',
'2003-01-30T00:00:00.000000000', '2004-01-02T00:00:00.000000000',
'2004-01-09T00:00:00.000000000', '2004-01-16T00:00:00.000000000',
'2004-01-23T00:00:00.000000000', '2004-01-30T00:00:00.000000000',
'2005-01-02T00:00:00.000000000', '2005-01-09T00:00:00.000000000',
'2005-01-16T00:00:00.000000000', '2005-01-23T00:00:00.000000000',
'2005-01-30T00:00:00.000000000', '2006-01-02T00:00:00.000000000',
'2006-01-09T00:00:00.000000000', '2006-01-16T00:00:00.000000000',
'2006-01-23T00:00:00.000000000', '2006-01-30T00:00:00.000000000',
'2007-01-02T00:00:00.000000000', '2007-01-09T00:00:00.000000000',
'2007-01-16T00:00:00.000000000', '2007-01-23T00:00:00.000000000',
'2007-01-30T00:00:00.000000000', '2008-01-02T00:00:00.000000000',
'2008-01-09T00:00:00.000000000', '2008-01-16T00:00:00.000000000',
'2008-01-23T00:00:00.000000000', '2008-01-30T00:00:00.000000000',
'2009-01-02T00:00:00.000000000', '2009-01-09T00:00:00.000000000',
'2009-01-16T00:00:00.000000000', '2009-01-23T00:00:00.000000000',
'2009-01-30T00:00:00.000000000', '2010-01-02T00:00:00.000000000',
'2010-01-09T00:00:00.000000000', '2010-01-16T00:00:00.000000000',
'2010-01-23T00:00:00.000000000', '2010-01-30T00:00:00.000000000',
'2011-01-02T00:00:00.000000000', '2011-01-09T00:00:00.000000000',
'2011-01-16T00:00:00.000000000', '2011-01-23T00:00:00.000000000',
'2011-01-30T00:00:00.000000000', '2012-01-02T00:00:00.000000000',
'2012-01-09T00:00:00.000000000', '2012-01-16T00:00:00.000000000',
'2012-01-23T00:00:00.000000000', '2012-01-30T00:00:00.000000000',
'2013-01-02T00:00:00.000000000', '2013-01-09T00:00:00.000000000',
'2013-01-16T00:00:00.000000000', '2013-01-23T00:00:00.000000000',
'2013-01-30T00:00:00.000000000', '2014-01-02T00:00:00.000000000',
'2014-01-09T00:00:00.000000000', '2014-01-16T00:00:00.000000000',
'2014-01-23T00:00:00.000000000', '2014-01-30T00:00:00.000000000',
'2015-01-02T00:00:00.000000000', '2015-01-09T00:00:00.000000000',
'2015-01-16T00:00:00.000000000', '2015-01-23T00:00:00.000000000',
'2015-01-30T00:00:00.000000000', '2016-01-02T00:00:00.000000000',
'2016-01-09T00:00:00.000000000', '2016-01-16T00:00:00.000000000',
'2016-01-23T00:00:00.000000000', '2016-01-30T00:00:00.000000000',
'2017-01-02T00:00:00.000000000', '2017-01-09T00:00:00.000000000',
'2017-01-16T00:00:00.000000000', '2017-01-23T00:00:00.000000000',
'2017-01-30T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
* S (S) datetime64[ns] 1999-01-02 1999-01-09 ... 2017-01-23 2017-01-30
So, for example, group 53 should all actually be in group 1, and then others are in the wrong group. All of the group 53 dates:
In [40]: em.S[0].values
Out[40]: numpy.datetime64('1999-01-02T00:00:00.000000000')
In [41]: em.S[5].values
Out[41]: numpy.datetime64('2000-01-02T00:00:00.000000000')
In [42]: em.S[10].values
Out[42]: numpy.datetime64('2001-01-02T00:00:00.000000000')
In [43]: em.S[55].values
Out[43]: numpy.datetime64('2010-01-02T00:00:00.000000000')
In [44]: em.S[85].values
Out[44]: numpy.datetime64('2016-01-02T00:00:00.000000000')
Any suggestions?
The groups being created are not actually wrong, as pointed out by several already. I was expecting that each weekly group would have the same month-days, but that is not the case since groups are based on ISO weeks. So, January 2 can actually be in week 1, 52, or 53 based on ISO weeks.
guys. I am now working on a python algorithm and I am new to python. I'd like to generate a list of numbers like 4, 7, 8, 11, 12, 13, 16, 17, 18, 19, 22, 23, 24, 25... with 2 for loops.
I've done some work to find some numbers and I am close to the result I want, which is generate a list contains this numbers
My code is here:
for x in range(0, 6, 1):
start_ind = int(((x+3) * (x+2)) / 2 + 1)
print("start index is ", [start_ind], x)
start_node = node[start_ind]
for y in range(0, x):
ind = start_ind + y + 1
ind_list = node[ind]
index = [ind_list]
print(index)
Node is a list:
node = ['n%d' % i for i in range(0, 36, 1)]
What I received from this code is:
start index is [7] 1
['n8']
start index is [11] 2
['n12']
['n13']
start index is [16] 3
['n17']
['n18']
['n19']
start index is [22] 4
['n23']
['n24']
['n25']
['n26']
start index is [29] 5
['n30']
['n31']
['n32']
['n33']
['n34']
This seems to give the same list: and I think it's much clearer what's happening!
val=4
result=[]
for i in range(1,7):
for j in range(val,val+i):
val = val+1
result.append(j)
val = j+3
print(result)
Do not think you need a loop for this, let alone two:
import numpy as np
dif = np.ones(100, dtype = np.int32)
dif[np.cumsum(np.arange(14))] = 3
(1+np.cumsum(dif)).tolist()
output
[4, 7, 8, 11, 12, 13, 16, 17, 18, 19, 22, 23, 24, 25, 26, 29, 30, 31, 32, 33, 34, 37, 38, 39, 40, 41, 42, 43, 46, 47, 48, 49, 50, 51, 52, 53, 56, 57, 58, 59, 60, 61, 62, 63, 64, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 121, 122, 123, 124, 125, 126, 127, 128, 129]
ind_list = []
start_ind = 4
for x in range(0, 6):
ind_list.append(start_ind)
for y in range(1, x+1):
ind_list.append(start_ind + y)
start_ind = ind_list[len(ind_list)-1]+3
print(ind_list)
You could probably use this. the print function works fine, the list I assume works fairly well for the numbers provided. It appends the new number at the beginning of the loop, with a cotinually longer loop each time for x. I'm assuming the number sequence is 4, 4+3, 4+3+1, 4+3+1+3, 4+3+1+3+1, 4+3+1+3+1+1, 4+3+1+3+1+1+3, ....
I am trying to use regex to identify particular rows of a large pandas dataframe. Specifically, I intend to match the DOI of a paper to an xml ID that contains the DOI number.
# An example of the dataframe and a test doi:
ScID.xml journal year topic1
0 0009-3570(2017)050[0199:omfg]2.3.co.xml Journal_1 2017 0.000007
1 0001-3568(2001)750[0199:smdhmf]2.3.co.xml Journal_3 2001 0.000648
2 0002-3568(2004)450[0199:gissaf]2.3.co.xml Journal_1 2004 0.000003
3 0003-3568(2011)150[0299:easayy]2.3.co.xml Journal_1 2011 0.000003
# A dummy doi:
test_doi = '0002-3568(2004)450'
In this example case I would like to be able to return the index of the third row (2) by finding the partial match in the ScID.xml column. The DOI is not always at the beginning of the ScID.xml string.
I have searched this site and applied the methods described for similar scenarios.
Including:
df.iloc[:,0].apply(lambda x: x.contains(test_doi)).values.nonzero()
This returns:
AttributeError: 'str' object has no attribute 'contains'
and:
df.filter(regex=test_doi)
gives:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54,
55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
[287459 rows x 0 columns]
and finally:
df.loc[:, df.columns.to_series().str.contains(test_doi).tolist()]
which also returns the Empty DataFrame as above.
All help is appreciated. Thank you.
There are two reasons why your first approach does not work:
First - If you use apply on a series the values in the lambda function will not be a series but a scalar. And because contains is a function from pandas and not from strings you get your error message.
Second - Brackets have a special meaning in a regex (the delimit a capture group). If you want the brackets as characters you have to escape them.
test_doi = '0002-3568\(2004\)450'
df.loc[df.iloc[:,0].str.contains(test_doi)]
ScID.xml journal year topic1
2 0002-3568(2004)450[0199:gissaf]2.3.co.xml Journal_1 2004 0.000003
Bye the way, pandas filter function filters on the label of the index, not the values.
This is likely a really simple question, but it's one I've been confused about and stuck on for a while, so I'm hoping I might get some help.
I'm using cross validation to test my data set, but I'm finding that indexing the pandas df is not working as I'm expecting. Specifically, when I print out x_test, I find that there are no data points for x_test. In fact, there are indexes but no columns.
k = 10
N = len(df)
n = N/k + 1
for i in range(k):
print i*n, i*n+n
x_train = df.iloc[i*n: i*n+n]
y_train = df.iloc[i*n: i*n+n]
x_test = df.iloc[0:i*n, i*n+n:-1]
print x_test
Typical output:
0 751
Empty DataFrame
Columns: []
Index: []
751 1502
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
I'm trying to work out how to get the data to show up. Any thoughts?
Why don't you use sklearn.cross_validation.KFold? There is a clear example on this site...
UPDATE:
At all subsets you have to specify columns as well: at x_train and x_test you have to exclude target column, at y_train only the target column have to be present. See slicing and indexing for more details.
target = 'target' # name of target column
list_features = df.columns.tolist() # use all columns at model training
list_features.remove(target) # excluding "target" column
k = 10
N = len(df)
n = int(N/k) + 1 # 'int()' is necessary at Python 3
for i in range(k):
print i*n, i*n+n
x_train = df.loc[i*n: i*n+n-1, list_features] # '.loc[]' is inclusive, that's why "-1" is present
y_train = df.loc[i*n: i*n+n-1, target] # specify columns after ","
x_test = df.loc[~df.index.isin(range(int(i*n), int(i*n+n))), list_features]
print x_test
I have a simple DataFrame, AR, with 83 columns and 1428 rows:
In [128]:
AR.index
Out[128]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')
In [129]:
AR.columns
Out[129]:
Index([u'ARKOD', u'ANSNR', u'PNR', u'NAMN', u'MEDB', u'LAEN', u'GATA', u'ORT1', u'ORT2', u'LAND', u'TFNA', u'TFNB', u'BEH_BEHAR', u'BEH_BEHS1', u'BEH_BEHS2', u'BEH_BEHKV', u'BEH_BEHAR2', u'BEH_BEHS1_2', u'BEH_BEHS2_2', u'BEH_BEHKV2', u'BEH_BEHAR3', u'BEH_BEHS1_3', u'BEH_BEHS2_3', u'BEH_BEHKV_3', u'BEH_BEHAR_4', u'BEH_BEHS1_4', u'BEH_BEHS2_4', u'BEH_BEHKV_4', u'BEH25', u'FILLER1', u'BEHFT', u'SLP_SPLAR', u'SLP_SLPP', u'MOTSV', u'FILLER2', u'ATG_ATG25', u'ATG_ATG9', u'ATG_ATGFT', u'ATG_ATGOB', u'ATG_ATGUT', u'ATG_ATGSI', u'ATG_ATGDI', u'ATG_ATGFO', u'ATG_ATGUG', u'ATG_ATGAL ', u'ATG_ATGUL1', u'ATG_ATGUL2', u'ATG_ATGUL3', u'ATG_ATGUL4', u'ATG_ATGUL5', u'ATG_ATGUL6', u'ATG_ATGUL7', u'ATG_ATGUL8', u'ATG_ATGUL9', u'ATG_ATGUL10', u'ATG_ATGUL11', u'ATG_ATGUL12', u'ATG_ATGFU1', u'ATG_ATGFU2', u'ATG_ATGFU3', u'ATG_ATGFU4', u'ATG_ATGB1', u'ATG_ATGB2', u'SLUMP', u'STAT_STATF', u'STAT_STATO', u'STAT_STATA', u'STAT_STATK', u'STAT_STATU', u'STAT_STATH', u'STAT_STATR', u'ANTAL', u'ANTBT', u'ANTSM', u'ANTAE', u'ANTFU', u'ANTZL', u'ANTYL', u'STATL', u'ATB', u'ANTB ', u'FILLER2'], dtype='object')
When I do for example:
In [121]:
AR[AR.ANSNR==10042]
I get
AssertionError: Cannot create BlockManager._ref_locs because block [IntBlock: [ANSNR, PNR, MEDB, SLUMP, ANTAL, ANTBT, ANTSM, ANTAE, ANTFU, ANTZL, ANTYL, ATB], 12 x 1, dtype: int64] with duplicate items [Index([u'ARKOD', u'ANSNR', u'PNR', u'NAMN', u'MEDB', u'LAEN', u'GATA', u'ORT1', u'ORT2', u'LAND', u'TFNA', u'TFNB', u'BEH_BEHAR', u'BEH_BEHS1', u'BEH_BEHS2', u'BEH_BEHKV', u'BEH_BEHAR2', u'BEH_BEHS1_2', u'BEH_BEHS2_2', u'BEH_BEHKV2', u'BEH_BEHAR3', u'BEH_BEHS1_3', u'BEH_BEHS2_3', u'BEH_BEHKV_3', u'BEH_BEHAR_4', u'BEH_BEHS1_4', u'BEH_BEHS2_4', u'BEH_BEHKV_4', u'BEH25', u'FILLER1', u'BEHFT', u'SLP_SPLAR', u'SLP_SLPP', u'MOTSV', u'FILLER2', u'ATG_ATG25', u'ATG_ATG9', u'ATG_ATGFT', u'ATG_ATGOB', u'ATG_ATGUT', u'ATG_ATGSI', u'ATG_ATGDI', u'ATG_ATGFO', u'ATG_ATGUG', u'ATG_ATGAL ', u'ATG_ATGUL1', u'ATG_ATGUL2', u'ATG_ATGUL3', u'ATG_ATGUL4', u'ATG_ATGUL5', u'ATG_ATGUL6', u'ATG_ATGUL7', u'ATG_ATGUL8', u'ATG_ATGUL9', u'ATG_ATGUL10', u'ATG_ATGUL11', u'ATG_ATGUL12', u'ATG_ATGFU1', u'ATG_ATGFU2', u'ATG_ATGFU3', u'ATG_ATGFU4', u'ATG_ATGB1', u'ATG_ATGB2', u'SLUMP', u'STAT_STATF', u'STAT_STATO', u'STAT_STATA', u'STAT_STATK', u'STAT_STATU', u'STAT_STATH', u'STAT_STATR', u'ANTAL', u'ANTBT', u'ANTSM', u'ANTAE', u'ANTFU', u'ANTZL', u'ANTYL', u'STATL', u'ATB', u'ANTB ', u'FILLER2'], dtype='object')] does not have _ref_locs set
Thank you for any suggestions
Edit: sorry, here is the Pandas version:
in [136]:
pd.__version__
Out[136]:
'0.13.1'
Jeff's question:
In [139]:
AR.index.is_unique
Out[139]:
True
In [140]:
AR.columns.is_unique
Out[140]:
False
Is is the last one making problems?