slicing data in pandas - python

This is likely a really simple question, but it's one I've been confused about and stuck on for a while, so I'm hoping I might get some help.
I'm using cross validation to test my data set, but I'm finding that indexing the pandas df is not working as I'm expecting. Specifically, when I print out x_test, I find that there are no data points for x_test. In fact, there are indexes but no columns.
k = 10
N = len(df)
n = N/k + 1
for i in range(k):
print i*n, i*n+n
x_train = df.iloc[i*n: i*n+n]
y_train = df.iloc[i*n: i*n+n]
x_test = df.iloc[0:i*n, i*n+n:-1]
print x_test
Typical output:
0 751
Empty DataFrame
Columns: []
Index: []
751 1502
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
I'm trying to work out how to get the data to show up. Any thoughts?

Why don't you use sklearn.cross_validation.KFold? There is a clear example on this site...
UPDATE:
At all subsets you have to specify columns as well: at x_train and x_test you have to exclude target column, at y_train only the target column have to be present. See slicing and indexing for more details.
target = 'target' # name of target column
list_features = df.columns.tolist() # use all columns at model training
list_features.remove(target) # excluding "target" column
k = 10
N = len(df)
n = int(N/k) + 1 # 'int()' is necessary at Python 3
for i in range(k):
print i*n, i*n+n
x_train = df.loc[i*n: i*n+n-1, list_features] # '.loc[]' is inclusive, that's why "-1" is present
y_train = df.loc[i*n: i*n+n-1, target] # specify columns after ","
x_test = df.loc[~df.index.isin(range(int(i*n), int(i*n+n))), list_features]
print x_test

Related

given an random 100 number with duplicate. I want to count how many number is inside an interval of number in python

For Example
first_interval = [40, 50, 60, 70, 80, 90]
second_interval = [49, 59, 69, 79, 89, 99]
Data = [40, 42, 47, 49, 50, 52, 55, 56, 57, 59, 60, 61, 63, 65, 65, 65, 66, 68, 68, 69, 72, 74, 78, 79, 81, 85, 87, 88, 90, 98]
x = first_interval[0] <= data <= second_interval[0]
y = first_interval[1] <= data <= second_intercal[1] # and so on
I want to know how many numbers from data is between 40-49, 50-59, 60-69 and so on
frequency = [4, 6] # 4 is x and 6 is y
Iterate on the bounds using zip, then with a list comprehension you can filter the correct values
first_interval = [40, 50, 60, 70, 80, 90]
second_interval = [49, 59, 69, 79, 89, 99]
data = [40, 42, 47, 49, 50, 52, 55, 56, 57, 59, 60, 61, 63, 65, 65,
65, 66, 68, 68, 69, 72, 74, 78, 79, 81, 85, 87, 88, 90, 98]
result = {}
for start, end in zip(first_interval, second_interval):
result[(start, end)] = len([v for v in data if start <= v <= end])
print(result)
# {(40, 49): 4, (50, 59): 6, (60, 69): 10, (70, 79): 4, (80, 89): 4, (90, 99): 2}
print(result[(40, 49)])
# 4
The version with a list and len is easier to understand
result[(start, end)] = len([v for v in data if start <= v <= end])
But the following version would be more performant for bigger size, as it's a generator, it won't have to build the whole list to just forget it after
result[(start, end)] = sum((1 for v in data if start <= v <= end))
Another version, that doesn't use the predefined bounds, and so is much performant as it's complexity is O(n) and not O(n*m) as the first one : you iterate once on values, not on values for each bounds
result = defaultdict(int) # from collections import defaultdict
for value in data:
start = 10 * (value // 10)
result[(start, start + 9)] += 1
This may help you :
first_interval = [40, 50, 60, 70, 80, 90]
second_interval = [49, 59, 69, 79, 89, 99]
Data = [40, 42, 47, 49, 50, 52, 55, 56, 57, 59, 60, 61, 63, 65, 65, 65, 66, 68, 68, 69, 72, 74, 78, 79, 81, 85, 87, 88, 90, 98]
def find_occurence(start,end,data):
counter = 0
for i in data :
if start<=i<=end :
counter += 1
return counter
print(find_occurence(first_interval[0],second_interval[0],Data)) #this gives you the anser for x and the same thing for y
Note : start :means from where you want to start.
end : mean where you want to stop.
We can use numpy.histogram with bins defined by:
first_interval bins, but open on the right
max(second_interval) to determine the close of rightmost bin
Code
# Generate counts and bins (right most edge given by max(second_interval))
frequency, bins = np.histogram(data, bins = first_interval + [max(second_interval)])
# Show Results
for i in range(len(frequency)):
if i < len(frequency) - 1:
print(f'{bins[i]}-{bins[i+1]-1} : {frequency[i]}') # frequency doesn't include right edge
else:
print(f'{bins[i]}-{bins[i+1]} : {frequency[i]}') # frequency includes right edge in last bin
Output
40-49 : 4
50-59 : 6
60-69 : 10
70-79 : 4
80-89 : 4
90-99 : 2

How to generate a list of numbers in python

guys. I am now working on a python algorithm and I am new to python. I'd like to generate a list of numbers like 4, 7, 8, 11, 12, 13, 16, 17, 18, 19, 22, 23, 24, 25... with 2 for loops.
I've done some work to find some numbers and I am close to the result I want, which is generate a list contains this numbers
My code is here:
for x in range(0, 6, 1):
start_ind = int(((x+3) * (x+2)) / 2 + 1)
print("start index is ", [start_ind], x)
start_node = node[start_ind]
for y in range(0, x):
ind = start_ind + y + 1
ind_list = node[ind]
index = [ind_list]
print(index)
Node is a list:
node = ['n%d' % i for i in range(0, 36, 1)]
What I received from this code is:
start index is [7] 1
['n8']
start index is [11] 2
['n12']
['n13']
start index is [16] 3
['n17']
['n18']
['n19']
start index is [22] 4
['n23']
['n24']
['n25']
['n26']
start index is [29] 5
['n30']
['n31']
['n32']
['n33']
['n34']
This seems to give the same list: and I think it's much clearer what's happening!
val=4
result=[]
for i in range(1,7):
for j in range(val,val+i):
val = val+1
result.append(j)
val = j+3
print(result)
Do not think you need a loop for this, let alone two:
import numpy as np
dif = np.ones(100, dtype = np.int32)
dif[np.cumsum(np.arange(14))] = 3
(1+np.cumsum(dif)).tolist()
output
[4, 7, 8, 11, 12, 13, 16, 17, 18, 19, 22, 23, 24, 25, 26, 29, 30, 31, 32, 33, 34, 37, 38, 39, 40, 41, 42, 43, 46, 47, 48, 49, 50, 51, 52, 53, 56, 57, 58, 59, 60, 61, 62, 63, 64, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 121, 122, 123, 124, 125, 126, 127, 128, 129]
ind_list = []
start_ind = 4
for x in range(0, 6):
ind_list.append(start_ind)
for y in range(1, x+1):
ind_list.append(start_ind + y)
start_ind = ind_list[len(ind_list)-1]+3
print(ind_list)
You could probably use this. the print function works fine, the list I assume works fairly well for the numbers provided. It appends the new number at the beginning of the loop, with a cotinually longer loop each time for x. I'm assuming the number sequence is 4, 4+3, 4+3+1, 4+3+1+3, 4+3+1+3+1, 4+3+1+3+1+1, 4+3+1+3+1+1+3, ....

How to get all values possible from a list of number without itertools

I want to create an algorithm that find all values that can be created with the 4 basic operations + - * / from a list of number n, where 2 <= len(l) <= 6 and n >= 1
All numbers must be integers.
I have seen a lot of similar topics but I don't want to use the itertool method, I want to understand why my recursive program doesn't work
I tried to make a costly recursive program that makes an exhaustive search of all the possible combinations, like a tree with n=len(l) start and each tree depth is n.
L list of the starting number
C the current value
M the list of all possible values
My code:
def result(L,C,M):
if len(L)>0:
for i in range(len(L)) :
a=L[i]
if C>=a:
l=deepcopy(L)
l.remove(a)
m=[] # new current values
#+
m.append(C+a)
# * 1 is useless
if C !=1 or a !=1:
m.append(C*a)
# must be integer
if C%a==0 and a<=C: # a can't be ==0
m.append(C//a)
#0 is useless
if C!=a:
m.append(C-a)
for r in m: #update all values possible
if r not in M:
M.append(r)
for r in m: # call the fucntion again with new current values,and updated list of remaining number
result(l,r,M)
def values_possible(L) :
m=[]
for i in L:
l=deepcopy(L)
l.remove(i)
result(l,i,m)
m.sort()
return m
For small lists without duplicate numbers, my algorithm seems to work but with lists like [1,1,2,2,4,5] it misses some values.
It returns:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41,
42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61,
62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81,
82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 94, 95, 96, 97, 98, 99, 100, 101,
102, 104, 105, 110, 112, 115, 116, 118, 119, 120, 121, 122, 124, 125, 128, 130,
140, 160]
but it misses 93,108,114,117,123,126,132,135,150,180.
Let's take an even simpler example: [1, 1, 2, 2].
One of the numbers your algorithm can't find is 9 = (1 + 2) * (1 + 2).
Your algorithm simply cannot come up with this computation because it always deals with a "current" value C. You can start with C = 1 + 2, but you cannot find the next 1 + 2 because it has to be constructed separately.
So your recursion will have to do at least some kind of partitioning into two groups, finding all the answers for those and then doing combining them.
Something like this could work:
def partitions(L):
if not L:
yield ([], [])
else:
for l, r in partitions(L[1:]):
yield [L[0]] + l, r
yield l, [L[0]] + r
def values_possible(L):
if len(L) == 1:
return L
results = set()
for a, b in partitions(L):
if not a or not b:
continue
for va in values_possible(a):
for vb in values_possible(b):
results.add(va + vb)
results.add(va * vb)
if va > vb:
results.add(va - vb)
if va % vb == 0:
results.add(va // vb)
return results
Not too efficient though.

Using regex to obtain the row index of a partial match within a pandas dataframe column

I am trying to use regex to identify particular rows of a large pandas dataframe. Specifically, I intend to match the DOI of a paper to an xml ID that contains the DOI number.
# An example of the dataframe and a test doi:
ScID.xml journal year topic1
0 0009-3570(2017)050[0199:omfg]2.3.co.xml Journal_1 2017 0.000007
1 0001-3568(2001)750[0199:smdhmf]2.3.co.xml Journal_3 2001 0.000648
2 0002-3568(2004)450[0199:gissaf]2.3.co.xml Journal_1 2004 0.000003
3 0003-3568(2011)150[0299:easayy]2.3.co.xml Journal_1 2011 0.000003
# A dummy doi:
test_doi = '0002-3568(2004)450'
In this example case I would like to be able to return the index of the third row (2) by finding the partial match in the ScID.xml column. The DOI is not always at the beginning of the ScID.xml string.
I have searched this site and applied the methods described for similar scenarios.
Including:
df.iloc[:,0].apply(lambda x: x.contains(test_doi)).values.nonzero()
This returns:
AttributeError: 'str' object has no attribute 'contains'
and:
df.filter(regex=test_doi)
gives:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54,
55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
[287459 rows x 0 columns]
and finally:
df.loc[:, df.columns.to_series().str.contains(test_doi).tolist()]
which also returns the Empty DataFrame as above.
All help is appreciated. Thank you.
There are two reasons why your first approach does not work:
First - If you use apply on a series the values in the lambda function will not be a series but a scalar. And because contains is a function from pandas and not from strings you get your error message.
Second - Brackets have a special meaning in a regex (the delimit a capture group). If you want the brackets as characters you have to escape them.
test_doi = '0002-3568\(2004\)450'
df.loc[df.iloc[:,0].str.contains(test_doi)]
ScID.xml journal year topic1
2 0002-3568(2004)450[0199:gissaf]2.3.co.xml Journal_1 2004 0.000003
Bye the way, pandas filter function filters on the label of the index, not the values.

Simple boolean indexing gives AssertionError: Cannot create BlockManager

I have a simple DataFrame, AR, with 83 columns and 1428 rows:
In [128]:
AR.index
Out[128]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')
In [129]:
AR.columns
Out[129]:
Index([u'ARKOD', u'ANSNR', u'PNR', u'NAMN', u'MEDB', u'LAEN', u'GATA', u'ORT1', u'ORT2', u'LAND', u'TFNA', u'TFNB', u'BEH_BEHAR', u'BEH_BEHS1', u'BEH_BEHS2', u'BEH_BEHKV', u'BEH_BEHAR2', u'BEH_BEHS1_2', u'BEH_BEHS2_2', u'BEH_BEHKV2', u'BEH_BEHAR3', u'BEH_BEHS1_3', u'BEH_BEHS2_3', u'BEH_BEHKV_3', u'BEH_BEHAR_4', u'BEH_BEHS1_4', u'BEH_BEHS2_4', u'BEH_BEHKV_4', u'BEH25', u'FILLER1', u'BEHFT', u'SLP_SPLAR', u'SLP_SLPP', u'MOTSV', u'FILLER2', u'ATG_ATG25', u'ATG_ATG9', u'ATG_ATGFT', u'ATG_ATGOB', u'ATG_ATGUT', u'ATG_ATGSI', u'ATG_ATGDI', u'ATG_ATGFO', u'ATG_ATGUG', u'ATG_ATGAL ', u'ATG_ATGUL1', u'ATG_ATGUL2', u'ATG_ATGUL3', u'ATG_ATGUL4', u'ATG_ATGUL5', u'ATG_ATGUL6', u'ATG_ATGUL7', u'ATG_ATGUL8', u'ATG_ATGUL9', u'ATG_ATGUL10', u'ATG_ATGUL11', u'ATG_ATGUL12', u'ATG_ATGFU1', u'ATG_ATGFU2', u'ATG_ATGFU3', u'ATG_ATGFU4', u'ATG_ATGB1', u'ATG_ATGB2', u'SLUMP', u'STAT_STATF', u'STAT_STATO', u'STAT_STATA', u'STAT_STATK', u'STAT_STATU', u'STAT_STATH', u'STAT_STATR', u'ANTAL', u'ANTBT', u'ANTSM', u'ANTAE', u'ANTFU', u'ANTZL', u'ANTYL', u'STATL', u'ATB', u'ANTB ', u'FILLER2'], dtype='object')
When I do for example:
In [121]:
AR[AR.ANSNR==10042]
I get
AssertionError: Cannot create BlockManager._ref_locs because block [IntBlock: [ANSNR, PNR, MEDB, SLUMP, ANTAL, ANTBT, ANTSM, ANTAE, ANTFU, ANTZL, ANTYL, ATB], 12 x 1, dtype: int64] with duplicate items [Index([u'ARKOD', u'ANSNR', u'PNR', u'NAMN', u'MEDB', u'LAEN', u'GATA', u'ORT1', u'ORT2', u'LAND', u'TFNA', u'TFNB', u'BEH_BEHAR', u'BEH_BEHS1', u'BEH_BEHS2', u'BEH_BEHKV', u'BEH_BEHAR2', u'BEH_BEHS1_2', u'BEH_BEHS2_2', u'BEH_BEHKV2', u'BEH_BEHAR3', u'BEH_BEHS1_3', u'BEH_BEHS2_3', u'BEH_BEHKV_3', u'BEH_BEHAR_4', u'BEH_BEHS1_4', u'BEH_BEHS2_4', u'BEH_BEHKV_4', u'BEH25', u'FILLER1', u'BEHFT', u'SLP_SPLAR', u'SLP_SLPP', u'MOTSV', u'FILLER2', u'ATG_ATG25', u'ATG_ATG9', u'ATG_ATGFT', u'ATG_ATGOB', u'ATG_ATGUT', u'ATG_ATGSI', u'ATG_ATGDI', u'ATG_ATGFO', u'ATG_ATGUG', u'ATG_ATGAL ', u'ATG_ATGUL1', u'ATG_ATGUL2', u'ATG_ATGUL3', u'ATG_ATGUL4', u'ATG_ATGUL5', u'ATG_ATGUL6', u'ATG_ATGUL7', u'ATG_ATGUL8', u'ATG_ATGUL9', u'ATG_ATGUL10', u'ATG_ATGUL11', u'ATG_ATGUL12', u'ATG_ATGFU1', u'ATG_ATGFU2', u'ATG_ATGFU3', u'ATG_ATGFU4', u'ATG_ATGB1', u'ATG_ATGB2', u'SLUMP', u'STAT_STATF', u'STAT_STATO', u'STAT_STATA', u'STAT_STATK', u'STAT_STATU', u'STAT_STATH', u'STAT_STATR', u'ANTAL', u'ANTBT', u'ANTSM', u'ANTAE', u'ANTFU', u'ANTZL', u'ANTYL', u'STATL', u'ATB', u'ANTB ', u'FILLER2'], dtype='object')] does not have _ref_locs set
Thank you for any suggestions
Edit: sorry, here is the Pandas version:
in [136]:
pd.__version__
Out[136]:
'0.13.1'
Jeff's question:
In [139]:
AR.index.is_unique
Out[139]:
True
In [140]:
AR.columns.is_unique
Out[140]:
False
Is is the last one making problems?

Categories

Resources