Related
For Example
first_interval = [40, 50, 60, 70, 80, 90]
second_interval = [49, 59, 69, 79, 89, 99]
Data = [40, 42, 47, 49, 50, 52, 55, 56, 57, 59, 60, 61, 63, 65, 65, 65, 66, 68, 68, 69, 72, 74, 78, 79, 81, 85, 87, 88, 90, 98]
x = first_interval[0] <= data <= second_interval[0]
y = first_interval[1] <= data <= second_intercal[1] # and so on
I want to know how many numbers from data is between 40-49, 50-59, 60-69 and so on
frequency = [4, 6] # 4 is x and 6 is y
Iterate on the bounds using zip, then with a list comprehension you can filter the correct values
first_interval = [40, 50, 60, 70, 80, 90]
second_interval = [49, 59, 69, 79, 89, 99]
data = [40, 42, 47, 49, 50, 52, 55, 56, 57, 59, 60, 61, 63, 65, 65,
65, 66, 68, 68, 69, 72, 74, 78, 79, 81, 85, 87, 88, 90, 98]
result = {}
for start, end in zip(first_interval, second_interval):
result[(start, end)] = len([v for v in data if start <= v <= end])
print(result)
# {(40, 49): 4, (50, 59): 6, (60, 69): 10, (70, 79): 4, (80, 89): 4, (90, 99): 2}
print(result[(40, 49)])
# 4
The version with a list and len is easier to understand
result[(start, end)] = len([v for v in data if start <= v <= end])
But the following version would be more performant for bigger size, as it's a generator, it won't have to build the whole list to just forget it after
result[(start, end)] = sum((1 for v in data if start <= v <= end))
Another version, that doesn't use the predefined bounds, and so is much performant as it's complexity is O(n) and not O(n*m) as the first one : you iterate once on values, not on values for each bounds
result = defaultdict(int) # from collections import defaultdict
for value in data:
start = 10 * (value // 10)
result[(start, start + 9)] += 1
This may help you :
first_interval = [40, 50, 60, 70, 80, 90]
second_interval = [49, 59, 69, 79, 89, 99]
Data = [40, 42, 47, 49, 50, 52, 55, 56, 57, 59, 60, 61, 63, 65, 65, 65, 66, 68, 68, 69, 72, 74, 78, 79, 81, 85, 87, 88, 90, 98]
def find_occurence(start,end,data):
counter = 0
for i in data :
if start<=i<=end :
counter += 1
return counter
print(find_occurence(first_interval[0],second_interval[0],Data)) #this gives you the anser for x and the same thing for y
Note : start :means from where you want to start.
end : mean where you want to stop.
We can use numpy.histogram with bins defined by:
first_interval bins, but open on the right
max(second_interval) to determine the close of rightmost bin
Code
# Generate counts and bins (right most edge given by max(second_interval))
frequency, bins = np.histogram(data, bins = first_interval + [max(second_interval)])
# Show Results
for i in range(len(frequency)):
if i < len(frequency) - 1:
print(f'{bins[i]}-{bins[i+1]-1} : {frequency[i]}') # frequency doesn't include right edge
else:
print(f'{bins[i]}-{bins[i+1]} : {frequency[i]}') # frequency includes right edge in last bin
Output
40-49 : 4
50-59 : 6
60-69 : 10
70-79 : 4
80-89 : 4
90-99 : 2
guys. I am now working on a python algorithm and I am new to python. I'd like to generate a list of numbers like 4, 7, 8, 11, 12, 13, 16, 17, 18, 19, 22, 23, 24, 25... with 2 for loops.
I've done some work to find some numbers and I am close to the result I want, which is generate a list contains this numbers
My code is here:
for x in range(0, 6, 1):
start_ind = int(((x+3) * (x+2)) / 2 + 1)
print("start index is ", [start_ind], x)
start_node = node[start_ind]
for y in range(0, x):
ind = start_ind + y + 1
ind_list = node[ind]
index = [ind_list]
print(index)
Node is a list:
node = ['n%d' % i for i in range(0, 36, 1)]
What I received from this code is:
start index is [7] 1
['n8']
start index is [11] 2
['n12']
['n13']
start index is [16] 3
['n17']
['n18']
['n19']
start index is [22] 4
['n23']
['n24']
['n25']
['n26']
start index is [29] 5
['n30']
['n31']
['n32']
['n33']
['n34']
This seems to give the same list: and I think it's much clearer what's happening!
val=4
result=[]
for i in range(1,7):
for j in range(val,val+i):
val = val+1
result.append(j)
val = j+3
print(result)
Do not think you need a loop for this, let alone two:
import numpy as np
dif = np.ones(100, dtype = np.int32)
dif[np.cumsum(np.arange(14))] = 3
(1+np.cumsum(dif)).tolist()
output
[4, 7, 8, 11, 12, 13, 16, 17, 18, 19, 22, 23, 24, 25, 26, 29, 30, 31, 32, 33, 34, 37, 38, 39, 40, 41, 42, 43, 46, 47, 48, 49, 50, 51, 52, 53, 56, 57, 58, 59, 60, 61, 62, 63, 64, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 121, 122, 123, 124, 125, 126, 127, 128, 129]
ind_list = []
start_ind = 4
for x in range(0, 6):
ind_list.append(start_ind)
for y in range(1, x+1):
ind_list.append(start_ind + y)
start_ind = ind_list[len(ind_list)-1]+3
print(ind_list)
You could probably use this. the print function works fine, the list I assume works fairly well for the numbers provided. It appends the new number at the beginning of the loop, with a cotinually longer loop each time for x. I'm assuming the number sequence is 4, 4+3, 4+3+1, 4+3+1+3, 4+3+1+3+1, 4+3+1+3+1+1, 4+3+1+3+1+1+3, ....
I want to create an algorithm that find all values that can be created with the 4 basic operations + - * / from a list of number n, where 2 <= len(l) <= 6 and n >= 1
All numbers must be integers.
I have seen a lot of similar topics but I don't want to use the itertool method, I want to understand why my recursive program doesn't work
I tried to make a costly recursive program that makes an exhaustive search of all the possible combinations, like a tree with n=len(l) start and each tree depth is n.
L list of the starting number
C the current value
M the list of all possible values
My code:
def result(L,C,M):
if len(L)>0:
for i in range(len(L)) :
a=L[i]
if C>=a:
l=deepcopy(L)
l.remove(a)
m=[] # new current values
#+
m.append(C+a)
# * 1 is useless
if C !=1 or a !=1:
m.append(C*a)
# must be integer
if C%a==0 and a<=C: # a can't be ==0
m.append(C//a)
#0 is useless
if C!=a:
m.append(C-a)
for r in m: #update all values possible
if r not in M:
M.append(r)
for r in m: # call the fucntion again with new current values,and updated list of remaining number
result(l,r,M)
def values_possible(L) :
m=[]
for i in L:
l=deepcopy(L)
l.remove(i)
result(l,i,m)
m.sort()
return m
For small lists without duplicate numbers, my algorithm seems to work but with lists like [1,1,2,2,4,5] it misses some values.
It returns:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41,
42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61,
62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81,
82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 94, 95, 96, 97, 98, 99, 100, 101,
102, 104, 105, 110, 112, 115, 116, 118, 119, 120, 121, 122, 124, 125, 128, 130,
140, 160]
but it misses 93,108,114,117,123,126,132,135,150,180.
Let's take an even simpler example: [1, 1, 2, 2].
One of the numbers your algorithm can't find is 9 = (1 + 2) * (1 + 2).
Your algorithm simply cannot come up with this computation because it always deals with a "current" value C. You can start with C = 1 + 2, but you cannot find the next 1 + 2 because it has to be constructed separately.
So your recursion will have to do at least some kind of partitioning into two groups, finding all the answers for those and then doing combining them.
Something like this could work:
def partitions(L):
if not L:
yield ([], [])
else:
for l, r in partitions(L[1:]):
yield [L[0]] + l, r
yield l, [L[0]] + r
def values_possible(L):
if len(L) == 1:
return L
results = set()
for a, b in partitions(L):
if not a or not b:
continue
for va in values_possible(a):
for vb in values_possible(b):
results.add(va + vb)
results.add(va * vb)
if va > vb:
results.add(va - vb)
if va % vb == 0:
results.add(va // vb)
return results
Not too efficient though.
I am trying to use regex to identify particular rows of a large pandas dataframe. Specifically, I intend to match the DOI of a paper to an xml ID that contains the DOI number.
# An example of the dataframe and a test doi:
ScID.xml journal year topic1
0 0009-3570(2017)050[0199:omfg]2.3.co.xml Journal_1 2017 0.000007
1 0001-3568(2001)750[0199:smdhmf]2.3.co.xml Journal_3 2001 0.000648
2 0002-3568(2004)450[0199:gissaf]2.3.co.xml Journal_1 2004 0.000003
3 0003-3568(2011)150[0299:easayy]2.3.co.xml Journal_1 2011 0.000003
# A dummy doi:
test_doi = '0002-3568(2004)450'
In this example case I would like to be able to return the index of the third row (2) by finding the partial match in the ScID.xml column. The DOI is not always at the beginning of the ScID.xml string.
I have searched this site and applied the methods described for similar scenarios.
Including:
df.iloc[:,0].apply(lambda x: x.contains(test_doi)).values.nonzero()
This returns:
AttributeError: 'str' object has no attribute 'contains'
and:
df.filter(regex=test_doi)
gives:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54,
55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
[287459 rows x 0 columns]
and finally:
df.loc[:, df.columns.to_series().str.contains(test_doi).tolist()]
which also returns the Empty DataFrame as above.
All help is appreciated. Thank you.
There are two reasons why your first approach does not work:
First - If you use apply on a series the values in the lambda function will not be a series but a scalar. And because contains is a function from pandas and not from strings you get your error message.
Second - Brackets have a special meaning in a regex (the delimit a capture group). If you want the brackets as characters you have to escape them.
test_doi = '0002-3568\(2004\)450'
df.loc[df.iloc[:,0].str.contains(test_doi)]
ScID.xml journal year topic1
2 0002-3568(2004)450[0199:gissaf]2.3.co.xml Journal_1 2004 0.000003
Bye the way, pandas filter function filters on the label of the index, not the values.
I have a simple DataFrame, AR, with 83 columns and 1428 rows:
In [128]:
AR.index
Out[128]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')
In [129]:
AR.columns
Out[129]:
Index([u'ARKOD', u'ANSNR', u'PNR', u'NAMN', u'MEDB', u'LAEN', u'GATA', u'ORT1', u'ORT2', u'LAND', u'TFNA', u'TFNB', u'BEH_BEHAR', u'BEH_BEHS1', u'BEH_BEHS2', u'BEH_BEHKV', u'BEH_BEHAR2', u'BEH_BEHS1_2', u'BEH_BEHS2_2', u'BEH_BEHKV2', u'BEH_BEHAR3', u'BEH_BEHS1_3', u'BEH_BEHS2_3', u'BEH_BEHKV_3', u'BEH_BEHAR_4', u'BEH_BEHS1_4', u'BEH_BEHS2_4', u'BEH_BEHKV_4', u'BEH25', u'FILLER1', u'BEHFT', u'SLP_SPLAR', u'SLP_SLPP', u'MOTSV', u'FILLER2', u'ATG_ATG25', u'ATG_ATG9', u'ATG_ATGFT', u'ATG_ATGOB', u'ATG_ATGUT', u'ATG_ATGSI', u'ATG_ATGDI', u'ATG_ATGFO', u'ATG_ATGUG', u'ATG_ATGAL ', u'ATG_ATGUL1', u'ATG_ATGUL2', u'ATG_ATGUL3', u'ATG_ATGUL4', u'ATG_ATGUL5', u'ATG_ATGUL6', u'ATG_ATGUL7', u'ATG_ATGUL8', u'ATG_ATGUL9', u'ATG_ATGUL10', u'ATG_ATGUL11', u'ATG_ATGUL12', u'ATG_ATGFU1', u'ATG_ATGFU2', u'ATG_ATGFU3', u'ATG_ATGFU4', u'ATG_ATGB1', u'ATG_ATGB2', u'SLUMP', u'STAT_STATF', u'STAT_STATO', u'STAT_STATA', u'STAT_STATK', u'STAT_STATU', u'STAT_STATH', u'STAT_STATR', u'ANTAL', u'ANTBT', u'ANTSM', u'ANTAE', u'ANTFU', u'ANTZL', u'ANTYL', u'STATL', u'ATB', u'ANTB ', u'FILLER2'], dtype='object')
When I do for example:
In [121]:
AR[AR.ANSNR==10042]
I get
AssertionError: Cannot create BlockManager._ref_locs because block [IntBlock: [ANSNR, PNR, MEDB, SLUMP, ANTAL, ANTBT, ANTSM, ANTAE, ANTFU, ANTZL, ANTYL, ATB], 12 x 1, dtype: int64] with duplicate items [Index([u'ARKOD', u'ANSNR', u'PNR', u'NAMN', u'MEDB', u'LAEN', u'GATA', u'ORT1', u'ORT2', u'LAND', u'TFNA', u'TFNB', u'BEH_BEHAR', u'BEH_BEHS1', u'BEH_BEHS2', u'BEH_BEHKV', u'BEH_BEHAR2', u'BEH_BEHS1_2', u'BEH_BEHS2_2', u'BEH_BEHKV2', u'BEH_BEHAR3', u'BEH_BEHS1_3', u'BEH_BEHS2_3', u'BEH_BEHKV_3', u'BEH_BEHAR_4', u'BEH_BEHS1_4', u'BEH_BEHS2_4', u'BEH_BEHKV_4', u'BEH25', u'FILLER1', u'BEHFT', u'SLP_SPLAR', u'SLP_SLPP', u'MOTSV', u'FILLER2', u'ATG_ATG25', u'ATG_ATG9', u'ATG_ATGFT', u'ATG_ATGOB', u'ATG_ATGUT', u'ATG_ATGSI', u'ATG_ATGDI', u'ATG_ATGFO', u'ATG_ATGUG', u'ATG_ATGAL ', u'ATG_ATGUL1', u'ATG_ATGUL2', u'ATG_ATGUL3', u'ATG_ATGUL4', u'ATG_ATGUL5', u'ATG_ATGUL6', u'ATG_ATGUL7', u'ATG_ATGUL8', u'ATG_ATGUL9', u'ATG_ATGUL10', u'ATG_ATGUL11', u'ATG_ATGUL12', u'ATG_ATGFU1', u'ATG_ATGFU2', u'ATG_ATGFU3', u'ATG_ATGFU4', u'ATG_ATGB1', u'ATG_ATGB2', u'SLUMP', u'STAT_STATF', u'STAT_STATO', u'STAT_STATA', u'STAT_STATK', u'STAT_STATU', u'STAT_STATH', u'STAT_STATR', u'ANTAL', u'ANTBT', u'ANTSM', u'ANTAE', u'ANTFU', u'ANTZL', u'ANTYL', u'STATL', u'ATB', u'ANTB ', u'FILLER2'], dtype='object')] does not have _ref_locs set
Thank you for any suggestions
Edit: sorry, here is the Pandas version:
in [136]:
pd.__version__
Out[136]:
'0.13.1'
Jeff's question:
In [139]:
AR.index.is_unique
Out[139]:
True
In [140]:
AR.columns.is_unique
Out[140]:
False
Is is the last one making problems?