how to group a data in python

how to group a data in python - python

I have a file with data like:
Entry Freq.
2 4.5
3 3.4
5 4.9
8 9.1
12 11.1
16 13.1
18 12.2
22 11.2
now the problem I am trying to solve is: I want to make it a grouped data (with range 10) based on the Entry and want to add up the frequencies falling within the range.
e.g. for above table if I group it then it should be like:
Range SumFreq.
0-10 21.9(i.e. 4.5 + 3.4 + 4.9 + 9.1)
11-20 36.4
I reached upto column separation with following code but can't be able to perform range separation thing:
my code is:
inp = ("c:/usr/ovisek/desktop/file.txt",'r').read().strip().split('\n')
for line in map(str.split,inp):
k = int(line[0])
l = float(line[-1])
so far is fine but how could I be able to group the data in 10 range.

One way would be to [ab]use the fact that integer division will give you the right bins:
import collections
bin_size = 10
d = collections.defaultdict(float)
for line in map(str.split,inp):
k = int(line[0])
l = float(line[-1])
d[bin_size * (k // bin_size)] += l

How about, just adding to your code there:
def group_data(range):
grouped_data = {}
inp = ("c:/usr/ovisek/desktop/file.txt",'r').read().strip().split('\n')
for line in map(str.split,inp):
k = int(line[0])
l = float(line[-1])
range_value = k // range
if grouped_data.has_key(range_value):
grouped_data[range_value]['freq'] = groped_data[range_value]['freq'] + l
else:
grouped_data[range_value] = {'freq':l, 'value':[str(range_value * range) + ':' + str((range_value + 1) * range )]}
return grouped_data
This should give you a dictionary like:
{1 : {'value':'0-10', 'freq':21.9} , .... }

This should get you started, tested fine:
inp = open("/tmp/input.txt",'r').read().strip().split('\n')
interval = 10
index = 0
resultDict = {}
for line in map(str.split,inp):
k = int(line[0])
l = float(line[-1])
rangeNum = (int) ((k-1)/10 )
rangeKeyName = str(rangeNum*10+1)+"-"+str((rangeNum+1)*10)
if(rangeKeyName in resultDict):
resultDict[rangeKeyName] += l
else:
resultDict[rangeKeyName] = l
print(str(resultDict))
Would output:
{'21-30': 11.199999999999999, '11-20': 36.399999999999999, '1-10': 21.899999999999999}

you can do something like this:
fr = {}
inp = open("file.txt",'r').read().strip().split('\n')
for line in map(str.split,inp):
k = int(line[0])
l = float(line[-1])
key = abs(k-1) / 10 * 10
if fr.has_key(key):
fr[key] += l
else:
fr[key] = l
for k in sorted(fr.keys()):
sum = fr[k]
print '%d-%d\t%f' % (k+1 if k else 0, k+10, sum)
output:
0-10 21.900000
11-20 36.400000
21-30 11.200000

Related

Binary search: Not getting upper & lower bound for very large values

I'm trying to solve this cp problem, UVA - The Playboy Chimp using Python but for some reason, the answer comes wrong for very large values for example this input:
5
3949 45969 294854 9848573 2147483647
5
10000 6 2147483647 4959 5949583
Accepted output:
3949 45969
X 3949
9848573 X
3949 45969
294854 9848573
My output:
X 294854
X 294854
9848573 X
X 294854
45969 9848573
My code:
def bs(target, search_space):
l, r = 0, len(search_space) - 1
while l <= r:
m = (l + r) >> 1
if target == search_space[m]:
return m - 1, m + 1
elif target > search_space[m]:
l = m + 1
else:
r = m - 1
return r, l
n = int(input())
f_heights = list(set([int(a) for a in input().split()]))
q = int(input())
heights = [int(b) for b in input().split()]
for h in heights:
a, b = bs(h, f_heights)
print(f_heights[a] if a >= 0 else 'X', f_heights[b] if b < len(f_heights) else 'X')
Any help would be appreciated!

This is because you are inserting the first input to set, which changes the order of the numbers in the list. If you are using Python 3.6 or newer
dict maintains the insertion order, so you can use dict.fromkeys to maintain the order
f_heights = list(dict.fromkeys(int(a) for a in s.split()))
Example:
f_heights = list(set([int(a) for a in input().split()]))
print(f_heights) # [294854, 3949, 45969, 9848573, 2147483647]
f_heights = list(dict.fromkeys(int(a) for a in input().split()))
print(f_heights) # [3949, 45969, 294854, 9848573, 2147483647]

Knapsack problem(optimized doesn't work correctly)

I am working on the Python code in order to solve Knapsack problem.
Here is my code:
import time
start_time = time.time()
#reading the data:
values = []
weights = []
test = []
with open("test.txt") as file:
W, size = map(int, next(file).strip().split())
for line in file:
value, weight = map(int, line.strip().split())
values.append(int(value))
weights.append(int(weight))
weights = [0] + weights
values = [0] + values
#Knapsack Algorithm:
hash_table = {}
for x in range(0,W +1):
hash_table[(0,x)] = 0
for i in range(1,size + 1):
for x in range(0,W +1):
if weights[i] > x:
hash_table[(i,x)] = hash_table[i - 1,x]
else:
hash_table[(i,x)] = max(hash_table[i - 1,x],hash_table[i - 1,x - weights[i]] + values[i])
print("--- %s seconds ---" % (time.time() - start_time))
This code works correctly, but on a big files my programm crashes due to RAM issues.
So I have decided to change the followng part:
for i in range(1,size + 1):
for x in range(0,W +1):
if weights[i] > x:
hash_table[(1,x)] = hash_table[0,x]
#hash_table[(0,x)] = hash_table[1,x]
else:
hash_table[(1,x)] = max(hash_table[0,x],hash_table[0,x - weights[i]] + values[i])
hash_table[(0,x)] = hash_table[(1,x)]
As you can see instead of using n rows i am using only two(copying the second row into the first one in order to recreate the following line of code hash_table[(i,x)] = hash_table[i - 1,x]), which should solve issues with RAM.
But unfortunately it gives me a wrong result.
I have used the following test case:
190 6
50 56
50 59
64 80
46 64
50 75
5 17
Should get a total value of 150 and total weight of 190 using 3 items:
item with value 50 and weight 75,
item with value 50 and weight 59,
item with value 50 and weight 56,
More test cases: https://people.sc.fsu.edu/~jburkardt/datasets/knapsack_01/knapsack_01.html

The problem here is that you need to reset all the values in the iteration over i, but also need the x index, so to do so, you could use another loop:
for i in range(1,size + 1):
for x in range(0,W +1):
if weights[i] > x:
hash_table[(1,x)] = hash_table[0,x]
else:
hash_table[(1,x)] = max(hash_table[0,x],hash_table[0,x - weights[i]] + values[i])
for x in range(0, W+1): # Make sure to reset after working on item i
hash_table[(0,x)] = hash_table[(1,x)]

Summing results from a monte carlo

I am trying to sum the values in the 'Callpayoff' list however am unable to do so, print(Callpayoff) returns a vertical list:
0
4.081687878300656
1.6000410648454846
0.5024316862043037
0
so I wonder if it's a special sublist ? sum(Callpayoff) does not work unfortunately. Any help would be greatly appreciated.
def Generate_asset_price(S,v,r,dt):
return (1 + r * dt + v * sqrt(dt) * np.random.normal(0,1))
def Call_Poff(S,T):
return max(stream[-1] - S,0)
# initial values
S = 100
v = 0.2
r = 0.05
T = 1
N = 2 # number of steps
dt = 0.00396825
simulations = 5
for x in range(simulations):
stream = [100]
Callpayoffs = []
t = 0
for n in range(N):
s = stream[t] * Generate_asset_price(S,v,r,dt)
stream.append(s)
t += 1
Callpayoff = Call_Poff(S,T)
print(Callpayoff)
plt.plot(stream)

Right now you're not appending values to a list, you're just replacing the value of Callpayoff at each iteration and printing it. At each iteration, it's printed on a new line so it looks like a "vertical list".
What you need to do is use Callpayoffs.append(Call_Poff(S,T)) instead of Callpayoff = Call_Poff(S,T).
Now a new element will be added to Callpayoffs at every iteration of the for loop.
Then you can print the list with print(Callpayoffs) or the sum with print(sum(Callpayoffs))
All in all the for loop should look like this:
for x in range(simulations):
stream = [100]
Callpayoffs = []
t = 0
for n in range(N):
s = stream[t] * Generate_asset_price(S,v,r,dt)
stream.append(s)
t += 1
Callpayoffs.append(Call_Poff(S,T))
print(Callpayoffs,"sum:",sum(Callpayoffs))
Output:
[2.125034975231003, 0] sum: 2.125034975231003
[0, 0] sum: 0
[0, 0] sum: 0
[0, 0] sum: 0
[3.2142923036024342, 4.1390018820809615] sum: 7.353294185683396

sequence of repeated values in a list

I have problems with a program, I hope someone can help me to fix this. Basically I have a random generated list with 20 values, and I want to place between brackets the values that are repeated (for example if the list is [1,2,2,4,5] it should display 1 ( 2 2 ) 4 5 )
Now here's my code that works only if there is no repeated value in the end, because the list index goes out of range. How can I fix this?
from random import randint
lanci = []
for i in range(20):
x = randint(1,6)
lanci.append(x)
print(lanci)
i=0
while i < len(lanci)-1):
if lanci[i] == lanci[i+1]:
print("(",end=" ")
print(lanci[i],end=" ")
while lanci[i]==lanci[i+1]:
i = i + 1
print(lanci[i],end=" ")
print(")",end=" ")
else:
print(lanci[i],end=" ")
i = i + 1

Alternatively to your more manual approach, you could use itertools.groupby to group equal values in the list and then enclose those in parens:
>>> import random, itertools
>>> lst = [random.randint(1, 5) for _ in range(20)]
>>> tmp = [list(map(str, g)) for k, g in itertools.groupby(lst)]
>>> ' '.join(g[0] if len(g) == 1 else "(" + " ".join(g) + ")" for g in tmp)
'5 4 1 2 1 4 (5 5) 4 5 1 5 4 3 (5 5) 3 (5 5 5)'

Not the pretiest but will do it:
from random import randint
from itertools import groupby
lanci = [randint(1,6) for _ in range(20)]
result = [tuple(v) for _, v in groupby(lanci)]
print(*[i[0] if len(i) == 1 else '('+' '.join(map(str, i))+')' for i in result], sep=' ')
#(2 2) 3 5 3 1 5 4 6 2 1 4 6 4 (5 5) 3 6 3 4

Just check for "last element" before your inner while loop.
from random import randint
lanci = []
for i in range(20):
x = randint(1,6)
lanci.append(x)
print(lanci)
i=0
while i < len(lanci)-1):
if lanci[i] == lanci[i+1]:
print("(",end=" ")
print(lanci[i],end=" ")
while (i+1 < len(lanci)) and (lanci[i]==lanci[i+1]):
i = i + 1
print(lanci[i],end=" ")
print(")",end=" ")
else:
print(lanci[i],end=" ")
i = i + 1

convert the list of number to a string then you can use this function.
split it if you need the list back again.
def add_brackets(string):
_character, _index = None, 0
_return_string = ''
for i, c in enumerate(string+ ' '):
if _character is None or _character != c :
if len(string[_index:i])>1:
_return_string+='(' + string[_index: i] + ')'
else:
_return_string+=string[_index: i]
_character, _index = c, i
return _return_string

This is another option using just basic list:
def group_consecutives(lst):
res, sub, memo = [None], [], None
lst.append(memo)
for x in lst:
if memo == x:
sub.append(memo)
if res[-1] != sub: res.append(sub)
else:
sub.append(memo)
if memo and not len(sub) > 1: res.append(memo)
memo, sub = x, []
return res[1:]
print(group_consecutives(lanci))

Efficient algorithm for counting unique elements in "suffixes" of an array

I was doing 368B on CodeForces with Python 3, which basically asks you to print the numbers of unique elements in a series of "suffixes" of a given array. Here's my solution (with some additional redirection code for testing):
import sys
if __name__ == "__main__":
f_in = open('b.in', 'r')
original_stdin = sys.stdin
sys.stdin = f_in
n, m = [int(i) for i in sys.stdin.readline().rstrip().split(' ')]
a = [int(i) for i in sys.stdin.readline().rstrip().split(' ')]
l = [None] * m
for i in range(m):
l[i] = int(sys.stdin.readline().rstrip())
l_sorted = sorted(l)
l_order = sorted(range(m), key=lambda k: l[k])
# the ranks of elements in l
l_rank = sorted(range(m), key=lambda k: l_order[k])
# unique_elem[i] = non-duplicated elements between l_sorted[i] and l_sorted[i+1]
unique_elem = [None] * m
for i in range(m):
unique_elem[i] = set(a[(l_sorted[i] - 1): (l_sorted[i + 1] - 1)]) if i < m - 1 else set(a[(l_sorted[i] - 1): n])
# unique_elem_cumulative[i] = non-duplicated elements between l_sorted[i] and a's end
unique_elem_cumulative = unique_elem[-1]
# unique_elem_cumulative_count[i] = #unique_elem_cumulative[i]
unique_elem_cumulative_count = [None] * m
unique_elem_cumulative_count[-1] = len(unique_elem[-1])
for i in range(m - 1):
i_rev = m - i - 2
unique_elem_cumulative = unique_elem[i_rev] | unique_elem_cumulative
unique_elem_cumulative_count[i_rev] = len(unique_elem_cumulative)
with open('b.out', 'w') as f_out:
for i in range(m):
idx = l_rank[i]
f_out.write('%d\n' % unique_elem_cumulative_count[idx])
sys.stdin = original_stdin
f_in.close()
The code shows correct results except for the possibly last big test, with n = 81220 and m = 48576 (a simulated input file is here, and an expected output created by a naive solution is here). The time limit is 1 sec, within which I can't solve the problem. So is it possible to solve it within 1 sec with Python 3? Thank you.
UPDATE: an "expected" output file is added, which is created by the following code:
import sys
if __name__ == "__main__":
f_in = open('b.in', 'r')
original_stdin = sys.stdin
sys.stdin = f_in
n, m = [int(i) for i in sys.stdin.readline().rstrip().split(' ')]
a = [int(i) for i in sys.stdin.readline().rstrip().split(' ')]
with open('b_naive.out', 'w') as f_out:
for i in range(m):
l_i = int(sys.stdin.readline().rstrip())
f_out.write('%d\n' % len(set(a[l_i - 1:])))
sys.stdin = original_stdin
f_in.close()

You'll be cutting it close, I think. On my admittedly rather old machine, the I/O alone takes 0.9 seconds per run.
An efficient algorithm, I think, will be to iterate backwards through the array, keeping track of which distinct elements you've found. When you find a new element, add its index to a list. This will therefore be a descending sorted list.
Then for each li, the index of li in this list will be the answer.
For the small sample dataset
10 10
1 2 3 4 1 2 3 4 100000 99999
1
2
3
4
5
6
7
8
9
10
The list would contain [10, 9, 8, 7, 6, 5] since when reading from the right, the first distinct value occurs at index 10, the second at index 9, and so on.
So then if li = 5, it has index 6 in the generated list, so 6 distinct values are found at indices >= li. Answer is 6
If li = 8, it has index 3 in the generated list, so 3 distinct values are found at indices >= li. Answer is 3
It's a little fiddly that the excercise numbers 1-indexed and python counts 0-indexed.
And to find this index quickly using existing library functions, I've reversed the list and then use bisect.
import timeit
from bisect import bisect_left
def doit():
f_in = open('b.in', 'r')
n, m = [int(i) for i in f_in.readline().rstrip().split(' ')]
a = [int(i) for i in f_in.readline().rstrip().split(' ')]
found = {}
indices = []
for i in range(n - 1, 0, -1):
if not a[i] in found:
indices.append(i+1)
found[a[i]] = True
indices.reverse()
length = len(indices)
for i in range(m):
l = int(f_in.readline().rstrip())
index = bisect_left(indices, l)
print length - index
if __name__ == "__main__":
print (timeit.timeit('doit()', setup="from bisect import bisect_left;from __main__ import doit", number=10))
On my machine outputs 12 seconds for 10 runs. Still too slow.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to group a data in python - python

One way would be to [ab]use the fact that integer division will give you the right bins: import collections bin_size = 10 d = collections.defaultdict(float) for line in map(str.split,inp): k = int(line[0]) l = float(line[-1]) d[bin_size * (k // bin_size)] += l

Related

Binary search: Not getting upper & lower bound for very large values

Knapsack problem(optimized doesn't work correctly)

Summing results from a monte carlo

sequence of repeated values in a list

Efficient algorithm for counting unique elements in "suffixes" of an array

Categories

Resources