How to save a large dict with tuples as keys? - python

I have large dict which has 3-tuples of integers as keys. I would like to save it to disk so I can read it in quickly. Sadly it seems I can't save it as a JSON file (which would let me use a fast JSON module such as orjson). What are my options other than pickle?
A tiny example would be:
my_dict = {
(1, 2, 3): [4, 5, 6],
(4, 5, 6): [7, 8, 9],
(7, 8, 9): [10, 11, 12]
}
I have about 500,000 keys and each value list is of length 500.
I will make this data once and it will not be modified after it is made. my_dict will only ever be used as a lookup table

You can try with the package pprint. This is a code saving the file as a Python module, which you can import either as module or just the dictionary object. This is the code.
import pprint
my_dict = {
(1, 2, 3): [4, 5, 6],
(4, 5, 6): [7, 8, 9],
(7, 8, 9): [10, 11, 12]
}
obj_str = pprint.pformat(my_dict, indent=4, compact=False)
message = f'my_dict = {obj_str}\n'
with open('data.py', 'w') as f:
f.write(message)
Of course, you don't have to save it as a Python module, you can just save it as text/binary data and read it into your program as an object; maybe with eval in case you save it as text.
EDIT
Just saw you edited the question. This might be enough for 500,000 keys with 500 items each.

Related

Generic function for consequtive element paring by n given to a function with zip

I have created a generic function to process consecutive pairings of n length from a given list of integers and give them to a function. It works but I very much dislike the eval in the function but don't know how to change that and still use the zip function.
def consecutive_element_pairing(data: list[int], consecutive_element=3, map_to_func=sum) -> list[int]:
"""
Return a list with consecutively paired items given to a function that can handle an iterable
:param data: the list of integers to process
:param consecutive_element: how many to group consecutively
:param map_to_func: the function to give the groups to
:return: the new list of consecutive grouped functioned items
"""
if len(data) < consecutive_element:
return []
return list(map(map_to_func, eval("zip(%s)" % "".join((map(lambda x: "data[%d:], " % x, range(consecutive_element)))))))
given a list of e.g.:
values = [1, 2, 3, 4, 5, 6, 7, 8, 9]
and I call it like this:
print("result:", consecutive_element_pairing(values))
[6, 9, 12, 15, 18, 21, 24]
This is correct as it correctly groups ((1,2,3),(2,3,4),(3,4,5)...) them by consecutive groups of 3 and then sums those.
The trouble I have with my code is the eval statement on the generated string of zip(data[0:], data[1:], data[2:], ).
I have no idea how to do this a different way as zip with a list inside does something completely different.
Can this be done differently while still using zip?
Any help is appreciated.
I know how to do this in many different ways but the challenge for myself was the usage of zip here :-) and making it a "generic" function.
You can simply use zip(*(values[i:] for i in range(N))):
Example
values = [1, 2, 3, 4, 5, 6, 7, 8, 9]
N = 3
list(zip(*(values[i:] for i in range(N))))
# [(1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6), (5, 6, 7), (6, 7, 8), (7, 8, 9)]
A slightly improved variant for long lists and large N might be:
zip(*(values[i:len(values)-(N-i)+1] for i in range(N)))
function
def consecutive_element_pairing(data: list[int], consecutive_element=3, map_to_func=sum) -> list[int]:
N = consecutive_element
return list(map(map_to_func, zip(*(data[i:len(data)-(N-i)+1] for i in range(N)))))
consecutive_element_pairing(values)
# [6, 9, 12, 15, 18, 21, 24]

Picking unique random numbers from a range [duplicate]

I need to pick out "x" number of non-repeating, random numbers out of a list. For example:
all_data = [1, 2, 2, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 11, 12, 13, 14, 15, 15]
How do I pick out a list like [2, 11, 15] and not [3, 8, 8]?
That's exactly what random.sample() does.
>>> random.sample(range(1, 16), 3)
[11, 10, 2]
Edit: I'm almost certain this is not what you asked, but I was pushed to include this comment: If the population you want to take samples from contains duplicates, you have to remove them first:
population = [1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1]
population = list(set(population))
samples = random.sample(population, 3)
Something like this:
all_data = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
from random import shuffle
shuffle(all_data)
res = all_data[:3]# or any other number of items
OR:
from random import sample
number_of_items = 4
sample(all_data, number_of_items)
If all_data could contains duplicate entries than modify your code to remove duplicates first and then use shuffle or sample:
all_data = list(set(all_data))
shuffle(all_data)
res = all_data[:3]# or any other number of items
Others have suggested that you use random.sample. While this is a valid suggestion, there is one subtlety that everyone has ignored:
If the population contains repeats,
then each occurrence is a possible
selection in the sample.
Thus, you need to turn your list into a set, to avoid repeated values:
import random
L = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
random.sample(set(L), x) # where x is the number of samples that you want
Another way, of course with all the solutions you have to be sure that there are at least 3 unique values in the original list. all_data = [1,2,2,3,4,5,6,7,8,8,9,10,11,11,12,13,14,15,15]
choices = []
while len(choices) < 3:
selection = random.choice(all_data)
if selection not in choices:
choices.append(selection)
print choices
You can also generate a list of random choices, using itertools.combinations and random.shuffle.
all_data = [1,2,2,3,4,5,6,7,8,8,9,10,11,11,12,13,14,15,15]
# Remove duplicates
unique_data = set(all_data)
# Generate a list of combinations of three elements
list_of_three = list(itertools.combinations(unique_data, 3))
# Shuffle the list of combinations of three elements
random.shuffle(list_of_three)
Output:
[(2, 5, 15), (11, 13, 15), (3, 10, 15), (1, 6, 9), (1, 7, 8), ...]
import random
fruits_in_store = ['apple','mango','orange','pineapple','fig','grapes','guava','litchi','almond']
print('items available in store :')
print(fruits_in_store)
my_cart = []
for i in range(4):
#selecting a random index
temp = int(random.random()*len(fruits_in_store))
# adding element at random index to new list
my_cart.append(fruits_in_store[temp])
# removing the add element from original list
fruits_in_store.pop(temp)
print('items successfully added to cart:')
print(my_cart)
Output:
items available in store :
['apple', 'mango', 'orange', 'pineapple', 'fig', 'grapes', 'guava', 'litchi', 'almond']
items successfully added to cart:
['orange', 'pineapple', 'mango', 'almond']
If the data being repeated implies that we are more likely to draw that particular data, we can't turn it into a set right away (since we would loose that information by doing so). For this, we need to pick samples one by one and verify the size of the set that we generate has reached x (the number of samples that we want). Something like:
data=[0, 1, 2, 3, 4, 4, 4, 4, 5, 5, 6, 6]
x=3
res=set()
while(len(res)<x):
res.add(np.random.choice(data))
print(res)
some outputs :
{3, 4, 5}
{3, 5, 6}
{0, 4, 5}
{2, 4, 5}
As we can see 4 or 5 appear more frequently (I know 4 examples is not good enough statistics).

Dataframe with fixed length (over writing)

I write a code that generates a mass amount of data in each round. So, I need to only store data for the last 10 rounds. How can I create a dataframe which erases the oldest object when I add a need object (over-writing)? The order of observations -from old to new- should be maintained. Is there any simple function or data format to do this?
Thanks in advance!
You could use this function:
def ins(arr, item):
if len(arr) < 10:
arr.insert(0, item)
else:
arr.pop()
arr.insert(0, item)
ex = [1, 2, 3, 4, 5, 6, 7, 8, 9]
ins(ex, 'a')
print(ex)
# ['a', 1, 2, 3, 4, 5, 6, 7, 8, 9]
ins(ex, 'b')
print(ex)
# ['b', 'a', 1, 2, 3, 4, 5, 6, 7, 8]
In order for this to work you MUST pass a list as argument to the function ins(), so that the new item is inserted and the 10th is removed (if there is one).
(I considered that the question is not pandas specific, but rather a way to store a maximum amount of items in an array)

Piping a pipe-delimited flat file into python for use in Pandas and Stats

I have searched a lot, but haven't found an answer to this.
I am trying to pipe in a flat file with data and put into something python read and that I can do analysis with (for instance, perform a t-test).
First, I created a simple pipe delimited flat file:
1|2
3|4
4|5
1|6
2|7
3|8
8|9
and saved it as "simpledata".
Then I created a bash script in nano as
#!/usr/bin/env python
import sys
from scipy import stats
A = sys.stdin.read()
print A
paired_sample = stats.ttest_rel(A[:,0],A[:,1])
print "The t-statistic is %.3f and the p-value is %.3f." % paired_sample
Then I save the script as pairedttest.sh and run it as
cat simpledata | pairedttest.sh
The error I get is
TypeError: string indices must be integers, not tuple
Thanks for your help in advance
Are you trying to call this?:
paired_sample = stats.ttest_rel([1,3,4,1,2,3,8], [2,4,5,6,7,8,9])
If so, you can't do it the way you're trying. A is just a string when you read it from stdin, so you can't index it the way you're trying. You need to build the two lists from the string. The most obvious way is like this:
left = []
right = []
for line in A.splitlines():
l, r = line.split("|")
left.append(int(l))
right.append(int(r))
print left
print right
This will output:
[1, 3, 4, 1, 2, 3, 8]
[2, 4, 5, 6, 7, 8, 9]
So you can call stats.ttest_rel(left, right)
Or to be really clever and make a (nearly impossible to read) one-liner out of it:
z = zip(*[map(int, line.split("|")) for line in A.splitlines()])
This will output:
[(1, 3, 4, 1, 2, 3, 8), (2, 4, 5, 6, 7, 8, 9)]
So you can call stats.ttest_rel(*z)

How to define column headers when reading a csv file in Python

I have a comma separated value table that I want to read in Python. What I need to do is first tell Python not to skip the first row because that contains the headers. Then I need to tell it to read in the data as a list and not a string because I need to build an array out of the data and the first column is non-integer (row headers).
There are a total of 11 columns and 5 rows.
Here is the format of the table (except there are no row spaces):
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11
w0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
w1 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
w2 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
w3 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Is there a way to do this? Any help is greatly appreciated!
You can use the csv module for this sort of thing. It will read in each row as a list of strings representing the different fields.
How exactly you'd want to use it depends on how you're going to process the data afterwards, but you might consider making a Reader object (from the csv.reader() function), calling next() on it once to get the first row, i.e. the headers, and then iterating over the remaining lines in a for loop.
r = csv.reader(...)
headers = r.next()
for fields in r:
# do stuff
If you're going to wind up putting the fields into a dict, you'd use DictReader instead (and that class will automatically take the field names from the first row, so you can just construct it an use it in a loop).

Categories

Resources