h5py order of dataset by dataset name - python

I am creating an h5 files with 5 datasets ['a160'],['a1214']
How can I make it so that the datasets will be sorted by the dataset name..
For example when I do h5dump on my file I get:
HDF5 "jjjj.h5" {
GROUP "/" {
DATASET "a1214" {
DATATYPE H5T_IEEE_F32BE
DATASPACE SIMPLE { ( 1, 19 ) / ( H5S_UNLIMITED, 19 ) }
DATA {
(0,0): 160, 0, 165, 4, 2.29761, 264, 4, 1.74368, 1, 0, 17, 193, 0, 0,
(0,14): 0, 0, 0, 0, 0
}
}
DATASET "a160" {
DATATYPE H5T_IEEE_F32BE
DATASPACE SIMPLE { ( 3, 19 ) / ( H5S_UNLIMITED, 19 ) }
DATA {
(0,0): 263, 0, 262, 7, 4.90241, 201, 34, 0.348432, 1, 0, 29, 11, 0, 0,
(0,14): 0, 0, 0, 0, 0,
}
}
But I want it to be ordered by the dataset name, I need h5dump to output
HDF5 "jjjj.h5" {
GROUP "/" {
DATASET "a160" {
DATATYPE H5T_IEEE_F32BE
DATASPACE SIMPLE { ( 3, 19 ) / ( H5S_UNLIMITED, 19 ) }
DATA {
(0,0): 263, 0, 262, 7, 4.90241, 201, 34, 0.348432, 1, 0, 29, 11, 0, 0,
(0,14): 0, 0, 0, 0, 0,
}
}
DATASET "a1214" {
DATATYPE H5T_IEEE_F32BE
DATASPACE SIMPLE { ( 1, 19 ) / ( H5S_UNLIMITED, 19 ) }
DATA {
(0,0): 160, 0, 165, 4, 2.29761, 264, 4, 1.74368, 1, 0, 17, 193, 0, 0,
(0,14): 0, 0, 0, 0, 0
}
}
}

By default h5dump sorts HDF5 files' groups and attributes by their names in ascending order:
-q Q, --sort_by=Q Sort groups and attributes by index Q
-z Z, --sort_order=Z Sort groups and attributes by order Z
Q - is the sort index type. It can be "creation_order" or "name" (default)
Z - is the sort order type. It can be "descending" or "ascending" (default)
The problem in this case is that "a160" is considered greater than "a1214" because that's how dictionary sorting works ('a12' < 'a16').
There's no change you can make to the internal structure of the HDF5 file to force h5dump to sort these data structures in a different order. However, you could zero-pad your names like so:
a0040
a0160
a1214
and then the standard dictionary sort will output the file the way you want.

Related

End of File Expected

This is the json file I am working with. I am new to json and after doing some basic research, I was able to dump a dictionary that I had in it with some sample data as placeholders. When I try to use the file though it says that the End of file expected json[9,1] and I have no idea how to fix this as most of the results that I have found on this topic go way over my head. Thanks
{
"923390702359048212": [
0,
0,
0
]
}
{
"462291477964259329": [
0,
0,
0
]
}
{
"803390252265242634": [
0,
0,
0
]
}
{
"832041337968263178": [
0,
0,
0
]
}
{
"824114065445486592": [
0,
0,
0
]
}
You cannot have separate objects in your json file. You need to have this as an array.
[{
"923390702359048212": [
0,
0,
0
]
},
{
"462291477964259329": [
0,
0,
0
]
}]
Missing comma between bracket section an add a level of bracket
{
{
"923390702359048212": [
0,
0,
0
]
},
{
"462291477964259329": [
0,
0,
0
]
}
}
Complete all the json like that and it will be okay

read_pickle failing stochastically

I have a dataframe that I saved to a pickle file. When I load it with read_pickle it fails with the following error on roughly 1/10th of runs:
ValueError: Level values must be unique: [Timestamp('2020-06-03 15:59:59.999999+0000', tz='UTC'), datetime.date(2020, 6, 3), datetime.date(2020, 6, 4), datetime.date(2020, 6, 5)] on level 0
What is causing this stochastic behaviour?
The issue can be reproduced with the following:
from datetime import timedelta, date
import pandas as pd
import pytz
from pandas import Timestamp
utc = pytz.UTC
data = {
"date": [
Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).replace(minute=59, second=59, microsecond=999999),
Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).date(),
Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).date(),
Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).date() + timedelta(days=1),
Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).date() + timedelta(days=1),
Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).date() + timedelta(days=2),
Timestamp("2020-06-03 15:00:00").replace(tzinfo=utc).date() + timedelta(days=2),
],
"status": ["in_progress", "in_progress", "done", "in_progress", "done", "in_progress", "done"],
"issue_count": [20, 18, 2, 14, 6, 10, 10],
"points": [100, 90, 10, 70, 30, 50, 50],
"stories": [0, 0, 0, 0, 0, 0, 0],
"tasks": [100, 100, 100, 100, 100, 100, 100],
"bugs": [0, 0, 0, 0, 0, 0, 0],
"subtasks": [0, 0, 0, 0, 0, 0, 0],
"assignee": ["Name", "Name", "Name", "Name", "Name", "Name", "Name"],
}
df = pd.DataFrame(data).groupby(["date", "status"]).sum()
df.to_pickle("~/failing_df.pkl")
pd.read_pickle("~/failing_df.pkl")
try to_csv() or to_dict()
# write it to csv
df.to_csv('temp.csv')
# read it from csv
df2 = pd.read_csv('temp.csv')
df2.set_index(['date', 'status'], inplace=True)
or optionally
df_dict = df.to_dict()
# pickle it
df.to_pickle('temp.pkl')
# unpickle it
df2 = pd.read_pickle('temp.pkl')

How to add padding in a dataset to fill up to 50 items in a list and replace NaN with 0?

I have the following encoded text column in my dataset:
[182, 4]
[14, 2, 31, 42, 72]
[362, 685, 2, 399, 21, 16, 684, 682, 35, 7, 12]
Somehow I want this column to be filled up to 50 items on each row, assuming no row is larger than 50 items. And where there is no numeric value I want a 0 to be placed.
In the example the wanted outcome would be:
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,182, 4]
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,14, 2, 31, 42, 72]
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,362, 685, 2, 399, 21, 16, 684, 682, 35, 7, 12]
Try this:
>>> y=[182,4]
>>> ([0]*(50-len(y))+y)
Assuming you parsed the lists from the string columns already, a very basic approach could be as follows:
a = [182, 4]
b = [182, 4, 'q']
def check_numeric(element):
# assuming only integers are valid numeric values
try:
element = int(element)
except ValueError:
element = 0
return element
def replace_nonnumeric(your_list):
return [check_numeric(element) for element in your_list]
# change the desired length to your needs (change 15 to 50)
def fill_zeros(your_list, desired_length=15):
prepend = (desired_length - len(your_list)) * [0]
result = prepend + your_list
return result
aa = replace_nonnumeric(a)
print(fill_zeros(aa))
bb = replace_nonnumeric(b)
print(fill_zeros(bb))
This code outputs:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 182, 4] # <-- aa
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 182, 4, 0] # <-- bb
However, I suggest using this code as a basis and adopt it to your needs.
Especially when parsing a lot of entries from the "list as strings" column, writing a parsing function and calling it via pandas' .apply() would be nice approach.

How can I multiply list items in a dict with another list in Python

I have a dictionary with player names and their points, and what I need to do is multiply each list item with coefficients from another list resulting in a new array with multiplied points:
points = {mark : [650, 400, 221, 0, 3], bob : ([240, 300, 5, 0, 0], [590, 333, 20, 30, 0]), james : [789, 201, 0, 0, 1]}
coefficients = [5, 4, 3, 2, 1]
So for example for Mark:
player_points = [650*5, 400*4, 221*3, 0*2, 3*1]
And for Bob:
player_points = [240*5, 300*4, 5*3, 0*2, 0*1], [590*5, 333*4, 20*3, 30*2, 0*1]
What I tried was the following but it didn't work whatsoever:
def calculate_points(points, coefficients):
i = 0
for coefficient in coefficients:
player_points = coefficient * points[i]
i += 1
return player_points
def main():
points = {"mark": [650, 400, 221, 0, 3],
"bob": ([240, 300, 5, 0, 0], [590, 333, 20, 30, 0]),
"james": [789, 201, 0, 0, 1]}
coefficients = [5, 4, 3, 2, 1]
player_points = calculate_points(points, coefficients)
print(player_points)
main()
For list multilplication you can do
player_point = [i*j for i,j in zip(point['mark'], coefficients)]
So if you want a player_point dictionnary:
player_points = {}
For name in points.keys():
player_points[name] = [i*j for i,j in zip(points[name], coefficients)]
Here is code that works using a for loop:
points = {"mark" : [650, 400, 221, 0, 3], "bob" : [240, 300, 5, 0, 0],"joe" : [590, 333, 20, 30, 0], "james" : [789, 201, 0, 0, 1]}
coefficients = [5, 4, 3, 2, 1]
for element in points:
player_points= []
for i in range(len(points.get(element))):
player_points.append(points.get(element)[i]*coefficients[i])
print(player_points)
This will give the output of
[3250,1600,663,0,3]
[1200,1200,15,0,0]
[2950,1332,60,60,0]
[3945,804,0,0,1]
Your data structure is irregular which make processing it much harder than it needs to be. If all the dictionary values were tuples, a simple dictionary comprehension could be used. As it is, you sometimes have an array, and sometimes a tuple which requires the code to deal with exceptions and type detection.
Here's how it would work if the structure was consistent (i.e. tuples for all values)
points = { "mark" : ([650, 400, 221, 0, 3],),
"bob" : ([240, 300, 5, 0, 0], [590, 333, 20, 30, 0]),
"james" : ([789, 201, 0, 0, 1],)
}
coefficients = [5, 4, 3, 2, 1]
player_points = { pl:tuple([p*c for p,c in zip(pt,coefficients)] for pt in pts)
for pl,pts in points.items() }
print(player_points)
{
'mark' : ([3250, 1600, 663, 0, 3],),
'bob' : ([1200, 1200, 15, 0, 0], [2950, 1332, 60, 60, 0]),
'james': ([3945, 804, 0, 0, 1],)
}
If you don't want to adjust your structure, you'll need a function that handles the inconsistency:
points = { "mark" : [650, 400, 221, 0, 3],
"bob" : ([240, 300, 5, 0, 0], [590, 333, 20, 30, 0]),
"james" : [789, 201, 0, 0, 1]
}
coefficients = [5, 4, 3, 2, 1]
def applyCoeffs(pts,coeffs):
if isinstance(pts,list):
return [p*c for p,c in zip(pts,coeffs)]
else:
return tuple(applyCoeffs(pt,coeffs) for pt in pts)
player_points = { pl: applyCoeffs(pts,coefficients) for pl,pts in points.items() }
print(player_points)
{
'mark' : [3250, 1600, 663, 0, 3],
'bob' : ([1200, 1200, 15, 0, 0], [2950, 1332, 60, 60, 0]),
'james': [3945, 804, 0, 0, 1]
}

Deleted documents when using Elasticsearch API from Python

I'm relatively new to Elasticsearch and am having a problem determining why the number of records from a pythondataframe is different than the indexes document count Elasticsearch.
I start by creating an index by running the following: As you can see there are 62932 records.
I'm creating an index in elasticsearch using the following:
Python code
When I check the index in Kibana Management/Index Management there are only 62630 documents. According to Stats window there were 302 deleted count. I don't know what this means.
Below is the output from the STATS window
{
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"stats": {
"uuid": "egOx_6EwTFysBr0WkJyR1Q",
"primaries": {
"docs": {
"count": 62630,
"deleted": 302
},
"store": {
"size_in_bytes": 4433722
},
"indexing": {
"index_total": 62932,
"index_time_in_millis": 3235,
"index_current": 0,
"index_failed": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 0
},
"get": {
"total": 0,
"time_in_millis": 0,
"exists_total": 0,
"exists_time_in_millis": 0,
"missing_total": 0,
"missing_time_in_millis": 0,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 140,
"query_time_in_millis": 1178,
"query_current": 0,
"fetch_total": 140,
"fetch_time_in_millis": 1233,
"fetch_current": 0,
"scroll_total": 1,
"scroll_time_in_millis": 6262,
"scroll_current": 0,
"suggest_total": 0,
"suggest_time_in_millis": 0,
"suggest_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 2,
"total_time_in_millis": 417,
"total_docs": 62932,
"total_size_in_bytes": 4882755,
"total_stopped_time_in_millis": 0,
"total_throttled_time_in_millis": 0,
"total_auto_throttle_in_bytes": 20971520
},
"refresh": {
"total": 26,
"total_time_in_millis": 597,
"external_total": 24,
"external_total_time_in_millis": 632,
"listeners": 0
},
"flush": {
"total": 1,
"periodic": 0,
"total_time_in_millis": 10
},
"warmer": {
"current": 0,
"total": 23,
"total_time_in_millis": 0
},
"query_cache": {
"memory_size_in_bytes": 17338,
"total_count": 283,
"hit_count": 267,
"miss_count": 16,
"cache_size": 4,
"cache_count": 4,
"evictions": 0
},
"fielddata": {
"memory_size_in_bytes": 0,
"evictions": 0
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 2,
"memory_in_bytes": 22729,
"terms_memory_in_bytes": 17585,
"stored_fields_memory_in_bytes": 2024,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 512,
"points_memory_in_bytes": 2112,
"doc_values_memory_in_bytes": 496,
"index_writer_memory_in_bytes": 0,
"version_map_memory_in_bytes": 0,
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": -1,
"file_sizes": {}
},
"translog": {
"operations": 62932,
"size_in_bytes": 17585006,
"uncommitted_operations": 0,
"uncommitted_size_in_bytes": 55,
"earliest_last_modified_age": 0
},
"request_cache": {
"memory_size_in_bytes": 0,
"evictions": 0,
"hit_count": 0,
"miss_count": 0
},
"recovery": {
"current_as_source": 0,
"current_as_target": 0,
"throttle_time_in_millis": 0
}
},
"total": {
"docs": {
"count": 62630,
"deleted": 302
},
"store": {
"size_in_bytes": 4433722
},
"indexing": {
"index_total": 62932,
"index_time_in_millis": 3235,
"index_current": 0,
"index_failed": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 0
},
"get": {
"total": 0,
"time_in_millis": 0,
"exists_total": 0,
"exists_time_in_millis": 0,
"missing_total": 0,
"missing_time_in_millis": 0,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 140,
"query_time_in_millis": 1178,
"query_current": 0,
"fetch_total": 140,
"fetch_time_in_millis": 1233,
"fetch_current": 0,
"scroll_total": 1,
"scroll_time_in_millis": 6262,
"scroll_current": 0,
"suggest_total": 0,
"suggest_time_in_millis": 0,
"suggest_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 2,
"total_time_in_millis": 417,
"total_docs": 62932,
"total_size_in_bytes": 4882755,
"total_stopped_time_in_millis": 0,
"total_throttled_time_in_millis": 0,
"total_auto_throttle_in_bytes": 20971520
},
"refresh": {
"total": 26,
"total_time_in_millis": 597,
"external_total": 24,
"external_total_time_in_millis": 632,
"listeners": 0
},
"flush": {
"total": 1,
"periodic": 0,
"total_time_in_millis": 10
},
"warmer": {
"current": 0,
"total": 23,
"total_time_in_millis": 0
},
"query_cache": {
"memory_size_in_bytes": 17338,
"total_count": 283,
"hit_count": 267,
"miss_count": 16,
"cache_size": 4,
"cache_count": 4,
"evictions": 0
},
"fielddata": {
"memory_size_in_bytes": 0,
"evictions": 0
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 2,
"memory_in_bytes": 22729,
"terms_memory_in_bytes": 17585,
"stored_fields_memory_in_bytes": 2024,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 512,
"points_memory_in_bytes": 2112,
"doc_values_memory_in_bytes": 496,
"index_writer_memory_in_bytes": 0,
"version_map_memory_in_bytes": 0,
"fixed_bit_set_memory_in_bytes": 0,
"max_unsafe_auto_id_timestamp": -1,
"file_sizes": {}
},
"translog": {
"operations": 62932,
"size_in_bytes": 17585006,
"uncommitted_operations": 0,
"uncommitted_size_in_bytes": 55,
"earliest_last_modified_age": 0
},
"request_cache": {
"memory_size_in_bytes": 0,
"evictions": 0,
"hit_count": 0,
"miss_count": 0
},
"recovery": {
"current_as_source": 0,
"current_as_target": 0,
"throttle_time_in_millis": 0
}
}
}
}
why does the doc count differ from the index total? I've exported the data and the number of records matches the doc count. How can I find out why documents were deleted and make sure they are not in the future?
Possible causes:
Deleted documents tie up disk space in the index.
In-memory per-document data structures, such as norms or field data, will still consume RAM for deleted documents.
Search throughput is lower, since each search must check the deleted bitset for every potential hit. More on this below.
Aggregate term statistics, used for query scoring, will still reflect deleted terms and documents. When a merge completes, the term statistics will suddenly jump closer to their true values, changing hit scores. In practice this impact is minor, unless the deleted documents had divergent statistics from the rest of the index.
A deleted document ties up a document ID from the maximum 2.1 B documents for a single shard. If your shard is riding close to that limit (not recommended!) this could matter.
Fuzzy queries can have slightly different results, because they may match ghost terms.
https://www.elastic.co/guide/en/elasticsearch/reference/current//cat-indices.html
https://www.elastic.co/blog/lucenes-handling-of-deleted-documents

Categories

Resources