Searching items of large list in large python dictionary quickly

Searching items of large list in large python dictionary quickly - python

I am currently working to make a dictionary with a tuple of names as keys and a float as the value of the form {(nameA, nameB) : datavalue, (nameB, nameC) : datavalue ,...}
The values data is from a matrix I have made into a pandas DataFrame with the names as both the index and column labels. I have created an ordered list of the keys for my final dictionary called keys with the function createDictionaryKeys(). The issue I have is that not all the names from this list appear in my data matrix. I want to only include the names do appear in the data matrix in my final dictionary.
How can I do this search avoiding the slow linear for loop? I created a dictionary that has the name as key and a value of 1 if it should be included and 0 otherwise as well. It has the form {nameA : 1, nameB: 0, ... } and is called allow_dict. I was hoping to use this to do some sort of hash search.
def createDictionary( keynamefile, seperator, datamatrix, matrixsep):
import pandas as pd
keys = createDictionaryKeys(keynamefile, seperator)
final_dict = {}
data_df = pd.read_csv(open(datamatrix), sep = matrixsep)
pd.set_option("display.max_rows", len(data_df))
df_indices = list(data_df.index.values)
df_cols = list(data_df.columns.values)[1:]
for i in df_indices:
data_df = data_df.rename(index = {i:df_cols[i]})
data_df = data_df.drop("Unnamed: 0", 1)
allow_dict = descriminatePromoters( HARDCODEDFILENAME, SEP, THRESHOLD )
#print ( item for item in df_cols if allow_dict[item] == 0 ).next()
present = [ x for x in keys if x[0] in df_cols and x[1] in df_cols]
for i in present:
final_dict[i] = final_df.loc[i[0],i[1]]
return final_dict

Testing existence in python sets is O(1), so simply:
present = [ x for x in keys if x[0] in set(df_cols) and x[1] in set(df_cols)]
...should give you some speed up. Since you're iterating through in O(n) anyway (and have to to construct your final_dict), something like:
colset = set(df_cols)
final_dict = {k: final_df.loc[k[0],k[1]]
for k in keys if (k[0] in colset)
and (k[1] in colset)}
Would be nice, I would think.

Related

Trying to Access keys in Dict from their values

I'm importing a CSV to a dictionary, where there are a number of houses labelled (I.E. 1A, 1B,...)
Rows are labelled containing some item such as 'coffee' and etc. In the table is data indicating how much of each item each house hold needs.
Excel screenshot
What I am trying to do it check the values of the key value pairs in the dictionary for anything that isn't blank (containing either 1 or 2), and then take the key value pair and the 'PRODUCT NUMBER' (from the csv) and append those into a new list.
I want to create a shopping list that will contain what item I need, with what quantity, to which household.
the column containing 'week' is not important for this
I import the CSV into python as a dictionary like this:
import csv
import pprint
from typing import List, Dict
input_file_1 = csv.DictReader(open("DATA CWK SHOPPING DATA WEEK 1 FILE B.xlsb.csv"))
table: List[Dict[str, int]] = [] #list
for row in input_file_1:
string_row: Dict[str, int] = {} #dictionary
for column in row:
string_row[column] = row[column]
table.append(string_row)
I found on 'geeksforgeeks' how to access the pair by its value. however when I try this in my dictionary, it only seems to be able to search for the last row.
# creating a new dictionary
my_dict ={"java":100, "python":112, "c":11}
# list out keys and values separately
key_list = list(my_dict.keys())
val_list = list(my_dict.values())
# print key with val 100
position = val_list.index(100)
print(key_list[position])
I also tried to do a for in range loop, but that didn't seem to work either:
for row in table:
if row["PRODUCT NUMBER"] == '1' and row["Week"] == '1':
for i in range(8):
if string_row.values() != ' ':
print(row[i])
Please, if I am unclear anywhere, please let me know and I will clear it up!!

Here is a loop I made that should do what you want.
values = list(table.values())
keys = list(table.keys())
new_table = {}
index = -1
for i in range(values.count("")):
index = values.index("", index +1)
new_table[keys[index]] = values[index]
If you want to remove those values from the original dict you can just add in
d.pop(keys[index]) into the loop

Create Json file from Python dataframe with grouping on one col and making column name as key with unique values as a list inside the key

#Create the pandas DataFrame#
My data frame is like this
data = [[6, 1, "False","var_1"], [6, 1, "False","var_2"], [7, 1, "False","var_3"]]
df = pd.DataFrame(data, columns =['CONSTRAINT_ID','CONSTRAINT_NODE_ID','PRODUCT_GRAIN','LEFT_SIDE_TYPE'])
##Expected Output Json##
I want to group by column CONSTRAINT_ID and the key should be natural numbers or index. LEFT_SIDE_TYPE column values should come in list
{
"1": {"CONSTRAINT_NODE_ID ":[1],
"product_grain":False,
"left_side_type":["Variable_1","Variable_2"],
},
"2": {"CONSTRAINT_NODE_ID ":[2],
"product_grain":False,
"left_side_type":["Variable_3"],
}
}

It is likely not the most efficient solution. However provided a df in the format specified in your original question, the below function will return a str consisting of a valid json string with the desired structure and values.
It filters the df by CONSTRAINT_ID, iterating across each unique value and creating a JSON object with a key 1...n and the desired values based on your original question within the response variable. This implementation uses set structures to store values during iterations to avoid duplication of values before converting these to list instances before they are added to the response.
import json
def generate_response(df):
response = dict()
constraints = df['CONSTRAINT_ID'].unique()
for i, c in enumerate(constraints):
temp = {'CONSTRAINT_NODE_ID': set(),'PRODUCT_GRAIN': None, 'LEFT_SIDE_TYPE': set()}
for _, row in df[df['CONSTRAINT_ID'] == c].iterrows():
temp['CONSTRAINT_NODE_ID'].add(row['CONSTRAINT_NODE_ID'])
temp['PRODUCT_GRAIN'] = row['PRODUCT_GRAIN']
temp['LEFT_SIDE_TYPE'].add(row['LEFT_SIDE_TYPE'])
temp['CONSTRAINT_NODE_ID'] = list(temp['CONSTRAINT_NODE_ID'])
temp['LEFT_SIDE_TYPE'] = list(temp['LEFT_SIDE_TYPE'])
response[str(i + 1)] = temp
return json.dumps(response, indent=4)

Split List of values to dataframe in Python

So I was trying to split a list of values into dataframe in Python.
Here is a sample example of my list
ini_string1 = "Time=2014-11-07 00:00:00,strangeness=0.0001,p-value=0.19,deviation=0.78,D_Range=low'"
templist = []
for i in range(5):
templist.append({ini_string1})
Now I was trying to create a dataframe with Time, Strangeness, P-Values, Deviation, D_Range as variables.
I was able to get a data frame when I have only one sigle value of ini_string but counld not make it when I have list of values.
Below is a sample code I tried with single value ini_string
lst_dict = []
cols = ['Time','Strangeness', 'P-Values', 'Deviation', 'Is_Deviation']
# Initialising string
for i in range(5):
ini_string1 = "Time=2014-11-07 00:00:00,strangeness=0.0001,p-value=0.19,deviation=0.78,D_Range=low'"
tempstr = ini_string1
res = dict(item.split("=") for item in tempstr.split(","))
lst_dict.append({'Time': res['Time'],
'Strangeness': res['strangeness'],
'P-Values': res['p-value'],
'Deviation': res['deviation'],
'Is_Deviation': res['D_Range']})
print(lst_dict)
strdf = pd.DataFrame(lst_dict, columns=cols)
I could not figureout the implementation for list of values

The below code will do the job.
from collections import defaultdict
import pandas as pd
ini_string1 = "Time=2014-11-07 00:00:00,strangeness=0.0001,p-value=0.19,deviation=0.78,D_Range='low'"
ini_string2 = "Time=2015-12-07 00:00:00,strangeness=0.0005,p-value=0.31,deviation=0.01,D_Range='high'"
ini_strings = [ini_string1, ini_string2]
dd = defaultdict(list)
for ini_str in ini_strings:
for key_val in ini_str.split(','):
k, v = key_val.split('=')
dd[k].append(v)
df = pd.DataFrame(dd)
Read more about defaultdict - How does collections.defaultdict work?
Python has other interesting data structures - https://docs.python.org/2/library/collections.html

dataframe from dict resulting in empty dataframe

Hi I wrote some code that builds a default dictionary
def makedata(filename):
with open(filename, "r") as file:
for x in features:
previous = []
count = 0
for line in file:
var_name = x
regexp = re.compile(var_name + r'.*?([0-9.-]+)')
match = regexp.search(line)
if match and (match.group(1)) != previous:
previous = match.group(1)
count += 1
if count > wlength:
count = 1
target = str(str(count) + x)
dict.setdefault(target, []).append(match.group(1))
file.seek(0)
df = pd.DataFrame.from_dict(dict)
The dictionary looks good but when I try to convert to dataframe it is empty. I can't figure it out
dict:
{'1meanSignalLenght': ['0.5305184', '0.48961428', '0.47203177', '0.5177274'], '1amplCor': ['0.8780955002105448', '0.8634431017504487', '0.9381169983046714', '0.9407036427333355'], '1metr10.angle1': ['0.6439386643584522', '0.6555194964997434', '0.9512436169922103', '0.23789348400794422'], '1syncVar': ['0.1344131181025432', '0.08194580887223515', '0.15922251165913678', '0.28795644612520327'], '1linVelMagn': ['0.07062673289287498', '0.08792496681784517', '0.12603999663935528', '0.14791253129369603'], '1metr6.velSum': ['0.17850601560734558', '0.15855169971072014', '0.21396496345720045', '0.2739525279330513']}
df:
Empty DataFrame
Columns: []
Index: []
{}

I think part of your issue is that you are using the keyword 'dict', assuming it is a variable
make a dictionary in your function, call it something other than 'dict'. Have your function return that dictionary. Then when you make a dataframe use that return value. Right now, you are creating a data frame from an empty dictionary object.

df = pd.DataFrame(dict)
This should make a dataframe from the dictionary.

You can either pass a list of dicts simply using pd.DataFrame(list_of_dicts) (use pd.DataFrame([dict]) if your variable is not a list) or a dict of list using pd.DataFrame.from_dict(dict). In this last case dict should be something like dict = {a:[1,2,3], "b": ["a", "b", "c"], "c":...}.
see: Pandas Dataframe from dict with empty list value

create a filtered list of dictionaries based on existing list of dictionaries

I have a list of dictionaries read in from csv DictReader that represent rows of a csv file:
rows = [{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"}]
I would like to create a new dictionary, where only unique ID's are stored. But I would like to only keep the row entry with the most recent date. Based on the above example, it would keep the row with date 2/2/18.
I was thinking of doing something like this, but having trouble translating the pseudocode in the else statement into actual python.
I can figure out the part of checking the two dates for which is more recent, but having the most trouble figuring out how I check the new list for the dictionary that contains the same id and then retrieving the date from that row.
Note: Unfortunately, due to resource constraints on our platform I am unable to use pandas for this project.
new_data = []
for row in rows:
if row['id'] not in new_data:
new_data.append(row)
else:
check the element in new_data with the same id as row['id']
if that element's date value is less recent:
replace it with the current row
else :
continue to next row in rows

You'll need a function to convert your date (as string) to a date (as date).
import datetime
def to_date(date_str):
d1, m1, y1 = [int(s) for s in date_str.split('/')]
return datetime.date(y1, m1, d1)
I assumed your date format is d/m/yy. Consider using datetime.strptime to parse your dates, as illustrated by Alex Hall's answer.
Then, the idea is to loop over your rows and store them in a new structure (here, a dict whose keys are the IDs). If a key already exists, compare its date with the current row, and take the right one. Following your pseudo-code, this leads to:
rows = [{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"}]
new_data = dict()
for row in rows:
existing = new_data.get(row['id'], None)
if existing is None or to_date(existing['date']) < to_date(row['date']):
new_data[row['id']] = row
If your want your new_data variable to be a list, use new_data = list(new_data.values()).

import datetime
rows = [{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"}]
def parse_date(d):
return datetime.datetime.strptime(d, "%d/%m/%y").date()
tmp_dict = {}
for row in rows:
if row['id'] not in tmp_dict.keys():
tmp_dict['id'] = row
else:
if parse_date(row['date']) > parse_date(tmp_dict[row['id']]):
tmp_dict['id'] = row
print tmp_dict.values()
output
[{'date': '2/2/18', 'foo': 'baz', 'id': '123'}]
Note: you can merge the two if to if row['id'] not in tmp_dict.keys() || parse_date(row['date']) > parse_date(tmp_dict[row['id']]) for cleaner and shorter code

Firstly, work with proper date objects, not strings. Here is how to parse them:
from datetime import datetime, date
rows = [{"id": "123", "date": "1/1/18", "foo": "bar"},
{"id": "123", "date": "2/2/18", "foo": "baz"}]
for row in rows:
row['date'] = datetime.strptime(row['date'], '%d/%m/%y').date()
(check if the format is correct)
Then for the actual task:
new_data = {}
for row in rows:
new_data[row['id']] = max(new_data.get(row['id'], date.min),
row['date'])
print(new_data.values())
Alternatively:
Here are some generic utility functions that work well here which I use in many places:
from collections import defaultdict
def group_by_key_func(iterable, key_func):
"""
Create a dictionary from an iterable such that the keys are the result of evaluating a key function on elements
of the iterable and the values are lists of elements all of which correspond to the key.
"""
result = defaultdict(list)
for item in iterable:
result[key_func(item)].append(item)
return result
def group_by_key(iterable, key):
return group_by_key_func(iterable, lambda x: x[key])
Then the solution can be written as:
by_id = group_by_key(rows, 'id')
for id_num, group in list(by_id.items()):
by_id[id_num] = max(group, key=lambda r: r['date'])
print(by_id.values())
This is less efficient than the first solution because it creates lists along the way that are discarded, but I use the general principles in many places and I thought of it first, so here it is.

If you like to utilize classes as much as I do, then you could make your own class to do this:
from datetime import date
rows = [
{"id":"123","date":"1/1/18","foo":"bar"},
{"id":"123","date":"2/2/18", "foo":"baz"},
{"id":"456","date":"3/3/18","foo":"bar"},
{"id":"456","date":"1/1/18","foo":"bar"}
]
class unique(dict):
def __setitem__(self, key, value):
#Add key if missing or replace key if date is newer
if key not in self or self[key]["date"] < value["date"]:
dict.__setitem__(self, key, value)
data = unique() #Initialize new class based on dict
for row in rows:
d, m, y = map(int, row["date"].split('/')) #Split date into parts
row["date"] = date(y, m, d) #Replace date value
data[row["id"]] = row #Set new data. Will overwrite same ids with more recent
print data.values()
Outputs:
[
{'date': datetime.date(18, 2, 2), 'foo': 'baz', 'id': '123'},
{'date': datetime.date(18, 3, 3), 'foo': 'bar', 'id': '456'}
]
Keep in mind that data is a dict that essentially overrides the __setitem__ method that uses IDs as keys. And the dates are date objects so they can be compared easily.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Searching items of large list in large python dictionary quickly - python

Related

Trying to Access keys in Dict from their values

Create Json file from Python dataframe with grouping on one col and making column name as key with unique values as a list inside the key

Split List of values to dataframe in Python

dataframe from dict resulting in empty dataframe

create a filtered list of dictionaries based on existing list of dictionaries

Categories

Resources