Clustering keys based on list of values - python

If we have a input of a dictionary with format {key: [list]} like below
Key1: [value01, value02, value 03]
Key2: [value02, value04, value 05]
Key3: [value01, value03, value 07]
Key4: [value01, value03, value 04]
The values are strings
Is there a way where we can group/cluster the keys based on the similarity between it's values(lists) in python?
Consider this as a huge data of 1M Keys and group the keywords with common values of more than 2. Like in the above example, I can group keys into two sets. (Key1, Key3, Key4), (Key2). Is there any way to do that in graph based method, instead of using for loops?
Tried for loops, but it was taking hours to complete for the large data set.

Related

What is the best way to create a dictionary using unique column values, and the corresponding range?

I am trying to create a dictionary with keys being the alphabetized, unique values of a column, and the values being the value of range.
So for example, If i have a column called "States" that we expect to have 50 unique values, there would be 50 keys, each containing a state. I would want the corresponding key to be 1 for the first key, and 50 for the last key.
The dictionary would look like this: {'AL':1, 'AK':2, .... 'WV':49, 'WY':50}
Ive tried something as follows -
mapper = {df.Statee.unique().tolist()[i]:i for i in range(1, len(df.State.unique().tolist()+1))}
but that doesn't work.
Something like this strikes me as most readable:
uniqs = df.State.unique()
mapper = {k:v+1 for v, k in enumerate(uniqs)}

Is it possible to get data by value in Redis?

I already checked to be able to find a row by key in Redis. But I wonder if it's possible to find a row by value in same row. For example, my row's data is {"1", "A", "B"} and I wanna find the row by "A" or "B" not by "1" (first columns is key in this case) with Python.
Redis has nothing out of the box for this. You can create a secondary index in Redis on the value. It comes at a cost though - you need more memory to store the index.
you can build your own 'value index'.
for example, you can add a second key with type sorted set, with key: A, value: 1(1), 2(2), 3(3), 4(4), the number in parentheses is score, you can use your own score like timestamp.
so when you want first ten primary key with value A, use this:
zrangebyscore A -inf +inf limit 0 10

Fastest pythonic way to loop over dictionary to create new Pandas column

I have a dictionary "c" with 30000 keys and around 600000 unique values (around 20 unique values per key)
I want to create a new pandas series "'DOC_PORTL_ID'" to get a sample value from each row of column "'image_keys'" and then look for its key in my dictionary and return. So I wrote a function like this:
def find_match(row, c):
for key, val in c.items():
for item in val:
if item == row['image_keys']:
return key
and then I use .apply to create my new column like:
df_image_keys['DOC_PORTL_ID'] = df_image_keys.apply(lambda x: find_match(x, c), axis =1)
This takes a long time. I am wondering if I can improve my snippet code to make it faster.
I googled a lot and was not able to find the best way of doing this. Any help would appreciated.
You're using your dictionary as a reverse lookup. And frankly, you haven't given us enough information about the dictionary. Are the 600,000 values unique? If not, you're only returning the first one you find. Is that expected?
Assume they are unique
reverse_dict = {val: key for key, values in c.items() for val in values}
df_image_keys['DOC_PORTL_ID'] = df_image_keys['image_keys'].map(reverse_dict)
This is as good as you've done yourself. If those values are not unique, you'll have to provide a better explanation of what you expect to happen.

Efficiently selecting rows from pandas dataframe using sorted column

I have a large-ish pandas dataframe with multiple columns (c1 ... c8) and ~32 mil rows. The dataframe is already sorted by c1. I want to grab other column values from rows that share a particular value of c1.
something like
keys = big_df['c1'].unique()
red = np.zeros(len(keys))
for i, key in enumerate(keys):
inds = (big_df['c1'] == key)
v1 = np.array(big_df.loc[inds]['c2'])
v2 = np.array(big_df.loc[inds]['c6'])
red[i] = reduce_fun(v1,v2)
However this turns out to be very slow I think because it checks the entire columns for the matching criterion (even though there might only be 10 rows out of 32 mil that are relevant). Since big_df is sorted by c1 and the keys is just the list of all unique c1's, is there a fast way to get the red[] array (ie i know the first row with the next key is the row after the last row of the previous key, I know that the last row for a key is the last row that matches the key, since all subsequent rows are guaranteed not to match).
Thanks,
Ilya
Edit: I am not sure what order unique() method produces, but I basically want to have for every key in keys a value of reduce_fun(), I don't particularly care what order they are (presumably the easiest order is the order c1 is already sorted in).
Edit2: I slightly restructured the code. Basically, is there an efficient way of constructing inds. big_df['c1'] == key takes 75.8% of total time in my data, while creating v1, v2 takes 21.6% according to line profiler.
Rather than a list, I chose a dictionary to hold the reduced values keyed on each item in c1.
red = {key: reduce_func(frame['c2'].values, frame['c7'].values)
for key, frame in df.groupby('c1')}
How about a groupby statement in a list comprehension? This should be especially efficient given the DataFrame is already sorted by c1:
Edit: Forgot that groupby returns a tuple. Oops!
red = [reduce_fun(g['c2'].values, g['c6'].values) for i, g in big_df.groupby('c1', sort=False)]
Seems to chug through pretty quickly for me (~2 seconds for 30 million random rows and a trivial reduce_fun).

pandas value to find in a dictionary and return key - python

In my pandas data frame column, I need to check if the column has any of the word in the dictionary values, then I should return the key.
my_dict = {'woodhill': ["woodhill"],'woodcocks': ["woodcocks"], 'whangateau' : ["whangateau","whangate"],'whangaripo' : ["whangaripo","whangari","whangar"],
'westmere' : ["westmere"],'western springs': ["western springs","western springs","western spring","western sprin",
"western spri","western spr","western sp","western s"]}
I can write a for loop for this, however, I have nearly 1.5 million records in my data frame and the dictionary has more than 100 items and each may have up to 20 values in some case. How do I do this efficiently? Can I create reverse the values as key and key as values in the dictionary to make it fast? Thanks.
you can reverse your dictionary
reversed_dict = {val: key for key in my_dict for val in my_dict[key]}
and then map with your dataframe
df =pd.DataFrame({'col1':['western springs','westerns','whangateau','whangate']})
df['col1'] = df['col1'].map(reversed_dict)
Try this code, this may help you.
1st reverse the dictionary items. # as limited items , so it'll be fast.
2nd create dataframe from dictionary. # instead of searching all keys for each comparison with dataframe, it's best to do join. so for that create dataframe.
3rd make left join from big size dataframe to small size dataframe (in this case dictionary).

Categories

Resources