How to append a key before a value in Python dict? - python

I have a dict
x = {'[a]':'(1234)', '[b]':'(2345)', '[c]':'(xyzad)'}
Now I want to append the key before values, so my expected output is:
{'[a]': '[a](1234)', '[b]': '[b](2345)', '[c]': '[c](xyzad)'}
I can do it using for loop like below:
for k,v in x.items():
x.update({k:k+v})
I am looking for efficent way of doing this or I should stick to my current approach?

Your approach seems fine. You could also use a dictionary comprehension, for a more concise solution:
x = {'[a]':'(1234)', '[b]':'(2345)', '[c]':'(xyzad)'}
{k: k+v for k,v in x.items()}
# {'[a]': '[a](1234)', '[b]': '[b](2345)', '[c]': '[c](xyzad)'}

Another way:
x = {'[a]':'(1234)', '[b]':'(2345)', '[c]':'(xyzad)'}
dict(((key, key + x[key]) for key in x))
>>>{'[a]': '[a](1234)', '[b]': '[b](2345)', '[c]': '[c](xyzad)'}

For smaller size dictionaries, the dictionary comprehension solution by #yatu is the best.
Since you mentioned that the data set is large & you would like to avoid for loop, pandas would be the recommended solution.
Create pandas dataframe from dict 'x'
Transform the dataframe & write to a new dictionary
Code:
# Read dictionary to a dataframe
df = pd.DataFrame(list(x.items()), columns=['Key', 'Value'])
Out[317]:
Key Value
0 [a] (1234)
1 [b] (2345)
2 [c] (xyzad)
# Since the transformation is just concatenating both key and value, this can be done while writing to the new dictionary in a single step.
y = dict(zip(df.Key, df.Key+df.Value))
Out[324]: {'[a]': '[a](1234)', '[b]': '[b](2345)', '[c]': '[c](xyzad)'}
This would be much faster for large data sets but I'm not sure how to compare the timings.

Related

Efficient way to conditionally "replace" elements in list of lists based on corresponding dataframe rows

I have a pandas dataframe with a structure like:
df =
repl_str
normal_str
1_labelled
1_text
2_labelled
2_text
4_labelled
4_text
5_labelled
5_text
7_labelled
7_text
8_labelled
8_text
And a list of lists where some of the strings in df["normal_str"] are present, but not necessarily all, like:
A = [[1_text, 3_text, 4_text], [5_text], [6_text, 8_text]]
I want to create a new list of lists B, where the string elements present in df and A are exchanged for the corresponding string in the "labelled_str" column of df. The strings in A which are not present in df["normal_str"] should be left as is.
So in this case:
B = [[1_labelled, 3_text, 4_labelled], [5_labelled], [6_text, 8_labelled]].
In the actual list of lists (instead of this mock example), the inner lists greatly vary in length. I have a working solution using list comprehension, but it takes a long time to run:
[[[str_val for str_val in df['repl_str'].where(df['normal_str']==y).tolist() if str_val==str_val][0]
if [str_val for str_val in df['repl_str'].where(df['normal_str']==y).tolist() if str_val == str_val]
else y for y in x] for x in A]
Does anyone know a quicker way?
If values in normal_str column are all unique, you can create a dictionary that maps normal_str column to repl_str column
A = [['1_text', '3_text', '4_text'], ['5_text'], ['6_text', '8_text']]
d = df.set_index(['normal_str'])['repl_str'].to_dict()
B = [[d.get(text, text) for text in lst] for lst in A]
print(B)
[['1_labelled', '3_text', '4_labelled'], ['5_labelled'], ['6_text', '8_labelled']]

Fast conversion of multicolumn dataframe into dictionary

I have the following problem. I have a pandas dataframe with columns A to D with columns A and B being kind of the identifier. My ultimate goal is to create a dictionary where the tuple (A,B) denotes he keys and the values C and D are stored under each key as numpy array. I can write this in one line if I only want to store C or D, but I struggle to get both under the hood. That's what I have:
output_dict = df.groupby(['A','B'])['C'].apply(np.array).to_dict()
works as expected, i.e. the data per each key is a dim(N,1) array. But if I try the following:
output_dict = df.groupby(['A','B'])['C','D'].apply(np.array).to_dict()
I receive the error that
TypeError: Series.name must be a hashable type
How can I include the 2nd column such that the data in the dict per key is an array of dim(N,2).
Thanks!
You can create a new column (e.g. C_D) containing lists of the corresponding values in the columns C and D. Select columns C and D from the dataframe and use the tolist() method:
df['C_D'] = df[['C','D']].values.tolist()
Then run your code line on that new column:
output_dict = df.groupby(['A','B'])['C_D'].apply(np.array).to_dict()
I played a bit more around and next to Gerd's already helpful answer I found the following matching my needs by using lambda.
output_dict = df.groupby(['A','B']).apply(lambda df: np.array( [ df['C'],df['D'] ] ).T).to_dict()
Time comparison with Gerd's solution in my particular case:
Gerd's: roughly 0.055s
This one: roughly 0.035s

Advice on data structure for scaler operation between dictionary and dataframe

I have a dictionary of constants which needs to be multiplied to a data frame. Can any one provide guidance on how to handle this situation or suggest efficient data structure?
for example constant dictionary is like
dct = {a : [0.1,0.22,0.13],b : [0.544,0.65,0.17],c : [0.13,0.544,0.65]}
and then I have a dataframe.
d = {'ID': ['A1','A2','A3'],'AAA':[0,0.4,0.8],'AA':[0,0.6,0.1],'A':[0,0.72,0.32],'BBB':[0,0.55,0.66]}
df2 = pd.DataFrame(data=d)
What I want to do is pick each constant from array and apply some complex function to a data frame based on condition. Could you advice me on data structure I should use?
I also thought about converting data frame to dictionary and then perform scaler operation using zip but that doesn't seem right.
i.e.
df2.set_index('ID').T.to_dict(orient='list')
for k,v in dct.items():
for k1,v1 in df2_dct.items():
#here I have two lists v and v1 to perform operation but this is not efficient.
operations are like if value is 0, ignore, if less then 0.5 than some formula, else if greater than 0.5 then some formula.
I'd appreciate any advice.
EDIT1:
Another Idea I have is to iterate through each key,value of dictionary 'dct' , add value list a dataframe column and then perform operation. It is totally doable and fast but then how do I store all 3 dataframes?
EDIT2:
Scaler operations are like:
tmp_list = list()
for i in range(len(v1)):
if v1[i] ==0:
temp_list[i]=v1[i]
if v1[i]>0.5:
temp_list[i]=v1[i]*4 + v[i]^2
if v1[i] <0.5:
temp_list[i]=v1[i]*0.24+v[i]
EDIT3:
expected output would be either dictionary or dataframe. In Case of dictionary, it will be a nested dictionary like,
op_dct = {'a':{'A1':[values],'A2':[values],'A3':[values]},
'b':{'A1':[values],'A2':[values],'A3':[values]},
'c':{'A1':[values],'A2':[values],'A3':[values]} }
so that I can access vector like op_dct[constant_type][ID].
multiple dataframes doesn't seem right option.

Avoid using two nested for loops in order to increase performance

I'm trying to replace all the elements in a dataFrame in a particular column through some other elements which are stored in a dictionary.
The dataFrame I have is schematically built like :
dict_main = {'Elektro':[1,2,3],
'Nucleo':[88,22,23]}
df = pd.DataFrame(dict_test)
and the dictionary where I want to get the elements from is basically
dict_elements = {'1': 'ABC',
'2': 'CDE',
'3': 'EFG'}
What I tried is to use two for loops to replace the elements in the 'Elektro' column where they match the keys of dict_elements. The code looks like:
for index in df.index:
for key in dict_elements.key():
if df.loc[index]['Elektro'] == key:
df.loc[index]['Elektro'] = dict_elements[key]
But as you can imagine if you have several thousand elements in the dataFrame and the dictionary this will take a lot of time... how can I improve the performance. Or is there a better and faster alternative to my approach?
You can use built-in pandas function to accomplish this task --
dict_main = {'Elektro':[1,2,3],
'Nucleo':[88,22,23]}
df = pd.DataFrame(dict_main)
and the dictionary where you want to get the elements from:
dict_elements = {1: 'ABC',
2: 'CDE',
3: 'EFG'}
I edited your dict_elements to have int instead of str for the keys and this can become a one-line with built in pandas function
df['Elektro'].replace(dict_elements,inplace=True)
Elektro Nucleo
0 ABC 88
1 CDE 22
2 EFG 23
One thing to note is that values that do not match a key in the dictionary will remain as-is in the original dataframe. You can use a map function instead to have them appear as NAs if you'd rather have that behavior. Hope this helps.

Using dictionary keys in pandas dataframe columns

I wrote the following code in which I create a dictionary of pandas dataframes:
import pandas as pd
import numpy as np
classification = pd.read_csv('classification.csv')
thresholdRange = np.arange(0, 70, 0.5).tolist()
classificationDict = {}
for t in thresholdRange:
classificationDict[t] = classification
for k, v in classificationDict.iteritems():
v ['Threshold'] = k
In this case, I want to create a column called 'Threshold' in all the pandas dataframes in which the keys of the dictionary are the values. However, what I get with the code above is the same value in all dataframes. What am I missing here? Perhaps I am complicating things for myself with this approach, but I'd greatly appreciate your help.
Sorry, I got your question wrong. Now this is the issue:
Obviously, classification (a pandas dataframe, I suppose) is a mutable object, and adding a mutable object to a list or a dict makes strange (for python-beginners) behaviour. The same object is added. If you change one of the list entries, all get changed. Try this:
a = [1]
b = [a, a]
b[0] = 2
print(b[1])
This is what happens to your dict.
You have to add different objects to the dict. Probably the dataframe has a .copy()-method to do this. Alternatively, I found this post for you, with (in essence) the same problem, there are further solutions there:
https://stackoverflow.com/a/2612815/6053327
Of course you get the same value. You are doing the same assignment over and over again in
for k, v in classificationDict.iteritems():
because your vs are all identical, you assigned them in the first for
Did you try debugging yourself, and print classification? I assume that it is only the first line?

Categories

Resources