Fast conversion of multicolumn dataframe into dictionary - python

I have the following problem. I have a pandas dataframe with columns A to D with columns A and B being kind of the identifier. My ultimate goal is to create a dictionary where the tuple (A,B) denotes he keys and the values C and D are stored under each key as numpy array. I can write this in one line if I only want to store C or D, but I struggle to get both under the hood. That's what I have:
output_dict = df.groupby(['A','B'])['C'].apply(np.array).to_dict()
works as expected, i.e. the data per each key is a dim(N,1) array. But if I try the following:
output_dict = df.groupby(['A','B'])['C','D'].apply(np.array).to_dict()
I receive the error that
TypeError: Series.name must be a hashable type
How can I include the 2nd column such that the data in the dict per key is an array of dim(N,2).
Thanks!

You can create a new column (e.g. C_D) containing lists of the corresponding values in the columns C and D. Select columns C and D from the dataframe and use the tolist() method:
df['C_D'] = df[['C','D']].values.tolist()
Then run your code line on that new column:
output_dict = df.groupby(['A','B'])['C_D'].apply(np.array).to_dict()

I played a bit more around and next to Gerd's already helpful answer I found the following matching my needs by using lambda.
output_dict = df.groupby(['A','B']).apply(lambda df: np.array( [ df['C'],df['D'] ] ).T).to_dict()
Time comparison with Gerd's solution in my particular case:
Gerd's: roughly 0.055s
This one: roughly 0.035s

Related

Advice on data structure for scaler operation between dictionary and dataframe

I have a dictionary of constants which needs to be multiplied to a data frame. Can any one provide guidance on how to handle this situation or suggest efficient data structure?
for example constant dictionary is like
dct = {a : [0.1,0.22,0.13],b : [0.544,0.65,0.17],c : [0.13,0.544,0.65]}
and then I have a dataframe.
d = {'ID': ['A1','A2','A3'],'AAA':[0,0.4,0.8],'AA':[0,0.6,0.1],'A':[0,0.72,0.32],'BBB':[0,0.55,0.66]}
df2 = pd.DataFrame(data=d)
What I want to do is pick each constant from array and apply some complex function to a data frame based on condition. Could you advice me on data structure I should use?
I also thought about converting data frame to dictionary and then perform scaler operation using zip but that doesn't seem right.
i.e.
df2.set_index('ID').T.to_dict(orient='list')
for k,v in dct.items():
for k1,v1 in df2_dct.items():
#here I have two lists v and v1 to perform operation but this is not efficient.
operations are like if value is 0, ignore, if less then 0.5 than some formula, else if greater than 0.5 then some formula.
I'd appreciate any advice.
EDIT1:
Another Idea I have is to iterate through each key,value of dictionary 'dct' , add value list a dataframe column and then perform operation. It is totally doable and fast but then how do I store all 3 dataframes?
EDIT2:
Scaler operations are like:
tmp_list = list()
for i in range(len(v1)):
if v1[i] ==0:
temp_list[i]=v1[i]
if v1[i]>0.5:
temp_list[i]=v1[i]*4 + v[i]^2
if v1[i] <0.5:
temp_list[i]=v1[i]*0.24+v[i]
EDIT3:
expected output would be either dictionary or dataframe. In Case of dictionary, it will be a nested dictionary like,
op_dct = {'a':{'A1':[values],'A2':[values],'A3':[values]},
'b':{'A1':[values],'A2':[values],'A3':[values]},
'c':{'A1':[values],'A2':[values],'A3':[values]} }
so that I can access vector like op_dct[constant_type][ID].
multiple dataframes doesn't seem right option.

new dataframe within if statement. Python

Here is the the part of the code I am having issues with:
for x in range(len(df['Days'])):
if df['Days'][x]>0 and df['Days'][x]<=30:
b = df['Days'][x]
b
The output I get is b = 14 which is the last value where the if statement holds in the column of the dataframe. I am trying to get ALL the values of the column in which the if statement holds to be held in "b" rather than just the last value alone.
What you want to do is make a list instead and append b to it.
my_vals = []
for x in range(len(df['Days'])):
if df['Days'][x]>0 and df['Days'][x]<=30:
b = df['Days'][x]
my_vals.append(b)
my_vals
In your code, you are changing b in every iterration and so it only stores the most recent value. In the future when you are trying to store multiple values, do so in a different Data Type.
You can also use the filtering functionality of pandas and use
values = df.loc[(df['Days'] >= 0) & (df['Days'] <= 30)]
If you want the values as a Series instead of a DataFrame use
values_series = values['Days']
If you want the values as a list instead of a Series use
values_list = list(values_series)

Efficient way of converting a numpy array of 2 dimensions into a list with no duplicates

I want to extract the values from two different columns of a pandas dataframe, put them in a list with no duplicate values.
I have tried the following:
arr = df[['column1', 'column2']].values
thelist= []
for ix, iy in np.ndindex(arr.shape):
if arr[ix, iy] not in thelist:
thelist.append(edges[ix, iy])
This works but it is taking too long. The dataframe contains around 30 million rows.
Example:
column1 column2
1 adr1 adr2
2 adr1 adr2
3 adr3 adr4
4 adr4 adr5
Should generate the list with the values:
[adr1, adr2, adr3, adr4, adr5]
Can you please help me find a more efficient way of doing this, considering that the dataframe contains 30 million rows.
#ALollz gave a right answer. I'll extend from there. To convert into list as expected just use list(np.unique(df.values))
You can use just np.unique(df) (maybe this is the shortest version).
Formally, the first parameter of np.unique should be an array_like object,
but as I checked, you can also pass just a DataFrame.
Of course, if you want just plain list not a ndarray, write
np.unique(df).tolist().
Edit following your comment
If you want the list unique but in the order of appearance, write:
pd.DataFrame(df.values.reshape(-1,1))[0].drop_duplicates().tolist()
Operation order:
reshape changes the source array into a single column.
Then a DataFrame is created, with default column name = 0.
Then [0] takes just this (the only) column.
drop_duplicates acts exactly what the name says.
And the last step: tolist converts to a plain list.

Using dictionary keys in pandas dataframe columns

I wrote the following code in which I create a dictionary of pandas dataframes:
import pandas as pd
import numpy as np
classification = pd.read_csv('classification.csv')
thresholdRange = np.arange(0, 70, 0.5).tolist()
classificationDict = {}
for t in thresholdRange:
classificationDict[t] = classification
for k, v in classificationDict.iteritems():
v ['Threshold'] = k
In this case, I want to create a column called 'Threshold' in all the pandas dataframes in which the keys of the dictionary are the values. However, what I get with the code above is the same value in all dataframes. What am I missing here? Perhaps I am complicating things for myself with this approach, but I'd greatly appreciate your help.
Sorry, I got your question wrong. Now this is the issue:
Obviously, classification (a pandas dataframe, I suppose) is a mutable object, and adding a mutable object to a list or a dict makes strange (for python-beginners) behaviour. The same object is added. If you change one of the list entries, all get changed. Try this:
a = [1]
b = [a, a]
b[0] = 2
print(b[1])
This is what happens to your dict.
You have to add different objects to the dict. Probably the dataframe has a .copy()-method to do this. Alternatively, I found this post for you, with (in essence) the same problem, there are further solutions there:
https://stackoverflow.com/a/2612815/6053327
Of course you get the same value. You are doing the same assignment over and over again in
for k, v in classificationDict.iteritems():
because your vs are all identical, you assigned them in the first for
Did you try debugging yourself, and print classification? I assume that it is only the first line?

Pandas: More Efficient .map() function or method?

I am using a rather large dataset of ~37 million data points that are hierarchically indexed into three categories country, productcode, year. The country variable (which is the countryname) is rather messy data consisting of items such as: 'Austral' which represents 'Australia'. I have built a simple guess_country() that matches letters to words, and returns a best guess and confidence interval from a known list of country_names. Given the length of the data and the nature of hierarchy it is very inefficient to use .map() to the Series: country. [The guess_country function takes ~2ms / request]
My question is: Is there a more efficient .map() which takes the Series and performs map on only unique values? (Given there are a LOT of repeated countrynames)
There isn't, but if you want to only apply to unique values, just do that yourself. Get mySeries.unique(), then use your function to pre-calculate the mapped alternatives for those unique values and create a dictionary with the resulting mappings. Then use pandas map with the dictionary. This should be about as fast as you can expect.
On Solution is to make use of the Hierarchical Indexing in DataFrame!
data = data.set_index(keys=['COUNTRY', 'PRODUCTCODE', 'YEAR'])
data.index.levels[0] = pd.Index(data.index.levels[0].map(lambda x: guess_country(x, country_names)[0]))
This works well ... by replacing the data.index.levels[0] -> when COUNTRY is level 0 in the index, replacement then which propagates through the data model.
Call guess_country() on unique country names, and make a country_map Series object with the original name as the index, converted name as the value. Then you can use country_map[df.country] to do the conversion.
import pandas as pd
c = ["abc","abc","ade","ade","ccc","bdc","bxy","ccc","ccx","ccb","ccx"]
v = range(len(c))
df = pd.DataFrame({"country":c, "data":v})
def guess_country(c):
return c[0]
uc = df.country.unique()
country_map = pd.Series(list(map(guess_country, uc)), index=uc)
df["country_id"] = country_map[df.country].values
print(df)

Categories

Resources