Advice on data structure for scaler operation between dictionary and dataframe - python

I have a dictionary of constants which needs to be multiplied to a data frame. Can any one provide guidance on how to handle this situation or suggest efficient data structure?
for example constant dictionary is like
dct = {a : [0.1,0.22,0.13],b : [0.544,0.65,0.17],c : [0.13,0.544,0.65]}
and then I have a dataframe.
d = {'ID': ['A1','A2','A3'],'AAA':[0,0.4,0.8],'AA':[0,0.6,0.1],'A':[0,0.72,0.32],'BBB':[0,0.55,0.66]}
df2 = pd.DataFrame(data=d)
What I want to do is pick each constant from array and apply some complex function to a data frame based on condition. Could you advice me on data structure I should use?
I also thought about converting data frame to dictionary and then perform scaler operation using zip but that doesn't seem right.
i.e.
df2.set_index('ID').T.to_dict(orient='list')
for k,v in dct.items():
for k1,v1 in df2_dct.items():
#here I have two lists v and v1 to perform operation but this is not efficient.
operations are like if value is 0, ignore, if less then 0.5 than some formula, else if greater than 0.5 then some formula.
I'd appreciate any advice.
EDIT1:
Another Idea I have is to iterate through each key,value of dictionary 'dct' , add value list a dataframe column and then perform operation. It is totally doable and fast but then how do I store all 3 dataframes?
EDIT2:
Scaler operations are like:
tmp_list = list()
for i in range(len(v1)):
if v1[i] ==0:
temp_list[i]=v1[i]
if v1[i]>0.5:
temp_list[i]=v1[i]*4 + v[i]^2
if v1[i] <0.5:
temp_list[i]=v1[i]*0.24+v[i]
EDIT3:
expected output would be either dictionary or dataframe. In Case of dictionary, it will be a nested dictionary like,
op_dct = {'a':{'A1':[values],'A2':[values],'A3':[values]},
'b':{'A1':[values],'A2':[values],'A3':[values]},
'c':{'A1':[values],'A2':[values],'A3':[values]} }
so that I can access vector like op_dct[constant_type][ID].
multiple dataframes doesn't seem right option.

Related

How to flatten data numpy.ndarray in python

I have a numpy.ndarray data that looks like below and I want to flatten it out so that i can manipulate it. Please find my sample data below:
sample_data=[list([{'region': 'urn:li:region:9194', 'followerCounts': {'organicFollowerCount': 157, 'paidFollowerCount': 0}}, {'region': 'urn:li:region:7127', 'followerCounts': {'organicFollowerCount': 17, 'paidFollowerCount': 0}}])]
I have tried to use the following code but no luck yet:
sample.flatter()
The desired output is as follows:
region organicFollowerCount paidFollowerCount
urn:li:region:9194 157 0
urn:li:region:7127 17 0
Can anyone help me achieving this please?
Here is an approach that uses pd.json_normalize:
import pandas as pd
# note that `sample data` has been modified into a list of dictionaries
sample_data = [
{'region': 'urn:li:region:9194',
'followerCounts': {'organicFollowerCount': 157, 'paidFollowerCount': 0}},
{'region': 'urn:li:region:7127',
'followerCounts': {'organicFollowerCount': 17, 'paidFollowerCount': 0}}
]
Now, convert each item in the list to a data frame:
dfs = list()
# convert one dict at a time into a data frame, using json_normalize()
for sd in sample_data:
t = pd.json_normalize(sd)
dfs.append(t)
# convert list of dataframes into a single data frame,
# and change column labels
t = pd.concat(dfs).rename(columns={
'followerCounts.organicFollowerCount': 'organicFollowerCount',
'followerCounts.paidFollowerCount': 'paidFollowerCount'
}).set_index('region')
print(t)
organicFollowerCount paidFollowerCount
region
urn:li:region:9194 157 0
urn:li:region:7127 17 0
As #thehumaneraser noted, this format is not ideal, but we can't always influence the format of the data we receive.
You are not going to be able to flatten this data the way you want with Numpy's flatten method. That method simply takes a multi-dimensional ndarray and flattens it to one dimension. You can read the docs here.
A couple other things. First of all, your sample data above is not an ndarray, it is just a python list. And actually since you call list() inside square brackets it is a nested list of dictionaries. This is really not a good way to store this information and based on this convoluted format you leave yourself very few options for nicely "flattening" it into the table you desire.
If you have many rows like this I would do the following:
headers = ["region", "organicFollowerCount", "paidFollowerCount"]
data = [headers]
for row in sample_data[0]: # Subindexing here because it is unwisely a nested list
formatted_row = []
formatted_row.append(row["region"])
formatted_row.append(row["followerCounts"]["organicFollowerCount"])
formatted_row.append(row["followerCounts"]["paidFollowerCount"]
data.append(formatted_row)
data = np.array(data)
This will give you an ndarray of the data as you have it here, but this is still an ugly solution. Really this is a highly impractical presentation of data and you should ditch it for a better one.
One last thing: don't use camel case. That is standard practice for some languages like Java but nor for Python. Instead of organicFollowerCount use organic_follower_count and so on.

Fast conversion of multicolumn dataframe into dictionary

I have the following problem. I have a pandas dataframe with columns A to D with columns A and B being kind of the identifier. My ultimate goal is to create a dictionary where the tuple (A,B) denotes he keys and the values C and D are stored under each key as numpy array. I can write this in one line if I only want to store C or D, but I struggle to get both under the hood. That's what I have:
output_dict = df.groupby(['A','B'])['C'].apply(np.array).to_dict()
works as expected, i.e. the data per each key is a dim(N,1) array. But if I try the following:
output_dict = df.groupby(['A','B'])['C','D'].apply(np.array).to_dict()
I receive the error that
TypeError: Series.name must be a hashable type
How can I include the 2nd column such that the data in the dict per key is an array of dim(N,2).
Thanks!
You can create a new column (e.g. C_D) containing lists of the corresponding values in the columns C and D. Select columns C and D from the dataframe and use the tolist() method:
df['C_D'] = df[['C','D']].values.tolist()
Then run your code line on that new column:
output_dict = df.groupby(['A','B'])['C_D'].apply(np.array).to_dict()
I played a bit more around and next to Gerd's already helpful answer I found the following matching my needs by using lambda.
output_dict = df.groupby(['A','B']).apply(lambda df: np.array( [ df['C'],df['D'] ] ).T).to_dict()
Time comparison with Gerd's solution in my particular case:
Gerd's: roughly 0.055s
This one: roughly 0.035s

Dictionary with arrays as values

Not sure if this is a good idea after all, but having a dictionary with arrays as values, such as
DF = {'z_eu': array([127.45064758, 150.4478288 , 150.74781189, -98.3227338 , -98.25155681, -98.24993753]),
'Process': array(['initStep', 'Transportation', 'Transportation', 'Transportation', 'Transportation', 'phot']),
'Creator': array(['SynRad', 'SynRad', 'SynRad', 'SynRad', 'SynRad', 'SynRad']) }
I need to do a selection of the numeric data (z_eu) based on values of the other two keys.
One workaround I came up with so far, was to extract the arrays and iterate through them, thereby creating another array which contains the valid data.
proc = DF['Process']; z= DF['z_eu']; creat = DF['Creator']
data = [z for z,p,c in zip(z, proc,creat) if (p == 'initStep') and c=='SynRad' ]
But somehow this seems to me as effort which can be completely avoided by dealing more intelligently with the dictionary in the first place? Also, the zip() takes a long time as well.
I know that dataframes are a valid alternative but unfortunately, since I'm dealing with strings, pandas appears to be too slow.
Any hints are most welcome!
A bit simpler, using conditional slicing you could write
data = DF['z_eu'][(DF['Process'] == 'initStep') & (DF['Creator'] == 'SynRad')]
...or still using zip, you could simplify to
data = [z for z, p, c in zip(*DF.values()) if p == 'initStep' and c == 'SynRad']
Basically also conditional slicing, using a pandas DataFrame:
df = pd.DataFrame(DF)
data = df.loc[(df['Process'] == 'initStep') & (df['Creator'] == 'SynRad'), 'z_eu']
print(data)
# 0 127.450648
# Name: z_eu, dtype: float64
In principle I'd say there's nothing wrong with handling numpy arrays in a dict. You'll have a lot of flexibility and sometimes operations are more efficient if you do them straight in numpy (you could even utilize numba for purely numerical, expensive calculations) - but if that is not needed and you're fine with basically a n*m table, pandas dfs are nice and convenient.
If your dataset is large and you want to perform many look-ups as the one shown, you might not want to perform those on strings. To improve performance, you could e.g. come up with unique IDs (integers) for each 'Process' or 'Creator' from the example. You'll just need to be able to map those back to the original strings, so keep that data as well.
You can loop through one array and via the index get the right element
z_eu = DF['z_eu']
process = DF['Process']
creator = DF['Creator']
result = []
for i in range(len(z_eu)):
if process[i] == 'initStep' and creator[i] == 'SynRad':
result.append(z_eu[i])
print(result)

How to append a key before a value in Python dict?

I have a dict
x = {'[a]':'(1234)', '[b]':'(2345)', '[c]':'(xyzad)'}
Now I want to append the key before values, so my expected output is:
{'[a]': '[a](1234)', '[b]': '[b](2345)', '[c]': '[c](xyzad)'}
I can do it using for loop like below:
for k,v in x.items():
x.update({k:k+v})
I am looking for efficent way of doing this or I should stick to my current approach?
Your approach seems fine. You could also use a dictionary comprehension, for a more concise solution:
x = {'[a]':'(1234)', '[b]':'(2345)', '[c]':'(xyzad)'}
{k: k+v for k,v in x.items()}
# {'[a]': '[a](1234)', '[b]': '[b](2345)', '[c]': '[c](xyzad)'}
Another way:
x = {'[a]':'(1234)', '[b]':'(2345)', '[c]':'(xyzad)'}
dict(((key, key + x[key]) for key in x))
>>>{'[a]': '[a](1234)', '[b]': '[b](2345)', '[c]': '[c](xyzad)'}
For smaller size dictionaries, the dictionary comprehension solution by #yatu is the best.
Since you mentioned that the data set is large & you would like to avoid for loop, pandas would be the recommended solution.
Create pandas dataframe from dict 'x'
Transform the dataframe & write to a new dictionary
Code:
# Read dictionary to a dataframe
df = pd.DataFrame(list(x.items()), columns=['Key', 'Value'])
Out[317]:
Key Value
0 [a] (1234)
1 [b] (2345)
2 [c] (xyzad)
# Since the transformation is just concatenating both key and value, this can be done while writing to the new dictionary in a single step.
y = dict(zip(df.Key, df.Key+df.Value))
Out[324]: {'[a]': '[a](1234)', '[b]': '[b](2345)', '[c]': '[c](xyzad)'}
This would be much faster for large data sets but I'm not sure how to compare the timings.

Pandas: More Efficient .map() function or method?

I am using a rather large dataset of ~37 million data points that are hierarchically indexed into three categories country, productcode, year. The country variable (which is the countryname) is rather messy data consisting of items such as: 'Austral' which represents 'Australia'. I have built a simple guess_country() that matches letters to words, and returns a best guess and confidence interval from a known list of country_names. Given the length of the data and the nature of hierarchy it is very inefficient to use .map() to the Series: country. [The guess_country function takes ~2ms / request]
My question is: Is there a more efficient .map() which takes the Series and performs map on only unique values? (Given there are a LOT of repeated countrynames)
There isn't, but if you want to only apply to unique values, just do that yourself. Get mySeries.unique(), then use your function to pre-calculate the mapped alternatives for those unique values and create a dictionary with the resulting mappings. Then use pandas map with the dictionary. This should be about as fast as you can expect.
On Solution is to make use of the Hierarchical Indexing in DataFrame!
data = data.set_index(keys=['COUNTRY', 'PRODUCTCODE', 'YEAR'])
data.index.levels[0] = pd.Index(data.index.levels[0].map(lambda x: guess_country(x, country_names)[0]))
This works well ... by replacing the data.index.levels[0] -> when COUNTRY is level 0 in the index, replacement then which propagates through the data model.
Call guess_country() on unique country names, and make a country_map Series object with the original name as the index, converted name as the value. Then you can use country_map[df.country] to do the conversion.
import pandas as pd
c = ["abc","abc","ade","ade","ccc","bdc","bxy","ccc","ccx","ccb","ccx"]
v = range(len(c))
df = pd.DataFrame({"country":c, "data":v})
def guess_country(c):
return c[0]
uc = df.country.unique()
country_map = pd.Series(list(map(guess_country, uc)), index=uc)
df["country_id"] = country_map[df.country].values
print(df)

Categories

Resources