I wrote the following code in which I create a dictionary of pandas dataframes:
import pandas as pd
import numpy as np
classification = pd.read_csv('classification.csv')
thresholdRange = np.arange(0, 70, 0.5).tolist()
classificationDict = {}
for t in thresholdRange:
classificationDict[t] = classification
for k, v in classificationDict.iteritems():
v ['Threshold'] = k
In this case, I want to create a column called 'Threshold' in all the pandas dataframes in which the keys of the dictionary are the values. However, what I get with the code above is the same value in all dataframes. What am I missing here? Perhaps I am complicating things for myself with this approach, but I'd greatly appreciate your help.
Sorry, I got your question wrong. Now this is the issue:
Obviously, classification (a pandas dataframe, I suppose) is a mutable object, and adding a mutable object to a list or a dict makes strange (for python-beginners) behaviour. The same object is added. If you change one of the list entries, all get changed. Try this:
a = [1]
b = [a, a]
b[0] = 2
print(b[1])
This is what happens to your dict.
You have to add different objects to the dict. Probably the dataframe has a .copy()-method to do this. Alternatively, I found this post for you, with (in essence) the same problem, there are further solutions there:
https://stackoverflow.com/a/2612815/6053327
Of course you get the same value. You are doing the same assignment over and over again in
for k, v in classificationDict.iteritems():
because your vs are all identical, you assigned them in the first for
Did you try debugging yourself, and print classification? I assume that it is only the first line?
Related
I've read a few other posts about this but the other solutions haven't worked for me. I'm trying to look at 2 different CSV files and compare data from 1 column from each file. Here's what I have so far:
import pandas as pd
import numpy as np
dataBI = pd.read_csv("U:/eu_inventory/EO BI Orders.csv")
dataOrderTrimmed = dataBI.iloc[:,1:2].values
dataVA05 = pd.read_csv("U:\eu_inventory\VA05_Export.csv")
dataVAOrder = dataVA05.iloc[:,1:2].values
dataVAList = []
ordersInBoth = []
ordersInBI = []
ordersInVA = []
for order in np.nditer(dataOrderTrimmed):
if order in dataVAOrder:
ordersInBoth.append(order)
else:
ordersInBI.append(order)
So if the order number from dataOrderTrimmed is also in dataVAOrder I want to add it to ordersInBoth, otherwise I want to add it to ordersInBI. I think it splits the information correctly but if I try to print ordersInBoth each item prints as array(5555555, dtype=int64) I want to have a list of the order numbers not as an array and not including the dtype information. Let me know if you need more information or if the way I've typed it out is confusing. Thanks!
The way you're using .iloc is giving you a DataFrame, which becomes 2D array when you access values. If you just want the values in the column at index 1, then you should just say:
dataOrderTrimmed = dataBI.iloc[:, 1].values
Then you can iterate over dataOrderTrimmed directly (i.e. you don't need nditer), and you will get regular scalar values.
I have the following problem. I have a pandas dataframe with columns A to D with columns A and B being kind of the identifier. My ultimate goal is to create a dictionary where the tuple (A,B) denotes he keys and the values C and D are stored under each key as numpy array. I can write this in one line if I only want to store C or D, but I struggle to get both under the hood. That's what I have:
output_dict = df.groupby(['A','B'])['C'].apply(np.array).to_dict()
works as expected, i.e. the data per each key is a dim(N,1) array. But if I try the following:
output_dict = df.groupby(['A','B'])['C','D'].apply(np.array).to_dict()
I receive the error that
TypeError: Series.name must be a hashable type
How can I include the 2nd column such that the data in the dict per key is an array of dim(N,2).
Thanks!
You can create a new column (e.g. C_D) containing lists of the corresponding values in the columns C and D. Select columns C and D from the dataframe and use the tolist() method:
df['C_D'] = df[['C','D']].values.tolist()
Then run your code line on that new column:
output_dict = df.groupby(['A','B'])['C_D'].apply(np.array).to_dict()
I played a bit more around and next to Gerd's already helpful answer I found the following matching my needs by using lambda.
output_dict = df.groupby(['A','B']).apply(lambda df: np.array( [ df['C'],df['D'] ] ).T).to_dict()
Time comparison with Gerd's solution in my particular case:
Gerd's: roughly 0.055s
This one: roughly 0.035s
Not sure if this is a good idea after all, but having a dictionary with arrays as values, such as
DF = {'z_eu': array([127.45064758, 150.4478288 , 150.74781189, -98.3227338 , -98.25155681, -98.24993753]),
'Process': array(['initStep', 'Transportation', 'Transportation', 'Transportation', 'Transportation', 'phot']),
'Creator': array(['SynRad', 'SynRad', 'SynRad', 'SynRad', 'SynRad', 'SynRad']) }
I need to do a selection of the numeric data (z_eu) based on values of the other two keys.
One workaround I came up with so far, was to extract the arrays and iterate through them, thereby creating another array which contains the valid data.
proc = DF['Process']; z= DF['z_eu']; creat = DF['Creator']
data = [z for z,p,c in zip(z, proc,creat) if (p == 'initStep') and c=='SynRad' ]
But somehow this seems to me as effort which can be completely avoided by dealing more intelligently with the dictionary in the first place? Also, the zip() takes a long time as well.
I know that dataframes are a valid alternative but unfortunately, since I'm dealing with strings, pandas appears to be too slow.
Any hints are most welcome!
A bit simpler, using conditional slicing you could write
data = DF['z_eu'][(DF['Process'] == 'initStep') & (DF['Creator'] == 'SynRad')]
...or still using zip, you could simplify to
data = [z for z, p, c in zip(*DF.values()) if p == 'initStep' and c == 'SynRad']
Basically also conditional slicing, using a pandas DataFrame:
df = pd.DataFrame(DF)
data = df.loc[(df['Process'] == 'initStep') & (df['Creator'] == 'SynRad'), 'z_eu']
print(data)
# 0 127.450648
# Name: z_eu, dtype: float64
In principle I'd say there's nothing wrong with handling numpy arrays in a dict. You'll have a lot of flexibility and sometimes operations are more efficient if you do them straight in numpy (you could even utilize numba for purely numerical, expensive calculations) - but if that is not needed and you're fine with basically a n*m table, pandas dfs are nice and convenient.
If your dataset is large and you want to perform many look-ups as the one shown, you might not want to perform those on strings. To improve performance, you could e.g. come up with unique IDs (integers) for each 'Process' or 'Creator' from the example. You'll just need to be able to map those back to the original strings, so keep that data as well.
You can loop through one array and via the index get the right element
z_eu = DF['z_eu']
process = DF['Process']
creator = DF['Creator']
result = []
for i in range(len(z_eu)):
if process[i] == 'initStep' and creator[i] == 'SynRad':
result.append(z_eu[i])
print(result)
I'm using a pandas DataFrame in which one column contains numpy arrays. When trying to sum that column via aggregation I get an error stating 'Must produce aggregated value'.
e.g.
import pandas as pd
import numpy as np
DF = pd.DataFrame([[1,np.array([10,20,30])],
[1,np.array([40,50,60])],
[2,np.array([20,30,40])],], columns=['category','arraydata'])
This works the way I would expect it to:
DF.groupby('category').agg(sum)
output:
arraydata
category 1 [50 70 90]
2 [20 30 40]
However, since my real data frame has multiple numeric columns, arraydata is not chosen as the default column to aggregate on, and I have to select it manually. Here is one approach I tried:
g=DF.groupby('category')
g.agg({'arraydata':sum})
Here is another:
g=DF.groupby('category')
g['arraydata'].agg(sum)
Both give the same output:
Exception: must produce aggregated value
However if I have a column that uses numeric rather than array data, it works fine. I can work around this, but it's confusing and I'm wondering if this is a bug, or if I'm doing something wrong. I feel like the use of arrays here might be a bit of an edge case and indeed wasn't sure if they were supported. Ideas?
Thanks
One, perhaps more clunky way to do it would be to iterate over the GroupBy object (it generates (grouping_value, df_subgroup) tuples. For example, to achieve what you want here, you could do:
grouped = DF.groupby("category")
aggregate = list((k, v["arraydata"].sum()) for k, v in grouped)
new_df = pd.DataFrame(aggregate, columns=["category", "arraydata"]).set_index("category")
This is very similar to what pandas is doing under the hood anyways [groupby, then do some aggregation, then merge back in], so you aren't really losing out on much.
Diving into the Internals
The problem here is that pandas is checking explicitly that the output not be an ndarray because it wants to intelligently reshape your array, as you can see in this snippet from _aggregate_named where the error occurs.
def _aggregate_named(self, func, *args, **kwargs):
result = {}
for name, group in self:
group.name = name
output = func(group, *args, **kwargs)
if isinstance(output, np.ndarray):
raise Exception('Must produce aggregated value')
result[name] = self._try_cast(output, group)
return result
My guess is that this happens because groupby is explicitly set up to try to intelligently put back together a DataFrame with the same indexes and everything aligned nicely. Since it's rare to have nested arrays in a DataFrame like that, it checks for ndarrays to make sure that you are actually using an aggregate function. In my gut, this feels like a job for Panel, but I'm not sure how to transform it perfectly. As an aside, you can sidestep this problem by converting your output to a list, like this:
DF.groupby("category").agg({"arraydata": lambda x: list(x.sum())})
Pandas doesn't complain, because now you have an array of Python objects. [but this is really just cheating around the typecheck]. And if you want to convert back to array, just apply np.array to it.
result = DF.groupby("category").agg({"arraydata": lambda x: list(x.sum())})
result["arraydata"] = result["arraydata"].apply(np.array)
How you want to resolve this issue really depends on why you have columns of ndarray and whether you want to aggregate anything else at the same time. That said, you can always iterate over GroupBy like I've shown above.
Pandas works much more efficiently if you don't do this (e.g using numeric data, as you suggest). Another alternative is to use a Panel object for this kind of multidimensional data.
Saying that, this looks like a bug, the Exception is being raised purely because the result is an array:
Exception: Must produce aggregated value
In [11]: %debug
> /Users/234BroadWalk/pandas/pandas/core/groupby.py(1511)_aggregate_named()
1510 if isinstance(output, np.ndarray):
-> 1511 raise Exception('Must produce aggregated value')
1512 result[name] = self._try_cast(output, group)
ipdb> output
array([50, 70, 90])
If you were to recklessly remove these two lines from the source code it works as expected:
In [99]: g.agg(sum)
Out[99]:
arraydata
category
1 [50, 70, 90]
2 [20, 30, 40]
Note: They're almost certainly in there for a reason...
Since the sum function only iterate over rows, or sum function only calculates the sum along the first axis.
You can define an aggregation function:
def mySum(dataframe):
return np.sum(np.sum(dataframe))
And then pass this function into the agg():
DF.groupby('category').agg(mySum)
I am using a rather large dataset of ~37 million data points that are hierarchically indexed into three categories country, productcode, year. The country variable (which is the countryname) is rather messy data consisting of items such as: 'Austral' which represents 'Australia'. I have built a simple guess_country() that matches letters to words, and returns a best guess and confidence interval from a known list of country_names. Given the length of the data and the nature of hierarchy it is very inefficient to use .map() to the Series: country. [The guess_country function takes ~2ms / request]
My question is: Is there a more efficient .map() which takes the Series and performs map on only unique values? (Given there are a LOT of repeated countrynames)
There isn't, but if you want to only apply to unique values, just do that yourself. Get mySeries.unique(), then use your function to pre-calculate the mapped alternatives for those unique values and create a dictionary with the resulting mappings. Then use pandas map with the dictionary. This should be about as fast as you can expect.
On Solution is to make use of the Hierarchical Indexing in DataFrame!
data = data.set_index(keys=['COUNTRY', 'PRODUCTCODE', 'YEAR'])
data.index.levels[0] = pd.Index(data.index.levels[0].map(lambda x: guess_country(x, country_names)[0]))
This works well ... by replacing the data.index.levels[0] -> when COUNTRY is level 0 in the index, replacement then which propagates through the data model.
Call guess_country() on unique country names, and make a country_map Series object with the original name as the index, converted name as the value. Then you can use country_map[df.country] to do the conversion.
import pandas as pd
c = ["abc","abc","ade","ade","ccc","bdc","bxy","ccc","ccx","ccb","ccx"]
v = range(len(c))
df = pd.DataFrame({"country":c, "data":v})
def guess_country(c):
return c[0]
uc = df.country.unique()
country_map = pd.Series(list(map(guess_country, uc)), index=uc)
df["country_id"] = country_map[df.country].values
print(df)