I have a dataframe df with transactions where the values in the column Col can be repeated. I use Counter dictionary1 to count the frequency for each Col value, then I would like to run a for loop on a subset of the data and obtain a value pit. I want to create a new dictionary dict1 where the key is the key from dictionary1 and the value is the value of pit. This is the code I have so far:
dictionary1 = Counter(df['Col'])
dict1 = defaultdict(int)
for i in range(len(dictionary1)):
temp = df[df['Col'] == dictionary1.keys()[i]]
b = temp['IsBuy'].sum()
n = temp['IsBuy'].count()
pit = b/n
dict1[dictionary1.keys()[i]] = pit
My question is, how can i assign the key and value for dict1 based on the key of dictionary1 and the value obtained from the calculation of pit. In other words, what is the correct way to write the last line of code in the above script.
Thank you.
Since you're using pandas, I should point out that the problem you're facing is common enough that there's a built-in way to do it. We call collecting "similar" data into groups and then performing operations on them a groupby operation. It's probably wortwhile reading the tutorial section on the groupby split-apply-combine idiom -- there are lots of neat things you can do!
The pandorable way to compute the pit values would be something like
df.groupby("Col")["IsBuy"].mean()
For example:
>>> # make dummy data
>>> N = 10**4
>>> df = pd.DataFrame({"Col": np.random.randint(1, 10, N), "IsBuy": np.random.choice([True, False], N)})
>>> df.head()
Col IsBuy
0 3 False
1 6 True
2 6 True
3 1 True
4 5 True
>>> df.groupby("Col")["IsBuy"].mean()
Col
1 0.511709
2 0.495697
3 0.489796
4 0.510658
5 0.507491
6 0.513183
7 0.522936
8 0.488688
9 0.490498
Name: IsBuy, dtype: float64
which you could turn into a dictionary from a Series if you insisted:
>>> df.groupby("Col")["IsBuy"].mean().to_dict()
{1: 0.51170858629661753, 2: 0.49569707401032703, 3: 0.48979591836734693, 4: 0.51065801668211308, 5: 0.50749063670411987, 6: 0.51318267419962338, 7: 0.52293577981651373, 8: 0.48868778280542985, 9: 0.49049773755656106}
Related
An example to illustrate the point. In 1 column, there are the following 5 categories for "food_spice_levels".
high_heat, medium_heat, mild_heat, no_heat, bland
The goal is to create a new binary variable called "Spiciness" to show whether the food is spicy or not spicy. Bland, no_heat, mild_heat, and medium_heat = 0, and High_heat = 1 is the goal and again, to be in 1 new column.
Current code and issues:
df['Spiciness'] = df['food_spice_levels'].map({'Bland''no_heat''mild_heat''medium_heat': 0, 'high_heat': 1})
Commas between each category in the code for the "0" category gave a syntax error. Without the commas, this warning came:
"SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead"
It did create a new column with high_heat being coded as "1" correctly, but all the desired "0" values got coded to "NaN" and I don't want to destroy the dataset if the warning is telling me something that can't be ignored. Can anyone help so that I get 0's and 1's in the new column while potentially avoiding this error message. Thanks!
IIUC
df = pd.DataFrame({'food': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F'}, 'food_spice_levels': {0: 'bland', 1: 'high_heat', 2: 'mild_heat', 3: 'medium_heat', 4: 'high_heat', 5: 'bland'}})
print(df)
# food food_spice_levels
# 0 A bland
# 1 B high_heat
# 2 C mild_heat
# 3 D medium_heat
# 4 E high_heat
# 5 F bland
df['binary'] = (df['food_spice_levels']=='high_heat').astype(int)
print(df)
# food food_spice_levels binary
# 0 A bland 0
# 1 B high_heat 1
# 2 C mild_heat 0
# 3 D medium_heat 0
# 4 E high_heat 1
# 5 F bland 0
This utilizes the fact that booleans are represented as 1 (True) or 0 (False). The part df['food_spice_levels']=='high_heat' creates a boolean series, which then is parsed as int.
I am writing a piece of simulation software in python using pandas, here is my problem:
Imagine you have two pandas dataframes dfA and dfB with numeric columns A and B respectively.
Both dataframes have a different number of rows denoted by n and m.
Let's assume that n > m.
Moreover, dfA includes a binary column C, which has m times 1, and the rest 0.
Assume both dfA and dfB are sorted.
My question is, in order, I want to add the values in B to the values in column A if column C == 0.
In the example n = 6, m = 3.
Example data:
dataA = {'A': [7,7,7,7,7,7],
'C': [1,0,1,0,0,1]}
dfA = pd.Dataframe(dataA)
dfB = pd.Dataframe([3,5,4], columns = ['B'])
Example pseudocode:
DOES NOT WORK
if dfA['C'] == 1:
dfD['D'] = dfA['A']
else:
dfD['D'] = dfA['A'] + dfB['B']
Expected result:
dfD['D']
[7,10,7,12,11,7]
I can only think of obscure for loops with index counters for each of the three vectors, but I am sure that there is a faster way by writing a function and using apply. But maybe there is something completely different that I am missing.
*NOTE: In the real problem the rows are not single values, but row vectors of equal length. Moreover, in the real problem it is not just simple addition but a weighted average over the two row vectors
You can use:
m = dfA['C'].eq(1)
dfA['C'] = dfA['A'].where(m, dfA['A']+dfB['B'].set_axis(dfA.index[~m]))
Or:
dfA.loc[m, 'C'] = dfA.loc[m, 'A']
dfA.loc[~m, 'C'] = dfB['B'].values
Output:
A C
0 7 7
1 7 10
2 7 7
3 7 12
4 7 11
5 7 7
The alternative answer is pretty clever. I am just showing a different way if you would like to do it using loops:
# Create an empty df
dfD = pd.DataFrame()
# Create Loop
k = 0
for i in range(len(dfA)):
if dfA.loc[i, "C"] == 1:
dfD.loc[i, "D"] = dfA.loc[i, "A"]
else:
dfD.loc[i, "D"] = dfA.loc[i, "A"] + dfB.loc[k, "B"]
k = k+1
# Show results
dfD
I have a dataframe as follows:
A B C
1 6 1
2 5 7
3 4 9
4 2 2
I want a dictionary like this:
{A: [1,2,3,4], B:[6,5,4,2], C:[1,7,9,2]}
I have tried using the normal df.to_dict() and it is no where close. If I use the transposed dataframe, hence df.T.to_dict() it gets close, but I have something like this:
{0: {A: 1, B: 6, C:1} , ... , 4:{A: 4, B: 2, C: 2 } }
The questions in stack overflow are limited to the dictionary having one value per key, not an array.
It would be very valuable for me to use to_dict() and avoid any for loop, since the database I am using is quite big and I want the computational complexity to be as low as possible.
orient='list'
df.to_dict(orient='list')
Or: the method actually just checks the first character
df.to_dict('l')
If you want to preserve the numpy array
{k: v.to_numpy() for k, v in df.items()}
I can solve my task by writing a for loop, but I wonder, how to do this in a more pandorable way.
So I have this dataframe storing some lists and want to find all the rows that have any common values in these lists,
(This code just to obtaine a df with lists:
>>> df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,1,4,6]})
>>> df
a b
0 A 1
1 A 2
2 B 5
3 B 1
4 B 4
5 C 6
>>> d = df.groupby('a')['b'].apply(list)
)
Here we start:
>>> d
A [1, 2]
B [5, 1, 4]
C [6]
Name: b, dtype: object
I want to select rows with index 'A' and 'B', because their lists overlap by the value 1.
I could write now a for loop or expand the dataframe at these lists (reversing the way I got it above) and have multiple rows copying other values.
What would you do here? Or is there some way, to use df.groupby(by=lambda x, y : return not set(x).isdisjoint(y)), that compares two rows?
But groupby and also boolean masking just look at one element at once...
I tried now to overload the equality operator for lists, and because lists are not hashable, then of tuples and sets (I set hash to 1 to avoid identity comparison). I then used groupby and merge on the frame with itself, but as it seems, that it checks off the indexes, that it has already matched.
import pandas as pd
import numpy as np
from operator import itemgetter
class IndexTuple(set):
def __hash__(self):
#print(hash(str(self)))
return hash(1)
def __eq__(self, other):
#print("eq ")
is_equal = not set(self).isdisjoint(other)
return is_equal
l = IndexTuple((1,7))
l1 = IndexTuple((4, 7))
print (l == l1)
df = pd.DataFrame(np.random.randint(low=0, high=4, size=(10, 2)), columns=['a','b']).reset_index()
d = df.groupby('a')['b'].apply(IndexTuple).to_frame().reset_index()
print (d)
print (d.groupby('b').b.apply(list))
print (d.merge (d, on = 'b', how = 'outer'))
outputs (it works fine for the first element, but at [{3}] there should be [{3},{0,3}] instead:
True
a b
0 0 {1}
1 1 {0, 2}
2 2 {3}
3 3 {0, 3}
b
{1} [{1}]
{0, 2} [{0, 2}, {0, 3}]
{3} [{3}]
Name: b, dtype: object
a_x b a_y
0 0 {1} 0
1 1 {0, 2} 1
2 1 {0, 2} 3
3 3 {0, 3} 1
4 3 {0, 3} 3
5 2 {3} 2
Using a merge on df:
v = df.merge(df, on='b')
common_cols = set(
np.sort(v.iloc[:, [0, -1]].query('a_x != a_y'), axis=1).ravel()
)
common_cols
{'A', 'B'}
Now, pre-filter and call groupby:
df[df.a.isin(common_cols)].groupby('a').b.apply(list)
a
A [1, 2]
B [5, 1, 4]
Name: b, dtype: object
I understand you are asking for a "pandorable" solution, but in my opinion this is not a task ideally suited to pandas.
Below is one solution using collections.Counter and itertools.combinations which provides your result without using a dataframe.
from collections import defaultdict
from itertools import combinations
data = {'a':['A','A','B','B','B','C'], 'b':[1,2,5,1,4,6]}
d = defaultdict(set)
for i, j in zip(data['a'], data['b']):
d[i].add(j)
res = {frozenset({i, j}) for i, j in combinations(d, 2) if not d[i].isdisjoint(d[j])}
# {frozenset({'A', 'B'})}
Explanation
Group to a set with collections.defaultdict. via an O(n) complexity solution.
Iterate using itertools.combinations to find set values which are not disjoint, using a set comprehension.
Use frozenset (or sorted tuple) for key type as lists are mutable and therefore cannot be used as dictionary keys.
Say I have a list (or numpy array or pandas series) as below
l = [1,2,6,6,4,2,4]
I want to return a list of each value's ordinal, 1-->1(smallest), 2-->2, 4-->3, 6-->4 and
to_ordinal(l) == [1,2,4,4,3,2,4]
and I want it to also work for list of strings input.
I can try
s = numpy.unique(l)
then loop over each element in l and find its index in s. Just wonder if there is a direct method?
In pandas you can call rank and pass method='dense':
In [18]:
l = [1,2,6,6,4,2,4]
s = pd.Series(l)
s.rank(method='dense')
Out[18]:
0 1
1 2
2 4
3 4
4 3
5 2
6 3
dtype: float64
This also works for strings:
In [19]:
l = ['aaa','abc','aab','aba']
s = pd.Series(l)
Out[19]:
0 aaa
1 abc
2 aab
3 aba
dtype: object
In [20]:
s.rank(method='dense')
Out[20]:
0 1
1 4
2 2
3 3
dtype: float64
I don't think that there is a "direct method" for this1. The most straight forward way that I can think to do it is to sort a set of the elements:
sorted_unique = sorted(set(l))
Then make a dictionary mapping the value to it's ordinal:
ordinal_map = {val: i for i, val in enumerate(sorted_unique, 1)}
Now one more pass over the data and we can get your list:
ordinals = [ordinal_map[val] for val in l]
Note that this is a roughly O(NlogN) algorithm (due to the sort) -- And the more non-unique elements you have, the closer it becomes to O(N).
1Certainly not in vanilla python and I don't know of anything in numpy. I'm less familiar with pandas so I can't speak to that.