Consider a Pandas pivot table like so:
E
A B C D
bar one large 4 6
small 5 8
two large 7 9
small 6 9
foo one large 2 9
small 1 2
two small 3 11
I would like to multiply each E entry that has A = bar by l and A = foo by m. For entries that have B = one I'd like to multiply them by n, for B = two by p. For every level of every dimension I have a different value that I would like to multiply E by. The resultant table would have each original value in E multiplied by [the number of dimensions in the table (four)] variables.
What is the fastest way to do this in Python? My actual table is high-dimensional and this operation will need to be done many times as part of an optimization process.
I created the pivot table using this code:
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
"bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two",
"one", "one", "two", "two"],
"C": ["small", "large", "large", "small",
"small", "large", "small", "small",
"large"],
"D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
"E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
table = pd.pivot_table(df, values='D', index=['A', 'B', 'C', 'D'], aggfunc=np.sum)
The values to multiply by are stored in a dictionary.
For instance:
{'A': {'bar': 0.5, 'foo': 0.2},
'B': {'one': 0.1, 'two': 0.3},
'C': {'large': 2, 'small': 4},
'D': {1: 10, 2: 20, 3: 30, 4: 40, 5: 50, 6: 60, 7: 70}}
With this dictionary, the result for the first row would be 6 * 0.5 * 0.1 * 2 * 40 = 24.
you could use map in each level of your index got with index.get_level_values.
table['Emult'] = table['E']*np.prod([table.index.get_level_values(lv).map(d[lv])
for lv in table.index.names],
axis=0)
print (table)
E Emult
A B C D
bar one large 4 6 24.0
small 5 8 80.0
two large 7 9 189.0
small 6 9 324.0
foo one large 2 9 7.2
small 1 2 1.6
two small 3 11 79.2
where d is the dictionary you gave in the question
Related
Consider the following data:
, Animal, Color, Rank, X
0, c, b, 1, 9
1, c, b, 2, 8
2, c, b, 3, 7
3, c, r, 1, 6
4, c, r, 2, 5
5, c, r, 3, 4
6, d, g, 1, 3
7, d, g, 2, 2
8, d, g, 3, 1
I now want to group by ["Animal", "Color"] and, for every group, I want to subtract the X value that corresponds to Rank equal to 1 from every other X value in that group.
Currently I am looping like this:
dfs = []
for _, tmp in df.groupby(["Animal","Color"]):
baseline = tmp.loc[tmp["Rank"]==1,"X"].to_numpy()
tmp["Y"] = tmp["X"]-baseline
dfs.append(tmp)
dfs = pd.concat(dfs)
This yields the right result, i.e.,
The whole process is however really slow and I would prefer to use apply or transform instead.
My problem is that I am unable to find a way to use the whole grouped data within apply or transform.
Is there a way to accelerate my computation?
For completeness, here's my MWE:
df = pd.DataFrame(
{
"Animal": {0: "c", 1: "c", 2: "c", 3: "c", 4: "c", 5: "c", 6: "d", 7: "d", 8: "d"},
"Color": {0: "b", 1: "b", 2: "b", 3: "r", 4: "r", 5: "r", 6: "g", 7: "g", 8: "g"},
"Rank": {0: 1, 1: 2, 2: 3, 3: 1, 4: 2, 5: 3, 6: 1, 7: 2, 8: 3},
"X": {0: 9, 1: 8, 2: 7, 3: 6, 4: 5, 5: 4, 6: 3, 7: 2, 8: 1},
}
)
Maybe it's the same as OP's solution performance-wise, but a little bit shorter:
# Just to be sure that we won't mess up the ordering after groupby
df.sort_values(['Animal', 'Color', 'Rank'], inplace=True)
df['Y'] = df['X'] - df.groupby(['Animal', 'Color']).transform('first')['X']
I think I found a solution that is faster (at least in my use case):
# get a unique identifier for every group
df["_group"] = df.groupby(["Animal", "Color"]).ngroup()
# for every group, get that identifier and the value X to be subtracted
baseline = df.loc[df["Rank"] == 1, ["_group", "X"]]
# merge the original data and the baseline data on the group
# this gives a new column with the Rank==1 value of X
df = pd.merge(df, baseline, on="_group", suffixes=("", "_baseline"))
# perform arithmetic
df["Y"] = df["X"] - df["X_baseline"]
# drop intermediate columns
df.drop(columns=["_group", "X_baseline"], inplace=True)
The question is I would like to avoid iterrows here.
From my dataframe I want to create a new column "unique" that will be based on the condition that if "a" and "b" column values are the same I would give it a value "uniqueN" then for all occurrence of the exact "a" and "b" I would need the same value "uniqueN".
In this case
"1", "3" (the first row) from "a" and "b" is the first unique pair, so I give that the value "unique1", and the seventh row will also have the same value which is "unique1" as it is also "1", "3".
"2", "2" (the second row) is the next unique "a", "b" pair so I give them "unique2" and the eight row also has "2", "2" so that will also have "unique2".
"3", "1" (third row) is the next unique, so "unique3", no more rows in the df is "3", "1" so that value wont repeat.
and so on
I have a working code that uses loops but this is not the pandas way, can anyone suggest how I can do this using pandas functions?
Expected Output (My code works, but its not using pandas methods)
a b unique
0 1 3 unique1
1 2 2 unique2
2 3 1 unique3
3 4 2 unique4
4 3 3 unique5
5 4 2 unique4
6 1 3 unique1
7 2 2 unique2
Code
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})
c = 1
seen = {}
for i, j in df.iterrows():
j = tuple(j)
if j not in seen:
seen[j] = 'unique' + str(c)
c += 1
for key, value in seen.items():
df.loc[(df.a == key[0]) & (df.b == key[1]), 'unique'] = value
Let's use groupby ngroup with sort=False to ensure values are enumerated in order of appearance, add 1 so group numbers start at one, then convert to string with astype so we can add the prefix unique to the number:
df['unique'] = 'unique' + \
df.groupby(['a', 'b'], sort=False).ngroup().add(1).astype(str)
Or with map and format instead of converting and concatenating:
df['unique'] = (
df.groupby(['a', 'b'], sort=False).ngroup()
.add(1)
.map('unique{}'.format)
)
df:
a b unique
0 1 3 unique1
1 2 2 unique2
2 3 1 unique3
3 4 2 unique4
4 3 3 unique5
5 4 2 unique4
6 1 3 unique1
7 2 2 unique2
Setup:
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]
})
I came up with a slightly different solution. I'll add this for posterity, but the groupby answer is superior.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})
print(df)
df1 = df[~df.duplicated()]
print(df1)
df1['unique'] = df1.index
print(df1)
df2 = df.merge(df1, how='left')
print(df2)
I have a dataframe df with column 'ColumnA'. How do i count the keys in this column using python.
df = pd.DataFrame({
'ColA': [{
"a": 10,
"b": 5,
"c": [1, 2, 3],
"d": 20
}, {
"f": 1,
"b": 3,
"c": [0],
"x": 71
}, {
"a": 1,
"m": 99,
"w": [8, 6],
"x": 88
}, {
"a": 9,
"m": 99,
"c": [3],
"x": 55
}]
})
Here i want to calculate count for each key like this. Then visualise the frequency using a chart
Expected Answers :
a=3,
b=2,
c=3,
d=1,
f=1,
x=3,
m=2,
w=1
try this, Series.explode transform's list-like to a row, Series.value_counts to get counts of unique values, Series.plot to create plot out of the series generated.
df.ColA.apply(lambda x : list(x.keys())).explode().value_counts()
a 3
c 3
x 3
b 2
m 2
f 1
d 1
w 1
Name: ColA, dtype: int64
Ive created a dataframe 'Pclass'
class deck weight
0 3 C 0.367568
1 3 B 0.259459
2 3 D 0.156757
3 3 E 0.140541
4 3 A 0.070270
5 3 T 0.005405
my initial dataframe 'df' looks like
class deck
0 3 NaN
1 1 C
2 3 NaN
3 1 C
4 3 NaN
5 3 NaN
6 1 E
7 3 NaN
8 3 NaN
9 2 NaN
10 3 G
11 1 C
I want to fill in the null deck values in df by choosing a sample from the
decks given in Pclass based on the weights.
I've only managed to code the sampling procedure.
np.random.choice(a=Pclass.deck,p=Pclass.weight)
I'm having trouble implementing a method to fill in the nulls by finding null rows that belong to class 3 and picking a random deck value for each(not the same value all the time), so not fillna('with just one').
Note: I have another question similar to this,but broader with a groupby object as well to maximize efficiency but I've gotten no responses. Any help would be greatly appreciated!
edit: added rows to dataframe Pclass
1 F 0.470588
1 E 0.294118
1 D 0.235294
2 F 0.461538
2 G 0.307692
2 E 0.230769
This generates a random selection from the deck column from the Pclass dataframe and assigns these to the df dataframe in the deck column (generating the required number). These commands could be put in a list comprehension if you wanted to do this across different values of the class variable. I'd recommend avoiding using class as a variable name since it's used to define new classes within Python.
import numpy as np
import pandas as pd
# Generate data and normalised weights
normweights = np.random.rand(6)
normweights /= normweights.sum()
Pclass = pd.DataFrame({
"cla": [3, 3, 3, 3, 3, 3],
"deck": ["C", "B", "D", "E", "A", "T"],
"weight": normweights
})
df = pd.DataFrame({
"cla": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
"deck": [np.nan, "C", np.nan, "C",
np.nan, np.nan, "E", np.nan,
np.nan, np.nan, "G", "C"]
})
# Find missing locations
missing_locs = np.where(df.deck.isnull() & (df.cla == 3))[0]
# Generate new values
new_vals = np.random.choice(a = Pclass.deck.values,
p = Pclass.weight.values, size = len(missing_locs))
# Assign the new values to the dataframe
df.set_value(missing_locs, 'deck', new_vals)
Running for multiple levels of the categorical variable
If you wanted to run this on all levels of the class variable you'd need to make sure you're selecting a subset of the data in Pclass (just the class of interest). One could use a list comprehension to find the missing data for each level of 'class' like so (I've updated the mock data below) ...
# Find missing locations
missing_locs = [np.where(df.deck.isnull() & (df.cla == i))[0] for i in [1,2,3]]
However, I think the code would be easier to read if it was in a loop:
# Generate data and normalised weights
normweights3 = np.random.rand(6)
normweights3 /= normweights3.sum()
normweights2 = np.random.rand(3)
normweights2 /= normweights2.sum()
Pclass = pd.DataFrame({
"cla": [3, 3, 3, 3, 3, 3, 2, 2, 2],
"deck": ["C", "B", "D", "E", "A", "T", "X", "Y", "Z"],
"weight": np.concatenate((normweights3, normweights2))
})
df = pd.DataFrame({
"cla": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
"deck": [np.nan, "C", np.nan, "C",
np.nan, np.nan, "E", np.nan,
np.nan, np.nan, "G", "C"]
})
class_levels = [1, 2, 3]
for i in class_levels:
missing_locs = np.where(df.deck.isnull() & (df.cla == i))[0]
if len(missing_locs) > 0:
subset = Pclass[Pclass.cla == i]
# Generate new values
new_vals = np.random.choice(a = subset.deck.values,
p = subset.weight.values, size = len(missing_locs))
# Assign the new values to the dataframe
df.set_value(missing_locs, 'deck', new_vals)
Is there a way to get a textual representation of a dataframe that I can just paste back into the repl, but that still looks good as a table? Numpy repr manages this pretty well, I'm talking something like:
> df
A B C
i
0 3 1 8
1 3 1 6
2 7 4 6
> df.to_python()
DataFrame(
columns=['i', 'A', 'B', 'C'],
data = [[ 0, 30, 1, 8],
[ 1, 3, 1, 6],
[ 2, 7, 4, 6]]
).set_index('i')
This seems like it would be especially useful for stack overflow, but I often find myself needing to share small dataframes and would love it if this were possible.
Edit: I know about to_csv and to_dict and so on, what I want is a way of exactly reproducing a dataframe that also can be read as a table. It seems that this probably doesn't have a current answer (although I'd love to see pandas add it), but I think I can make pd.read_clipboard('\s\s+') work for 95% of my usages.
StringIO tells python to treat a string as a filelike object which allows you to use the read_csv method example below...
df = """ A B C
i
0 3 1 8
1 3 1 6
2 7 4 6"""#this is equivalent to str(df) or what happens when you use print df
df = pd.read_csv(StringIO.StringIO(df),sep="\s*",engine = 'python')
df.to_dict() will get you close, although you do lose the index name:
df.to_dict()
Out[5]: {'A': {0: 30, 1: 3, 2: 7}, 'B': {0: 1, 1: 1, 2: 4}, 'C': {0: 8, 1: 6, 2: 6}}
df_copy = pd.DataFrame(df.to_dict())
df_copy
Out[7]:
A B C
0 30 1 8
1 3 1 6
2 7 4 6