Filling in nulls in dataframe with different elements by sampling pandas - python

Ive created a dataframe 'Pclass'
class deck weight
0 3 C 0.367568
1 3 B 0.259459
2 3 D 0.156757
3 3 E 0.140541
4 3 A 0.070270
5 3 T 0.005405
my initial dataframe 'df' looks like
class deck
0 3 NaN
1 1 C
2 3 NaN
3 1 C
4 3 NaN
5 3 NaN
6 1 E
7 3 NaN
8 3 NaN
9 2 NaN
10 3 G
11 1 C
I want to fill in the null deck values in df by choosing a sample from the
decks given in Pclass based on the weights.
I've only managed to code the sampling procedure.
np.random.choice(a=Pclass.deck,p=Pclass.weight)
I'm having trouble implementing a method to fill in the nulls by finding null rows that belong to class 3 and picking a random deck value for each(not the same value all the time), so not fillna('with just one').
Note: I have another question similar to this,but broader with a groupby object as well to maximize efficiency but I've gotten no responses. Any help would be greatly appreciated!
edit: added rows to dataframe Pclass
1 F 0.470588
1 E 0.294118
1 D 0.235294
2 F 0.461538
2 G 0.307692
2 E 0.230769

This generates a random selection from the deck column from the Pclass dataframe and assigns these to the df dataframe in the deck column (generating the required number). These commands could be put in a list comprehension if you wanted to do this across different values of the class variable. I'd recommend avoiding using class as a variable name since it's used to define new classes within Python.
import numpy as np
import pandas as pd
# Generate data and normalised weights
normweights = np.random.rand(6)
normweights /= normweights.sum()
Pclass = pd.DataFrame({
"cla": [3, 3, 3, 3, 3, 3],
"deck": ["C", "B", "D", "E", "A", "T"],
"weight": normweights
})
df = pd.DataFrame({
"cla": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
"deck": [np.nan, "C", np.nan, "C",
np.nan, np.nan, "E", np.nan,
np.nan, np.nan, "G", "C"]
})
# Find missing locations
missing_locs = np.where(df.deck.isnull() & (df.cla == 3))[0]
# Generate new values
new_vals = np.random.choice(a = Pclass.deck.values,
p = Pclass.weight.values, size = len(missing_locs))
# Assign the new values to the dataframe
df.set_value(missing_locs, 'deck', new_vals)
Running for multiple levels of the categorical variable
If you wanted to run this on all levels of the class variable you'd need to make sure you're selecting a subset of the data in Pclass (just the class of interest). One could use a list comprehension to find the missing data for each level of 'class' like so (I've updated the mock data below) ...
# Find missing locations
missing_locs = [np.where(df.deck.isnull() & (df.cla == i))[0] for i in [1,2,3]]
However, I think the code would be easier to read if it was in a loop:
# Generate data and normalised weights
normweights3 = np.random.rand(6)
normweights3 /= normweights3.sum()
normweights2 = np.random.rand(3)
normweights2 /= normweights2.sum()
Pclass = pd.DataFrame({
"cla": [3, 3, 3, 3, 3, 3, 2, 2, 2],
"deck": ["C", "B", "D", "E", "A", "T", "X", "Y", "Z"],
"weight": np.concatenate((normweights3, normweights2))
})
df = pd.DataFrame({
"cla": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
"deck": [np.nan, "C", np.nan, "C",
np.nan, np.nan, "E", np.nan,
np.nan, np.nan, "G", "C"]
})
class_levels = [1, 2, 3]
for i in class_levels:
missing_locs = np.where(df.deck.isnull() & (df.cla == i))[0]
if len(missing_locs) > 0:
subset = Pclass[Pclass.cla == i]
# Generate new values
new_vals = np.random.choice(a = subset.deck.values,
p = subset.weight.values, size = len(missing_locs))
# Assign the new values to the dataframe
df.set_value(missing_locs, 'deck', new_vals)

Related

Return a boolean mask for a Pandas index wrt a subset of the index

I have a df like this:
df = pd.DataFrame(index=["a", "b", "c"], data={"col": [1, 2, 3]})
And a subset of the indexes: s = ["a", "b"]
I would like to form the boolean mask so I can change the values of row "a" and "b" only, like so:
df.loc[m, "col"] = [10, 11]
Is there a neat way to do this?
If list of assigned values is same like list of indexes use:
df.loc[s, "col"] = [10, 11]
print (df)
col
a 10
b 11
c 3
If use boolean mask by Index.isin and order of list is different like indices get different ouput:
df = pd.DataFrame(index=["a", "b", "c"], data={"col": [1, 2, 3]})
#swapped order
s = ["b", "a"]
m = df.index.isin(s)
df.loc[m, "col"] = [10, 11]
print (df)
col
a 10
b 11
c 3
df.loc[s, "col"] = [10, 11]
print (df)
col
a 11
b 10
c 3

Proper way to do this in pandas without using for loop

The question is I would like to avoid iterrows here.
From my dataframe I want to create a new column "unique" that will be based on the condition that if "a" and "b" column values are the same I would give it a value "uniqueN" then for all occurrence of the exact "a" and "b" I would need the same value "uniqueN".
In this case
"1", "3" (the first row) from "a" and "b" is the first unique pair, so I give that the value "unique1", and the seventh row will also have the same value which is "unique1" as it is also "1", "3".
"2", "2" (the second row) is the next unique "a", "b" pair so I give them "unique2" and the eight row also has "2", "2" so that will also have "unique2".
"3", "1" (third row) is the next unique, so "unique3", no more rows in the df is "3", "1" so that value wont repeat.
and so on
I have a working code that uses loops but this is not the pandas way, can anyone suggest how I can do this using pandas functions?
Expected Output (My code works, but its not using pandas methods)
a b unique
0 1 3 unique1
1 2 2 unique2
2 3 1 unique3
3 4 2 unique4
4 3 3 unique5
5 4 2 unique4
6 1 3 unique1
7 2 2 unique2
Code
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})
c = 1
seen = {}
for i, j in df.iterrows():
j = tuple(j)
if j not in seen:
seen[j] = 'unique' + str(c)
c += 1
for key, value in seen.items():
df.loc[(df.a == key[0]) & (df.b == key[1]), 'unique'] = value
Let's use groupby ngroup with sort=False to ensure values are enumerated in order of appearance, add 1 so group numbers start at one, then convert to string with astype so we can add the prefix unique to the number:
df['unique'] = 'unique' + \
df.groupby(['a', 'b'], sort=False).ngroup().add(1).astype(str)
Or with map and format instead of converting and concatenating:
df['unique'] = (
df.groupby(['a', 'b'], sort=False).ngroup()
.add(1)
.map('unique{}'.format)
)
df:
a b unique
0 1 3 unique1
1 2 2 unique2
2 3 1 unique3
3 4 2 unique4
4 3 3 unique5
5 4 2 unique4
6 1 3 unique1
7 2 2 unique2
Setup:
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]
})
I came up with a slightly different solution. I'll add this for posterity, but the groupby answer is superior.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})
print(df)
df1 = df[~df.duplicated()]
print(df1)
df1['unique'] = df1.index
print(df1)
df2 = df.merge(df1, how='left')
print(df2)

Multidimensional multiplication in Pandas pivot tables

Consider a Pandas pivot table like so:
E
A B C D
bar one large 4 6
small 5 8
two large 7 9
small 6 9
foo one large 2 9
small 1 2
two small 3 11
I would like to multiply each E entry that has A = bar by l and A = foo by m. For entries that have B = one I'd like to multiply them by n, for B = two by p. For every level of every dimension I have a different value that I would like to multiply E by. The resultant table would have each original value in E multiplied by [the number of dimensions in the table (four)] variables.
What is the fastest way to do this in Python? My actual table is high-dimensional and this operation will need to be done many times as part of an optimization process.
I created the pivot table using this code:
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
"bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two",
"one", "one", "two", "two"],
"C": ["small", "large", "large", "small",
"small", "large", "small", "small",
"large"],
"D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
"E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
table = pd.pivot_table(df, values='D', index=['A', 'B', 'C', 'D'], aggfunc=np.sum)
The values to multiply by are stored in a dictionary.
For instance:
{'A': {'bar': 0.5, 'foo': 0.2},
'B': {'one': 0.1, 'two': 0.3},
'C': {'large': 2, 'small': 4},
'D': {1: 10, 2: 20, 3: 30, 4: 40, 5: 50, 6: 60, 7: 70}}
With this dictionary, the result for the first row would be 6 * 0.5 * 0.1 * 2 * 40 = 24.
you could use map in each level of your index got with index.get_level_values.
table['Emult'] = table['E']*np.prod([table.index.get_level_values(lv).map(d[lv])
for lv in table.index.names],
axis=0)
print (table)
E Emult
A B C D
bar one large 4 6 24.0
small 5 8 80.0
two large 7 9 189.0
small 6 9 324.0
foo one large 2 9 7.2
small 1 2 1.6
two small 3 11 79.2
where d is the dictionary you gave in the question

Pandas groupby sequential values

I have no idea how to call this operation, so I couldn't really google anything, but here's what I'm trying to do:
I have this dataframe:
df = pd.DataFrame({"name": ["A", "B", "B", "B", "A", "A", "B"], "value":[3, 1, 2, 0, 5, 2, 3]})
df
name value
0 A 3
1 B 1
2 B 2
3 B 0
4 A 5
5 A 2
6 B 3
And I want to group it on df.name and apply a max function on df.values but only if the names are in sequence. So my desired result is as follows:
df.groupby_sequence("name")["value"].agg(max)
name value
0 A 3
1 B 2
2 A 5
3 B 3
Any clue how to do this?
Using pandas, you can groupby when the name changes from row to row, using (df.name!=df.name.shift()).cumsum(), which essentially groups together consecutive names:
>>> df.groupby((df.name!=df.name.shift()).cumsum()).max().reset_index(drop=True)
name value
0 A 3
1 B 2
2 A 5
3 B 3
Not exactly a pandas solution, but you could use groupby from itertools:
from operator import itemgetter
import pandas as pd
from itertools import groupby
df = pd.DataFrame({"name": ["A", "B", "B", "B", "A", "A", "B"], "value":[3, 1, 2, 0, 5, 2, 3]})
result = [max(group, key=itemgetter(1)) for k, group in groupby(zip(df.name, df.value), key=itemgetter(0))]
print(result)
Output
[('A', 3), ('B', 2), ('A', 5), ('B', 3)]

Python Pandas declaration empty DataFrame only with columns

I need to declare empty dataframe in python in order to append it later in a loop. Below the line of the declaration:
result_table = pd.DataFrame([[], [], [], [], []], columns = ["A", "B", "C", "D", "E"])
It throws an error:
AssertionError: 5 columns passed, passed data had 0 columns
Why is it so? I tried to find out the solution, but I failed.
import pandas as pd
df = pd.DataFrame(columns=['A','B','C','D','E'])
That's it!
Because you actually pass no data. Try this
result_frame = pd.DataFrame(columns=['a', 'b', 'c', 'd', 'e'])
if you then want to add data, use
result_frame.loc[len(result_frame)] = [1, 2, 3, 4, 5]
I think it is better to create a list or tuples or lists and then call DataFrame only once:
L = []
for i in range(3):
#some random data
a = 1
b = i + 2
c = i - b
d = i
e = 10
L.append((a, b, c, d, e))
print (L)
[(1, 2, -2, 0, 10), (1, 3, -2, 1, 10), (1, 4, -2, 2, 10)]
result_table = pd.DataFrame(L, columns = ["A", "B", "C", "D", "E"])
print (result_table)
A B C D E
0 1 2 -2 0 10
1 1 3 -2 1 10
2 1 4 -2 2 10

Categories

Resources