Pandas groupby sequential values - python

I have no idea how to call this operation, so I couldn't really google anything, but here's what I'm trying to do:
I have this dataframe:
df = pd.DataFrame({"name": ["A", "B", "B", "B", "A", "A", "B"], "value":[3, 1, 2, 0, 5, 2, 3]})
df
name value
0 A 3
1 B 1
2 B 2
3 B 0
4 A 5
5 A 2
6 B 3
And I want to group it on df.name and apply a max function on df.values but only if the names are in sequence. So my desired result is as follows:
df.groupby_sequence("name")["value"].agg(max)
name value
0 A 3
1 B 2
2 A 5
3 B 3
Any clue how to do this?

Using pandas, you can groupby when the name changes from row to row, using (df.name!=df.name.shift()).cumsum(), which essentially groups together consecutive names:
>>> df.groupby((df.name!=df.name.shift()).cumsum()).max().reset_index(drop=True)
name value
0 A 3
1 B 2
2 A 5
3 B 3

Not exactly a pandas solution, but you could use groupby from itertools:
from operator import itemgetter
import pandas as pd
from itertools import groupby
df = pd.DataFrame({"name": ["A", "B", "B", "B", "A", "A", "B"], "value":[3, 1, 2, 0, 5, 2, 3]})
result = [max(group, key=itemgetter(1)) for k, group in groupby(zip(df.name, df.value), key=itemgetter(0))]
print(result)
Output
[('A', 3), ('B', 2), ('A', 5), ('B', 3)]

Related

Return a boolean mask for a Pandas index wrt a subset of the index

I have a df like this:
df = pd.DataFrame(index=["a", "b", "c"], data={"col": [1, 2, 3]})
And a subset of the indexes: s = ["a", "b"]
I would like to form the boolean mask so I can change the values of row "a" and "b" only, like so:
df.loc[m, "col"] = [10, 11]
Is there a neat way to do this?
If list of assigned values is same like list of indexes use:
df.loc[s, "col"] = [10, 11]
print (df)
col
a 10
b 11
c 3
If use boolean mask by Index.isin and order of list is different like indices get different ouput:
df = pd.DataFrame(index=["a", "b", "c"], data={"col": [1, 2, 3]})
#swapped order
s = ["b", "a"]
m = df.index.isin(s)
df.loc[m, "col"] = [10, 11]
print (df)
col
a 10
b 11
c 3
df.loc[s, "col"] = [10, 11]
print (df)
col
a 11
b 10
c 3

Proper way to do this in pandas without using for loop

The question is I would like to avoid iterrows here.
From my dataframe I want to create a new column "unique" that will be based on the condition that if "a" and "b" column values are the same I would give it a value "uniqueN" then for all occurrence of the exact "a" and "b" I would need the same value "uniqueN".
In this case
"1", "3" (the first row) from "a" and "b" is the first unique pair, so I give that the value "unique1", and the seventh row will also have the same value which is "unique1" as it is also "1", "3".
"2", "2" (the second row) is the next unique "a", "b" pair so I give them "unique2" and the eight row also has "2", "2" so that will also have "unique2".
"3", "1" (third row) is the next unique, so "unique3", no more rows in the df is "3", "1" so that value wont repeat.
and so on
I have a working code that uses loops but this is not the pandas way, can anyone suggest how I can do this using pandas functions?
Expected Output (My code works, but its not using pandas methods)
a b unique
0 1 3 unique1
1 2 2 unique2
2 3 1 unique3
3 4 2 unique4
4 3 3 unique5
5 4 2 unique4
6 1 3 unique1
7 2 2 unique2
Code
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})
c = 1
seen = {}
for i, j in df.iterrows():
j = tuple(j)
if j not in seen:
seen[j] = 'unique' + str(c)
c += 1
for key, value in seen.items():
df.loc[(df.a == key[0]) & (df.b == key[1]), 'unique'] = value
Let's use groupby ngroup with sort=False to ensure values are enumerated in order of appearance, add 1 so group numbers start at one, then convert to string with astype so we can add the prefix unique to the number:
df['unique'] = 'unique' + \
df.groupby(['a', 'b'], sort=False).ngroup().add(1).astype(str)
Or with map and format instead of converting and concatenating:
df['unique'] = (
df.groupby(['a', 'b'], sort=False).ngroup()
.add(1)
.map('unique{}'.format)
)
df:
a b unique
0 1 3 unique1
1 2 2 unique2
2 3 1 unique3
3 4 2 unique4
4 3 3 unique5
5 4 2 unique4
6 1 3 unique1
7 2 2 unique2
Setup:
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]
})
I came up with a slightly different solution. I'll add this for posterity, but the groupby answer is superior.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 3, 4, 1, 2], 'b': [3, 2, 1, 2, 3, 2, 3, 2]})
print(df)
df1 = df[~df.duplicated()]
print(df1)
df1['unique'] = df1.index
print(df1)
df2 = df.merge(df1, how='left')
print(df2)

List of tuples for each pandas dataframe slice

I need to do something very similar to this question: Pandas convert dataframe to array of tuples
The difference is I need to get not only a single list of tuples for the entire DataFrame, but a list of lists of tuples, sliced based on some column value.
Supposing this is my data set:
t_id A B
----- ---- -----
0 AAAA 1 2.0
1 AAAA 3 4.0
2 AAAA 5 6.0
3 BBBB 7 8.0
4 BBBB 9 10.0
...
I want to produce as output:
[[(1,2.0), (3,4.0), (5,6.0)],[(7,8.0), (9,10.0)]]
That is, one list for 'AAAA', another for 'BBBB' and so on.
I've tried with two nested for loops. It seems to work, but it is taking too long (actual data set has ~1M rows):
result = []
for t in df['t_id'].unique():
tuple_list= []
for x in df[df['t_id' == t]].iterrows():
row = x[1][['A', 'B']]
tuple_list.append(tuple(x))
result.append(tuple_list)
Is there a faster way to do it?
You can groupby column t_id, iterate through groups and convert each sub dataframe into a list of tuples:
[g[['A', 'B']].to_records(index=False).tolist() for _, g in df.groupby('t_id')]
# [[(1, 2.0), (3, 4.0), (5, 6.0)], [(7, 8.0), (9, 10.0)]]
I think this should work too:
import pandas as pd
import itertools
df = pd.DataFrame({"A": [1, 2, 3, 1], "B": [2, 2, 2, 2], "C": ["A", "B", "C", "B"]})
tuples_in_df = sorted(tuple(df.to_records(index=False)), key=lambda x: x[0])
output = [[tuple(x)[1:] for x in group] for _, group in itertools.groupby(tuples_in_df, lambda x: x[0])]
print(output)
Out:
[[(2, 'A'), (2, 'B')], [(2, 'B')], [(2, 'C')]]

Python Pandas declaration empty DataFrame only with columns

I need to declare empty dataframe in python in order to append it later in a loop. Below the line of the declaration:
result_table = pd.DataFrame([[], [], [], [], []], columns = ["A", "B", "C", "D", "E"])
It throws an error:
AssertionError: 5 columns passed, passed data had 0 columns
Why is it so? I tried to find out the solution, but I failed.
import pandas as pd
df = pd.DataFrame(columns=['A','B','C','D','E'])
That's it!
Because you actually pass no data. Try this
result_frame = pd.DataFrame(columns=['a', 'b', 'c', 'd', 'e'])
if you then want to add data, use
result_frame.loc[len(result_frame)] = [1, 2, 3, 4, 5]
I think it is better to create a list or tuples or lists and then call DataFrame only once:
L = []
for i in range(3):
#some random data
a = 1
b = i + 2
c = i - b
d = i
e = 10
L.append((a, b, c, d, e))
print (L)
[(1, 2, -2, 0, 10), (1, 3, -2, 1, 10), (1, 4, -2, 2, 10)]
result_table = pd.DataFrame(L, columns = ["A", "B", "C", "D", "E"])
print (result_table)
A B C D E
0 1 2 -2 0 10
1 1 3 -2 1 10
2 1 4 -2 2 10

Filling in nulls in dataframe with different elements by sampling pandas

Ive created a dataframe 'Pclass'
class deck weight
0 3 C 0.367568
1 3 B 0.259459
2 3 D 0.156757
3 3 E 0.140541
4 3 A 0.070270
5 3 T 0.005405
my initial dataframe 'df' looks like
class deck
0 3 NaN
1 1 C
2 3 NaN
3 1 C
4 3 NaN
5 3 NaN
6 1 E
7 3 NaN
8 3 NaN
9 2 NaN
10 3 G
11 1 C
I want to fill in the null deck values in df by choosing a sample from the
decks given in Pclass based on the weights.
I've only managed to code the sampling procedure.
np.random.choice(a=Pclass.deck,p=Pclass.weight)
I'm having trouble implementing a method to fill in the nulls by finding null rows that belong to class 3 and picking a random deck value for each(not the same value all the time), so not fillna('with just one').
Note: I have another question similar to this,but broader with a groupby object as well to maximize efficiency but I've gotten no responses. Any help would be greatly appreciated!
edit: added rows to dataframe Pclass
1 F 0.470588
1 E 0.294118
1 D 0.235294
2 F 0.461538
2 G 0.307692
2 E 0.230769
This generates a random selection from the deck column from the Pclass dataframe and assigns these to the df dataframe in the deck column (generating the required number). These commands could be put in a list comprehension if you wanted to do this across different values of the class variable. I'd recommend avoiding using class as a variable name since it's used to define new classes within Python.
import numpy as np
import pandas as pd
# Generate data and normalised weights
normweights = np.random.rand(6)
normweights /= normweights.sum()
Pclass = pd.DataFrame({
"cla": [3, 3, 3, 3, 3, 3],
"deck": ["C", "B", "D", "E", "A", "T"],
"weight": normweights
})
df = pd.DataFrame({
"cla": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
"deck": [np.nan, "C", np.nan, "C",
np.nan, np.nan, "E", np.nan,
np.nan, np.nan, "G", "C"]
})
# Find missing locations
missing_locs = np.where(df.deck.isnull() & (df.cla == 3))[0]
# Generate new values
new_vals = np.random.choice(a = Pclass.deck.values,
p = Pclass.weight.values, size = len(missing_locs))
# Assign the new values to the dataframe
df.set_value(missing_locs, 'deck', new_vals)
Running for multiple levels of the categorical variable
If you wanted to run this on all levels of the class variable you'd need to make sure you're selecting a subset of the data in Pclass (just the class of interest). One could use a list comprehension to find the missing data for each level of 'class' like so (I've updated the mock data below) ...
# Find missing locations
missing_locs = [np.where(df.deck.isnull() & (df.cla == i))[0] for i in [1,2,3]]
However, I think the code would be easier to read if it was in a loop:
# Generate data and normalised weights
normweights3 = np.random.rand(6)
normweights3 /= normweights3.sum()
normweights2 = np.random.rand(3)
normweights2 /= normweights2.sum()
Pclass = pd.DataFrame({
"cla": [3, 3, 3, 3, 3, 3, 2, 2, 2],
"deck": ["C", "B", "D", "E", "A", "T", "X", "Y", "Z"],
"weight": np.concatenate((normweights3, normweights2))
})
df = pd.DataFrame({
"cla": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
"deck": [np.nan, "C", np.nan, "C",
np.nan, np.nan, "E", np.nan,
np.nan, np.nan, "G", "C"]
})
class_levels = [1, 2, 3]
for i in class_levels:
missing_locs = np.where(df.deck.isnull() & (df.cla == i))[0]
if len(missing_locs) > 0:
subset = Pclass[Pclass.cla == i]
# Generate new values
new_vals = np.random.choice(a = subset.deck.values,
p = subset.weight.values, size = len(missing_locs))
# Assign the new values to the dataframe
df.set_value(missing_locs, 'deck', new_vals)

Categories

Resources