Dataframe age column grouping in pandas [duplicate] - python

It seems like a simple question, but I need ur help.
For example, I have df:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [2, 1, 3, 1, 8, 9, 6, 7, 4, 6]
How can I group 'x' in range from 1 to 5, and from 6 to 10 and calc mean 'y' value for this two bins?
I expect to get new df like:
x_grpd = [5, 10]
y_grpd = [3, 6.4]
Range of 'x' is given as an example. Ideally i want to be able to set any int value to get different bins quantity.

You can use cut and groupby.mean:
bins = [5, 10]
df2 = (df
.groupby(pd.cut(df['x'], [0]+bins,
labels=bins,
right=True))
['y'].mean()
.reset_index()
)
Output:
x y
0 5 3.0
1 10 6.4

Related

extract elements of tuple from a pandas series

I have a pandas series with data of type tuple as list elements. The length of the tuple is exactly 2 and there are a bunch of NaNs. I am trying to split each list in the tuple into its own column.
import pandas as pd
import numpy as np
df = pd.DataFrame({'val': [([1,2,3],[4,5,6]),
([7,8,9],[10,11,12]),
np.nan]
})
Expected Output:
If you know the lenght of tuples are exactly 2, you can do:
df["x"] = df.val.str[0]
df["y"] = df.val.str[1]
print(df[["x", "y"]])
Prints:
x y
0 [1, 2, 3] [4, 5, 6]
1 [7, 8, 9] [10, 11, 12]
2 NaN NaN
You could also convert the column to a list and cast it to the DataFrame constructor (fill None with np.nan as well):
out = pd.DataFrame(df['val'].tolist(), columns=['x','y']).fillna(np.nan)
Output:
x y
0 [1, 2, 3] [4, 5, 6]
1 [7, 8, 9] [10, 11, 12]
2 NaN NaN
One way using pandas.Series.apply:
new_df = df["val"].apply(pd.Series)
print(new_df)
Output:
0 1
0 [1, 2, 3] [4, 5, 6]
1 [7, 8, 9] [10, 11, 12]
2 NaN NaN

Pandas cumsum separated by comma

I have a dataframe with a column with data as:
my_column my_column_two
1,2,3 A
5,6,8 A
9,6,8 B
5,5,8 B
if I do:
data = df.astype(str).groupby('my_column_two').agg(','.join).cumsum()
data.iloc[[0]]['my_column'].apply(print)
data.iloc[[1]]['my_column'].apply(print)
I have:
1,2,3,5,6,8
1,2,3,5,6,89,6,8,5,5,8
how can I have 1,2,3,5,6,8,9,6,8,5,5,8 so the cummulative adds a comma when adding the previous row? (Notice 89 should be 8,9)
Were you after?
df['new']=df.groupby('my_column_two')['my_column'].apply(lambda x: x.str.split(',').cumsum())
my_column my_column_two new
0 1,2,3 A [1, 2, 3]
1 5,6,8 A [1, 2, 3, 5, 6, 8]
2 9,6,8 B [9, 6, 8]
3 5,5,8 B [9, 6, 8, 5, 5, 8]

Maximum of an array constituting a pandas dataframe cell

I have a pandas dataframe in which a column is formed by arrays. So every cell is an array.
Say there is a column A in dataframe df, such that
A = [ [1, 2, 3],
[4, 5, 6],
[7, 8, 9],
... ]
I want to operate in each array and get, e.g. the maximum of each array, and store it in another column.
In the example, I would like to obtain another column
B = [ 3,
6,
9,
...]
I have tried these approaches so far, none of them giving what I want.
df['B'] = np.max(df['A']);#
df.applymap (lambda B: A.max())
df['B'] = df.applymap (lambda B: np.max(np.array(df['A'].tolist()),0))
How should I proceed? And is this the best way to have my dataframe organized?
You can just apply(max). It doesn't matter if the values are lists or np.array.
df = pd.DataFrame({'a': [[1, 2, 3], [4, 5, 6], [7, 8, 9]]})
df['b'] = df['a'].apply(max)
print(df)
Outputs
a b
0 [1, 2, 3] 3
1 [4, 5, 6] 6
2 [7, 8, 9] 9
Here is one way without apply:
df['B']=np.max(df['A'].values.tolist(),axis=1)
A B
0 [1, 2, 3] 3
1 [4, 5, 6] 6
2 [7, 8, 9] 9

picking values from columns [duplicate]

This question already has answers here:
Vectorized lookup on a pandas dataframe
(3 answers)
Closed 3 years ago.
I have a pandas DataFrame with values in a number of columns, make it two for simplicity, and a column of column names I want to use to pick values from the other columns:
import pandas as pd
import numpy as np
np.random.seed(1337)
df = pd.DataFrame(
{"a": np.arange(10), "b": 10 - np.arange(10), "c": np.random.choice(["a", "b"], 10)}
)
which gives
> df['c']
0 b
1 b
2 a
3 a
4 b
5 b
6 b
7 a
8 a
9 a
Name: c, dtype: object
That is, I want the first and second elements to be picked from column b, the third from a and so on.
This works:
def pick_vals_from_cols(df, col_selector):
condlist = np.row_stack(col_selector.map(lambda x: x == df.columns))
values = np.select(condlist.transpose(), df.values.transpose())
return values
> pick_vals_from_cols(df, df["c"])
array([10, 9, 2, 3, 6, 5, 4, 7, 8, 9], dtype=object)
But it just feels so fragile and clunky. Is there a better way to do this?
lookup
df.lookup(df.index, df.c)
array([10, 9, 2, 3, 6, 5, 4, 7, 8, 9])
Comprehension
But why when you have lookup?
[df.at[t] for t in df.c.items()]
[10, 9, 2, 3, 6, 5, 4, 7, 8, 9]
Bonus Hack
Not intended for actual use
[*map(df.at.__getitem__, zip(df.index, df.c))]
[10, 9, 2, 3, 6, 5, 4, 7, 8, 9]
Because df.get_value is deprecated
[*map(df.get_value, df.index, df.c)]
FutureWarning: get_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
[10, 9, 2, 3, 6, 5, 4, 7, 8, 9]

How to fill dataframe Nan values with empty list [] in pandas?

This is my dataframe:
date ids
0 2011-04-23 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
1 2011-04-24 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2 2011-04-25 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
3 2011-04-26 Nan
4 2011-04-27 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
5 2011-04-28 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
I want to replace Nan with []. How to do that? Fillna([]) did not work. I even tried replace(np.nan, []) but it gives error:
TypeError('Invalid "to_replace" type: \'float\'',)
My approach is similar to #hellpanderrr's, but instead tests for list-ness rather than using isnan:
df['ids'] = df['ids'].apply(lambda d: d if isinstance(d, list) else [])
I originally tried using pd.isnull (or pd.notnull) but, when given a list, that returns the null-ness of each element.
After a lot of head-scratching I found this method that should be the most efficient (no looping, no apply), just assigning to a slice:
isnull = df.ids.isnull()
df.loc[isnull, 'ids'] = [ [[]] * isnull.sum() ]
The trick was to construct your list of [] of the right size (isnull.sum()), and then enclose it in a list: the value you are assigning is a 2D array (1 column, isnull.sum() rows) containing empty lists as elements.
A simple solution would be:
df['ids'].fillna("").apply(list)
As noted by #timgeb, this requires df['ids'] to contain lists or nan only.
You can first use loc to locate all rows that have a nan in the ids column, and then loop through these rows using at to set their values to an empty list:
for row in df.loc[df.ids.isnull(), 'ids'].index:
df.at[row, 'ids'] = []
>>> df
date ids
0 2011-04-23 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
1 2011-04-24 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
2 2011-04-25 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
3 2011-04-26 []
4 2011-04-27 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
5 2011-04-28 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
Surprisingly, passing a dict with empty lists as values seems to work for Series.fillna, but not DataFrame.fillna - so if you want to work on a single column you can use this:
>>> df
A B C
0 0.0 2.0 NaN
1 NaN NaN 5.0
2 NaN 7.0 NaN
>>> df['C'].fillna({i: [] for i in df.index})
0 []
1 5
2 []
Name: C, dtype: object
The solution can be extended to DataFrames by applying it to every column.
>>> df.apply(lambda s: s.fillna({i: [] for i in df.index}))
A B C
0 0 2 []
1 [] [] 5
2 [] 7 []
Note: for large Series/DataFrames with few missing values, this might create an unreasonable amount of throwaway empty lists.
Tested with pandas 1.0.5.
Another solution using numpy:
df.ids = np.where(df.ids.isnull(), pd.Series([[]]*len(df)), df.ids)
Or using combine_first:
df.ids = df.ids.combine_first(pd.Series([[]]*len(df)))
Without assignments:
1) Assuming we have only floats and integers in our dataframe
import math
df.apply(lambda x:x.apply(lambda x:[] if math.isnan(x) else x))
2) For any dataframe
import math
def isnan(x):
if isinstance(x, (int, long, float, complex)) and math.isnan(x):
return True
df.apply(lambda x:x.apply(lambda x:[] if isnan(x) else x))
Maybe not the most short/optimized solution, but I think is pretty readable:
# Masking-in nans
mask = df['ids'].isna()
# Filling nans with a list-like string and literally-evaluating such string
df.loc[mask, 'ids'] = df.loc[mask, 'ids'].fillna('[]').apply(eval)
Maybe more dense:
df['ids'] = [[] if type(x) != list else x for x in df['ids']]
This is probably faster, one liner solution:
df['ids'].fillna('DELETE').apply(lambda x : [] if x=='DELETE' else x)
Another solution that is explicit:
# use apply to only replace the nulls with the list
df.loc[df.ids.isnull(), 'ids'] = df.loc[df.ids.isnull(), 'ids'].apply(lambda x: [])
Create a function that checks your condition, if not, it returns an empty list/empty set etc.
Then apply that function to the variable, but also assigning the new calculated variable to the old one or to a new variable if you wish.
aa=pd.DataFrame({'d':[1,1,2,3,3,np.NaN],'r':[3,5,5,5,5,'e']})
def check_condition(x):
if x>0:
return x
else:
return list()
aa['d]=aa.d.apply(lambda x:check_condition(x))
You can try this:
df.fillna(df.notna().applymap(lambda x: x or []))

Categories

Resources