Applying functions data frame columns - python

I have a following function to calculate a value for two parameter x,y:
import numpy as np
import math
def some_func(pt1,pt2):
return math.sqrt( (pt2[0]-pt1[0])*(pt2[0]-pt1[0]) + (pt2[1]-pt1[1])*(pt2[1]-pt1[1]) )
usage:
a = 1, 2
b = 4, 5
some_func(a,b)
#outputs = 4.24264
#or some_func((1,2), (4,5)) would give the same output too
I have a following df:
seq x y points
1 2 3 (2,3)
1 10 5 (10,5)
1 6 7 (6,7)
2 8 9 (8,9)
2 10 11 (10,11)
column "points" was obtained using the below piece of code:
df["points"] = list(zip(df.loc[:, "x"], df.loc[:, "y"]))
I want to apply the some_func function on the whole df, also by grouping them by "seq"
I tried :
df["value"] = some_func(df["points"].values, df["points"].shift(1).values)
#without using groupby
and
df["value"] = df.groupby("seq").points.apply(some_func) #with groupby
but both of them shows TypeError saying 1 missing argument or unsupported data type.
Expected df
seq x y points value
1 2 3 (2,3) NaN
1 10 5 (10,5) 8.24
1 6 7 (6,7) 4.47
2 8 9 (8,9) NaN
2 10 11 (10,11) 2.82

You can use groupby with DataFrameGroupBy.shift first, but then need replace NaNs to tuples - one possible solution is use fillna. Last use apply
s = pd.Series([(np.nan, np.nan)], index=df.index)
df['shifted'] = df.groupby('seq').points.shift().fillna(s)
df['values'] = df.apply(lambda x: some_func(x['points'], x['shifted']), axis=1)
print (df)
seq x y points shifted values
0 1 2 3 (2, 3) (nan, nan) NaN
1 1 10 5 (10, 5) (2, 3) 8.246211
2 1 6 7 (6, 7) (10, 5) 4.472136
3 2 8 9 (8, 9) (nan, nan) NaN
4 2 10 11 (10, 11) (8, 9) 2.828427
Another solution is filter out NaNs in apply:
df['shifted'] = df.groupby('seq').points.shift()
f = lambda x: some_func(x['points'], x['shifted']) if pd.notnull(x['shifted']) else np.nan
df['values'] = df.apply(f, axis=1)
print (df)
seq x y points shifted values
0 1 2 3 (2, 3) NaN NaN
1 1 10 5 (10, 5) (2, 3) 8.246211
2 1 6 7 (6, 7) (10, 5) 4.472136
3 2 8 9 (8, 9) NaN NaN
4 2 10 11 (10, 11) (8, 9) 2.828427

f=lambda x,y:some_func(x,y)
f["value"] = f(df["points"].values, df["points"].shift(1).values)

Related

how to squeeze a Dataframe into Series with binding element as tuple and recover it?

I want to transform two-dimensional dataframe into one-dimensional Series.
let me list an example:
In [11]: df = pd.DataFrame(np.reshape(range(9), (3,3)))
In [12]: df
Out[12]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
what i want is a seies like this:
In [13]: pd.Series([(0,1,2), (3,4,5), (6,7,8)])
Out[13]:
0 (0, 1, 2)
1 (3, 4, 5)
2 (6, 7, 8)
dtype: object
which merge all element in one row into a tuple, so we reduce the dimension from 2 -> 1
is there good methods can do this?
and, how can i recover the dataframe by the tuple series?
Use List comprehension with df.to_numpy to convert it to Series:
In [556]: l = [tuple(r) for r in df.to_numpy()]
In [563]: new_series = pd.Series(l)
In [564]: new_series
Out[564]:
0 (0, 1, 2)
1 (3, 4, 5)
2 (6, 7, 8)
dtype: object
To convert it back to df, pass the list to the dataframe constructor:
In [561]: pd.DataFrame(l)
Out[561]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8

Pandas Dataframe. Expand tuple values as columns with multiindex

I have a dataframe df:
A B
first second
bar one 0.0 0.0
two 0.0 0.0
foo one 0.0 0.0
two 0.0 0.0
I transform it to another one where values are tuples:
A B
first second
bar one (6, 1, 0) (0, 9, 3)
two (9, 3, 4) (6, 2, 1)
foo one (1, 9, 0) (4, 0, 0)
two (6, 1, 5) (8, 3, 5)
My question is how can I get it (expanded) to be like below where tuples values become columns with multiindex? Can I do it during transform or should I do it as an additional step after transform?
A B
m n k m n k
first second
bar one 6 1 0 0 9 3
two 9 3 4 6 2 1
foo one 1 9 0 4 0 0
two 6 1 5 8 3 5
Code for the above:
import numpy as np
import pandas as pd
np.random.seed(123)
def expand(s):
# complex logic of `result` has been replaced with `np.random`
result = [tuple(np.random.randint(10, size=3)) for i in s]
return result
index = pd.MultiIndex.from_product([['bar', 'foo'], ['one', 'two']], names=['first', 'second'])
df = pd.DataFrame(np.zeros((4, 2)), index=index, columns=['A', 'B'])
print(df)
expanded = df.groupby(['second']).transform(expand)
print(expanded)
Try this:
df_lst = []
for col in df.columns:
expanded_splt = expanded.apply(lambda x: pd.Series(x[col]),axis=1)
columns = pd.MultiIndex.from_product([[col], ['m', 'n', 'k']])
expanded_splt.columns = columns
df_lst.append(expanded_splt)
pd.concat(df_lst, axis=1)
Output:
A B
m n k m n k
first second
bar one 6 1 0 0 9 3
two 9 3 4 6 2 1
foo one 1 9 0 4 0 0
two 6 1 5 8 3 5
Finally I found time to find an answer that suits me.
expanded_data = expanded.agg(lambda x: np.concatenate(x), axis=1).to_numpy()
expanded_data = np.stack(expanded_data)
column_index = pd.MultiIndex.from_product([expanded.columns, ['m', 'n', 'k']])
exploded = pd.DataFrame(expanded_data, index=expanded.index, columns=column_index)
print(exploded)
A B
m n k m n k
first second
bar one 6 1 0 0 9 3
two 9 3 4 6 2 1
foo one 1 9 0 4 0 0
two 6 1 5 8 3 5

Can I split this column containing a mix of tuples/None more efficiently?

I have a simple DataFrame:
import pandas as pd
df = pd.DataFrame({'id':list('abcd')})
df['tuples'] = df.index.map(lambda i:(i,i+1))
# outputs:
# id tuples
# 0 a (0, 1)
# 1 b (1, 2)
# 2 c (2, 3)
# 3 d (3, 4)
I can then split the tuples column into two very simply, e.g.
df[['x','y']] = pd.DataFrame(df.tuples.tolist())
# outputs:
# id tuples x y
# 0 a (0, 1) 0 1
# 1 b (1, 2) 1 2
# 2 c (2, 3) 2 3
# 3 d (3, 4) 3 4
This approach also works:
df[['x','y']] = df.apply(lambda x:x.tuples,result_type='expand',axis=1)
However if my DataFrame is slightly more complex, e.g.
df = pd.DataFrame({'id':list('abcd')})
df['tuples'] = df.index.map(lambda i:(i,i+1) if i%2 else None)
# outputs:
# id tuples
# 0 a None
# 1 b (1, 2)
# 2 c None
# 3 d (3, 4)
then the first approach throws "Columns must be same length as key" (of course) because some rows have two values and some have none, and my code anticipates two.
I can use .loc to create single columns, twice.
get_rows = df.tuples.notnull() # return rows with tuples
df.loc[get_rows,'x'] = df.tuples.str[0]
df.loc[get_rows,'y'] = df.tuples.str[1]
# outputs:
# id tuples x y
# 0 a None NaN NaN
# 1 b (1, 2) 1.0 2.0
# 2 c None NaN NaN
# 3 d (3, 4) 3.0 4.0
[Aside: useful how the indexing carries assigns only relevant rows from the right, without having to specify them.]
However, I can't use .loc to create two columns at once, e.g.
# This isn't valid use of .loc
df.loc[get_rows,['x','y']] = df.loc[get_rows,'tuples'].map(lambda x:list(x))
as it throws the error "shape mismatch: value array of shape (2,2) could not be broadcast to indexing result of shape (2,)".
I also can't use this
df[get_rows][['x','y']] = df[get_rows].apply(lambda x:x.tuples,result_type='expand',axis=1)
as it throws the usual "A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc..."
I can't help thinking I'm missing something.
Here is another way (comments inline):
c=df.tuples.astype(bool) #similar to df.tuples.notnull()
#create a dataframe by dropping the None and assign index as df.index where c is True
d=pd.DataFrame(df.tuples.dropna().values.tolist(),columns=list('xy'),index=df[c].index)
final=pd.concat([df,d],axis=1) #concat them both
id tuples x y
0 a None NaN NaN
1 b (1, 2) 1.0 2.0
2 c None NaN NaN
3 d (3, 4) 3.0 4.0
df[get_rows] is a copy, set value to df[get_rows][['x','y']] does not change the underlying data. Just use df[['x','y']] to create now columns.
df = pd.DataFrame({'id':list('abcd')})
df['tuples'] = df.index.map(lambda i:(i,i+1) if i%2 else None)
get_rows = df.tuples.notnull()
df[['x','y']] = df[get_rows].apply(lambda x:x.tuples,result_type='expand',axis=1)
print(df)
id tuples x y
0 a None NaN NaN
1 b (1, 2) 1.0 2.0
2 c None NaN NaN
3 d (3, 4) 3.0 4.0
Another quick fix:
pd.concat([df, pd.DataFrame(df.tuples.to_dict()).T],
axis=1)
returns:
id tuples 0 1
0 a None None None
1 b (1, 2) 1 2
2 c None None None
3 d (3, 4) 3 4
One-liner with itertools.zip_longest:
In [862]: from itertools import zip_longest
In [863]: new_columns = ['x', 'y']
In [864]: df.join(df.tuples.apply(lambda x: pd.Series(dict(zip_longest(new_cols, [x] if pd.isnull(x) else list(x))))))
Out[864]:
id tuples x y
0 a None NaN NaN
1 b (1, 2) 1.0 2.0
2 c None NaN NaN
3 d (3, 4) 3.0 4.0
Or even simpler:
In [876]: f = lambda x: [x] * len(new_cols) if pd.isnull(x) else list(x)
In [877]: df.join(pd.DataFrame(df.tuples.apply(f).tolist(), columns=new_cols))
Out[877]:
id tuples x y
0 a None NaN NaN
1 b (1, 2) 1.0 2.0
2 c None NaN NaN
3 d (3, 4) 3.0 4.0

Count appearances of a value until it changes to another value

I have the following DataFrame:
df = pd.DataFrame([10, 10, 23, 23, 9, 9, 9, 10, 10, 10, 10, 12], columns=['values'])
I want to calculate the frequency of each value, but not an overall count - the count of each value until it changes to another value.
I tried:
df['values'].value_counts()
but it gives me
10 6
9 3
23 2
12 1
The desired output is
10:2
23:2
9:3
10:4
12:1
How can I do this?
Use:
df = df.groupby(df['values'].ne(df['values'].shift()).cumsum())['values'].value_counts()
Or:
df = df.groupby([df['values'].ne(df['values'].shift()).cumsum(), 'values']).size()
print (df)
values values
1 10 2
2 23 2
3 9 3
4 10 4
5 12 1
Name: values, dtype: int64
Last for remove first level:
df = df.reset_index(level=0, drop=True)
print (df)
values
10 2
23 2
9 3
10 4
12 1
dtype: int64
Explanation:
Compare original column by shifted with not equal ne and then add cumsum for helper Series:
print (pd.concat([df['values'], a, b, c],
keys=('orig','shifted', 'not_equal', 'cumsum'), axis=1))
orig shifted not_equal cumsum
0 10 NaN True 1
1 10 10.0 False 1
2 23 10.0 True 2
3 23 23.0 False 2
4 9 23.0 True 3
5 9 9.0 False 3
6 9 9.0 False 3
7 10 9.0 True 4
8 10 10.0 False 4
9 10 10.0 False 4
10 10 10.0 False 4
11 12 10.0 True 5
You can keep track of where the changes in df['values'] occur, and groupby the changes and also df['values'] (to keep them as index) computing the size of each group
changes = df['values'].diff().ne(0).cumsum()
df.groupby([changes,'values']).size().reset_index(level=0, drop=True)
values
10 2
23 2
9 3
10 4
12 1
dtype: int64
itertools.groupby
from itertools import groupby
pd.Series(*zip(*[[len([*v]), k] for k, v in groupby(df['values'])]))
10 2
23 2
9 3
10 4
12 1
dtype: int64
It's a generator
def f(x):
count = 1
for this, that in zip(x, x[1:]):
if this == that:
count += 1
else:
yield count, this
count = 1
yield count, [*x][-1]
pd.Series(*zip(*f(df['values'])))
10 2
23 2
9 3
10 4
12 1
dtype: int64
Using crosstab
df['key']=df['values'].diff().ne(0).cumsum()
pd.crosstab(df['key'],df['values'])
Out[353]:
values 9 10 12 23
key
1 0 2 0 0
2 0 0 0 2
3 3 0 0 0
4 0 4 0 0
5 0 0 1 0
Slightly modify the result above
pd.crosstab(df['key'],df['values']).stack().loc[lambda x:x.ne(0)]
Out[355]:
key values
1 10 2
2 23 2
3 9 3
4 10 4
5 12 1
dtype: int64
Base on python groupby
from itertools import groupby
[ (k,len(list(g))) for k,g in groupby(df['values'].tolist())]
Out[366]: [(10, 2), (23, 2), (9, 3), (10, 4), (12, 1)]
This is far from the most time/memory efficient method that in this thread but here's an iterative approach that is pretty straightforward. Please feel encouraged to suggest improvements on this method.
import pandas as pd
df = pd.DataFrame([10, 10, 23, 23, 9, 9, 9, 10, 10, 10, 10, 12], columns=['values'])
dict_count = {}
for v in df['values'].unique():
dict_count[v] = 0
curr_val = df.iloc[0]['values']
count = 1
for i in range(1, len(df)):
if df.iloc[i]['values'] == curr_val:
count += 1
else:
if count > dict_count[curr_val]:
dict_count[curr_val] = count
curr_val = df.iloc[i]['values']
count = 1
if count > dict_count[curr_val]:
dict_count[curr_val] = count
df_count = pd.DataFrame(dict_count, index=[0])
print(df_count)
The function groupby in itertools can help you, for str:
>>> string = 'aabbaacc'
>>> for char, freq in groupby('aabbaacc'):
>>> print(char, len(list(freq)), sep=':', end='\n')
[out]:
a:2
b:2
a:2
c:2
This function also works for list:
>>> df = pd.DataFrame([10, 10, 23, 23, 9, 9, 9, 10, 10, 10, 10, 12], columns=['values'])
>>> for char, freq in groupby(df['values'].tolist()):
>>> print(char, len(list(freq)), sep=':', end='\n')
[out]:
10:2
23:2
9:3
10:4
12:1
Note: for df, you always use this way like df['values'] to take 'values' column, because DataFrame have a attribute values

Pandas Dataframe: Expand rows with lists to multiple row with desired indexing for all columns

I have time series data in pandas dataframe with index as time at the start of measurement and columns with list of values recorded at a fixed sampling rate (difference in consecutive index/number of elements in the list)
Here is the what it looks like:
Time A B ....... Z
0 [1, 2, 3, 4] [1, 2, 3, 4]
2 [5, 6, 7, 8] [5, 6, 7, 8]
4 [9, 10, 11, 12] [9, 10, 11, 12]
6 [13, 14, 15, 16] [13, 14, 15, 16 ]
...
I want to expand each row in all the columns to multiple rows such that:
Time A B .... Z
0 1 1
0.5 2 2
1 3 3
1.5 4 4
2 5 5
2.5 6 6
.......
So far I am thinking along these lines (code doesn't wok):
def expand_row(dstruc):
for i in range (len(dstruc)):
for j in range (1,len(dstruc[i])):
dstruc.loc[i+j/len(dstruc[i])] = dstruc[i][j]
dstruc.loc[i] = dstruc[i][0]
return dstruc
expanded = testdf.apply(expand_row)
I also tried using split(',') and stack() together but I am not able to fix my indexing appropriately.
import numpy as np
import pandas as pd
df = pd.DataFrame({key: zip(*[iter(range(1, 17))]*4) for key in list('ABC')},
index=range(0,8,2))
result = pd.DataFrame.from_items([(index, zipped) for index, row in df.iterrows() for zipped in zip(*row)], orient='index', columns=df.columns)
result.index.name = 'Time'
grouped = result.groupby(level=0)
increment = (grouped.cumcount()/grouped.size())
result.index = result.index + increment
print(result)
yields
In [183]: result
Out[183]:
A B C
Time
0.00 1 1 1
0.25 2 2 2
0.50 3 3 3
0.75 4 4 4
2.00 5 5 5
2.25 6 6 6
2.50 7 7 7
2.75 8 8 8
4.00 9 9 9
4.25 10 10 10
4.50 11 11 11
4.75 12 12 12
6.00 13 13 13
6.25 14 14 14
6.50 15 15 15
6.75 16 16 16
Explanation:
One way to loop over the contents of list is to use a list comprehension:
In [172]: df = pd.DataFrame({key: zip(*[iter(range(1, 17))]*4) for key in list('ABC')}, index=range(2,10,2))
In [173]: [(index, zipped) for index, row in df.iterrows() for zipped in zip(*row)]
Out[173]:
[(0, (1, 1, 1)),
(0, (2, 2, 2)),
...
(6, (15, 15, 15)),
(6, (16, 16, 16))]
Once you have the values in the above form, you can build the desired DataFrame with pd.DataFrame.from_items:
result = pd.DataFrame.from_items([(index, zipped) for index, row in df.iterrows() for zipped in zip(*row)], orient='index', columns=df.columns)
result.index.name = 'Time'
yields
In [175]: result
Out[175]:
A B C
Time
2 1 1 1
2 2 2 2
...
8 15 15 15
8 16 16 16
To compute the increments to be added to the index, you can group by the index and find the ratio of the cumcount to the size of each group:
In [176]: grouped = result.groupby(level=0)
In [177]: increment = (grouped.cumcount()/grouped.size())
In [179]: result.index = result.index + increment
In [199]: result.index
Out[199]:
Int64Index([ 0.0, 0.25, 0.5, 0.75, 2.0, 2.25, 2.5, 2.75, 4.0, 4.25, 4.5,
4.75, 6.0, 6.25, 6.5, 6.75],
dtype='float64', name=u'Time')
Probably not ideal, but this can be done using groupby and apply a function which returns the expanded DataFrame for each row (here the time difference is assumed to be fixed at 2.0):
def expand(x):
data = {c: x[c].iloc[0] for c in x if c != 'Time'}
n = len(data['A'])
step = 2.0 / n;
data['Time'] = [x['Time'].iloc[0] + i*step for i in range(n)]
return pd.DataFrame(data)
print df.groupby('Time').apply(expand).set_index('Time', drop=True)
Output:
A B
Time
0.0 1 1
0.5 2 2
1.0 3 3
1.5 4 4
2.0 5 5
2.5 6 6
3.0 7 7
3.5 8 8
4.0 9 9
4.5 10 10
5.0 11 11
5.5 12 12
6.0 13 13
6.5 14 14
7.0 15 15
7.5 16 16
Say, the dataframe wanted to be expanded is named as df_to_expand, you could do the following using eval.
df_expanded_list = []
for coln in df_to_expand.columns:
_df = df_to_expand[coln].apply(lambda x: pd.Series(eval(x), index=[coln + '_' + str(i) for i in range(len(eval(x)))]))
df_expanded_list.append(_df)
df_expanded = pd.concat(df_expanded_list, axis=1)
References:
covert a string which is a list into a proper list python

Categories

Resources