Processing dataframe omits previous row values

Processing dataframe omits previous row values - python

This code :
import numpy as np
import pandas as pd
df = pd.DataFrame([['stop' , '1'], ['a1' , '2'], ['a1' , '3'], ['stop' , '4'], ['a2' , '5'], ['wildcard' , '6']] , columns=['a' , 'b'])
print(df)
prints :
a b
0 stop 1
1 a1 2
2 a1 3
3 stop 4
4 a2 5
5 wildcard 6
I'm attempting to create a new dataframe where if stop is encountered then a new row is created that contains a tuple where the value of column 'a' is first element of tuple and 'b' is subsequent element of tuple. So for df above post transforming the new df df_post structure is :
df_post = pd.DataFrame([['stop' , [('a1' , '2') , ('a1' , '3')]] , ['stop' , [('a2' , 5)]]] , columns=['a' , 'b'])
print(df_post)
a b
0 stop [(a1, 2), (a1, 3)]
1 stop [(a2, 5)]
wildcard is also a stopping condition where if encountered a new row is inserted into df_post as before.
Here is what I have so far :
df['stop_loc'] = ( (df['a'] == 'stop') | (df['a'] == 'wildcard') ).cumsum()
df_new = df[(df['a'] != 'stop') & (df['stop_loc'] != df['stop_loc'].max())].groupby('stop_loc').apply(lambda x: list(zip(x.a, x.b)))
df_new
which renders :
stop_loc
1 [(a1, 2), (a1, 3)]
2 [(a2, 5)]
dtype: object
The 'stop' value is not inserted as row. How to modify so that the dataframe produced is
a b
0 stop [(a1, 2), (a1, 3)]
1 stop [(a2, 5)]
instead of :
stop_loc
1 [(a1, 2), (a1, 3)]
2 [(a2, 5)]
dtype: object

You are filtering out the stop rows with df['a'] != 'stop'. Here is an alternative code:
# df['stop_loc'] = ( (df['a'] == 'stop') | (df['a'] == 'wildcard') ).cumsum()
df['stop_loc'] = df['a'].isin(['stop', 'wildcard']).cumsum()
def zip_entries(x):
return list(x.a)[0], list(zip(x.a[1:], x.b[1:]))
df_new = (df[(df['stop_loc'] != df['stop_loc'].max())]
.groupby('stop_loc')
.apply(zip_entries)
.apply(pd.Series))
print(df_new)
# 0 1
# stop_loc
# 1 stop [(a1, 2), (a1, 3)]
# 2 stop [(a2, 5)]

Related

Choose the best of three columns

I have a dataset with three columns A, B and C. I want to create a column where I select the two columns closest to each other and take the average. Take the table below as an example:
A B C Best of Three
3 2 5 2.5
4 3 1 3.5
1 5 2 1.5
For the first row, A and B are the closest pair, so the best of three column is (3+2)/2 = 2.5; for the third row, A and C are the closest pair, so the best of three column is (1+2)/2 = 1.5. Below is my code. It is quite unwieldy and quickly become too long if there are more columns. Look forward to suggestions!
data = {'A':[3,4,1],
'B':[2,3,5],
'C':[5,1,2]}
df = pd.DataFrame(data)
df['D'] = abs(df['A'] - df['B'])
df['E'] = abs(df['A'] - df['C'])
df['F'] = abs(df['C'] - df['B'])
df['G'] = min(df['D'], df['E'], df['F'])
if df['G'] = df['D']:
df['Best of Three'] = (df['A'] + df['B'])/2
elif df['G'] = df['E']:
df['Best of Three'] = (df['A'] + df['C'])/2
else:
df['Best of Three'] = (df['B'] + df['C'])/2

First you need a method that finds the minimum diff between 2 elements in a list, the method also returns the median with the 2 values, this is returned as a tuple (diff, median)
def min_list(values):
return min((abs(x - y), (x + y) / 2)
for i, x in enumerate(values)
for y in values[i + 1:])
Then apply it in each row
df = pd.DataFrame([[3, 2, 5, 6], [4, 3, 1, 10], [1, 5, 10, 20]],
columns=['A', 'B', 'C', 'D'])
df['best'] = df.apply(lambda x: min_list(x)[1], axis=1)
print(df)

Functions are your friends. You want to write a function that finds the two closest integers of an list, then pass it the list of the values of the row. Store those results and pass them to a second function that returns the average of two values.
(Also, your code would be much more readable if you replaced D, E, F, and G with descriptively named variables.)

Solve by using itertools combinations generator:
def get_closest_avg(s):
c = list(itertools.combinations(s, 2))
return sum(c[pd.Series(c).apply(lambda x: abs(x[0]-x[1])).idxmin()])/2
df['B3'] = df.apply(get_closest_avg, axis=1)
df:
A B C B3
0 3 2 5 2.5
1 4 3 1 3.5
2 1 5 2 1.5

Extracting rows with most frequent value

Have a dataframe with several columns from which I want to extract one row for each "family" of individuals that has the most frequent number ("No"). I have tested this with a for -loop that seems to work, but being a newbe I wanted to know if there is a shorter/smarter way of doing it.
Here is a short example code:
import pandas as pd
ind = [ ('A', 'a', 0.1 , 9) ,
('B', 'b', 0.6 , 10) ,
('C', 'b', 0.4 , 10) ,
('D', 'b', 0.2, 7) ,
('E', 'a', 0.9 , 6) ,
('F', 'b', 0.7 , 11)
]
df = pd.DataFrame(ind, columns = ['Name' , 'Family', 'Prob', 'No'])
res = pd.DataFrame(columns = df.columns)
for name,g in df.groupby('Family'):
v = g['No'].value_counts().idxmax()
idx = g['No'] == v
si = g[idx].iloc[0]
res = res.append(si)
print(res)
I have looked at several exampels that do some of it like this but with that I can only get the "Family" and "No" and not the whole row...

Here is an alternative using duplicated and mode+groupby with mode:
c = df['No'].eq(df.groupby('Family')['No'].transform(lambda x: x.mode().iat[0]))
c1 = df[['Family','No']].duplicated()
output = df[c & ~c1]
Name Family Prob No
1 B b 0.6 10
4 E a 0.9 6

Use GroupBy.transform with first mode, then filter and last remove duplicates by DataFrame.drop_duplicates:
df1 = (df[df.groupby('Family')['No'].transform(lambda x: x.mode().iat[0]).eq(df['No'])]
.drop_duplicates(['Family','No']))
print (df1)
Name Family Prob No
1 B b 0.6 10
4 E a 0.9 6

Creating a union of columns based on metrics

I have a dataframe-
df = pd.DataFrame({'a':[1,2,4], 'b': [0,3,5],'c':[1,1,1,1]})
a b c
0 1 0 1
1 2 3 1
2 4 5 1
and a list [('a', 0.91), ('b', 5), ('c', 2)].
Now I want to create another dataframe which iterates through each row and multiplies df element and list element together and and then selects the top 2 scores and makes a new list which has the said column names.
for example in the first row we have-
1*0.9=0.9 , 0*5=0 , 1*2=2
therefore the top 2 columns are a and c so we append them to a new list.
second row-
2*0.9=1.8, 3*5=15,1*2=2
therefore list=[a,c,b]
and so on...
third row-
4*0.9=3.6,5*5=25,1*2=2
so list remains unchanged [a,c,b]
so final output is [a,c,b]

If i understand you correctly I think the previous answers are incomplete so here is a solution. It involves using numpy which i hope you accept.
Create the weights:
n = [('a', 0.91), ('b', 5), ('c', 2)]
d = { a:b for a,b in n}
weights = [d[i] for i in df.columns]
Then we create a table with weights multiplied in:
df = pd.DataFrame({'a':[1,2,4], 'b': [0,3,5],'c':[1,1,1]})
df = df*weights
This yields:
a b c
0 0.9 0.0 2.0
1 1.8 15.0 2.0
2 3.6 25.0 2.0
Then we can get top two indices for this in numpy:
b = np.argsort(df.values,axis=1)
b = b[:,-2:]
This yields:
array([[0, 2],
[2, 1],
[0, 1]], dtype=int64)
Finally we can calculate the order of appearance and give back column names:
c =b.reshape(-1)
_, idx = np.unique(c, return_index=True)
d = c[np.sort(idx)]
print(list(df.columns[d].values))
This yields:
['a', 'c', 'b']

Try this :
dict1 = {'a':[1,2,4], 'b': [0,3,5],'c':[1,1,1]} # arrays must all be same length
df = pd.DataFrame(dict1)
list1 = [('a', 0.91), ('b', 5), ('c', 2)]
df2 = pd.DataFrame({k : [j*v[1] for j in dict1[k]] for k in dict1 for v in list1 if k == v[0]})
"""
df2 should be like this :
a b c
0 0.91 0 2
1 1.82 15 2
2 3.64 25 2
"""

IIUC, you need:
a = [('a', 0.91), ('b', 5), ('c', 2)]
m= df.mul(pd.DataFrame(a).set_index(0)[1])
a b c
0 0.91 0.0 2.0
1 1.82 15.0 2.0
2 3.64 25.0 2.0
Applying rank on each row and taking the sum , then sorting and finding the index gives your desired output.
m.rank(axis=1,method='dense').sum().sort_values().index.tolist()
#['a', 'c', 'b']

matching two different arrays and making a new array in python

I have two two-dimensional arrays, and I have to create a new array filtering through the 2nd array where 1st column indexes match. The arrays are of different size.
basically the idea is as follow:
file A
#x y
1 2
3 4
2 2
5 4
6 4
7 4
file B
#x1 y1
0 1
1 1
11 1
5 1
7 1
My expected output 2D array should look like
#newx newy
1 1
5 1
7 1
I tried it following way:
match =[]
for i in range(len(x)):
if x[i] == x1[i]:
new_array = x1[i]
match.append(new_array)
print match
This does not seem to work. Please suggest a way to create the new 2D array

Try np.isin.
arr1 = np.array([[1,3,2,5,6,7], [2,4,2,4,4,4]])
arr2 = np.array([[0,1,11,5,7], [1,1,1,1,1]])
arr2[:,np.isin(arr2[0], arr1[0])]
array([[1, 5, 7],
[1, 1, 1]])
np.isin(arr2[0], arr1[0]) checks whether each element of arr2[0] is in arr1[0]. Then, we use the result as the boolean index array to select elements in arr2.

If you make a set out of the first element in A, then it is fairly easy to find the elements in B to keep like:
Code:
a = ((1, 2), (3, 4), (2, 2), (5, 4), (6, 4), (7, 4))
b = ((0, 1), (1, 1), (11, 1), (5, 1), (7, 1))
in_a = {i[0] for i in a}
new_b = [i for i in b if i[0] in in_a]
print(new_b)
Results:
[(1, 1), (5, 1), (7, 1)]
Output results to file as:
with open('output.txt', 'w') as f:
for value in new_b:
f.write(' '.join(str(v) for v in value) + '\n')

#!/usr/bin/env python3
from io import StringIO
import pandas as pd
fileA = """x y
1 2
3 4
2 2
5 4
6 4
7 4
"""
fileB = """x1 y1
0 1
1 1
11 1
5 1
7 1
"""
df1 = pd.read_csv(StringIO(fileA), delim_whitespace=True, index_col="x")
df2 = pd.read_csv(StringIO(fileB), delim_whitespace=True, index_col="x1")
df = pd.merge(df1, df2, left_index=True, right_index=True)
print(df["y1"])
# 1 1
# 5 1
# 7 1
https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging

If you use pandas:
import pandas as pd
A = pd.DataFrame({'x': pd.Series([1,3,2,5,6,7]), 'y': pd.Series([2,4,2,4,4,4])})
B = pd.DataFrame({'x1': pd.Series([0,1,11,5,7]), 'y1': 1})
C = A.join(B.set_index('x1'), on='x')
Then if you wanted to drop the unneeded row/columns and rename the columns:
C = A.join(B.set_index('x1'), on='x')
C = C.drop(['y'], axis=1)
C.columns = ['newx', 'newy']
which gives you:
>>> C
newx newy
0 1 1.0
3 5 1.0
5 7 1.0
If you are going to work with arrays, dataframes, etc - pandas is definitely worth a look: https://pandas.pydata.org/pandas-docs/stable/10min.html

Assuming that you have (x, y) pairs in your 2-D arrays, a simple loop may work:
arr1 = [[1, 2], [3, 4], [2, 2]]
arr2 = [[0, 1], [1, 1], [11, 1]]
result = []
for pair1 in arr1:
for pair2 in arr2:
if (pair1[0] == pair2[0]):
result.append(pair2)
print(result)

Not the best solution for smaller arrays, but for really large arrays, works fast -
import numpy as np
import pandas as pd
n1 = np.transpose(np.array([[1,3,2,5,6,7], [2,4,2,4,4,4]]))
n2 = np.transpose(np.array([[0,1,11,5, 7], [1,1,1,1,1]]))
np.array(pd.DataFrame(n1).merge(pd.DataFrame(n2), on=0, how='inner').drop('1_x', axis=1))

pandas: slice a MultiIndex by range of secondary index

I have a series with a MultiIndex like this:
import numpy as np
import pandas as pd
buckets = np.repeat(['a','b','c'], [3,5,1])
sequence = [0,1,5,0,1,2,4,50,0]
s = pd.Series(
np.random.randn(len(sequence)),
index=pd.MultiIndex.from_tuples(zip(buckets, sequence))
)
# In [6]: s
# Out[6]:
# a 0 -1.106047
# 1 1.665214
# 5 0.279190
# b 0 0.326364
# 1 0.900439
# 2 -0.653940
# 4 0.082270
# 50 -0.255482
# c 0 -0.091730
I'd like to get the s['b'] values where the second index ('sequence') is between 2 and 10.
Slicing on the first index works fine:
s['a':'b']
# Out[109]:
# bucket value
# a 0 1.828176
# 1 0.160496
# 5 0.401985
# b 0 -1.514268
# 1 -0.973915
# 2 1.285553
# 4 -0.194625
# 5 -0.144112
But not on the second, at least by what seems to be the two most obvious ways:
1) This returns elements 1 through 4, with nothing to do with the index values
s['b'][1:10]
# In [61]: s['b'][1:10]
# Out[61]:
# 1 0.900439
# 2 -0.653940
# 4 0.082270
# 50 -0.255482
However, if I reverse the index and the first index is integer and the second index is a string, it works:
In [26]: s
Out[26]:
0 a -0.126299
1 a 1.810928
5 a 0.571873
0 b -0.116108
1 b -0.712184
2 b -1.771264
4 b 0.148961
50 b 0.089683
0 c -0.582578
In [25]: s[0]['a':'b']
Out[25]:
a -0.126299
b -0.116108

As Robbie-Clarken answers, since 0.14 you can pass a slice in the tuple you pass to loc:
In [11]: s.loc[('b', slice(2, 10))]
Out[11]:
b 2 -0.65394
4 0.08227
dtype: float64
Indeed, you can pass a slice for each level:
In [12]: s.loc[(slice('a', 'b'), slice(2, 10))]
Out[12]:
a 5 0.27919
b 2 -0.65394
4 0.08227
dtype: float64
Note: the slice is inclusive.
Old answer:
You can also do this using:
s.ix[1:10, "b"]
(It's good practice to do in a single ix/loc/iloc since this version allows assignment.)
This answer was written prior to the introduction of iloc in early 2013, i.e. position/integer location - which may be preferred in this case. The reason it was created was to remove the ambiguity from integer-indexed pandas objects, and be more descriptive: "I'm slicing on position".
s["b"].iloc[1:10]
That said, I kinda disagree with the docs that ix is:
most robust and consistent way
it's not, the most consistent way is to describe what you're doing:
use loc for labels
use iloc for position
use ix for both (if you really have to)
Remember the zen of python:
explicit is better than implicit

Since pandas 0.15.0 this works:
s.loc['b', 2:10]
Output:
b 2 -0.503023
4 0.704880
dtype: float64
With a DataFrame it's slightly different (source):
df.loc(axis=0)['b', 2:10]

As of pandas 0.14.0 it is possible to slice multi-indexed objects by providing .loc a tuple containing slice objects:
In [2]: s.loc[('b', slice(2, 10))]
Out[2]:
b 2 -1.206052
4 -0.735682
dtype: float64

The best way I can think of is to use 'select' in this case. Although it even says in the docs that "This method should be used only when there is no more direct way."
Indexing and selecting data
In [116]: s
Out[116]:
a 0 1.724372
1 0.305923
5 1.780811
b 0 -0.556650
1 0.207783
4 -0.177901
50 0.289365
0 1.168115
In [117]: s.select(lambda x: x[0] == 'b' and 2 <= x[1] <= 10)
Out[117]: b 4 -0.177901

not sure if this is ideal but it works by creating a mask
In [59]: s.index
Out[59]:
MultiIndex
[('a', 0) ('a', 1) ('a', 5) ('b', 0) ('b', 1) ('b', 2) ('b', 4)
('b', 50) ('c', 0)]
In [77]: s[(tpl for tpl in s.index if 2<=tpl[1]<=10 and tpl[0]=='b')]
Out[77]:
b 2 -0.586568
4 1.559988
EDIT : hayden's solution is the way to go

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Processing dataframe omits previous row values - python

Related

Choose the best of three columns

Extracting rows with most frequent value

Creating a union of columns based on metrics

matching two different arrays and making a new array in python

pandas: slice a MultiIndex by range of secondary index

Categories

Resources