Dataframe get exact value using an array - python

Suppose I have the following dataframe:
A B C D Count
0 0 0 0 0 12.0
1 0 0 0 1 2.0
2 0 0 1 0 4.0
3 0 0 1 1 0.0
4 0 1 0 0 3.0
5 0 1 1 0 0.0
6 1 0 0 0 7.0
7 1 0 0 1 9.0
8 1 0 1 0 0.0
... (truncated for readability)
And an array: [1, 0, 0, 1]
I would like to access Count value given the above values of each column. In this case, this would be row 7 with Count = 9.0
I can use iloc or at by deconstructing each value in the array, but that seems inefficient. Wondering if there's a way to map the values in the array to a value of a column.

You can index the DataFrame with a list of the key column names and compare the resulting view to the array, using NumPy broadcasting to do it for each line at once. Then collapse the resulting Boolean DataFrame to a Boolean row index with all() and use that to index the Count column.
If df is the DataFrame and a is the array (or a list):
df.Count.loc[(df[list('ABCD')] == a).all(axis=1)]

You can try with tuple
out = df.loc[df[list('ABCD')].apply(tuple,1) == (1, 0, 0, 1),'Count']
Out[333]:
7 9.0
Name: Count, dtype: float64

I just used the .loc command, and searched for the multiple conditions like this:
f = [1,0,0,1]
result = df['Count'].loc[(df['A']==f[0]) &
(df['B']==f[1]) &
(df['C']==f[2]) &
(df['D']==f[3])].values
print(result)
OUTPUT:
[9.]
However, I like Arne's answer better :)

Related

Converting 2d Matrix to Single row DataFrame While Keeping Elements as Integers

I have a question regarding to converting a 2D matrix to a single row of Dataframe.
For example I have the following matrix (2D array) with integer elements
2d_array = [[0, 1, 1],[1, 0, 1],[1, 1, 0]]
Is there a way to convert it to a DataFrame like below, and keeping each element as integers?
df =
0 1 2 3 4 5 6 7 8
0 0 1 1 1 0 1 1 1 0
I tried to flatten the 2D array first
flattened_array = 2d_array.flatten()
Then I use pandas.DataFrame
df = pandas.DataFrame(flatttened_array)
But the results gave me a single column Dataframe with elements of "numpy.float64" like below:
df =
0
0 0.0
1 1.0
2 1.0
3 1.0
4 0.0
5 1.0
6 1.0
7 1.0
8 0.0
Please help. Thank you so much!
Tommy
Adding []
df = pd.DataFrame([flattened_array])
df
Out[297]:
0 1 2 3 4 5 6 7 8
0 0 1 1 1 0 1 1 1 0
maybe you can try:
df[flatttened_array] = df[flatttened_array].astype(int)
Another option:
pd.DataFrame(np.array(array).reshape(1,-1))

How to use lambda function on a pandas data frame via map/apply where lambda takes different values for each column

The idea is to transform a data frame in the fastest way according to the values specific to each column.
For simplicity, here is an example where each element of a column is compared to the mean of the column it belongs to and replaced with 0 if greater than mean(column) or 1 otherwise.
In [26]: df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
In [27]: df
Out[27]:
0 1 2
0 1 2 3
1 4 5 6
In [28]: df.mean().values.tolist()
Out[28]: [2.5, 3.5, 4.5]
Snippet bellow, it is not real code but more to exemplify the desired behavior. I used apply method but it can be whatever works fastest.
In [29]: f = lambda x: 0 if x < means else 1
In [30]: df.apply(f)
In [27]: df
Out[27]:
0 1 2
0 0 0 0
1 1 1 1
This is a toy example but the solution has to be applied to a big data frame, therefore, it has to be fast.
Cheers!
You can create a boolean mask of the dataframe by comparing each element with the mean of that column. It can be easily achieved using
df > df.mean()
0 1 2
0 False False False
1 True True True
Since True equates to 1 and False to 0, a boolean dataframe can be easily converted to integer using astype.
(df > df.mean()).astype(int)
0 1 2
0 0 0 0
1 1 1 1
If you need the output to be some strings rather than 0 and 1, use np.where which works as (condition, if true, else)
pd.DataFrame(np.where(df > df.mean(), 'm', 'n'))
0 1 2
0 n n n
1 m m m
Edit: Addressing qn in comment; What if m and n are column dependent
df = pd.DataFrame(np.arange(12).reshape(4,3))
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
pd.DataFrame(np.where(df > df.mean(), df.min(), df.max()))
0 1 2
0 9 10 11
1 9 10 11
2 0 1 2
3 0 1 2

Compute winning streak with pandas

I thought I knew how to do this but I'm pulling my hair out over it. I'm trying to use a function to create a new column. The function looks at the value of the win column in the current row and needs to compare it to the previous number in the win column as the if statements lay out below. The win column will only ever be 0 or 1.
import pandas as pd
data = pd.DataFrame({'win': [0, 0, 1, 1, 1, 0, 1]})
print (data)
win
0 0
1 0
2 1
3 1
4 1
5 0
6 1
def streak(row):
win_current_row = row['win']
win_row_above = row['win'].shift(-1)
streak_row_above = row['streak'].shift(-1)
if (win_row_above == 0) & (win_current_row == 0):
return 0
elif (win_row_above == 0) & (win_current_row ==1):
return 1
elif (win_row_above ==1) & (win_current_row == 1):
return streak_row_above + 1
else:
return 0
data['streak'] = data.apply(streak, axis=1)
All this ends with this error:
AttributeError: ("'numpy.int64' object has no attribute 'shift'", 'occurred at index 0')
In other examples I see functions that are referring to df['column'].shift(1) so I'm confused why I can't seem to do it in this instance.
The output I'm trying to get too is:
result = pd.DataFrame({'win': [0, 0, 1, 1, 1, 0, 1], 'streak': ['NaN', 0 , 1, 2, 3, 0, 1]})
print(result)
win streak
0 0 NaN
1 0 0
2 1 1
3 1 2
4 1 3
5 0 0
6 1 1
Thanks for helping to get me unstuck.
A fairly common trick when using pandas is grouping by consecutive values. This trick is well-described here.
To solve your particular problem, we want to groupby consecutive values, and then use cumsum, which means that groups of losses (groups of 0) will have a cumulative sum of 0, while groups of wins (or groups of 1) will track winning streaks.
grouper = (df.win != df.win.shift()).cumsum()
df['streak'] = df.groupby(grouper).cumsum()
win streak
0 0 0
1 0 0
2 1 1
3 1 2
4 1 3
5 0 0
6 1 1
For the sake of explanation, here is our grouper Series, which allows us to group by continuous regions of 1's and 0's:
print(grouper)
0 1
1 1
2 2
3 2
4 2
5 3
6 4
Name: win, dtype: int64
Let's try groupby and cumcount:
m = df.win.astype(bool)
df['streak'] = (
m.groupby([m, (~m).cumsum().where(m)]).cumcount().add(1).mul(m))
df
win streak
0 0 0
1 0 0
2 1 1
3 1 2
4 1 3
5 0 0
6 1 1
How it Works
Using df.win.astype(bool), convert df['win'] to its boolean equivalent (1=True, 0=False).
Next,
(~m).cumsum().where(m)
0 NaN
1 NaN
2 2.0
3 2.0
4 2.0
5 NaN
6 3.0
Name: win, dtype: float64
Represents all contiguous 1s with a unique number, with 0s being masked as NaN.
Now, use groupby, and cumcount to assign each row in the group with a monotonically increasing number.
m.groupby([m, (~m).cumsum().where(m)]).cumcount()
0 0
1 1
2 0
3 1
4 2
5 2
6 0
dtype: int64
This is what we want but you can see it is 1) zero-based, and 2) also assigns values to the 0 (no win). We can use m to mask it (x times 1 (=True) is x, and anything times 0 (=False) is 0).
m.groupby([m, (~m).cumsum().where(m)]).cumcount().add(1).mul(m)
0 0
1 0
2 1
3 2
4 3
5 0
6 1
dtype: int64
Assign this back in-place.
The reason why your getting that error is because shift() is pandas method. What your code was trying to do was getting the value at the in the row (row['win']) which is of numpy.int64. So you where trying to perform shift() on a numpy.int64. What this df['column'].shift(1) does is takes a dateframe column which is also a dataframe and shifts that column by 1.
To test this for yourself try
print(type(data['win']))
and
print(type(row['win']))
and
print(type(row))
That will tell you the datatype.
also your going to get an error when you get to
streak_row_above = row['streak'].shift(-1)
because your referring to row['streak'] before it is created.

conditional change of a pandas row, with the previous row value

In the following pandas dataframe, I want to change each row with a "-1" value with the value of the previous row. So this is the original df:
position
0 0
1 -1
2 1
3 1
4 -1
5 0
And I want to transform it in:
position
0 0
1 0
2 1
3 1
4 1
5 0
I'm doing it in the following way but I think that there should be faster ways, probably vectorizing it or something like that (although I wasn't able to do it).
for i, row in self.df.iterrows():
if row["position"] == -1:
self.df.loc[i, "position"] = self.df.loc[i-1, "position"]
So, the code works, but it seems slow, is there any way to speed it up?
Use replace + ffill:
df.replace(-1, np.nan).ffill()
position
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 0.0
Replace will convert -1 to NaN values. ffill will replace NaNs with the value just above it.
Use .astype for an integer result:
df.replace(-1, np.nan).ffill().astype(int)
position
0 0
1 0
2 1
3 1
4 1
5 0
Don't forget to assign the result back. You could perform the same operation non position if need be:
df['position'] = df['position'].replace(-1, np.nan).ffill().astype(int)
Solution using np.where:
c = df['position']
df['position'] = np.where(c == -1, c.shift(), c)
df
position
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 0.0

Select last observation per group

Someone asked to select the first observation per group in pandas df, I am interested in both first and last, and I don't know an efficient way of doing it except writing a for loop.
I am going to modify his example to tell you what I am looking for
basically there is a df like this:
group_id
1
1
1
2
2
2
3
3
3
I would like to have a variable that indicates the last observation in a group:
group_id indicator
1 0
1 0
1 1
2 0
2 0
2 1
3 0
3 0
3 1
Using pandas.shift, you can do something like:
df['group_indicator'] = df.group_id != df.group_id.shift(-1)
(or
df['group_indicator'] = (df.group_id != df.group_id.shift(-1)).astype(int)
if it's actually important for you to have it as an integer.)
Note:
for large datasets, this should be much faster than list comprehension (not to mention loops).
As Alexander notes, this assumes the DataFrame is sorted as it is in the example.
First, we'll create a list of the index locations containing the last element of each group. You can see the elements of each group as follows:
>>> df.groupby('group_id').groups
{1: [0, 1, 2], 2: [3, 4, 5], 3: [6, 7, 8]}
We use a list comprehension to extract the last index location (idx[-1]) of each of these group index values.
We assign the indicator to the dataframe by using a list comprehension and a ternary operator (i.e. 1 if condition else 0), iterating across each element in the index and checking if it is in the idx_last_group list.
idx_last_group = [idx[-1] for idx in df.groupby('group_id').groups.values()]
df['indicator'] = [1 if idx in idx_last_group else 0 for idx in df.index]
>>> df
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Use the .tail method:
df=df.groupby('group_id').tail(1)
You can groupby the 'id' and call nth(-1) to get the last entry for each group, then use this to mask the df and set the 'indicator' to 1 and then the rest with 0 using fillna:
In [21]:
df.loc[df.groupby('group_id')['group_id'].nth(-1).index,'indicator'] = 1
df['indicator'].fillna(0, inplace=True)
df
Out[21]:
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Here is the output from the groupby:
In [22]:
df.groupby('group_id')['group_id'].nth(-1)
Out[22]:
2 1
5 2
8 3
Name: group_id, dtype: int64
One line:
data['indicator'] = (data.groupby('group_id').cumcount()==data.groupby('group_id')['any_other_column'].transform('size') -1 ).astype(int)`
What we do is check if the cumulative count (which returns a vector the same size as the dataframe) is equal to the "size of the group - 1" which we calculate using transform so it also returns a vector the same size as the dataframe.
We need to use some other column for the transform because it won't let you transform the .groupby() variable but this can literally any other column and it won't be affected since its only used in calculating the new indicator. Use .astype(int) to make it a binary and done.

Categories

Resources