Omitting loop while referring to next elements of pandas DataFrame

Omitting loop while referring to next elements of pandas DataFrame - python

Let's consider the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame([1, 2, 3, 4, 3, 2 , 5, 6, 4, 2, 1, 6])
I want to do the following thing: If i-th element of the dataframe is bigger than mean of two next, then we assign 1, and if not, we assign -1 to this ith element.
My solution
An obvious solution is the following:
df_copy = df.copy()
for i in range(len(df) - 2):
if (df.iloc[i] > np.mean(df.iloc[(i+1):(i+2)]))[0]:
df_copy.iloc[i] = 1
else:
df_copy.iloc[i] = -1
However, I find it little cumbersome, and I'm wondering if there is any loop-free solution to these kind of problems.
Desired output
0
0 -1
1 -1
2 -1
3 1
4 1
5 -1
6 -1
7 1
8 1
9 1
10 1
11 6

You can use a rolling.mean and shift:
df['out'] = np.where(df[0].gt(df[0].rolling(2).mean().shift(-2)), 1, -1)
output:
0 out
0 1 -1
1 2 -1
2 3 -1
3 4 1
4 3 -1
5 2 -1
6 5 -1
7 6 1
8 4 1
9 2 -1
10 1 -1
11 6 -1
keeping last items unchanged:
m = df[0].rolling(2).mean().shift(-2)
df['out'] = np.where(df[0].gt(m), 1, -1)
df['out'] = df['out'].mask(m.isna(), df[0])
output:
0 out
0 1 -1
1 2 -1
2 3 -1
3 4 1
4 3 -1
5 2 -1
6 5 -1
7 6 1
8 4 1
9 2 -1
10 1 1
11 6 6

Related

Pandas display all index labels in jupyter notebook despite repetition

When displaying a DataFrame in jupyter notebook. The index is displayed in a hierarchical way. So that repeated labels are not shown in the following row. E.g. a dataframe with a Multiindex with the following labels
[1, 1, 1, 1]
[1, 1, 0, 1]
will be displayed as
1 1 1 1 ...
0 1 ...
Can I change this behaviour so that all index values are shown despite repetition? Like this:
1 1 1 1 ...
1 1 0 1 ...
?
import pandas as pd
import numpy as np
import itertools
N_t = 5
N_e = 2
classes = tuple(list(itertools.product([0, 1], repeat=N_e)))
N_c = len(classes)
noise = np.random.randint(0, 10, size=(N_c, N_t))
df = pd.DataFrame(noise, index=classes)
df
0 1 2 3 4
0 0 5 9 4 1 2
1 2 2 7 9 9
1 0 1 7 3 6 9
1 4 9 8 2 9
# should be shown as
0 1 2 3 4
0 0 5 9 4 1 2
0 1 2 2 7 9 9
1 0 1 7 3 6 9
1 1 4 9 8 2 9

Use -
with pd.option_context('display.multi_sparse', False):
print (df)
Output
0 1 2 3 4
0 0 8 1 4 0 2
0 1 0 1 7 4 7
1 0 9 6 5 2 0
1 1 2 2 7 2 7
And globally:
pd.options.display.multi_sparse = False
or
thanks #Kyle -
print(df.to_string(sparsify=False))

Python Find max in dataframe column to loop to find all values

So I have a large dataframe, using pandas.
When I do max(df['A']) it reports a max of 9999 when it should be 396450 by observation.
import numpy as numpy
import pandas as pd
f = open("20170901.as-rel2.txt", 'r')
#read file into array, ignore first 6 lines
lines = loadtxt("20170901.as-rel2.txt", dtype='str', comments="#", delimiter="|", unpack=False)
#ignore col 4
lines=lines[:, :3]
#convert to dataframe
df = pd.DataFrame(lines, columns=['A', 'B', 'C'])
After finding the max I have to count each node(col 'A') and say how many times it is repeated.
Here is a sample of the file:
df=
A B C
0 2 45714 0
1 2 52685 -1
2 3 293 0
3 3 23248 -1
4 3 133296 0
5 3 265301 -1
6 5 28599 -1
7 5 52352 0
8 5 262879 -1
9 5 265048 -1
10 5 265316 -1
11 10 46392 0
.....
384338 396238 62605 -1
384339 396371 3785 -1
384340 396434 35039 -1
384341 396450 2495 -1
384342 396450 5078 -1
Expect:
[1, 0
2, 2
3, 4
4, 0
5, 5
10, 1
....]
I was going to run a for loop of i <= maxvalue (the maxvalue exceeds the number of rows).
and use counter. What is the the most effective method?

np.bincount
pd.Series(np.bincount(df.A))
0 0
1 0
2 2
3 4
4 0
5 5
6 0
7 0
8 0
9 0
10 1
dtype: int64

Using Categorical with value_counts
df.A=pd.Categorical(df.A,categories=np.arange(1,max(df.A)+1))
df.A.value_counts().sort_index()
Out[312]:
1 0
2 2
3 4
4 0
5 5
6 0
7 0
8 0
9 0
Name: A, dtype: int64

How do you add an array to each previous row in pandas?

If I have an array [1, 2, 3, 4, 5] and a Pandas Dataframe
df = pd.DataFrame([[1,1,1,1,1], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0]])
0 1 2 3 4
0 1 1 1 1 1
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
How do I iterate through the Pandas DataFrame adding my array to each previous row?
The expected result would be:
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16

The array is added n times to the nth row, which you can create using np.arange(len(df))[:,None] * a and then add the first row:
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 0 0 0 0 0
#2 0 0 0 0 0
#3 0 0 0 0 0
a = np.array([1, 2, 3, 4, 5])
np.arange(len(df))[:,None] * a
#array([[ 0, 0, 0, 0, 0],
# [ 1, 2, 3, 4, 5],
# [ 2, 4, 6, 8, 10],
# [ 3, 6, 9, 12, 15]])
df[:] = df.iloc[0].values + np.arange(len(df))[:,None] * a
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 2 3 4 5 6
#2 3 5 7 9 11
#3 4 7 10 13 16

df = pd.DataFrame([
[1,1,1],
[0,0,0],
[0,0,0],
])
s = pd.Series([1,2,3])
# add to every row except first, then cumulative sum
result = df.add(s, axis=1)
result.iloc[0] = df.iloc[0]
result.cumsum()
Or if you want a one-liner:
pd.concat([df[:1], df[1:].add(s, axis=1)]).cumsum()
Either way, result:
0 1 2
0 1 1 1
1 2 3 4
2 3 5 7

Using cumsum and assignment:
df[1:] = (df+l).cumsum()[:-1].values
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
Or using concat:
pd.concat((df[:1], (df+l).cumsum()[:-1]))
0 1 2 3 4
0 1 1 1 1 1
0 2 3 4 5 6
1 3 5 7 9 11
2 4 7 10 13 16

After cumsum, you can shift and add back to the original df:
a = [1,2,3,4,5]
updated = df.add(pd.Series(a), axis=1).cumsum().shift().fillna(0)
df.add(updated)

Numpy Array to Pandas Data Frame of X Y Coordinates

I have a two dimensional numpy array:
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
How would I go about converting this into a pandas data frame that would have the x coordinate, y coordinate, and corresponding array value at that index into a pandas data frame like this:
x y val
0 0 1
0 1 4
0 2 7
1 0 2
1 1 5
1 2 8
...

With stack and reset index:
df = pd.DataFrame(arr).stack().rename_axis(['y', 'x']).reset_index(name='val')
df
Out:
y x val
0 0 0 1
1 0 1 2
2 0 2 3
3 1 0 4
4 1 1 5
5 1 2 6
6 2 0 7
7 2 1 8
8 2 2 9
If ordering is important:
df.sort_values(['x', 'y'])[['x', 'y', 'val']].reset_index(drop=True)
Out:
x y val
0 0 0 1
1 0 1 4
2 0 2 7
3 1 0 2
4 1 1 5
5 1 2 8
6 2 0 3
7 2 1 6
8 2 2 9

Here's a NumPy method -
>>> arr
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
>>> shp = arr.shape
>>> r,c = np.indices(shp)
>>> pd.DataFrame(np.c_[r.ravel(), c.ravel(), arr.ravel('F')], \
columns=((['x','y','val'])))
x y val
0 0 0 1
1 0 1 4
2 0 2 7
3 1 0 2
4 1 1 5
5 1 2 8
6 2 0 3
7 2 1 6
8 2 2 9

Pandas dataframe: how to group by values in a column and create new columns out of grouped values

I have a dataframe with two columns:
x y
0 1
1 1
2 2
0 5
1 6
2 8
0 1
1 8
2 4
0 1
1 7
2 3
What I want is:
x val1 val2 val3 val4
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
I know that the values in column x are repeated all N times.

You could use groupby/cumcount to assign column numbers and then call pivot:
import pandas as pd
df = pd.DataFrame({'x': [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2],
'y': [1, 1, 2, 5, 6, 8, 1, 8, 4, 1, 7, 3]})
df['columns'] = df.groupby('x')['y'].cumcount()
# x y columns
# 0 0 1 0
# 1 1 1 0
# 2 2 2 0
# 3 0 5 1
# 4 1 6 1
# 5 2 8 1
# 6 0 1 2
# 7 1 8 2
# 8 2 4 2
# 9 0 1 3
# 10 1 7 3
# 11 2 3 3
result = df.pivot(index='x', columns='columns')
print(result)
yields
y
columns 0 1 2 3
x
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Or, if you can really rely on the values in x being repeated in order N times,
N = 3
result = pd.DataFrame(df['y'].values.reshape(-1, N).T)
yields
0 1 2 3
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Using reshape is quicker than calling groupby/cumcount and pivot, but it
is less robust since it relies on the values in y appearing in the right order.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Omitting loop while referring to next elements of pandas DataFrame - python

Related

Pandas display all index labels in jupyter notebook despite repetition

Python Find max in dataframe column to loop to find all values

How do you add an array to each previous row in pandas?

Numpy Array to Pandas Data Frame of X Y Coordinates

Pandas dataframe: how to group by values in a column and create new columns out of grouped values

Categories

Resources