Let's consider the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame([1, 2, 3, 4, 3, 2 , 5, 6, 4, 2, 1, 6])
I want to do the following thing: If i-th element of the dataframe is bigger than mean of two next, then we assign 1, and if not, we assign -1 to this ith element.
My solution
An obvious solution is the following:
df_copy = df.copy()
for i in range(len(df) - 2):
if (df.iloc[i] > np.mean(df.iloc[(i+1):(i+2)]))[0]:
df_copy.iloc[i] = 1
else:
df_copy.iloc[i] = -1
However, I find it little cumbersome, and I'm wondering if there is any loop-free solution to these kind of problems.
Desired output
0
0 -1
1 -1
2 -1
3 1
4 1
5 -1
6 -1
7 1
8 1
9 1
10 1
11 6
You can use a rolling.mean and shift:
df['out'] = np.where(df[0].gt(df[0].rolling(2).mean().shift(-2)), 1, -1)
output:
0 out
0 1 -1
1 2 -1
2 3 -1
3 4 1
4 3 -1
5 2 -1
6 5 -1
7 6 1
8 4 1
9 2 -1
10 1 -1
11 6 -1
keeping last items unchanged:
m = df[0].rolling(2).mean().shift(-2)
df['out'] = np.where(df[0].gt(m), 1, -1)
df['out'] = df['out'].mask(m.isna(), df[0])
output:
0 out
0 1 -1
1 2 -1
2 3 -1
3 4 1
4 3 -1
5 2 -1
6 5 -1
7 6 1
8 4 1
9 2 -1
10 1 1
11 6 6
Related
When displaying a DataFrame in jupyter notebook. The index is displayed in a hierarchical way. So that repeated labels are not shown in the following row. E.g. a dataframe with a Multiindex with the following labels
[1, 1, 1, 1]
[1, 1, 0, 1]
will be displayed as
1 1 1 1 ...
0 1 ...
Can I change this behaviour so that all index values are shown despite repetition? Like this:
1 1 1 1 ...
1 1 0 1 ...
?
import pandas as pd
import numpy as np
import itertools
N_t = 5
N_e = 2
classes = tuple(list(itertools.product([0, 1], repeat=N_e)))
N_c = len(classes)
noise = np.random.randint(0, 10, size=(N_c, N_t))
df = pd.DataFrame(noise, index=classes)
df
0 1 2 3 4
0 0 5 9 4 1 2
1 2 2 7 9 9
1 0 1 7 3 6 9
1 4 9 8 2 9
# should be shown as
0 1 2 3 4
0 0 5 9 4 1 2
0 1 2 2 7 9 9
1 0 1 7 3 6 9
1 1 4 9 8 2 9
Use -
with pd.option_context('display.multi_sparse', False):
print (df)
Output
0 1 2 3 4
0 0 8 1 4 0 2
0 1 0 1 7 4 7
1 0 9 6 5 2 0
1 1 2 2 7 2 7
And globally:
pd.options.display.multi_sparse = False
or
thanks #Kyle -
print(df.to_string(sparsify=False))
So I have a large dataframe, using pandas.
When I do max(df['A']) it reports a max of 9999 when it should be 396450 by observation.
import numpy as numpy
import pandas as pd
f = open("20170901.as-rel2.txt", 'r')
#read file into array, ignore first 6 lines
lines = loadtxt("20170901.as-rel2.txt", dtype='str', comments="#", delimiter="|", unpack=False)
#ignore col 4
lines=lines[:, :3]
#convert to dataframe
df = pd.DataFrame(lines, columns=['A', 'B', 'C'])
After finding the max I have to count each node(col 'A') and say how many times it is repeated.
Here is a sample of the file:
df=
A B C
0 2 45714 0
1 2 52685 -1
2 3 293 0
3 3 23248 -1
4 3 133296 0
5 3 265301 -1
6 5 28599 -1
7 5 52352 0
8 5 262879 -1
9 5 265048 -1
10 5 265316 -1
11 10 46392 0
.....
384338 396238 62605 -1
384339 396371 3785 -1
384340 396434 35039 -1
384341 396450 2495 -1
384342 396450 5078 -1
Expect:
[1, 0
2, 2
3, 4
4, 0
5, 5
10, 1
....]
I was going to run a for loop of i <= maxvalue (the maxvalue exceeds the number of rows).
and use counter. What is the the most effective method?
np.bincount
pd.Series(np.bincount(df.A))
0 0
1 0
2 2
3 4
4 0
5 5
6 0
7 0
8 0
9 0
10 1
dtype: int64
Using Categorical with value_counts
df.A=pd.Categorical(df.A,categories=np.arange(1,max(df.A)+1))
df.A.value_counts().sort_index()
Out[312]:
1 0
2 2
3 4
4 0
5 5
6 0
7 0
8 0
9 0
Name: A, dtype: int64
If I have an array [1, 2, 3, 4, 5] and a Pandas Dataframe
df = pd.DataFrame([[1,1,1,1,1], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0]])
0 1 2 3 4
0 1 1 1 1 1
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
How do I iterate through the Pandas DataFrame adding my array to each previous row?
The expected result would be:
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
The array is added n times to the nth row, which you can create using np.arange(len(df))[:,None] * a and then add the first row:
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 0 0 0 0 0
#2 0 0 0 0 0
#3 0 0 0 0 0
a = np.array([1, 2, 3, 4, 5])
np.arange(len(df))[:,None] * a
#array([[ 0, 0, 0, 0, 0],
# [ 1, 2, 3, 4, 5],
# [ 2, 4, 6, 8, 10],
# [ 3, 6, 9, 12, 15]])
df[:] = df.iloc[0].values + np.arange(len(df))[:,None] * a
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 2 3 4 5 6
#2 3 5 7 9 11
#3 4 7 10 13 16
df = pd.DataFrame([
[1,1,1],
[0,0,0],
[0,0,0],
])
s = pd.Series([1,2,3])
# add to every row except first, then cumulative sum
result = df.add(s, axis=1)
result.iloc[0] = df.iloc[0]
result.cumsum()
Or if you want a one-liner:
pd.concat([df[:1], df[1:].add(s, axis=1)]).cumsum()
Either way, result:
0 1 2
0 1 1 1
1 2 3 4
2 3 5 7
Using cumsum and assignment:
df[1:] = (df+l).cumsum()[:-1].values
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
Or using concat:
pd.concat((df[:1], (df+l).cumsum()[:-1]))
0 1 2 3 4
0 1 1 1 1 1
0 2 3 4 5 6
1 3 5 7 9 11
2 4 7 10 13 16
After cumsum, you can shift and add back to the original df:
a = [1,2,3,4,5]
updated = df.add(pd.Series(a), axis=1).cumsum().shift().fillna(0)
df.add(updated)
I have a two dimensional numpy array:
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
How would I go about converting this into a pandas data frame that would have the x coordinate, y coordinate, and corresponding array value at that index into a pandas data frame like this:
x y val
0 0 1
0 1 4
0 2 7
1 0 2
1 1 5
1 2 8
...
With stack and reset index:
df = pd.DataFrame(arr).stack().rename_axis(['y', 'x']).reset_index(name='val')
df
Out:
y x val
0 0 0 1
1 0 1 2
2 0 2 3
3 1 0 4
4 1 1 5
5 1 2 6
6 2 0 7
7 2 1 8
8 2 2 9
If ordering is important:
df.sort_values(['x', 'y'])[['x', 'y', 'val']].reset_index(drop=True)
Out:
x y val
0 0 0 1
1 0 1 4
2 0 2 7
3 1 0 2
4 1 1 5
5 1 2 8
6 2 0 3
7 2 1 6
8 2 2 9
Here's a NumPy method -
>>> arr
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
>>> shp = arr.shape
>>> r,c = np.indices(shp)
>>> pd.DataFrame(np.c_[r.ravel(), c.ravel(), arr.ravel('F')], \
columns=((['x','y','val'])))
x y val
0 0 0 1
1 0 1 4
2 0 2 7
3 1 0 2
4 1 1 5
5 1 2 8
6 2 0 3
7 2 1 6
8 2 2 9
I have a dataframe with two columns:
x y
0 1
1 1
2 2
0 5
1 6
2 8
0 1
1 8
2 4
0 1
1 7
2 3
What I want is:
x val1 val2 val3 val4
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
I know that the values in column x are repeated all N times.
You could use groupby/cumcount to assign column numbers and then call pivot:
import pandas as pd
df = pd.DataFrame({'x': [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2],
'y': [1, 1, 2, 5, 6, 8, 1, 8, 4, 1, 7, 3]})
df['columns'] = df.groupby('x')['y'].cumcount()
# x y columns
# 0 0 1 0
# 1 1 1 0
# 2 2 2 0
# 3 0 5 1
# 4 1 6 1
# 5 2 8 1
# 6 0 1 2
# 7 1 8 2
# 8 2 4 2
# 9 0 1 3
# 10 1 7 3
# 11 2 3 3
result = df.pivot(index='x', columns='columns')
print(result)
yields
y
columns 0 1 2 3
x
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Or, if you can really rely on the values in x being repeated in order N times,
N = 3
result = pd.DataFrame(df['y'].values.reshape(-1, N).T)
yields
0 1 2 3
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Using reshape is quicker than calling groupby/cumcount and pivot, but it
is less robust since it relies on the values in y appearing in the right order.