So I have a large dataframe, using pandas.
When I do max(df['A']) it reports a max of 9999 when it should be 396450 by observation.
import numpy as numpy
import pandas as pd
f = open("20170901.as-rel2.txt", 'r')
#read file into array, ignore first 6 lines
lines = loadtxt("20170901.as-rel2.txt", dtype='str', comments="#", delimiter="|", unpack=False)
#ignore col 4
lines=lines[:, :3]
#convert to dataframe
df = pd.DataFrame(lines, columns=['A', 'B', 'C'])
After finding the max I have to count each node(col 'A') and say how many times it is repeated.
Here is a sample of the file:
df=
A B C
0 2 45714 0
1 2 52685 -1
2 3 293 0
3 3 23248 -1
4 3 133296 0
5 3 265301 -1
6 5 28599 -1
7 5 52352 0
8 5 262879 -1
9 5 265048 -1
10 5 265316 -1
11 10 46392 0
.....
384338 396238 62605 -1
384339 396371 3785 -1
384340 396434 35039 -1
384341 396450 2495 -1
384342 396450 5078 -1
Expect:
[1, 0
2, 2
3, 4
4, 0
5, 5
10, 1
....]
I was going to run a for loop of i <= maxvalue (the maxvalue exceeds the number of rows).
and use counter. What is the the most effective method?
np.bincount
pd.Series(np.bincount(df.A))
0 0
1 0
2 2
3 4
4 0
5 5
6 0
7 0
8 0
9 0
10 1
dtype: int64
Using Categorical with value_counts
df.A=pd.Categorical(df.A,categories=np.arange(1,max(df.A)+1))
df.A.value_counts().sort_index()
Out[312]:
1 0
2 2
3 4
4 0
5 5
6 0
7 0
8 0
9 0
Name: A, dtype: int64
Related
Let's consider the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame([1, 2, 3, 4, 3, 2 , 5, 6, 4, 2, 1, 6])
I want to do the following thing: If i-th element of the dataframe is bigger than mean of two next, then we assign 1, and if not, we assign -1 to this ith element.
My solution
An obvious solution is the following:
df_copy = df.copy()
for i in range(len(df) - 2):
if (df.iloc[i] > np.mean(df.iloc[(i+1):(i+2)]))[0]:
df_copy.iloc[i] = 1
else:
df_copy.iloc[i] = -1
However, I find it little cumbersome, and I'm wondering if there is any loop-free solution to these kind of problems.
Desired output
0
0 -1
1 -1
2 -1
3 1
4 1
5 -1
6 -1
7 1
8 1
9 1
10 1
11 6
You can use a rolling.mean and shift:
df['out'] = np.where(df[0].gt(df[0].rolling(2).mean().shift(-2)), 1, -1)
output:
0 out
0 1 -1
1 2 -1
2 3 -1
3 4 1
4 3 -1
5 2 -1
6 5 -1
7 6 1
8 4 1
9 2 -1
10 1 -1
11 6 -1
keeping last items unchanged:
m = df[0].rolling(2).mean().shift(-2)
df['out'] = np.where(df[0].gt(m), 1, -1)
df['out'] = df['out'].mask(m.isna(), df[0])
output:
0 out
0 1 -1
1 2 -1
2 3 -1
3 4 1
4 3 -1
5 2 -1
6 5 -1
7 6 1
8 4 1
9 2 -1
10 1 1
11 6 6
Say I have a list of integers which correspond to points where I want to increase an interger value by 1.
for example Int64Index([5, 10]), not necessarily even spaced like that, and I have a dataframe like,
val new_col
0 0.729726564 1
1 0.067509062 1
2 0.943927114 1
3 0.037718436 1
4 0.512142908 1
5 0.767198655 2
6 0.202230787 2
7 0.343767479 2
8 0.540026305 2
9 0.256425022 2
10 0.403845023 3
11 0.444475008 3
12 0.464677745 3
I want to create new_col which is an int, but increases by on a the above index rows.
Edit:
import pandas as pd
import numpy as np
df = pd.DataFrame({'val': np.random.rand(14)})
df['new_col'] = 1
How to increase the value of new_col by one at each index point (5, 10)?
I see from your comment that you refer to an "arbitrary position" so you can space them as you wish with bins.
example:
bins = [-1,3,5,12,14] #space as you wish
labels = [1,2,3,4] #labels or in your case values that you want
df['new_col'] = pd.cut(list(df.index.values), bins=bins, labels=labels)
val new_col
0 0.509742 1
1 0.081701 1
2 0.990583 1
3 0.813398 1
4 0.905022 2
5 0.951973 2
6 0.702487 3
7 0.916432 3
8 0.647568 3
9 0.955188 3
10 0.875067 3
11 0.284496 3
12 0.393931 3
13 0.341115 4
Use numpy.split with enumerate:
import pandas as pd
indices = [5, 10]
df['add_col'] = pd.concat([s + n for n, s in enumerate(pd.np.split(df['new_col'], indices))])
print(df)
Output:
val new_col add_col
0 0.953431 1 1
1 0.929134 1 1
2 0.548343 1 1
3 0.080713 1 1
4 0.465212 1 1
5 0.290549 1 2
6 0.570886 1 2
7 0.232350 1 2
8 0.036968 1 2
9 0.455084 1 2
10 0.385177 1 3
11 0.811477 1 3
12 0.802502 1 3
13 0.001847 1 3
I have a Pandas data frame like so:
d = {'col1': [1, 2], 'col2': [3, 4], 'col3': [5, 6]}
df = pd.DataFrame(data=d)
Which looks like:
doc sent col1 col2 col3
0 0 0 5 4 8
1 0 1 6 3 2
2 0 2 1 2 9
3 1 0 6 1 6
4 1 1 5 1 5
I'd like to bind the previous row and the next next row to each column like so (accounting for "doc" and "sent" column in my example, which count as indices that nothing can come before or after as seen below):
doc sent col1 col2 col3 p_col1 p_col2 p_col3 n_col1 n_col2 n_col3
0 0 0 5 4 8 0 0 0 6 3 2
1 0 1 6 3 2 5 4 8 1 2 9
2 0 2 1 2 9 6 3 2 6 1 6
3 1 0 6 1 6 0 0 0 5 1 5
4 1 1 5 1 5 6 1 6 0 0 0
use pd.DataFrame.shift to get the prev / next rows, pd.concat to merge the dataframes & fillna to set nulls to zero
The presence of nulls upcasts the ints to floats, since numpy integer arrays cannot contain null values, which are cast back to ints after replacing nulls with 0.
cs = ['col1', 'col2', 'col3']
g = df.groupby('doc')
pd.concat([
df,
g[cs].shift(-1).add_prefix('n'),
g[cs].shift().add_prefix('p')
], axis=1).fillna(0).astype(int)
outputs:
doc sent col1 col2 col3 ncol1 ncol2 ncol3 pcol1 pcol2 pcol3
0 0 0 5 4 8 6 3 2 0 0 0
1 0 1 6 3 2 1 2 9 5 4 8
2 0 2 1 2 9 0 0 0 6 3 2
3 1 0 6 1 6 5 1 5 0 0 0
4 1 1 5 1 5 0 0 0 6 1 6
I have a Pandas DataFrame with two columns. In some of the rows the columns are swapped. If they're swapped then column "a" will be negative. What would be the best way to check that and then swap the values of the two columns.
def swap(a,b):
if a < 0:
return b,a
else:
return a,b
Is there some way to use apply with this function to swap the two values?
Try this ? By using np.where
ary=np.where(df.a<0,[df.b,df.a],[df.a,df.b])
pd.DataFrame({'a':ary[0],'b':ary[1]})
Out[560]:
a b
0 3 -1
1 3 -1
2 8 -1
3 2 9
4 0 7
5 0 4
Data input :
df
Out[561]:
a b
0 -1 3
1 -1 3
2 -1 8
3 2 9
4 0 7
5 0 4
And using apply
def swap(x):
if x[0] < 0:
return [x[1],x[0]]
else:
return [x[0],x[1]]
df.apply(swap,1)
Out[568]:
a b
0 3 -1
1 3 -1
2 8 -1
3 2 9
4 0 7
5 0 4
Out of boredom:
df.values[:] = df.values[
np.arange(len(df))[:, None],
np.eye(2, dtype=int)[(df.a.values >= 0).astype(int)]
]
df
a b
0 3 -1
1 3 -1
2 8 -1
3 2 9
4 0 7
5 0 4
I use pandas:
input:
import pandas as pd
a=pd.Series([0,0,1,0,0,0,0])
output:
0 0
1 0
2 1
3 0
4 0
5 0
6 0
I want to get data for next rows in same values:
output:
0 0
1 0
2 1
3 1
4 1
5 1
6 0
use
a+a.shift(1)+a.shift(2)+a.shift(3)
I think this is not a smart solution
who have a smart solution for this
You can try this assuming index 6 should be value 1 too,
a=pd.Series([0,0,1,0,0,0,0])
a.eq(1).cumsum()
Out[19]:
0 0
1 0
2 1
3 1
4 1
5 1
6 1
dtype: int32
Updated : More than one value not equal to 0.
a=pd.Series([0,0,1,0,1,3,0])
a.ne(0).cumsum()
A=pd.DataFrame({'a':a,'Id':a.ne(0).cumsum()})
A.groupby('Id').a.cumsum()
Out[58]:
0 0
1 0
2 1
3 1
4 1
5 3
6 3
Or you can use ffill
a[a.eq(0)]=np.nan
a.ffill().fillna(0)
Out[64]:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 3.0
6 3.0
1 You could filter the series for "your" value (SearchValue).
2 Re-index the dataseries to a to-be-stated length (LengthOfIndex) and forward fill the "your" a given number of times (LengthOfFillRange)
3 Fill it with zeros again.
import pandas as pd
import numpy as np
a=pd.Series([0,0,1,0,0,0,0])
SearchValue = 1
LengthOfIndex = 7
LengthOfFillRange = 4
a=a[a==SearchValue]\
.reindex(np.linspace(1,LengthOfIndex,LengthOfIndex, dtype='int32'),
method='ffill',
limit=LengthOfFillRange)\
.fillna(0)
If you need repeat only 2 values Series by some limit use replace for NaNs, then ffill (fillna with method ffill) and last fillna for convert NaNs to original values (and if necessary convert to int):
a=pd.Series([0,0,1,0,0,0,0,1,0,0,0,])
print (a)
0 0
1 0
2 1
3 0
4 0
5 0
6 0
7 1
8 0
9 0
10 0
dtype: int64
b= a.replace(0,np.nan).ffill(limit=2).fillna(0).astype(a.dtype)
print (b)
0 0
1 0
2 1
3 1
4 1
5 0
6 0
7 1
8 1
9 1
10 0
dtype: int64