Say I have a list of integers which correspond to points where I want to increase an interger value by 1.
for example Int64Index([5, 10]), not necessarily even spaced like that, and I have a dataframe like,
val new_col
0 0.729726564 1
1 0.067509062 1
2 0.943927114 1
3 0.037718436 1
4 0.512142908 1
5 0.767198655 2
6 0.202230787 2
7 0.343767479 2
8 0.540026305 2
9 0.256425022 2
10 0.403845023 3
11 0.444475008 3
12 0.464677745 3
I want to create new_col which is an int, but increases by on a the above index rows.
Edit:
import pandas as pd
import numpy as np
df = pd.DataFrame({'val': np.random.rand(14)})
df['new_col'] = 1
How to increase the value of new_col by one at each index point (5, 10)?
I see from your comment that you refer to an "arbitrary position" so you can space them as you wish with bins.
example:
bins = [-1,3,5,12,14] #space as you wish
labels = [1,2,3,4] #labels or in your case values that you want
df['new_col'] = pd.cut(list(df.index.values), bins=bins, labels=labels)
val new_col
0 0.509742 1
1 0.081701 1
2 0.990583 1
3 0.813398 1
4 0.905022 2
5 0.951973 2
6 0.702487 3
7 0.916432 3
8 0.647568 3
9 0.955188 3
10 0.875067 3
11 0.284496 3
12 0.393931 3
13 0.341115 4
Use numpy.split with enumerate:
import pandas as pd
indices = [5, 10]
df['add_col'] = pd.concat([s + n for n, s in enumerate(pd.np.split(df['new_col'], indices))])
print(df)
Output:
val new_col add_col
0 0.953431 1 1
1 0.929134 1 1
2 0.548343 1 1
3 0.080713 1 1
4 0.465212 1 1
5 0.290549 1 2
6 0.570886 1 2
7 0.232350 1 2
8 0.036968 1 2
9 0.455084 1 2
10 0.385177 1 3
11 0.811477 1 3
12 0.802502 1 3
13 0.001847 1 3
Related
I have a DataFrame with two columns A and B.
I want to create a new column named C to identify the continuous A with the same B value.
Here's an example
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,5,6,10,11,12,13,18], 'B':[1,1,2,2,3,3,3,3,4,4]})
I found a similar question, but that method only identifies the continuous A regardless of B.
df['C'] = df['A'].diff().ne(1).cumsum().sub(1)
I have tried to groupby B and apply the function like this:
df['C'] = df.groupby('B').apply(lambda x: x['A'].diff().ne(1).cumsum().sub(1))
However, it doesn't work: TypeError: incompatible index of inserted column with frame index.
The expected output is
A B C
1 1 0
2 1 0
3 2 1
5 2 2
6 3 3
10 3 4
11 3 4
12 3 4
13 4 5
18 4 6
Let's create a sequential counter using groupby, diff and cumsum then factorize to reencode the counter
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().factorize()[0]
Result
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6
Use DataFrameGroupBy.diff with compare not equal 1 and Series.cumsum, last subtract 1:
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().sub(1)
print (df)
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6
I have this table:
import pandas as pd
list1 = [1,1,2,2,3,3,3,3,4,1,1,1,1,2,2]
df = pd.DataFrame(list1)
df.columns = ['A']
I want to keep maximum 3 consecutive duplicates, or keep all in case there's less than 3 (or no) duplicates.
The result should look like this:
list2 = [1,1,2,2,3,3,3,4,1,1,1,2,2]
result = pd.DataFrame(list2)
result.columns = ['A']
Use GroupBy.head with consecutive Series create by compare shifted values for not equal and cumulative sum by Series.cumsum:
df1 = df.groupby(df.A.ne(df.A.shift()).cumsum()).head(3)
print (df1)
A
0 1
1 1
2 2
3 2
4 3
5 3
6 3
8 4
9 1
10 1
11 1
13 2
14 2
Detail:
print (df.A.ne(df.A.shift()).cumsum())
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 3
8 4
9 5
10 5
11 5
12 5
13 6
14 6
Name: A, dtype: int32
Last us do
df[df.groupby(df[0].diff().ne(0).cumsum())[0].cumcount()<3]
0
0 1
1 1
2 2
3 2
4 3
5 3
6 3
8 4
9 1
10 1
11 1
13 2
14 2
Solving with itertools.groupby which groups only consecutive duplicates , then slicing 3 elements:
import itertools
pd.Series(itertools.chain.from_iterable([*g][:3] for i,g in itertools.groupby(df['A'])))
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 1
9 1
10 1
11 2
12 2
dtype: int64
When displaying a DataFrame in jupyter notebook. The index is displayed in a hierarchical way. So that repeated labels are not shown in the following row. E.g. a dataframe with a Multiindex with the following labels
[1, 1, 1, 1]
[1, 1, 0, 1]
will be displayed as
1 1 1 1 ...
0 1 ...
Can I change this behaviour so that all index values are shown despite repetition? Like this:
1 1 1 1 ...
1 1 0 1 ...
?
import pandas as pd
import numpy as np
import itertools
N_t = 5
N_e = 2
classes = tuple(list(itertools.product([0, 1], repeat=N_e)))
N_c = len(classes)
noise = np.random.randint(0, 10, size=(N_c, N_t))
df = pd.DataFrame(noise, index=classes)
df
0 1 2 3 4
0 0 5 9 4 1 2
1 2 2 7 9 9
1 0 1 7 3 6 9
1 4 9 8 2 9
# should be shown as
0 1 2 3 4
0 0 5 9 4 1 2
0 1 2 2 7 9 9
1 0 1 7 3 6 9
1 1 4 9 8 2 9
Use -
with pd.option_context('display.multi_sparse', False):
print (df)
Output
0 1 2 3 4
0 0 8 1 4 0 2
0 1 0 1 7 4 7
1 0 9 6 5 2 0
1 1 2 2 7 2 7
And globally:
pd.options.display.multi_sparse = False
or
thanks #Kyle -
print(df.to_string(sparsify=False))
I'm having trouble working out how to add the index value of a pandas dataframe to each value at that index. For example, if I have a dataframe of zeroes, the row with index 1 should have a value of 1 for all columns. The row at index 2 should have values of 2 for each column, and so on.
Can someone enlighten me please?
You can use pd.DataFrame.add with axis=0. Just remember, as below, to convert your index to a series first.
df = pd.DataFrame(np.random.randint(0, 10, (5, 5)))
print(df)
0 1 2 3 4
0 3 4 2 2 2
1 9 6 1 8 0
2 2 9 0 5 3
3 3 1 1 7 0
4 2 6 3 6 6
df = df.add(df.index.to_series(), axis=0)
print(df)
0 1 2 3 4
0 3 4 2 2 2
1 10 7 2 9 1
2 4 11 2 7 5
3 6 4 4 10 3
4 6 10 7 10 10
I have created a matrix:
items = [0, 1, 2, 3]
item_to_item = pd.DataFrame(index=items, columns=items)
I've put values in it so:
Its symetric
Its diagonal is all 0's
for example:
0 1 2 3
0 0 4 5 9
1 4 0 3 7
2 5 3 0 3
3 9 7 3 0
I want to create a data frame of all possible pairs (from [0, 1, 2, 3]) so that there wont be pairs of (x, x) and if (x, y) is in, I dont want (y, x) becuase its symetric and holds the same value.
In the end I will have the following Dataframe (or numpy 2d array)
item, item, value
0 1 4
0 2 5
0 3 9
1 2 3
1 3 7
2 3 3
Here's a NumPy solution with np.triu_indices -
In [453]: item_to_item
Out[453]:
0 1 2 3
0 0 4 5 9
1 4 0 3 7
2 5 3 0 3
3 9 7 3 0
In [454]: r,c = np.triu_indices(len(items),1)
In [455]: pd.DataFrame(np.column_stack((r,c, item_to_item.values[r,c])))
Out[455]:
0 1 2
0 0 1 4
1 0 2 5
2 0 3 9
3 1 2 3
4 1 3 7
5 2 3 3
numpy's np.triu gives you the upper triangle with all other elements set to zero. You can use that to construct your DataFrame and replace them with NaNs (so that they are dropped when you stack the columns):
pd.DataFrame(np.triu(df), index=df.index, columns=df.columns).replace(0, np.nan).stack()
Out:
0 1 4.0
2 5.0
3 9.0
1 2 3.0
3 7.0
2 3 3.0
dtype: float64
You can use reset_index at the end to convert indices to columns.
Another alternative would be resetting the index and stacking again but this time use a callable to slice the DataFrame:
df.stack().reset_index()[lambda x: x['level_0'] < x['level_1']]
Out:
level_0 level_1 0
1 0 1 4
2 0 2 5
3 0 3 9
6 1 2 3
7 1 3 7
11 2 3 3
This one requires pandas 0.18.0.