I need to go through a large pd and select consecutive rows with similar values in a column. i.e. in the pd below and selecting column x:
col row x y
1 1 1 1
2 2 2 2
6 3 3 8
9 2 3 4
5 3 3 9
4 9 4 4
5 5 5 1
3 7 5 2
6 6 6 6
The results output would be:
col row x y
6 3 3 8
9 2 3 4
5 3 3 9
5 5 5 1
3 7 5 2
Not sure how to do this.
IIUC, use boolean indexing using a mask of the consecutive values:
m = df['x'].eq(df['x'].shift())
df[m|m.shift(-1, fill_value=False)]
Output:
col row x y
2 6 3 3 8
3 9 2 3 4
4 5 3 3 9
6 5 5 5 1
7 3 7 5 2
I start with:
df
0 1 2 3 4
0 5 0 0 2 6
1 9 6 5 8 6
2 8 9 4 2 1
3 2 5 8 9 6
4 8 8 8 0 8
and want to end up with:
df
0 1 2 3 4
A B C
1 2 0 5 0 0 2 6
1 9 6 5 8 6
2 8 9 4 2 1
3 2 5 8 9 6
4 8 8 8 0 8
where A and B are known after df creation, and C is the original
index of the df.
MWE:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(10, size=(5, 5)))
df_a = 1
df_b = 2
breakpoint()
What I have in mind, but gives unhashable type error:
df.reindex([df_a, df_b, df.index])
Try with pd.MultiIndex.from_product:
df.index = pd.MultiIndex.from_product(
[[df_a], [df_b], df.index], names=['A','B','C'])
df
Out[682]:
0 1 2 3 4
A B C
1 2 0 7 0 1 9 9
1 0 4 7 3 2
2 7 2 0 0 4
3 5 5 6 8 4
4 1 4 9 8 1
i have a dataframe looking like this:
A B....X
1 1 A
2 2 B
3 3 A
4 6 K
5 7 B
6 8 L
7 9 M
8 1 N
9 7 B
1 6 A
7 7 A
that is, some "rising edges" occur from time to time in the column X (in this example the edge is x==B)
What I need is, a new column Y which increments every time a value of B occurs in X:
A B....X Y
1 1 A 0
2 2 B 1
3 3 A 1
4 6 K 1
5 7 B 2
6 8 L 2
7 9 M 2
8 1 N 2
9 7 B 3
1 6 A 3
7 7 A 3
In SQL I would use some trick like sum(case when x=B then 1 else 0) over ... rows between first and previous. How can I do it in Pandas?
Use cumsum
df['Y'] = (df.X == 'B').cumsum()
Out[8]:
A B X Y
0 1 1 A 0
1 2 2 B 1
2 3 3 A 1
3 4 6 K 1
4 5 7 B 2
5 6 8 L 2
6 7 9 M 2
7 8 1 N 2
8 9 7 B 3
9 1 6 A 3
10 7 7 A 3
I have a dataframe
C V S D LOC
1 2 3 4 X
5 6 7 8
1 2 3 4
5 6 7 8 Y
9 10 11 12
how can i select rows from loc X to Y and inport them in another csv
Use idxmax for first values of index where True in condition:
df = df.loc[(df['LOC'] == 'X').idxmax():(df['LOC'] == 'Y').idxmax()]
print (df)
C V S D LOC
0 1 2 3 4 X
1 5 6 7 8 NaN
2 1 2 3 4 NaN
3 5 6 7 8 Y
In [133]: df.loc[df.index[df.LOC=='X'][0]:df.index[df.LOC=='Y'][0]]
Out[133]:
C V S D LOC
0 1 2 3 4 X
1 5 6 7 8 NaN
2 1 2 3 4 NaN
3 5 6 7 8 Y
PS this will select all rows between first occurence of X and first occurence of Y
I have a dictionary as follows:
d={1:(array[2,3]), 2:(array[8,4,5]), 3:(array[6,7,8,9])}
As depicted, here the values for each key are variable length arrays.
Now I want to convert it to DataFrame. So the output looks like:
A B
1 2
1 3
2 8
2 4
2 5
3 6
3 7
3 8
3 9
I used pd.Dataframe(d), but it does not handle one to many mapping.Any help would be appreciated.
Use Series constructor with str.len for lenghts of lists (arrays was converted to lists).
Then create new DataFrame with numpy.repeat, numpy.concatenate and Index.values:
d = {1:np.array([2,3]), 2:np.array([8,4,5]), 3:np.array([6,7,8,9])}
print (d)
a = pd.Series(d)
l = a.str.len()
df = pd.DataFrame({'A':np.repeat(a.index.values, l), 'B': np.concatenate(a.values)})
print (df)
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9
pd.DataFrame(
[[k, v] for k, a in d.items() for v in a.tolist()],
columns=['A', 'B']
)
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9
Setup
d = {1: np.array([2,3]), 2: np.array([8,4,5]), 3: np.array([6,7,8,9])}
Here's my version:
(pd.DataFrame.from_dict(d, orient='index').rename_axis('A')
.stack()
.reset_index(name='B')
.drop('level_1', axis=1)
.astype('int'))
Out[63]:
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9