I have the following table in a Pandas dataframe:
Seconds
Color
Break
NaN
End
0.639588
123
4
NaN
-
1.149597
123
1
NaN
-
1.671333
123
2
NaN
-
1.802052
123
2
NaN
-
1.900091
123
1
NaN
-
2.031240
123
4
NaN
-
2.221477
123
3
NaN
-
2.631840
123
2
NaN
-
2.822245
123
1
NaN
-
2.911147
123
4
NaN
-
3.133344
123
1
NaN
-
3.531246
123
1
NaN
-
3.822389
123
1
NaN
-
3.999389
123
2
NaN
-
4.327990
123
4
NaN
-
I'm trying to extract subgroups of the column labelled as 'Break' in such a way that the first and last item of each group is a '4'. So, the first group should be: [4,1,2,2,1,4]; the second group: [4,3,2,1,4]; the third group: [4,1,1,1,2,4]. The last '4' of each group is the first '4' of the following group.
I have the following code:
groups = []
def extract_phrases_between_four(data, new_group = []):
row_iterator = data.iterrows()
for i, row in row_iterator: #for index, row in row_iterator
if row['Break_Level_Annotation'] != '4':
new_group.append(row['Break_Level_Annotation'])
if row['Break_Level_Annotation'] == '4':
new_group = []
new_group.append(row['Break_Level_Annotation'])
groups.append(new_group)
return groups
but my output is:
[[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2],[4,3,2,1],[4,3,2,1],[4,3,2,1],[4,3,2,1],[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2]].
It's returning the same new_group repeatedly as many times as there are items in each new_group, while at the same time not including the final '4' of each new_group.
I've tried to move around the code but I can't seem to understand what the problem is. How can I get each new_group to include its first and final '4' and for the new_group to be included only once in the array 'groups'?
IIUC you can extract the index and use list comprehension:
s = df.loc[df["Break"].eq(4)].index
print ([df.loc[np.arange(x, y+1), "Break"].tolist() for x, y in zip(s, s[1:])])
[[4, 1, 2, 2, 1, 4], [4, 3, 2, 1, 4], [4, 1, 1, 1, 2, 4]]
The problem is that in each step in the for loop, you are adding new_group to groups, although you are still adding elements to new_group. You need to execute groups.append(new_group) inside the second if statement.
Also, pay attention that you can iterate directly the "Break" column values instead of the whole dataframe and accessing each time to get the value.
I rewrote the code a little bit, and it looks as follows:
groups = []
new_group = []
for i in data["Break"]:
new_group.append(i)
if i == 4:
if len(new_group) > 1:
groups.append(new_group)
new_group = [4]
print(groups)
And there is the result:
[[4, 1, 2, 2, 1, 4], [4, 3, 2, 1, 4], [4, 1, 1, 1, 2, 4]]
Related
I have the following dataframe:
d_test = {
'random_staff' : ['gfda', 'fsd','gec', 'erw', 'gd', 'kjhk', 'fd', 'kui'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
cluster_number column contains values from 1 to n. Some values could have repetition but no missing values are presented. For example above such values are: 1, 2, 3, 4.
I want to be able to select some value from cluster_number column and change every occurrence of this value to set of unique values. No missing value should be presented. For example if we select value 2 then desirable outcome for cluster_number is [1, 2, 3, 3, 5, 1, 4, 6]. Note we had three 2 in the column. We kept first one as 2 we change next occurrence of 2 to 5 and we changed last occurrence of 2 to 6.
I wrote code for the logic above and it works fine:
cluster_number_to_change = 2
max_cluster = max(df_test['cluster_number'])
first_iter = True
i = cluster_number_to_change
for index, row in df_test.iterrows():
if row['cluster_number'] == cluster_number_to_change:
df_test.loc[index, 'cluster_number'] = i
if first_iter:
i = max_cluster + 1
first_iter = False
else:
i += 1
But it is written as for-loop and I am trying understand if can be transformed in form of pandas .apply method (or any other effective vectorized solution).
Using boolean indexing:
# get cluster #2
m1 = df_test['cluster_number'].eq(2)
# identify duplicates
m2 = df_test['cluster_number'].duplicated()
# increment duplicates using the max as reference
df_test.loc[m1&m2, 'cluster_number'] = (
m2.where(m1).cumsum()
.add(df_test['cluster_number'].max())
.convert_dtypes()
)
print(df_test)
Output:
random_staff cluster_number
0 gfda 1
1 fsd 2
2 gec 3
3 erw 3
4 gd 5
5 kjhk 1
6 fd 4
7 kui 6
I have a pandas dataframe with a column such as :
df1 = pd.DataFrame({ 'val': [997.95, 997.97, 989.17, 999.72, 984.66, 1902.15]})
I have 2 types of events that can be detected from this column, I wanna label them 1 and 2 .
I need to get the indexes of each label , and to do so I need to find where the 'val' column has changed a lot (± 7 ) from previous row.
Expected output:
one = [0, 1, 3, 5]
two = [2, 4 ]
Use Series.diff with mask for test less values like 0, last use boolean indexing with indices:
m = df1.val.diff().lt(0)
#if need test less like -7
#m = df1.val.diff().lt(-7)
one = df1.index[~m]
two = df1.index[m]
print (one)
Int64Index([0, 1, 3, 5], dtype='int64')
print (two)
nt64Index([2, 4], dtype='int64')
If need lists:
one = df1.index[~m].tolist()
two = df1.index[m].tolist()
Details:
print (df1.val.diff())
0 NaN
1 0.02
2 -8.80
3 10.55
4 -15.06
5 917.49
Name: val, dtype: float64
I'm currently creating a new column in my pandas dataframe, which calculates a value based on a simple calculation using a value in another column, and a simple value subtracting from it. This is my current code, which almost gives me the output I desire (example shortened for reproduction):
subtraction_value = 3
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]}
data['new_column'] = data['test'][::-1] - subtraction_value
When run, this gives me the current output:
print(data['new_column'])
[9,1,2,1,-2,0,-1,3,7,6]
However, if I wanted to use a different value to subtract on the column, from position [0], then use the original subtraction value on positions [1:3] of the column, before using the second value on position [4] again, and repeat this pattern, how would I do this iteratively? I realize I could use a for loop to achieve this, but for performance reasons I'd like to do this another way. My new output would ideally look like this:
subtraction_value_2 = 6
print(data['new_column'])
[6,1,2,1,-5,0,-1,3,4,6]
You can use positional indexing:
subtraction_value_2 = 6
col = data.columns.get_loc('new_column')
data.iloc[0::4, col] = data['test'].iloc[0::4].sub(subtraction_value_2)
or with numpy.where:
data['new_column'] = np.where(data.index%4,
data['test']-subtraction_value,
data['test']-subtraction_value_2)
output:
test new_column
0 12 6
1 4 1
2 5 2
3 4 1
4 1 -5
5 3 0
6 2 -1
7 5 2
8 10 4
9 9 6
subtraction_value = 3
subtraction_value_2 = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]})
data['new_column'] = data.test - subtraction_value
data['new_column'][::4] = data.test[::4] - subtraction_value_2
print(list(data.new_column))
Output:
[6, 1, 2, 1, -5, 0, -1, 2, 4, 6]
Normally when you want to create a turn a set of data into a Data Frame, you make a list for each column, create a dictionary from those lists, then create a data frame from the dictionary.
The data frame I want to create has 75 columns, all with the same number of rows. Defining lists one-by-one isn't going work. Instead I decided to make a single list and iteratively put a certain chunk of each row onto a Data Frame.
Here I will make an example where I turn a list into a data frame:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# Example list
df =
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9
# Result I want from the example list
Here is my test code:
import pandas as pd
import numpy as np
dict = {'a':[], 'b':[], 'c':[], 'd':[], 'e':[]}
df = pd.DataFrame(dict)
# Here is my test data frame, it contains 5 columns and no rows.
lst = np.arange(10).tolist()
# This is my test list, it looks like this lst = [0, 2, …, 9]
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]))
# This code is supposed to put two entries per column for the whole data frame.
# For the first column, i = 0, so [2 * (0):2 * (0) + 2] = [0:2]
# df.iloc[:, 0] = lst[0:2], so df.iloc[:, 0] = [0, 1]
# Second column i = 1, so [2 * (1):2 * (1) + 2] = [2:4]
# df.iloc[:, 1] = lst[2:4], so df.iloc[:, 1] = [2, 3]
# This is how the code was supposed to allocate lst to df.
# However it outputs an error.
When I run this code I get this error:
ValueError: cannot reindex from a duplicate axis
When I add ignore_index = True such that I have
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]), ignore_index = True)
I get this error:
IndexError: single positional indexer is out-of-bounds
After running the code, I check the results of df. The output is the same whether I ignore index or not.
In: df
Out:
a b c d e
0 0 NaN NaN NaN NaN
1 1 NaN NaN NaN NaN
It seems that the first loop runs fine, but the error occurs when trying to fill the second column.
Does anybody know how to get this to work? Thank you.
IIUC:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
alst = np.array(lst)
df = pd.DataFrame(alst.reshape(2,-1, order='F'), columns = [*'abcde'])
print(df)
Output:
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9
What is the proper way to query top N rows by group in python datatable?
For example to get top 2 rows having largest v3 value by id2, id4 group I would do pandas expression in the following way:
df.sort_values('v3', ascending=False).groupby(['id2','id4']).head(2)
in R using data.table:
DT[order(-v3), head(v3, 2L), by=.(id2, id4)]
or in R using dplyr:
DF %>% arrange(desc(v3)) %>% group_by(id2, id4) %>% filter(row_number() <= 2L)
Example data and expected output using pandas:
import datatable as dt
dt = dt.Frame(id2=[1, 2, 1, 2, 1, 2], id4=[1, 1, 1, 1, 1, 1], v3=[1, 3, 2, 3, 3, 3])
df = dt.to_pandas()
df.sort_values('v3', ascending=False).groupby(['id2','id4']).head(2)
# id2 id4 v3
#1 2 1 3
#3 2 1 3
#4 1 1 3
#2 1 1 2
Starting from datatable version 0.8.0, this can be achieved by combining grouping, sorting and filtering:
from datatable import *
DT = Frame(id2=[1, 2, 1, 2, 1, 2],
id4=[1, 1, 1, 1, 1, 1],
v3=[1, 3, 2, 3, 3, 3])
DT[:2, :, by(f.id2, f.id4), sort(-f.v3)]
which produces
id2 id4 v3
--- --- --- --
0 1 1 3
1 1 1 2
2 2 1 3
3 2 1 3
[4 rows x 3 columns]
Explanation:
by(f.id2, f.id4) groups the data by columns "id2" and "id4";
the sort(-f.v3) command tells datatable to sort the records by column "v3" in descending order. In the presence of by() this operator will be applied within each group;
the first :2 selects the top 2 rows, again within each group;
the second : selects all columns. If needed, this could have been a list of columns or expressions, allowing you to perform some operation(s) on the first 2 rows of each group.