I have a data column of 828 values. In the column, it repeats numbers 1-18 many times instead. I want a new column that will grab 5 lines of every 18 so it can repeat 1-5 many times. This is not exactly what my data looks like but the method is the same of what I want to do to my actual data.
My data column with the repeating 1-18 is stored in df_rois
I tried this line of code but it just skips every 5 without keeping the numbers in between:
df_rois2 = df_rois[::5]
What I currently have in df_rois:
[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18]
What I want in df_rois2:
[ 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5]
You can loop through the 18-length slices and extract the first 5 elements, then use extend to add them to df_rois2:
df_rois2 = []
for i in range(0, len(df_rois), 18):
df_rois2.extend(df_rois[i:i + 5])
print(df_rois2)
Output:
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
Related
Given
import pandas as pd
df = pd.DataFrame({
"a": [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 2, 2, 3, 3, ],
})
print(df)
a
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 3
8 3
9 3
10 3
11 1
12 1
13 2
14 2
15 3
16 3
I need to calculate the following result:
res_df = pd.DataFrame({
"starts": [0, 3, 7, 11, 13, 15],
"ends": [3, 7, 11, 13, 15, 17]
})
print(res_df)
starts ends
0 0 3
1 3 7
2 7 11
3 11 13
4 13 15
5 15 17
If values were not duplicated, I could do something like zeroing out all duplicates, keeping the length of the group in groupby, then a cumsum.
However, there are duplicates, and order should be preserved.
Is there a way to do this in pandas?
As a follow up, I would like to calculate starts and ends just for df["a"] == 3, if it would be computationally less expensive.
Let's try this:
blocks = df['a'].diff().ne(0).cumsum()
# depending on your mask
out = (df[some_mask]
.index.to_frame()
.groupby(blocks)[0]
.agg(['min','max'])
)
out['max'] += 1
Output:
min max
a
1 0 3
2 3 7
3 7 11
4 11 13
5 13 15
6 15 17
you could slice the index and the column a with a mask where shift is not equal to current value, then create a dataframe. The result could also include the original value of the column a.
m = df['a'].ne(df['a'].shift())
res = pd.DataFrame({'a':df.loc[m,'a'],
'starts':df.index[m]})
res['ends'] = res['starts'].shift(-1, fill_value=len(df))
print(res)
a starts ends
0 1 0 3
3 2 3 7
7 3 7 11
11 1 11 13
13 2 13 15
15 3 15 17
I have a pandas table below which can be copy/pasted and read in with pd.read_clipboard(). I need to take a slice of 2 consectutive values, and 3 consecutive values, It's a two pair, three pair list,
as you can see by column y1. So 0,1 is a pair, then 2,3,4 are a pair, and then continues for each 5 group. I need to slice the entire list in these pairs. This is a group of 5, where the first 2 are a pair, and the next three are a pair.
So 14,1 is a pair, and 4,10,8 are a pair, and this is the same for every 5 pair.
what W1 W2 W8 W9 W0 y Y x y4 y1
0 14 4 14 12 14 2 15 4 7 1 1
1 1 11 1 3 1 13 0 14 8 10 1
2 4 14 4 6 4 8 5 5 13 13 1
3 10 0 10 8 10 6 11 9 3 8 1
4 8 2 8 10 8 4 9 12 1 8 1
5 15 15 13 11 15 0 4 15 4 11 11
6 11 11 9 15 11 4 0 9 0 2 11
7 9 9 11 13 9 6 2 0 2 10 11
8 2 2 0 6 2 13 9 9 9 0 11
9 0 0 2 4 0 15 11 15 11 10 11
10 4 6 4 13 4 12 13 6 7 9 9
11 9 11 9 0 9 1 0 1 10 2 9
12 3 1 3 10 3 11 10 10 0 7 9
13 2 0 2 11 2 10 11 3 1 10 9
14 10 8 10 3 10 2 3 12 9 14 9
15 13 13 5 14 13 2 6 13 2 11 11
16 11 11 3 8 11 4 0 4 4 8 11
17 4 4 12 7 4 11 15 7 11 4 11
18 8 8 0 11 8 7 3 7 7 4 11
19 4 4 12 7 4 11 15 9 11 7 11
I have tried this which gives the right results, but it doesn't repeat.
In [1540]: df['what'][:].to_numpy()[0:2:]
Out[1540]: array([14, 1], dtype=int8)
In [1538]: df['what'][2:].to_numpy()[0:3:]
Out[1538]: array([ 4, 10, 8], dtype=int8)
which is exactly what i want, but it doesn't continue slicing to the end of the list and what i want is it to continue slice so i get all the pairs like belowl:
array([ 4, 10, 8, 9, 2, 0, 3, 2, 10, 4, 8, 4] and the flip side array([14, 1, 15, 11, 4, 9, 13, 11]
How do i change my code or use pandas .loc/iloc or numpy slicing to continue slicing like my examples for the entire set?
The reason i need this is because i need to XOR the first two pair by a number, and the second three pair by a separate number. I'd like to XOR the first two pair and set the value in another column, and then XOR the second three pair, and set their values in another column in the correct index location.
Thanks in advance.
Convert data into numpy, then use a boolean of True and False to index the arrays
numpy resize helps in matching the boolean to the size of the what array
#create array
what = df.what.to_numpy()
what
array([14, 1, 4, 10, 8, 15, 11, 9, 2, 0, 4, 9, 3, 2, 10, 13, 11,
4, 8, 4], dtype=int64)
#create array of boolean
#ignore first two entries, gimme the next three entries
index = np.array([False,False,True,True,True])
#resize index to match size of what array
index = np.resize(index,what.shape[0])
what[index]
array([ 4, 10, 8, 9, 2, 0, 3, 2, 10, 4, 8, 4], dtype=int64)
#reverse the direction of the boolean
#keep first two entries, ignore next three
what[~index]
array([14, 1, 15, 11, 4, 9, 13, 11], dtype=int64)
I have a dataframe in which multiple dataseries with 2 columsn (0,1). The data is composed of different iterations of a measurement. The data is structured like so:
df = pd.DataFrame({
0: ['user', 'x', 1, 4, 7, 10, 'user', 'x', 1, 4, 7, 10, 'user', 'x', 1, 4, 7, 10],
1: ['iteration=0', 'y',5, 7, 9, 12, 'iteration=1', 'y',20, 8, 12, 12, 'iteration=2', 'y',3, 17, 19, 112]
})
0 user iteration=0
1 x y
2 1 5
3 4 7
4 7 9
5 10 12
6 user iteration=1
7 x y
8 1 20
9 4 8
10 7 12
11 10 12
12 user iteration=2
13 x y
14 1 3
15 4 17
16 7 19
17 10 112
I want to plot x vs y grouped by iteration.
I am trying to do this by first creaeting a single dataframe with the iteration as a column to perform the groupby on:
1 x y iteration
2 1 5 0
3 4 7 0
4 7 9 0
5 10 12 0
8 1 20 1
9 4 8 1
10 7 12 1
11 10 12 1
14 1 3 2
15 4 17 2
16 7 19 2
17 10 112 2
To create this joined dataframe, I implemented this code :
meta=df.loc[df[0]=='user']
lst=[]
ind=0
for index, row in meta.iterrows():
if index==0: #continue to start loop from second value
continue
splitvalue = meta.loc[ind][1].split('=')[1]
print (splitvalue)
temp=temp.iloc[ind:index]
temp['iteration']=splitvalue
ind=index
lst.append(temp)
pd.concat(lst)
Is there a way to create this joined dataframe without creating lists of subdataframes ? Or is there a way to directly plot from the original dataframe ?
You can use:
numeric=~pd.Series([isinstance(key,str) for key in df[0]])
iterations=df[1].where(df[1].str.contains('=').fillna(False)).ffill()
iterations=[int(key.replace('iteration=','')) for key in iterations]
df['iterations']=iterations
df=df.loc[numeric]
df.columns=['x','y','iteration']
df.reset_index(drop=True,inplace=True)
print(df)
x y iteration
0 1 5 0
1 4 7 0
2 7 9 0
3 10 12 0
4 1 20 1
5 4 8 1
6 7 12 1
7 10 12 1
8 1 3 2
9 4 17 2
10 7 19 2
11 10 112 2
import pandas as pd
import numpy as np
data=[]
columns = ['A', 'B', 'C']
data = [[0, 10, 5], [0, 12, 5], [2, 34, 13], [2, 3, 13], [4, 5, 8], [2, 4, 8], [1, 2, 4], [1, 3, 4], [3, 8, 12],[4,10,12],[6,7,12]]
df = pd.DataFrame(data, columns=columns)
print(df)
# A B C
# 0 0 10 5
# 1 0 12 5
# 2 2 34 13
# 3 2 3 13
# 4 4 5 8
# 5 2 4 8
# 6 1 2 4
# 7 1 3 4
# 8 3 8 12
# 9 4 10 12
# 10 6 7 12
Now I want to create two data frames df_train and df_test such that no two numbers of column 'C' are in the same set. eg. in column C the element 5 should be either in the training set or testing set .So, the rows [0, 10, 5], [0, 12, 5], [2, 34, 13] will either go in training set or testing set but not in both.This choosing of elements of column C should be done randomly.
I am stuck on this step and cannot proceed.
First sample your df , then groupby C get the cumcount distinct the duplicated value within the same group.
s=df.sample(len(df)).groupby('C').cumcount()
s
Out[481]:
5 0
7 0
2 0
1 0
0 1
6 1
10 0
4 1
3 1
8 1
9 2
dtype: int64
test=df.loc[s[s==1].index]
train=df.loc[s[s==0].index]
test
Out[483]:
A B C
0 0 10 5
6 1 2 4
4 4 5 8
3 2 3 13
8 3 8 12
train
Out[484]:
A B C
5 2 4 8
7 1 3 4
2 2 34 13
1 0 12 5
10 6 7 12
The question is not so clear of what the expected output of the two train and test set dataframe should looks like.
Anyway, I will try to answer.
I think you can first sort the dataframe values:
df_sorted = df.sort_values(['C'], ascending=[True])
print(df_sorted)
Out[1]:
A B C
6 1 2 4
7 1 3 4
0 0 10 5
1 0 12 5
4 4 5 8
5 2 4 8
8 3 8 12
9 4 10 12
10 6 7 12
2 2 34 13
3 2 3 13
Then split the sorted dataframe:
uniqe_c = df_sorted['C'].unique().tolist()
print(uniqe_c)
Out[2]:
[4, 5, 8, 12, 13]
train_set = df[df['C'] <= uniqe_c[2]]
val_set = df[df['C'] > uniqe_c[2]]
print(train_set)
# Train set dataframe
Out[3]:
A B C
0 0 10 5
1 0 12 5
4 4 5 8
5 2 4 8
6 1 2 4
7 1 3 4
print(val_set)
# Test set dataframe
Out[4]:
A B C
2 2 34 13
3 2 3 13
8 3 8 12
9 4 10 12
10 6 7 12
From 11 samples, after the split, 6 samples go to the train set and 5 samples go to the validation set. So, checked and no missing samples in the total combined two dataframes.
I have following ModelFrame
import pandas as pd
import pandas_ml as pdml
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'B': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12]})
dfml = pdml.ModelFrame(df)
In[20]: dfml
Out[20]:
A B
0 1 3
1 2 4
2 3 5
3 4 6
4 5 7
5 6 8
6 7 9
7 8 10
8 9 11
9 10 12
Added scaling
dfml['A'] = dfml.preprocessing.StandardScaler().fit_transform(dfml['A'])
0 -1.566699
1 -1.218544
2 -0.870388
3 -0.522233
4 -0.174078
5 0.174078
6 0.522233
7 0.870388
8 1.218544
9 1.566699
After I got train and test datasets
X, Y = dfml.cross_validation.train_test_split()
A
4 -0.174078
3 -0.522233
7 0.870388
Eventually, I performed fit and predict and got
A PREDICTED
4 -0.174078 8
3 -0.522233 2
7 0.870388 1
And right now, I want to combine my predicted result with original frame dfml and got final result as:
A B PREDICTED
0 1 3
1 2 4
2 3 5
3 4 6 2
4 5 7 8
5 6 8
6 7 9
7 8 10 1
8 9 11
9 10 12
Does it possible smth like dfml = dfml.join(Y) ? Or any other approach to use inverse_transform?
dfml.join(Y) should work, except that you have overlapping columns named A.
Try:
dfml = dfml.join(Y[['PREDICTED']])