I have following ModelFrame
import pandas as pd
import pandas_ml as pdml
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'B': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12]})
dfml = pdml.ModelFrame(df)
In[20]: dfml
Out[20]:
A B
0 1 3
1 2 4
2 3 5
3 4 6
4 5 7
5 6 8
6 7 9
7 8 10
8 9 11
9 10 12
Added scaling
dfml['A'] = dfml.preprocessing.StandardScaler().fit_transform(dfml['A'])
0 -1.566699
1 -1.218544
2 -0.870388
3 -0.522233
4 -0.174078
5 0.174078
6 0.522233
7 0.870388
8 1.218544
9 1.566699
After I got train and test datasets
X, Y = dfml.cross_validation.train_test_split()
A
4 -0.174078
3 -0.522233
7 0.870388
Eventually, I performed fit and predict and got
A PREDICTED
4 -0.174078 8
3 -0.522233 2
7 0.870388 1
And right now, I want to combine my predicted result with original frame dfml and got final result as:
A B PREDICTED
0 1 3
1 2 4
2 3 5
3 4 6 2
4 5 7 8
5 6 8
6 7 9
7 8 10 1
8 9 11
9 10 12
Does it possible smth like dfml = dfml.join(Y) ? Or any other approach to use inverse_transform?
dfml.join(Y) should work, except that you have overlapping columns named A.
Try:
dfml = dfml.join(Y[['PREDICTED']])
Related
I have a small subset of data here:
import pandas as pd
days = [1, 2, 3]
time = [2, 4, 2, 4, 2, 4, 2, 4, 2]
df1 = pd.DataFrame(days)
df2 = pd.Series(time)
df2 = df2.transpose()
df3 = df1*df2
Df1 is a column of data and df2 is a row of data. I need a dataframe that is going to be 3x9 where the row is multiplied by each value in the column to make one large dataframe.
The end result should look like:
df3 = [2 4 2 4 2 4 2 4 2
4 8 4 8 4 8 4 8 4
6 12 6 12 6 12 6 12 6 ]
They way I currently have it for my larger dataset, only a few datapoints are correctly multiplied and most are nans.
Dot(product) is one of the solutions to this problem
import pandas as pd
days = [1, 2, 3]
time = [2, 4, 2, 4, 2, 4, 2, 4, 2]
df1 = pd.DataFrame(days)
df2 = pd.DataFrame(time)
# use dot
df3 = df1.dot(df2.T)
df3
Output
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
1 4 8 4 8 4 8 4 8 4
2 6 12 6 12 6 12 6 12 6
Try this:
df1.dot(df2.to_frame().T)
Output:
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
1 4 8 4 8 4 8 4 8 4
2 6 12 6 12 6 12 6 12 6
I have a pandas table below which can be copy/pasted and read in with pd.read_clipboard(). I need to take a slice of 2 consectutive values, and 3 consecutive values, It's a two pair, three pair list,
as you can see by column y1. So 0,1 is a pair, then 2,3,4 are a pair, and then continues for each 5 group. I need to slice the entire list in these pairs. This is a group of 5, where the first 2 are a pair, and the next three are a pair.
So 14,1 is a pair, and 4,10,8 are a pair, and this is the same for every 5 pair.
what W1 W2 W8 W9 W0 y Y x y4 y1
0 14 4 14 12 14 2 15 4 7 1 1
1 1 11 1 3 1 13 0 14 8 10 1
2 4 14 4 6 4 8 5 5 13 13 1
3 10 0 10 8 10 6 11 9 3 8 1
4 8 2 8 10 8 4 9 12 1 8 1
5 15 15 13 11 15 0 4 15 4 11 11
6 11 11 9 15 11 4 0 9 0 2 11
7 9 9 11 13 9 6 2 0 2 10 11
8 2 2 0 6 2 13 9 9 9 0 11
9 0 0 2 4 0 15 11 15 11 10 11
10 4 6 4 13 4 12 13 6 7 9 9
11 9 11 9 0 9 1 0 1 10 2 9
12 3 1 3 10 3 11 10 10 0 7 9
13 2 0 2 11 2 10 11 3 1 10 9
14 10 8 10 3 10 2 3 12 9 14 9
15 13 13 5 14 13 2 6 13 2 11 11
16 11 11 3 8 11 4 0 4 4 8 11
17 4 4 12 7 4 11 15 7 11 4 11
18 8 8 0 11 8 7 3 7 7 4 11
19 4 4 12 7 4 11 15 9 11 7 11
I have tried this which gives the right results, but it doesn't repeat.
In [1540]: df['what'][:].to_numpy()[0:2:]
Out[1540]: array([14, 1], dtype=int8)
In [1538]: df['what'][2:].to_numpy()[0:3:]
Out[1538]: array([ 4, 10, 8], dtype=int8)
which is exactly what i want, but it doesn't continue slicing to the end of the list and what i want is it to continue slice so i get all the pairs like belowl:
array([ 4, 10, 8, 9, 2, 0, 3, 2, 10, 4, 8, 4] and the flip side array([14, 1, 15, 11, 4, 9, 13, 11]
How do i change my code or use pandas .loc/iloc or numpy slicing to continue slicing like my examples for the entire set?
The reason i need this is because i need to XOR the first two pair by a number, and the second three pair by a separate number. I'd like to XOR the first two pair and set the value in another column, and then XOR the second three pair, and set their values in another column in the correct index location.
Thanks in advance.
Convert data into numpy, then use a boolean of True and False to index the arrays
numpy resize helps in matching the boolean to the size of the what array
#create array
what = df.what.to_numpy()
what
array([14, 1, 4, 10, 8, 15, 11, 9, 2, 0, 4, 9, 3, 2, 10, 13, 11,
4, 8, 4], dtype=int64)
#create array of boolean
#ignore first two entries, gimme the next three entries
index = np.array([False,False,True,True,True])
#resize index to match size of what array
index = np.resize(index,what.shape[0])
what[index]
array([ 4, 10, 8, 9, 2, 0, 3, 2, 10, 4, 8, 4], dtype=int64)
#reverse the direction of the boolean
#keep first two entries, ignore next three
what[~index]
array([14, 1, 15, 11, 4, 9, 13, 11], dtype=int64)
Guys,
I have a DataFrame read from csv, it looks like as following (use simple number and X to simplify) :
X,X,X,X,X,X,X,X,X,X
X,1,2,3,4,5,6,7,8,X
X,1,2,3,4,5,6,7,8,X
X,1,2,3,4,5,6,7,8,X
X,1,2,3,4,5,6,7,8,X
Here it is my needs - I want to generate a 1D array/list from the DataFrame above with :
I need to remove all X values (which is only happened in first row, first column, and last column)
Create a column wise array - 1,1,1,1,2,2,2,2,3,3,3,3,...,8,8,8,8
Create a row wise array - 1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8,...
I want to know the simplest code for those purpose, really appreciate.
Regards,
Wangyang
1 : iloc
df.iloc[1:, 1:-1]
1 2 3 4 5 6 7 8
1 1 2 3 4 5 6 7 8
2 1 2 3 4 5 6 7 8
3 1 2 3 4 5 6 7 8
4 1 2 3 4 5 6 7 8
2 : unstack
For column-wise
df.iloc[1:, 1:-1].unstack()
3: stack
For row-wise
df.iloc[1:, 1:-1].stack()
____________________________________________
A Numpy take
a = np.array([center for left, *center, right in zip(*map(df.get, df))][1:], dtype=int)
a
array([[1, 2, 3, 4, 5, 6, 7, 8],
[1, 2, 3, 4, 5, 6, 7, 8],
[1, 2, 3, 4, 5, 6, 7, 8],
[1, 2, 3, 4, 5, 6, 7, 8]])
Column-wise
np.ravel(a, order='F')
[1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8]
Row-wise
np.ravel(a)
# np.ravel(a, order='C')
[1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8]
For your row-wise array use,
df.iloc[1:,1:-1].stack().values
For your column-wise array use,
df.iloc[1:,1:-1].unstack().values
I have a data column of 828 values. In the column, it repeats numbers 1-18 many times instead. I want a new column that will grab 5 lines of every 18 so it can repeat 1-5 many times. This is not exactly what my data looks like but the method is the same of what I want to do to my actual data.
My data column with the repeating 1-18 is stored in df_rois
I tried this line of code but it just skips every 5 without keeping the numbers in between:
df_rois2 = df_rois[::5]
What I currently have in df_rois:
[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18]
What I want in df_rois2:
[ 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5]
You can loop through the 18-length slices and extract the first 5 elements, then use extend to add them to df_rois2:
df_rois2 = []
for i in range(0, len(df_rois), 18):
df_rois2.extend(df_rois[i:i + 5])
print(df_rois2)
Output:
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
import pandas as pd
import numpy as np
data=[]
columns = ['A', 'B', 'C']
data = [[0, 10, 5], [0, 12, 5], [2, 34, 13], [2, 3, 13], [4, 5, 8], [2, 4, 8], [1, 2, 4], [1, 3, 4], [3, 8, 12],[4,10,12],[6,7,12]]
df = pd.DataFrame(data, columns=columns)
print(df)
# A B C
# 0 0 10 5
# 1 0 12 5
# 2 2 34 13
# 3 2 3 13
# 4 4 5 8
# 5 2 4 8
# 6 1 2 4
# 7 1 3 4
# 8 3 8 12
# 9 4 10 12
# 10 6 7 12
Now I want to create two data frames df_train and df_test such that no two numbers of column 'C' are in the same set. eg. in column C the element 5 should be either in the training set or testing set .So, the rows [0, 10, 5], [0, 12, 5], [2, 34, 13] will either go in training set or testing set but not in both.This choosing of elements of column C should be done randomly.
I am stuck on this step and cannot proceed.
First sample your df , then groupby C get the cumcount distinct the duplicated value within the same group.
s=df.sample(len(df)).groupby('C').cumcount()
s
Out[481]:
5 0
7 0
2 0
1 0
0 1
6 1
10 0
4 1
3 1
8 1
9 2
dtype: int64
test=df.loc[s[s==1].index]
train=df.loc[s[s==0].index]
test
Out[483]:
A B C
0 0 10 5
6 1 2 4
4 4 5 8
3 2 3 13
8 3 8 12
train
Out[484]:
A B C
5 2 4 8
7 1 3 4
2 2 34 13
1 0 12 5
10 6 7 12
The question is not so clear of what the expected output of the two train and test set dataframe should looks like.
Anyway, I will try to answer.
I think you can first sort the dataframe values:
df_sorted = df.sort_values(['C'], ascending=[True])
print(df_sorted)
Out[1]:
A B C
6 1 2 4
7 1 3 4
0 0 10 5
1 0 12 5
4 4 5 8
5 2 4 8
8 3 8 12
9 4 10 12
10 6 7 12
2 2 34 13
3 2 3 13
Then split the sorted dataframe:
uniqe_c = df_sorted['C'].unique().tolist()
print(uniqe_c)
Out[2]:
[4, 5, 8, 12, 13]
train_set = df[df['C'] <= uniqe_c[2]]
val_set = df[df['C'] > uniqe_c[2]]
print(train_set)
# Train set dataframe
Out[3]:
A B C
0 0 10 5
1 0 12 5
4 4 5 8
5 2 4 8
6 1 2 4
7 1 3 4
print(val_set)
# Test set dataframe
Out[4]:
A B C
2 2 34 13
3 2 3 13
8 3 8 12
9 4 10 12
10 6 7 12
From 11 samples, after the split, 6 samples go to the train set and 5 samples go to the validation set. So, checked and no missing samples in the total combined two dataframes.