I have a small subset of data here:
import pandas as pd
days = [1, 2, 3]
time = [2, 4, 2, 4, 2, 4, 2, 4, 2]
df1 = pd.DataFrame(days)
df2 = pd.Series(time)
df2 = df2.transpose()
df3 = df1*df2
Df1 is a column of data and df2 is a row of data. I need a dataframe that is going to be 3x9 where the row is multiplied by each value in the column to make one large dataframe.
The end result should look like:
df3 = [2 4 2 4 2 4 2 4 2
4 8 4 8 4 8 4 8 4
6 12 6 12 6 12 6 12 6 ]
They way I currently have it for my larger dataset, only a few datapoints are correctly multiplied and most are nans.
Dot(product) is one of the solutions to this problem
import pandas as pd
days = [1, 2, 3]
time = [2, 4, 2, 4, 2, 4, 2, 4, 2]
df1 = pd.DataFrame(days)
df2 = pd.DataFrame(time)
# use dot
df3 = df1.dot(df2.T)
df3
Output
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
1 4 8 4 8 4 8 4 8 4
2 6 12 6 12 6 12 6 12 6
Try this:
df1.dot(df2.to_frame().T)
Output:
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
1 4 8 4 8 4 8 4 8 4
2 6 12 6 12 6 12 6 12 6
Related
I have an carid and I would like to see all buyers who had something to do with this carid. So I would like to have all buyers who have bought carid 3.
How do I do that?
import pandas as pd
d = {'Buyerid': [1,1,2,2,3,3,3,4,5,5,5],
'Carid': [1,2,3,4,4,1,2,4,1,3,5]}
df = pd.DataFrame(data=d)
print(df)
Buyerid Carid
0 1 1
1 1 2
2 2 3
3 2 4
4 3 4
5 3 1
6 3 2
7 4 4
8 5 1
9 5 3
10 5 5
# What I want
Buyerid Carid
2 2 3
3 2 4
8 5 1
9 5 3
10 5 5
I have already tested this df = df.loc[df['Carid']==3,'Buyerid'], but this only gives me the line with CardID 3 but not the complete buyer.
How to select rows from a DataFrame based on column values
I looked at that, but I only get this here
Buyerid Carid
2 2 3
9 5 3
Do the following:
import pandas as pd
d = {'Buyerid': [1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5],
'Carid': [1, 2, 3, 4, 4, 1, 2, 4, 1, 3, 5]}
df = pd.DataFrame(data=d)
# get all buyers
buyers = set(df.loc[df['Carid'] == 3, 'Buyerid'])
# boolean mask for filtering
mask = df['Buyerid'].isin(buyers)
print(df[mask])
Output
Buyerid Carid
2 2 3
3 2 4
8 5 1
9 5 3
10 5 5
You can use df.loc:
df.loc[df['Carid']==3,'Buyerid']
I want to swap all the values of my data frame.Largest value must be replaced with smallest value (i.e. 7 with 1, 6 with 2, 5 with 3, 4 with 4, 3 with 5, and so on..
import numpy as np
import pandas as pd
import io
data = '''
Values
6
1
3
7
5
2
4
1
4
7
2
5
'''
df = pd.read_csv(io.StringIO(data))
Trial
First I want to get all the unique values from my data.
df1=df.Values.unique()
print(df1)
[6 1 3 7 5 2 4]
I have sorted it in ascending order:
sorted1 = list(np.sort(df1))
print(sorted1)
[1, 2, 3, 4, 5, 6, 7]
Than I have reverse sorted the list:
rev_sorted = list(reversed(sorted1))
print(rev_sorted)
[7, 6, 5, 4, 3, 2, 1]
Now I need to replace the max. value with min. value and so on in my main data set (df). The old values can be replaced or a new column might be added.
Expected Output:
Values,New_Values
6,2
1,7
3,5
7,1
5,3
2,6
4,4
1,7
4,4
7,1
2,6
5,3
Here's a vectorized one -
In [51]: m,n = np.unique(df['Values'], return_inverse=True)
In [52]: df['New_Values'] = m[n.max()-n]
In [53]: df
Out[53]:
Values New_Values
0 6 2
1 1 7
2 3 5
3 7 1
4 5 3
5 2 6
6 4 4
7 1 7
8 4 4
9 7 1
10 2 6
11 5 3
Translating to pandas with pandas.factorize -
m,n = pd.factorize(df.Values, sort=True)
df['New_Values'] = n[m.max()-m]
Use Series.map by dictionary created by sorted and reverse sorting lists:
df['New'] = df['Values'].map(dict(zip(sorted1,rev_sorted)))
print (df)
Values New
0 6 2
1 1 7
2 3 5
3 7 1
4 5 3
5 2 6
6 4 4
7 1 7
8 4 4
9 7 1
10 2 6
11 5 3
Let's say I have a DF with 5 columns and I want to make a unique 'key' for each row.
a b c d e
1 1 2 3 4 5
2 1 2 3 4 6
3 1 2 3 4 7
4 1 2 2 5 6
5 2 3 4 5 6
6 2 3 4 5 6
7 3 4 5 6 7
I'd like to create a 'key' column as follows:
a b c d e key
1 1 2 3 4 5 12345
2 1 2 3 4 6 12346
3 1 2 3 4 7 12347
4 1 2 2 5 6 12256
5 2 3 4 5 6 23456
6 2 3 4 5 6 23456
7 3 4 5 6 7 34567
Now the problem with this of course is that row 5 & 6 are duplicates.
I'd like to be able to create unique keys like so:
a b c d e key
1 1 2 3 4 5 12345_1
2 1 2 3 4 6 12346_1
3 1 2 3 4 7 12347_1
4 1 2 2 5 6 12256_1
5 2 3 4 5 6 23456_1
6 2 3 4 5 6 23456_2
7 3 4 5 6 7 34567_1
Not sure how to do this or if this is the best method - appreciate any help.
Thanks
Edit: Columns will be mostly strings, not numeric.
On way is to hash to tuple of each row:
In [11]: df.apply(lambda x: hash(tuple(x)), axis=1)
Out[11]:
1 -2898633648302616629
2 -2898619338595901633
3 -2898621714079554433
4 -9151203046966584651
5 1657626630271466437
6 1657626630271466437
7 3771657657075408722
dtype: int64
In [12]: df['key'] = df.apply(lambda x: hash(tuple(x)), axis=1)
In [13]: df['key'].astype(str) + '_' + (df.groupby('key').cumcount() + 1).astype(str)
Out[13]:
1 -2898633648302616629_1
2 -2898619338595901633_1
3 -2898621714079554433_1
4 -9151203046966584651_1
5 1657626630271466437_1
6 1657626630271466437_2
7 3771657657075408722_1
dtype: object
Note: Generally you don't need to be doing this (it's unclear why you'd want to!).
try this.,
df['key']=df.apply(lambda x:'-'.join(x.values.tolist()),axis=1)
m=~df['key'].duplicated()
s= (df.groupby(m.cumsum()).cumcount()+1).astype(str)
df['key']=df['key']+'_'+s
print (df)
O/P:
a b c d e key
0 1 2 3 4 5 1-2-3-4-5_0
1 1 2 3 4 6 1-2-3-4-6_0
2 1 2 3 4 7 1-2-3-4-7_0
3 1 2 2 5 6 1-2-2-5-6_0
4 2 3 4 5 6 2-3-4-5-6_0
5 2 3 4 5 6 2-3-4-5-6_1
6 3 4 5 6 7 3-4-5-6-7_0
7 1 2 3 4 5 1-2-3-4-5_1
Another much simpler way:
df['key']=df['key']+'_'+(df.groupby('key').cumcount()).astype(str)
Explanation:
first create your unique id using join.
create a sequence s using duplicate and perform cumsum, restart when new value found.
finally concat key and your sequence s.
Maybe you can do something link the following
import uuid
df['uuid'] = [uuid.uuid4() for __ in range(df.index.size)]
Another approach would be to use np.random.choice(range(10000,99999), len(df), replace=False) to generate unique random numbers without replacement for each row in your df:
df = pd.DataFrame(columns = ['a', 'b', 'c', 'd', 'e'],
data = [[1, 2, 3, 4, 5],[1, 2, 3, 4, 6],[1, 2, 3, 4, 7],[1, 2, 2, 5, 6],[2, 3, 4, 5, 6],[2, 3, 4, 5, 6],[3, 4, 5, 6, 7]])
df['key'] = np.random.choice(range(10000,99999), len(df), replace=False)
df
a b c d e key
0 1 2 3 4 5 10560
1 1 2 3 4 6 79547
2 1 2 3 4 7 24762
3 1 2 2 5 6 95221
4 2 3 4 5 6 79460
5 2 3 4 5 6 62820
6 3 4 5 6 7 82964
import pandas as pd
import numpy as np
data=[]
columns = ['A', 'B', 'C']
data = [[0, 10, 5], [0, 12, 5], [2, 34, 13], [2, 3, 13], [4, 5, 8], [2, 4, 8], [1, 2, 4], [1, 3, 4], [3, 8, 12],[4,10,12],[6,7,12]]
df = pd.DataFrame(data, columns=columns)
print(df)
# A B C
# 0 0 10 5
# 1 0 12 5
# 2 2 34 13
# 3 2 3 13
# 4 4 5 8
# 5 2 4 8
# 6 1 2 4
# 7 1 3 4
# 8 3 8 12
# 9 4 10 12
# 10 6 7 12
Now I want to create two data frames df_train and df_test such that no two numbers of column 'C' are in the same set. eg. in column C the element 5 should be either in the training set or testing set .So, the rows [0, 10, 5], [0, 12, 5], [2, 34, 13] will either go in training set or testing set but not in both.This choosing of elements of column C should be done randomly.
I am stuck on this step and cannot proceed.
First sample your df , then groupby C get the cumcount distinct the duplicated value within the same group.
s=df.sample(len(df)).groupby('C').cumcount()
s
Out[481]:
5 0
7 0
2 0
1 0
0 1
6 1
10 0
4 1
3 1
8 1
9 2
dtype: int64
test=df.loc[s[s==1].index]
train=df.loc[s[s==0].index]
test
Out[483]:
A B C
0 0 10 5
6 1 2 4
4 4 5 8
3 2 3 13
8 3 8 12
train
Out[484]:
A B C
5 2 4 8
7 1 3 4
2 2 34 13
1 0 12 5
10 6 7 12
The question is not so clear of what the expected output of the two train and test set dataframe should looks like.
Anyway, I will try to answer.
I think you can first sort the dataframe values:
df_sorted = df.sort_values(['C'], ascending=[True])
print(df_sorted)
Out[1]:
A B C
6 1 2 4
7 1 3 4
0 0 10 5
1 0 12 5
4 4 5 8
5 2 4 8
8 3 8 12
9 4 10 12
10 6 7 12
2 2 34 13
3 2 3 13
Then split the sorted dataframe:
uniqe_c = df_sorted['C'].unique().tolist()
print(uniqe_c)
Out[2]:
[4, 5, 8, 12, 13]
train_set = df[df['C'] <= uniqe_c[2]]
val_set = df[df['C'] > uniqe_c[2]]
print(train_set)
# Train set dataframe
Out[3]:
A B C
0 0 10 5
1 0 12 5
4 4 5 8
5 2 4 8
6 1 2 4
7 1 3 4
print(val_set)
# Test set dataframe
Out[4]:
A B C
2 2 34 13
3 2 3 13
8 3 8 12
9 4 10 12
10 6 7 12
From 11 samples, after the split, 6 samples go to the train set and 5 samples go to the validation set. So, checked and no missing samples in the total combined two dataframes.
I have following ModelFrame
import pandas as pd
import pandas_ml as pdml
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'B': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12]})
dfml = pdml.ModelFrame(df)
In[20]: dfml
Out[20]:
A B
0 1 3
1 2 4
2 3 5
3 4 6
4 5 7
5 6 8
6 7 9
7 8 10
8 9 11
9 10 12
Added scaling
dfml['A'] = dfml.preprocessing.StandardScaler().fit_transform(dfml['A'])
0 -1.566699
1 -1.218544
2 -0.870388
3 -0.522233
4 -0.174078
5 0.174078
6 0.522233
7 0.870388
8 1.218544
9 1.566699
After I got train and test datasets
X, Y = dfml.cross_validation.train_test_split()
A
4 -0.174078
3 -0.522233
7 0.870388
Eventually, I performed fit and predict and got
A PREDICTED
4 -0.174078 8
3 -0.522233 2
7 0.870388 1
And right now, I want to combine my predicted result with original frame dfml and got final result as:
A B PREDICTED
0 1 3
1 2 4
2 3 5
3 4 6 2
4 5 7 8
5 6 8
6 7 9
7 8 10 1
8 9 11
9 10 12
Does it possible smth like dfml = dfml.join(Y) ? Or any other approach to use inverse_transform?
dfml.join(Y) should work, except that you have overlapping columns named A.
Try:
dfml = dfml.join(Y[['PREDICTED']])