Sampling with fixed column ratio in pandas - python

I have this dataframe:
record = {
'F1': ['x1', 'x2','x3', 'x4','x5','x6','x7'],
'F2': ['a1', 'a2','a3', 'a4','a5','a6','a7'],
'Sex': ['F', 'M','F', 'M','M','M','F'] }
# Creating a dataframe
df = pd.DataFrame(record)
I would like to create for example 2 samples of this dataframe while keeping a fixed ratio of 50-50 on the Sex column.
I tried like this:
df_dict ={}
for i in range(2):
df_dict['df{}'.format(i)] = df.sample(frac=0.50, random_state=123)
But the output I get does not seem to match my expectation:
df_dict["df0"]
# Output:
F1 F2 Sex
1 x2 a2 M
3 x4 a4 M
4 x5 a5 M
0 x1 a1 F
Any help ?

Might not be the best idea, but I believe it might help you to solve your problem somehow:
n = 2
fDf = df[df["Sex"] == "F"].sample(frac=0.5, random_state=123).iloc[:n]
mDf = df[df["Sex"] == "M"].sample(frac=0.5, random_state=123).iloc[:n]
fDf.append(mDf)
Output
F1 F2 Sex
0 x1 a1 F
2 x3 a3 F
5 x6 a6 M
1 x2 a2 M

This should also work
n = 2
df.groupby('Sex', group_keys=False).apply(lambda x: x.sample(n))

Don't use frac that will give your a fraction of each group, but n that will give you a fixed value per group:
df.groupby('Sex').sample(n=2)
example output:
F1 F2 Sex
2 x3 a3 F
0 x1 a1 F
3 x4 a4 M
4 x5 a5 M
using a custom ratio
ratios = {'F':0.4, 'M':0.6} # sum should be 1
# total number desired
total = 4
# note that the exact number in the output depends
# on the rounding method to convert to int
# round should give the correct number but floor/ceil might
# under/over-sample
# see below for an example
s = pd.Series(ratios)*total
# convert to integer (chose your method, ceil/floor/round...)
s = np.ceil(s).astype(int)
df.groupby('Sex').apply(lambda x: x.sample(n=s[x.name])).droplevel(0)
example output:
F1 F2 Sex
0 x1 a1 F
6 x7 a7 F
4 x5 a5 M
3 x4 a4 M
1 x2 a2 M

Related

how to solve pandas multi-column explode issue?

I am trying to explode multi-columns at a time systematically.
Such that:
[
and I want the final output as:
I tried
df=df.explode('sauce', 'meal')
but this only provides the first element ( sauce) in this case to be exploded, and the second one was not exploded.
I also tried:
df=df.explode(['sauce', 'meal'])
but this code provides
ValueError: column must be a scalar
error.
I tried this approach, and also this. none worked.
Note: cannot apply to index, there are some none- unique values in the fruits column.
Prior to pandas 1.3.0 use:
df.set_index(['fruits', 'veggies'])[['sauce', 'meal']].apply(pd.Series.explode).reset_index()
Output:
fruits veggies sauce meal
0 x1 y2 a d
1 x1 y2 b e
2 x1 y2 c f
3 x2 y2 g k
4 x2 y2 h l
Many columns? Try:
df.set_index(df.columns.difference(['sauce', 'meal']).tolist())\
.apply(pd.Series.explode).reset_index()
Output:
fruits veggies sauce meal
0 x1 y2 a d
1 x1 y2 b e
2 x1 y2 c f
3 x2 y2 g k
4 x2 y2 h l
Update your version of Pandas
# Setup
df = pd.DataFrame({'fruits': ['x1', 'x2'],
'veggies': ['y1', 'y2'],
'sauce': [list('abc'), list('gh')],
'meal': [list('def'), list('kl')]})
print(df)
# Output
fruits veggies sauce meal
0 x1 y1 [a, b, c] [d, e, f]
1 x2 y2 [g, h] [k, l]
Explode (Pandas 1.3.5):
out = df.explode(['sauce', 'meal'])
print(out)
# Output
fruits veggies sauce meal
0 x1 y1 a d
0 x1 y1 b e
0 x1 y1 c f
1 x2 y2 g k
1 x2 y2 h l

Map values based off matched columns - Python

I want to map values based how two columns are matched. For instance, the df below contains different labels, A or B. I want to assign a new column that describes these labels. How this occurs is comparing columns Z L and Z P. Z L will always contain either ['X1','X2','X3','X4']. While Z P will correspondingly contain ['LA','LB','LC','LD'].
These will always be in acceding order or reverse order. As in ascending order will mean X1 corresponds to LA, X2 corresponds to LB etc. Reverse order means X1 corresponds to LD, X2 corresponds to LC etc.
If ascending order I want to map an R. If reverse order I want to map an L.
X = ['X1','X2','X3','X4']
R = ['LA','LB','LC','LD']
L = ['LD','LC','LB','LA']
df = pd.DataFrame({
'Period' : [1,1,1,1,1,2,2,2,2,2],
'labels' : ['A','B','A','B','A','B','A','B','A','B'],
'Z L' : [np.nan,np.nan,'X3','X2','X4',np.nan,'X2','X3','X3','X1'],
'Z P' : [np.nan,np.nan,'LC','LC','LD',np.nan,'LC','LC','LB','LA'],
})
df = df.dropna()
This is the output dataset to determine the combinations. I have a large df with repeated combinations so I'm not too concerned with returning all of them. I'm mainly concerned with all unique Mapped values for each Period.
Period labels Z L Z P
2 1 A X3 LC
3 1 B X2 LC
4 1 A X4 LD
6 2 A X2 LC
7 2 B X3 LC
8 2 A X3 LB
9 2 B X1 LA
Attempt:
labels = df['labels'].unique().tolist()
I = df.loc[df['labels'] == labels[0]]
J = df.loc[df['labels'] == labels[1]]
I['Map'] = ((I['Z L'].isin(X)) | (I['Z P'].isin(R))).map({True:'R', False:'L'})
J['Map'] = ((J['Z L'].isin(X)) | (J['Z P'].isin(R))).map({True:'R', False:'L'})
If I drop duplicates from period and labels the intended df is:
Period labels Map
0 1 A R
1 1 B L
2 2 A L
3 2 B R
Here's my approach:
# the ascending orders
lst1,lst2 = ['X1','X2','X3','X4'], ['LA','LB','LC','LD']
# enumerate the orders
d1, d2 = ({v:k for k,v in enumerate(l)} for l in (lst1, lst2))
# check if the enumerations in `Z L` and `Z P` are the same
df['Map'] = np.where(df['Z L'].map(d1)== df['Z P'].map(d2), 'R', 'L')
Output:
Period labels Z L Z P Map
2 1 A X3 LC R
3 1 B X2 LC L
4 1 A X4 LD R
6 2 A X2 LC L
7 2 B X3 LC R
8 2 A X3 LB L
9 2 B X1 LA R
and df.drop_duplicates(['Period', 'labels']):
Period labels Z L Z P Map
2 1 A X3 LC R
3 1 B X2 LC L
6 2 A X2 LC L
7 2 B X3 LC R
You said your data is always either in ascending or reversed order. You only need to define a fix mapping between Z L and Z P as the R and check on this mapping. If True it is R, else L. I may be wrong, but I think solution may be reduced to this
r_dict = dict(zip(['X1','X2','X3','X4'], ['LA','LB','LC','LD']))
df1['Map'] = (df1['Z L'].map(r_dict) == df1['Z P']).map({True: 'R', False: 'L'})
Out[292]:
Period labels Z L Z P Map
2 1 A X3 LC R
3 1 B X2 LC L
4 1 A X4 LD R
6 2 A X2 LC L
7 2 B X3 LC R
8 2 A X3 LB L
9 2 B X1 LA R
For the bottom desired output, you just drop_duplicates as QuangHoang.

pandas outcome variable is NaN

I have set the outcome variable y as a column in a csv. It loads properly and works when I print just y, but when I use y = y[x:] I start getting NaN as values.
y = previous_games_stats['Unnamed: 7'] #outcome variable (win/loss)
y = y[9:] #causes NaN for outcome variables
Then later in the file I print the outcome column. final_df is a dataframe which does not yet have the outcome variable set, so I set it below:
final_df['outcome'] = y
print(final_df['outcome'])
But the outcome is:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 L
It looks like the last value is correct (they should all be 'W' or 'L').
How can I line up my data frames properly so I do not get NaN?
Entire Code:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
np.random.seed(0)
from array import array
iris=load_iris()
previous_games_stats = pd.read_csv('stats/2016-2017 CANUCKS STATS.csv', header=1)
numGamesToLookBack = 10;
axis=1) #Predictor variables
X = previous_games_stats[['GF', 'GA']]
count = 0
final_df = pd.DataFrame(columns=['GF', 'GA'])
#final_y = pd.DataFrame(columns=['Unnamed: 7'])
y = previous_games_stats['Unnamed: 7'] #outcome variable (win/loss)
y = y[numGamesToLookBack-1:]
for game in range(0, 10):
X = previous_games_stats[['GF', 'GA']]
X = X[count:numGamesToLookBack] #num games to look back
stats_feature_names = list(X.columns.values)
df = pd.DataFrame(iris.data, columns=iris.feature_names)
stats_df = pd.DataFrame(X, columns=stats_feature_names).sum().to_frame().T
final_df = final_df.append(stats_df, ignore_index=True)
count+=1
numGamesToLookBack+=1
print("final_df:\n", final_df)
stats_target_names = np.array(['Win', 'Loss']) #don't need?...just a label it looks like
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
final_df['outcome'] = y
final_df['outcome'].update(y) #ADDED UPDATE TO FIX NaN
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 #for iris
final_df['is_train'] = np.random.uniform(0, 1, len(final_df)) <= .65
train, test = df[df['is_train']==True], df[df['is_train']==False]
stats_train = final_df[final_df['is_train']==True]
stats_test = final_df[final_df['is_train']==False]
features = df.columns[:4]
stats_features = final_df.columns[:2]
y = pd.factorize(train['species'])[0]
stats_y = pd.factorize(stats_train['outcome'])[0]
clf = RandomForestClassifier(n_jobs=2, random_state=0)
stats_clf = RandomForestClassifier(n_jobs=2, random_state=0)
clf.fit(train[features], y)
stats_clf.fit(stats_train[stats_features], stats_y)
stats_clf.predict_proba(stats_test[stats_features])[0:10]
preds = iris.target_names[clf.predict(test[features])]
stats_preds = stats_target_names[stats_clf.predict(stats_test[stats_features])]
pd.crosstab(stats_test['outcome'], stats_preds, rownames=['Actual Outcome'], colnames=['Predicted Outcome'])
print("~~~confusion matrix~~~\nColumns represent what we predicted for the outcome of the game, and rows represent the actual outcome of the game.\n")
print(pd.crosstab(stats_test['outcome'], stats_preds, rownames=['Actual Outcome'], colnames=['Predicted Outcome']))
It is expected, because y have no indices (no data) for first 9 values, so after assign back get NaNs.
If column is new and length of y is same as length of df assign numpy array:
final_df['outcome'] = y.values
But if lengths are different, it is a bit complicated, because need same lengths:
df = pd.DataFrame({'a':range(10), 'b':range(20,30)}).astype(str).radd('a')
print (df)
a b
0 a0 a20
1 a1 a21
2 a2 a22
3 a3 a23
4 a4 a24
5 a5 a25
6 a6 a26
7 a7 a27
8 a8 a28
9 a9 a29
y = df['a']
y = y[4:]
print (y)
4 a4
5 a5
6 a6
7 a7
8 a8
9 a9
Name: a, dtype: object
len(final_df) < len(y):
Filter y by final_df, then convert to numpy array for not align indices:
final_df = pd.DataFrame({'new':range(100, 105)})
final_df['s'] = y.iloc[:len(final_df)].values
print (final_df)
new s
0 100 a4
1 101 a5
2 102 a6
3 103 a7
4 104 a8
len(final_df) > len(y):
Create new Series by filtered index values:
final_df1 = pd.DataFrame({'new':range(100, 110)})
final_df1['s'] = pd.Series(y.values, index=final_df1.index[:len(y)])
print (final_df1)
new s
0 100 a4
1 101 a5
2 102 a6
3 103 a7
4 104 a8
5 105 a9
6 106 NaN
7 107 NaN
8 108 NaN
9 109 NaN

Group data frame given that they have something in common

I have a pandas dataframe of over 1000 lines that looks somewhat like this:
Copy name type ntv
G1 BA X 0.45
G1 BB X 0.878
G1 C Z 0.19
G1 LA1 Y 1.234
G1 L Y 0.09
G1 LB Y 1.056
F2 BA1 X -7.890
F2 BB X 2.345
F2 MA Y -0.871
F2 LB1 Y 0.737
In the example above (df1), there are two sets of the 'Copy' column, G1 and F2, with various names, and three types X,Y and Z.
I would like to create another data frame (df2) that looks like the one below, where they are grouped together in the form X-Y or Z-Y.
Model ntv_1 ntv_2
G1BA-LA1 0.45 1.234
G1BB-LB 0.878 1.056
G1C-L 0.19 0.09
F2BA1-MA -7.890 -0.871
F2BB-LB1 2.345 0.737
For group X-Y, they have the second character of df1['name'] in common. So, I decided to approach it this way:
c = df1[(df1['name'].str[0]=='B' & (df1['ntv'] != 0.0)]
h = df1[((df1['name'].str[0]=='L')|(df1['name'].str[0]=='M')) & (df['ntv'] != 0.0)]
b = (c.loc[:,c['name'].str[1]] == h.loc[:,h['name'].str[1]]).groupby('Copy')
df2['Model'] = c['Copy'].astype(str) + c['name'].astype(str) + '-' + h['name'].astype(str)
df2['ntv_1'] = c['ntv']
df2['ntv_2'] = h['ntv']
I got a KeyError message. So I decided to do this:
ca = c['name'].str[1].dropna()
ha = h['name'].str[1].dropna()
if ca == ha:
df2['Model'] = c['Copy'].astype(str) + c['name'].astype(str) + '-' + h['name'].astype(str)
df2['ntv_1'] = c['ntv']
df2['ntv_2'] = h['ntv']
But I got a ValueError: "Series length must match to compare."
Please how can I group the dataframe into the form X-Y or Z-Y? Thanks in advance!
There is problem c and h are not aligned, because different indices and possible different lenght:
#added condition for remove all rows with no second value in name
c = df1[(df1['name'].str[0]=='B') & (df1['ntv'] != 0.0) &
(df1['name'].str[1].notnull())].copy()
#created MultiIndex for align with Counter duplicates
ca = c['name'].str[1]
c.index = [ca, c.groupby(ca).cumcount()]
#added condition for remove all rows with no second value in name
h = df1[((df1['name'].str[0]=='L')|(df1['name'].str[0]=='M')) &
(df1['ntv'] != 0.0) & (df1['name'].str[1].notnull())].copy()
#created MultiIndex for align with Counter duplicates
ha = h['name'].str[1]
h.index = [ha, h.groupby(ha).cumcount()]
print (c)
copy name type ntv
name
A 0 G1 BA X 0.450
B 0 G1 BB X 0.878
A 1 F2 BA1 X -7.890
B 1 F2 BB X 2.345
print (h)
copy name type ntv
name
A 0 G1 LA1 Y 1.234
B 0 G1 LB Y 1.056
A 1 F2 MA Y -0.871
B 1 F2 LB1 Y 0.737
#join together DataFrames
df2 = pd.concat([c, h.add_suffix('_2')], axis=1)
#with real data is possible data are not aligned and get NaNs
#for remove all NaNs rows use
#df2 = df2.dropna()
df2['Model'] = df2['copy'].astype(str)+df2['name'].astype(str)+'-'+ df2['name_2'].astype(str)
#filter columns and remove MultiIndex
df2 = df2[['Model','ntv','ntv_2']].reset_index(drop=True)
print (df2)
Model ntv ntv_2
0 G1BA-LA1 0.450 1.234
1 G1BB-LB 0.878 1.056
2 F2BA1-MA -7.890 -0.871
3 F2BB-LB1 2.345 0.737

change table format of the output

I would like to change the format of my output for the following code.
import pandas as pd
x= pd.read_csv('x.csv')
y= pd.read_csv('y.csv')
z= pd.read_csv('z.csv')
list = pd.merge(x, y, how='left', on=['xx'])
list = pd.merge(list, z, how='left', on=['xx'])
columns_to_keep = ['yy','zz', 'uu']
list = list.set_index(['xx'])
list = list[columns_to_keep]
list = list.sort_index(axis=0, level=None, ascending=True, inplace=False,
sort_remaining=True, by=None)
with open('write.csv','w') as f:
list.to_csv(f,header=True, index=True, index_label='xx')
from this:
id date user_id user_name
1 8/13/2007 1 a1
2 1/8/2007 2 a2
2 1/8/2007 3 a3
3 12/14/2007 4 a4
4 3/6/2008 5 a5
4 4/14/2009 6 a6
4 5/30/2008 7 a7
4 5/30/2008 8 a8
5 6/17/2007 9 a9
to this:
id date user_id user_name
1 8/13/2007 1 a1
2 1/8/2007 2;3 a2;a3
3 12/14/2007 4 a4
4 3/6/2008 5;6;7;8 a5;a6;a7;a8
5 6/17/2007 9 a9
I think the following should work on the final dataframe (list), though I would suggest not to use "list" as a name as it is a built in function in python and you might want to use that function somewhere else. So in my code I will use "df" instead of "list":
ind = list(set(df.index.get_values()))
finaldf = pd.DataFrame(columns = list(df.columns))
for val in ind:
tempDF = df.loc[val]
print tempDF
for i in range(tempDF.shape[0]):
for jloc,j in enumerate(list(df.columns)):
if i != 0 and j != 'date':
finaldf.loc[val,j] += (";"+str(tempDF.iloc[i,jloc]))
elif i == 0:
finaldf.loc[val,j] = str(tempDF.iloc[i,jloc])
print finaldf

Categories

Resources