How to split values of a row into different columns in pandas? - python

I have the following dataframe in pandas:
a = ['[16.01319488 6.1095932 -0.14837995]',
'[16.10400501 6.23724404 -0.1727245 ]',
'[16.195107 6.36434895 -0.19695716]',
'[16.2864465 6.49178233 -0.22124142]',
'[16.37796913 6.62041857 -0.24574078]',
'[16.46962054 6.75113206 -0.27061875]',
'[16.56134636 6.88479719 -0.29603881]',
'[16.65309334 7.02229002 -0.32216479]',
'[16.74480491 7.16448166 -0.34915957]',
'[16.83642781 7.31224812 -0.37718693]',
'[16.92790769 7.46646379 -0.4064104 ]',
'[17.0190533 7.62784345 -0.4369622 ]',
'[17.10912594 7.79646343 -0.46884957]',
'[17.19725 7.97224045 -0.50204846]']
b = [0.0,
0.01999999989745438,
0.03999999979490875,
0.05999999969236312,
0.0799999995898175,
0.09999999948727188,
0.1199999993847262,
0.1399999992821806,
0.159999999179635,
0.1799999990770894,
0.1999999989745438,
0.2199999988719981,
0.2399999987694525,
0.2599999986669069]
b
dDictionary = {
'A':a,
'B': b
}
test = pd.DataFrame(dDictionary)
Each value in Column 'A' consists of three values that I want to split into three seperate columns. Is there a simple and robust way to do this?

Use Series.str.strip with Series.str.split and casting to floats:
test[['c','d','e']] = test.A.str.strip('[]').str.split(expand=True).astype(float)
print (test)
A B c d e
0 [16.01319488 6.1095932 -0.14837995] 0.00 16.013195 6.109593 -0.148380
1 [16.10400501 6.23724404 -0.1727245 ] 0.02 16.104005 6.237244 -0.172725
2 [16.195107 6.36434895 -0.19695716] 0.04 16.195107 6.364349 -0.196957
3 [16.2864465 6.49178233 -0.22124142] 0.06 16.286447 6.491782 -0.221241
4 [16.37796913 6.62041857 -0.24574078] 0.08 16.377969 6.620419 -0.245741
5 [16.46962054 6.75113206 -0.27061875] 0.10 16.469621 6.751132 -0.270619
6 [16.56134636 6.88479719 -0.29603881] 0.12 16.561346 6.884797 -0.296039
7 [16.65309334 7.02229002 -0.32216479] 0.14 16.653093 7.022290 -0.322165
8 [16.74480491 7.16448166 -0.34915957] 0.16 16.744805 7.164482 -0.349160
9 [16.83642781 7.31224812 -0.37718693] 0.18 16.836428 7.312248 -0.377187
10 [16.92790769 7.46646379 -0.4064104 ] 0.20 16.927908 7.466464 -0.406410
11 [17.0190533 7.62784345 -0.4369622 ] 0.22 17.019053 7.627843 -0.436962
12 [17.10912594 7.79646343 -0.46884957] 0.24 17.109126 7.796463 -0.468850
13 [17.19725 7.97224045 -0.50204846] 0.26 17.197250 7.972240 -0.502048
If need remove A use DataFrame.pop:
test[['c','d','e']] = test.pop('A').str.strip('[]').str.split(expand=True).astype(float)
print (test)
B c d e
0 0.00 16.013195 6.109593 -0.148380
1 0.02 16.104005 6.237244 -0.172725
2 0.04 16.195107 6.364349 -0.196957
3 0.06 16.286447 6.491782 -0.221241
4 0.08 16.377969 6.620419 -0.245741
5 0.10 16.469621 6.751132 -0.270619
6 0.12 16.561346 6.884797 -0.296039
7 0.14 16.653093 7.022290 -0.322165
8 0.16 16.744805 7.164482 -0.349160
9 0.18 16.836428 7.312248 -0.377187
10 0.20 16.927908 7.466464 -0.406410
11 0.22 17.019053 7.627843 -0.436962
12 0.24 17.109126 7.796463 -0.468850
13 0.26 17.197250 7.972240 -0.502048

Here is another approach to expand the column 'A' dynamically using pandas.concat.
expanded_test = (pd.concat([test['A'].str.strip('[]').str.split(expand=True)],
axis=1, keys=test.columns)
)
expanded_test.columns = expanded_test.columns.map(lambda x: '_'.join((x[0], str(x[1]+1))))
out = test.join(expanded_test)
>>> print(out)
A B A_1 A_2 A_3
0 [16.01319488 6.1095932 -0.14837995] 0.00 16.01319488 6.1095932 -0.14837995
1 [16.10400501 6.23724404 -0.1727245 ] 0.02 16.10400501 6.23724404 -0.1727245
2 [16.195107 6.36434895 -0.19695716] 0.04 16.195107 6.36434895 -0.19695716
3 [16.2864465 6.49178233 -0.22124142] 0.06 16.2864465 6.49178233 -0.22124142
4 [16.37796913 6.62041857 -0.24574078] 0.08 16.37796913 6.62041857 -0.24574078
5 [16.46962054 6.75113206 -0.27061875] 0.10 16.46962054 6.75113206 -0.27061875
6 [16.56134636 6.88479719 -0.29603881] 0.12 16.56134636 6.88479719 -0.29603881
7 [16.65309334 7.02229002 -0.32216479] 0.14 16.65309334 7.02229002 -0.32216479
8 [16.74480491 7.16448166 -0.34915957] 0.16 16.74480491 7.16448166 -0.34915957
9 [16.83642781 7.31224812 -0.37718693] 0.18 16.83642781 7.31224812 -0.37718693
10 [16.92790769 7.46646379 -0.4064104 ] 0.20 16.92790769 7.46646379 -0.4064104
11 [17.0190533 7.62784345 -0.4369622 ] 0.22 17.0190533 7.62784345 -0.4369622
12 [17.10912594 7.79646343 -0.46884957] 0.24 17.10912594 7.79646343 -0.46884957
13 [17.19725 7.97224045 -0.50204846] 0.26 17.19725 7.97224045 -0.50204846
[Finished in 1.0s]

Related

Initial value of multiple variables dataframe for time dilation

Dataframe:
product1
product2
product3
product4
product5
straws
orange
melon
chair
bread
melon
milk
book
coffee
cake
bread
melon
coffe
chair
book
CountProduct1
CountProduct2
CountProduct3
Countproduct4
Countproduct5
1
1
1
1
1
2
1
1
1
1
2
3
2
2
2
RatioProduct1
RatioProduct2
RatioProduct3
Ratioproduct4
Ratioproduct5
0.28
0.54
0.33
0.35
0.11
0.67
0.25
0.13
0.11
0.59
2.5
1.69
1.9
2.5
1.52
I want to create five others columns that keep my initial ratio of each item along the dataframe.
Output:
InitialRatio1
InitialRatio2
InitialRatio3
InitialRatio4
InitialRatio5
0.28
0.54
0.33
0.35
0.11
0.33
0.25
0.13
0.31
0.59
0.11
0.33
0.31
0.35
0.13
Check the code again. Do you have an error in product3 = coffe and product4 = coffee? Fixed coffe to coffee. As a result, 0.31 should not be.
import pandas as pd
pd.set_option('display.max_rows', None) # print everything rows
pd.set_option('display.max_columns', None) # print everything columns
df = pd.DataFrame(
{
'product1':['straws', 'melon', 'bread'],
'product2':['orange', 'milk', 'melon'],
'product3':['melon', 'book', 'coffee'],
'product4':['chair', 'coffee', 'chair'],
'product5':['bread', 'cake', 'book'],
'time':[1,2,3],
'Count1':[1,2,2],
'Count2':[1,1,3],
'Count3':[1,1,2],
'Count4':[1,1,2],
'Count5':[1,1,2],
'ratio1':[0.28, 0.67, 2.5],
'ratio2':[0.54, 0.25, 1.69],
'ratio3':[0.33, 0.13, 1.9],
'ratio4':[0.35, 0.11, 2.5],
'ratio5':[0.11, 0.59, 1.52],
})
print(df)
product = df[['product1', 'product2', 'product3', 'product4', 'product5']].stack().reset_index()
count = df[['Count1', 'Count2', 'Count3', 'Count4', 'Count5']].stack().reset_index()
ratio = df[['ratio1', 'ratio2', 'ratio3', 'ratio4', 'ratio5']].stack().reset_index()
print(ratio)
arr = pd.unique(product[0])
aaa = [i for i in range(len(arr)) if product[product[0] == arr[i]].count()[0] > 1]
for i in aaa:
prod_ind = product[product[0] == arr[i]].index
val_ratio = ratio.loc[prod_ind[0], 0]
ratio.loc[prod_ind, 0] = val_ratio
print(ratio.pivot_table(index='level_0', columns='level_1', values=[0]))
Output:
level_1 ratio1 ratio2 ratio3 ratio4 ratio5
level_0
0 0.28 0.54 0.33 0.35 0.11
1 0.33 0.25 0.13 0.11 0.59
2 0.11 0.33 0.11 0.35 0.13
To work with data, they need to be turned into one column using stack().reset_index(). Create a list of unique products arr. Further in the list aaa I get indexes of arr, which are more than one.
prod_ind = product[product[0] == arr[i]].index
In a loop, I get indexes of products that are more than one.
val_ratio = ratio.loc[prod_ind[0], 0]
Get the first value of the product.
ratio.loc[prod_ind, 0] = val_ratio
Set this value for all products.
To access the values, explicit loc indexing is used, where the row indices are in square brackets on the left, and the names of the columns on the right. Read more here.
In pivot_table I create back the table.
To insert the processed data into the original dataframe, simply use the following:
table = ratio.pivot_table(index='level_0', columns='level_1', values=[0])
df[['ratio1', 'ratio2', 'ratio3', 'ratio4', 'ratio5']] = table
print(df)
If you're after code to create the init_rateX columns then the following will work
pd.DataFrame(
np.divide(
df[["ratio1", "ratio2", "ratio3", "ratio4", "ratio5"]].to_numpy(),
df[["Count1", "Count2", "Count3", "Count4", "Count5"]].to_numpy(),
),
columns=["init_rate1", "init_rate2", "init_rate3", "init_rate4", "init_rate5"],
)
which gives
init_rate1 init_rate2 init_rate3 init_rate4 init_rate5
0 0.28 0.25 0.33 0.57 0.835
1 0.33 0.13 0.97 0.65 0.760
2 0.54 0.11 0.45 0.95 1.160
3 0.35 0.59 0.34 1.25 1.650
However it does not agree with your calcs for init_rate4 or init_rate5 so some clarification might be needed.

How can I compute the cumulative weighted average in new column?

Read all related pages on google and stackoverflow, and I still can't find the solution..
Given this df fragment:
key_br_acc_posid lot_in price
ix
1 1_885020_76141036 0.03 1.30004
2 1_885020_76236801 0.02 1.15297
5 1_885020_76502318 0.50 2752.08000
8 1_885020_76502318 4.50 2753.93000
9 1_885020_76502318 0.50 2753.93000
... ... ...
1042 1_896967_123068980 0.01 1.17657
1044 1_896967_110335293 0.01 28.07100
1047 1_896967_110335293 0.01 24.14000
1053 1_896967_146913299 25.00 38.55000
1054 1_896967_147039856 2.00 121450.00000
How can I create a new column w_avg_price computing the moving weighted average price by key_br_acc_posid? The lot_in is the weight and the price is the value.
I tried many approaches with groupby() + np.average() buy I have to avoid the data aggregation. I need this value in each row.
groupby and then perform the calculation for each group using cumsum()s:
(df.groupby('key_br_acc_posid', as_index = False)
.apply(lambda g: g.assign(w_avg_price = (g['lot_in']*g['price']).cumsum()/g['lot_in'].cumsum()))
.reset_index(level = 0, drop = True)
)
result:
key_br_acc_posid lot_in price w_avg_price
---- ------------------ -------- ------------ -------------
1 1_885020_76141036 0.03 1.30004 1.30004
2 1_885020_76236801 0.02 1.15297 1.15297
5 1_885020_76502318 0.5 2752.08 2752.08
8 1_885020_76502318 4.5 2753.93 2753.74
9 1_885020_76502318 0.5 2753.93 2753.76
1044 1_896967_110335293 0.01 28.071 28.071
1047 1_896967_110335293 0.01 24.14 26.1055
1042 1_896967_123068980 0.01 1.17657 1.17657
1053 1_896967_146913299 25 38.55 38.55
1054 1_896967_147039856 2 121450 121450
I don't think I'm calculating it right, but what you want is cumsum()
df = pd.DataFrame({'lot_in':[.1,.2,.3],'price':[1.0,1.25,1.3]})
df['mvg_avg'] = (df['lot_in'] * df['price']).cumsum()
print(df)
lot_in price mvg_avg
0 0.1 1.00 0.10
1 0.2 1.25 0.35
2 0.3 1.30 0.74

Only one index label in the dataset

I am working with the ecoli dataset from http://archive.ics.uci.
edu/ml/datasets/Ecoli. The values are separated by tabs. I would like to index each column and give them a name. But when i do that using the following code:
import pandas as pd
ecoli_cols= ['N_ecoli', 'info1', 'info2', 'info3', 'info4','info5','info6,'info7','type']
d= pd.read_table('ecoli.csv',sep= ' ',header = None, names= ecoli_cols)
Instead of creating the name for each index it creates a 6 new columns. But i would like to have those index name for each of the columns that i already have. And later i would like to extract information from this dataset. So it is important to have them as comma separated or in tables. Thanks
You can use url with data and separator \s+ - one or more whitespaces:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/ecoli/ecoli.data'
ecoli_cols= ['N_ecoli', 'info1', 'info2', 'info3', 'info4','info5','info6','info7','type']
df = pd.read_table(url,sep= '\s+',header = None, names= ecoli_cols)
#alternative use parameter delim_whitespace
#df = pd.read_table(url, delim_whitespace= True, header = None, names = ecoli_cols)
print (df.head())
N_ecoli info1 info2 info3 info4 info5 info6 info7 type
0 AAT_ECOLI 0.49 0.29 0.48 0.5 0.56 0.24 0.35 cp
1 ACEA_ECOLI 0.07 0.40 0.48 0.5 0.54 0.35 0.44 cp
2 ACEK_ECOLI 0.56 0.40 0.48 0.5 0.49 0.37 0.46 cp
3 ACKA_ECOLI 0.59 0.49 0.48 0.5 0.52 0.45 0.36 cp
4 ADI_ECOLI 0.23 0.32 0.48 0.5 0.55 0.25 0.35 cp
But if want use your file with separator as tab:
d = pd.read_table('ecoli.csv', sep='\t',header = None, names= ecoli_cols)
And if separator is ;:
d = pd.read_table('ecoli.csv', sep=';',header = None, names= ecoli_cols)

Create new DF with values representing difference between two dataframes

I am working with two numeric data.frames, both with 13803obs and 13803 variables. Their col- and rownames are identical however their entries are different. What I want to do is create a new data.frame where I have subtracted df2 values with df1 values.
"Formula" would be this, df1(entri-values) - df2(entri-values) = df3 difference. In other words, the purpose is to find the difference between all entries.
My problem illustrated here.
DF1
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.71 0.98 0.32
[GENE128] 0.23 0.61 0.90
[GENE271] 0.87 0.95 0.63
DF2
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.70 0.94 0.30
[GENE128] 0.25 0.51 0.80
[GENE271] 0.82 0.92 0.60
NEW DF3
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.01 0.04 0.02
[GENE128] -.02 0.10 0.10
[GENE271] 0.05 0.03 0.03
So, in DF3 the values are the difference between DF1 and DF2 for each entry.
DF1(GENE231) - DF2(GENE231) = DF3(DIFFERENCE-GENE231)
DF1(GENE271) - DF2(GENE271) = DF3(DIFFERENCE-GENE271)
and so on...
Help would be much appreciated!
Kind regards,
Harebell

Make console-friendly string a useable pandas dataframe python

A quick question as I'm currently changing from R to pandas for some projects:
I get the following print output from metrics.classification_report from sci-kit learn:
precision recall f1-score support
0 0.67 0.67 0.67 3
1 0.50 1.00 0.67 1
2 1.00 0.80 0.89 5
avg / total 0.83 0.78 0.79 9
I want to use this (and similar ones) as a matrix/dataframe so, that I could subset it to extract, say the precision of class 0.
In R, I'd give the first "column" a name like 'outcome_class' and then subset it:
my_dataframe[my_dataframe$class_outcome == 1, 'precision']
And I can do this in pandas but the dataframe that I want to use is simply a string see sckikit's doc
How can I make the table output here to a useable dataframe in pandas?
Assign it to a variable, s:
s = classification_report(y_true, y_pred, target_names=target_names)
Or directly:
s = '''
precision recall f1-score support
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
avg / total 0.70 0.60 0.61 5
'''
Use that as the string input for StringIO:
import io # For Python 2.x use import StringIO
df = pd.read_table(io.StringIO(s), sep='\s{2,}') # For Python 2.x use StringIO.StringIO(s)
df
Out:
precision recall f1-score support
class 0 0.5 1.00 0.67 1
class 1 0.0 0.00 0.00 1
class 2 1.0 0.67 0.80 3
avg / total 0.7 0.60 0.61 5
Now you can slice it like an R data.frame:
df.loc['class 2']['f1-score']
Out: 0.80000000000000004
Here, classes are the index of the DataFrame. You can use reset_index() if you want to use it as a regular column:
df = df.reset_index().rename(columns={'index': 'outcome_class'})
df.loc[df['outcome_class']=='class 1', 'support']
Out:
1 1
Name: support, dtype: int64

Categories

Resources