output multiple files based on column value python pandas

output multiple files based on column value python pandas - python

i have a sample pandas data frame:
import pandas as pd
df = {'ID': [73, 68,1,94,42,22, 28,70,47, 46,17, 19, 56, 33 ],
'CloneID': [1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4 ],
'VGene': ['64D', '64D', '64D', 61, 61, 61, 311, 311, 311, 311, 311, 311, 311, 311]}
df = pd.DataFrame(df)
it looks like this:
df
Out[7]:
CloneID ID VGene
0 1 73 64D
1 1 68 64D
2 1 1 64D
3 1 94 61
4 1 42 61
5 2 22 61
6 2 28 311
7 3 70 311
8 3 47 311
9 3 46 311
10 4 17 311
11 4 19 311
12 4 56 311
13 4 33 311
i want to write a simple script to output each cloneID to a different output file. so in this case there would be 4 different files.
the first file would be named 'CloneID1.txt' and it would look like this:
CloneID ID VGene
1 73 64D
1 68 64D
1 1 64D
1 94 61
1 42 61
second file would be named 'CloneID2.txt':
CloneID ID VGene
2 22 61
2 28 311
third file would be named 'CloneID3.txt':
CloneID ID VGene
3 70 311
3 47 311
3 46 311
and last file would be 'CloneID4.txt':
CloneID ID VGene
4 17 311
4 19 311
4 56 311
4 33 311
the code i found online was:
import pandas as pd
data = pd.read_excel('data.xlsx')
for group_name, data in data.groupby('CloneID'):
with open('results.csv', 'a') as f:
data.to_csv(f)
but it outputs everything to one file instead of multiple files.

You can do something like the following:
In [19]:
gp = df.groupby('CloneID')
for g in gp.groups:
print('CloneID' + str(g) + '.txt')
print(gp.get_group(g).to_csv())
CloneID1.txt
,CloneID,ID,VGene
0,1,73,64D
1,1,68,64D
2,1,1,64D
3,1,94,61
4,1,42,61
CloneID2.txt
,CloneID,ID,VGene
5,2,22,61
6,2,28,311
CloneID3.txt
,CloneID,ID,VGene
7,3,70,311
8,3,47,311
9,3,46,311
CloneID4.txt
,CloneID,ID,VGene
10,4,17,311
11,4,19,311
12,4,56,311
13,4,33,311
So here we iterate over the groups in for g in gp.groups: and we use this to create the result file path name and call to_csv on the group so the following should work for you:
gp = df.groupby('CloneID')
for g in gp.groups:
path = 'CloneID' + str(g) + '.txt'
gp.get_group(g).to_csv(path)
Actually the following would be even simpler:
gp = df.groupby('CloneID')
gp.apply(lambda x: x.to_csv('CloneID' + str(x.name) + '.txt'))

Related

how to do complex calculations in pandas dataframe

sample dataframe:
df = pd.DataFrame({'sales': ['2020-01','2020-02','2020-03','2020-04','2020-05','2020-06'],
'2020-01': [24,42,18,68,24,30],
'2020-02': [24,42,18,68,24,30],
'2020-03': [64,24,70,70,88,57],
'2020-04': [22,11,44,3,5,78],
'2020-05': [11,35,74,12,69,51]}
I want to find below df['L2']
I studied pandas rolling,groupby,etcs, cannot solve it.
please read L2 formula & givee me a opinion
L2 formula
L2(Jan-20) = 24
-------------------
sales 2020-01
0 2020-01 24
-------------------
L2(Feb-20) = 132 (sum of below matrix 2x2)
sales 2020-01 2020-02
0 2020-01 24 24
1 2020-02 42 42
-------------------
L2(Mar-20) = 154 (sum of matrix 2x2)
sales 2020-02 2020-03
0 2020-02 42 24
1 2020-03 18 70
-------------------
L2(Apr-20) = 187 (sum of below maxtrix 2x2)
sales 2020-03 2020-04
0 2020-03 70 44
1 2020-04 70 3
output
Unnamed: 0 sales Jan-20 Feb-20 Mar-20 Apr-20 May-20 L2 L3
0 0 Jan-20 24 24 64 22 11 24 24
1 1 Feb-20 42 42 24 11 35 132 132
2 2 Mar-20 18 18 70 44 74 154 326
3 3 Apr-20 68 68 70 3 12 187 350
4 4 May-20 24 24 88 5 69 89 545
5 5 Jun-20 30 30 57 78 51 203 433

Values=f.values[:,1:]
L2=[]
RANGE=Values.shape[0]
for a in range(RANGE):
if a==0:
result=Values[a,a]
else:
if Values[a-1:a+1,a-1:a+1].shape==(2,1):
result=np.sum(Values[a-1:a+1,a-2:a])
else:
result=np.sum(Values[a-1:a+1,a-1:a+1])
L2.append(result)
print(L2)
L2 output:-->[24, 132, 154, 187, 89, 203]
f["L2"]=L2
f:

import pandas as pd
import numpy as np
# make a dataset
df = pd.DataFrame({'sales': ['2020-01','2020-02','2020-03','2020-04','2020-05','2020-06'],
'2020-01': [24,42,18,68,24,30],
'2020-02': [24,42,18,68,24,30],
'2020-03': [64,24,70,70,88,57],
'2020-04': [22,11,44,3,5,78],
'2020-05': [11,35,74,12,69,51]})
print(df)
# datawork(L2)
for i in range(0,df.shape[0]):
if i==0:
df.loc[i,'L2']=df.loc[i,'2020-01']
else:
if i!=df.shape[0]-1:
df.loc[i,'L2']=df.iloc[i-1:i+1,i:i+2].sum().sum()
if i==df.shape[0]-1:
df.loc[i,'L2']=df.iloc[i-1:i+1,i-1:i+1].sum().sum()
print(df)
# sales 2020-01 2020-02 2020-03 2020-04 2020-05 L2
#0 2020-01 24 24 64 22 11 24.0
#1 2020-02 42 42 24 11 35 132.0
#2 2020-03 18 18 70 44 74 154.0
#3 2020-04 68 68 70 3 12 187.0
#4 2020-05 24 24 88 5 69 89.0
#5 2020-06 30 30 57 78 51 203.0

I tried another method.
this method uses reshape long(in python : melt), but I applyed reshape long twice in python because time frequency of sales and other columns in df is monthly and not daily, so I did reshape long one more time to make int column corresponding to monthly date.
(I have used Stata more often than python, in Stata, I can only do reshape long one time because it has monthly time frequency, and reshape task is much easier than that of pandas, python)
if you are interested, take a look
# 00.module
import pandas as pd
import numpy as np
from order import order # https://stackoverflow.com/a/68464246/16478699
# 0.make a dataset
df = pd.DataFrame({'sales': ['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06'],
'2020-01': [24, 42, 18, 68, 24, 30],
'2020-02': [24, 42, 18, 68, 24, 30],
'2020-03': [64, 24, 70, 70, 88, 57],
'2020-04': [22, 11, 44, 3, 5, 78],
'2020-05': [11, 35, 74, 12, 69, 51]}
)
df.to_stata('dataset.dta', version=119, write_index=False)
print(df)
# 1.reshape long(in python: melt)
t = list(df.columns)
t.remove('sales')
df_long = df.melt(id_vars='sales', value_vars=t, var_name='var', value_name='val')
df_long['id'] = list(range(1, df_long.shape[0] + 1)) # make id for another resape long
print(df_long)
# 2.another reshape long(in python: melt, reason: make int(col name: tid) corresponding to monthly date of sales and monthly columns in df)
df_long2 = df_long.melt(id_vars=['id', 'val'], value_vars=['sales', 'var'])
df_long2['tid'] = df_long2['value'].apply(lambda x: 1 + list(df_long2.value.unique()).index(x))
print(df_long2)
# 3.back to wide form with tid(in python: pd.pivot)
df_wide = pd.pivot(df_long2, index=['id', 'val'], columns='variable', values=['value', 'tid'])
df_wide.columns = df_wide.columns.map(lambda x: x[1] if x[0] == 'value' else f'{x[0]}_{x[1]}') # change multiindex columns name into just normal columns name
df_wide = df_wide.reset_index()
print(df_wide)
# 4.make values of L2
for i in df_wide.tid_sales.unique():
if list(df_wide.tid_sales.unique()).index(i) + 1 == len(df_wide.tid_sales.unique()):
df_wide.loc[df_wide['tid_sales'] == i, 'L2'] = df_wide.loc[(((df_wide['tid_sales'] == i) | (
df_wide['tid_sales'] == i - 1)) & ((df_wide['tid_var'] == i - 1) | (
df_wide['tid_var'] == i - 2))), 'val'].sum()
else:
df_wide.loc[df_wide['tid_sales'] == i, 'L2'] = df_wide.loc[(((df_wide['tid_sales'] == i) | (
df_wide['tid_sales'] == i - 1)) & ((df_wide['tid_var'] == i) | (
df_wide['tid_var'] == i - 1))), 'val'].sum()
print(df_wide)
# 5.back to shape of df with L2(reshape wide, in python: pd.pivot)
df_final = df_wide.drop(columns=df.filter(regex='^tid')) # no more columns starting with tid needed
df_final = pd.pivot(df_final, index=['sales', 'L2'], columns='var', values='val').reset_index()
df_final = order(df_final, 'L2', f_or_l='last') # order function is made by me
print(df_final)

How to create files from a groupby object, based on the length of the dataframe

I have a dataframe (df) that looks like this (highly simplified):
ID A B C VALUE
1 10 462 2241 217
2 11 498 6953 217
3 67 120 6926 654
4 68 898 7153 654
5 87 557 4996 654
6 88 227 6475 911
7 47 875 5097 911
8 48 143 8953 111
9 65 157 4470 111
10 66 525 9328 111
The 'VALUE' column contains a variable number of rows with identical values. I am trying to output a series of csv files that contain all of the rows that contain a 'VALUE' length == 2, ==3 etc. For example:
to_csv('/Path/to/VALUE_len_2.csv')
ID A B C VALUE
1 10 462 2241 217
2 11 498 6953 217
6 88 227 6475 911
7 47 875 5097 911
to_csv('/Path/to/VALUE_len_3.csv')
ID A B C VALUE
3 67 120 6926 654
4 68 898 7153 654
5 87 557 4996 654
to_csv('/Path/to/VALUE_len_4.csv')
ID A B C VALUE
7 47 875 5097 111
8 48 143 8953 111
9 65 157 4470 111
10 66 525 9328 111
I can get the desired output of one length value at a time, e.g., using:
df = pd.concat(v for _, v in df.groupby("VALUE") if len(v) == 2)
df.to_csv("/Path/to/VALUE_len_2.csv")
However, I have dozens of values to test. I would like to put this in a for loop on the order of:
mylist = [2,3,4,5,6,7,8,9] or len([2,3,4,5,6,7,8,9])
grouped = df.groupby(['VALUE'])
output = '/Path/to/VALUE_len_{}.csv'
for loop here:
if item in my list found in grouped:
output rows to csv
else:
pass
I've tried various constructions to iterate the groupby object using the items in the list, and I haven't been able to make anything work.
It might be an issue with trying to use a groupby object this way, but it is more than likely it is my inability to get the syntax right to complete the iteration.

It doesn't make sense to use a predetermined list to create the filenames.
df_len will be used to generate a filename using an f-string.
Path.exists() is used to determine if the file exists or not
import pandas as pd
from pathlib import Path
# test data
data = {'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'A': [10, 11, 67, 68, 87, 88, 47, 48, 65, 66], 'B': [462, 498, 120, 898, 557, 227, 875, 143, 157, 525], 'C': [2241, 6953, 6926, 7153, 4996, 6475, 5097, 8953, 4470, 9328], 'VALUE': [217, 217, 654, 654, 654, 911, 911, 111, 111, 111]}
df = pd.DataFrame(data)
# groupby value
for group, data in df.groupby('VALUE'):
# get the length of the dataframe
df_len = len(data)
# create a filename with df_len
file = Path(f'/path/to/VALUE_len_{df_len}.csv')
# if the file exists, append without the header
if file.exists():
data.to_csv(file, index=False, mode='a', header=False)
# create a new file
else:
data.to_csv(file, index=False)
If you must only create a file for dataframes of a specific length
desired_length = [2, 3, 4, 5, 6, 7, 8, 9]
# groupby value
for group, data in df.groupby('VALUE'):
# get the length of the dataframe
df_len = len(data)
# create a filename with df_len
file = Path(f'/path/to/VALUE_len_{df_len}.csv')
# check if the length of the dataframe is in the desired length
if df_len in desired_length:
# if the file exists, append without the header
if file.exists():
data.to_csv(file, index=False, mode='a', header=False)
# create a new file
else:
data.to_csv(file, index=False)

# find the VALUE_len of every VALUE first
df['VALUE_len'] = df.groupby('VALUE')['ID'].transform('count')
cols = df.columns[:-1]
# ['ID', 'A', 'B', 'C', 'VALUE']
# save group by VALUE_len
for VALUE_len, group in df.groupby('VALUE_len'):
file = Path(f'/path/to/VALUE_len_{VALUE_len}.csv')
group[cols].to_csv(file, index=False)

Python DataFrames concat or append problem

I have a problem with dataframes in Python. I am trying to copy certain rows to a new dataframe but I can't figure it out.
There are 2 arrays:
pokemon_data
# HP Attack Defense Sp. Atk Sp. Def Speed
0 1 45 49 49 65 65 45
1 2 60 62 63 80 80 60
2 3 80 82 83 100 100 80
3 4 80 100 123 122 120 80
4 5 39 52 43 60 50 65
... ... ... ... ... ... ... ...
795 796 50 100 150 100 150 50
796 797 50 160 110 160 110 110
797 798 80 110 60 150 130 70
798 799 80 160 60 170 130 80
799 800 80 110 120 130 90 70
800 rows × 7 columns
combats_data
First_pokemon Second_pokemon Winner
0 266 298 1
1 702 701 1
2 191 668 1
3 237 683 1
4 151 231 0
... ... ... ...
49995 707 126 0
49996 589 664 0
49997 303 368 1
49998 109 89 0
49999 9 73 0
50000 rows × 3 columns
I created third dataset with columns:
output1
HP0 Attack0 Defens0 Sp. Atk0 Sp. Def0 Speed0 HP1 Attack1 Defense1 Sp. Atk1 Sp. Def1 Speed1 Winner
What I'm trying to do is copy attributes from pokemon_data to output1 in order from combats_data.
HP0 and HP1 are respectivly HP from first Pokemon and HP from second Pokemon.
I want to use that data in neural networks with TensorFlow to predict what Pokemon would win.

For this type of wrangling, you should first "melt" or "tidy" the combats_data so each ID has its own row, then do a "join" or "merge" of the two dataframes.
You didn't provide a minimum reproducible example, so here's mine:
import pandas as pd
df1 = pd.DataFrame({'id': [1,2,3,4,5],
'var1': [10,20,30,40,50],
'var2': [15,25,35,45,55]})
df2 = pd.DataFrame({'id1': [1,2],
'id2': [3,4],
'outcome': [1,4]})
df2tidy = pd.melt(df2, id_vars=['outcome'], value_vars=['id1', 'id2'],
var_name='name', value_name='id')
df2tidy
# outcome name id
# 0 1 id1 1
# 1 4 id1 2
# 2 1 id2 3
# 3 4 id2 4
output = pd.merge(df2tidy, df1, on='id')
output
# outcome name id var1 var2
# 0 1 id1 1 10 15
# 1 4 id1 2 20 25
# 2 1 id2 3 30 35
# 3 4 id2 4 40 45
which you could then train some sort of classifier on outcome.
(Btw, you should make outcome a 0 or 1 (for pokemon1 vs pokemon2) instead of the actual ID of the winner.)

So i would like to create new array based on these two arrays. For example:
#ids represent pokemons and their attributes
pokemons = pd.DataFrame({'id': [1,2,3,4,5],
'HP': [10,20,30,40,50],
'Attack': [15,25,35,45,55],
'Defese' : [25,15,45,15,35]})
#here 0 or 1 represents whether first or second pokemon won
combats = pd.DataFrame({'id1': [1,2],
'id2': [3,4],
'winner': [0,1]})
#in output data i want to replace ids with attributes, the order is based on combats array
output = pd.DataFrame({'HP1': [10,20],
'Attack1': [15,25],
'Defense1': [25,15],
'HP2': [30,40],
'Attack2': [35,45],
'Defense2': [45,15],
'winner': [0,1]})
Not sure if its correct thinking. I want to train neural network to figure out what pokemon will win.

This is solution from user part from 4programmers.net forum.
import pandas as pd
if __name__ == "__main__":
pokemon_data = pd.DataFrame({
"Id": [1, 2, 3, 4, 5],
"HP": [45, 60, 80, 80, 39],
"Attack": [49, 62, 82, 100, 52],
"Defense": [49, 63, 83, 123, 43],
"Sp. Atk": [65, 80, 100, 122, 60],
"Sp. Def": [65, 80, 100, 120, 50],
"Speed": [45, 60, 80, 80, 65]})
combats_data = pd.DataFrame({
"First_pokemon": [1, 2, 3],
"Second_pokemon": [2, 3, 4],
"Winner": [1, 0, 1]})
output = pokemon_data.merge(combats_data, left_on="Id", right_on="First_pokemon")
output = output.merge(pokemon_data, left_on="Second_pokemon", right_on="Id",
suffixes=("_pokemon1", "_pokemon2"))
print(output)

How to apply a function to all the columns in a data frame and take output in the form of dataframe in python

I have two functions that do some calculation and gives me results. For now, I am able to apply it in one column and get the result in the form of a dataframe.
I need to know how I can apply the function on all the columns in the dataframe and get results as well in the form of a dataframe.
Say I have a data frame as below and I need to apply the function on each column in the data frame and get a dataframe with results corresponding for all the columns.
A B C D E F
1456 6744 9876 374 65413 1456
654 2314 674654 2156 872 6744
875 653 36541 345 4963 9876
6875 7401 3654 465 3547 374
78654 8662 35 6987 6874 65413
658 94512 687 489 8756 5854
Results
A B C D E F
2110 9058 684530 2530 66285 8200
1529 2967 711195 2501 5835 16620
7750 8054 40195 810 8510 10250
85529 16063 3689 7452 10421 65787

Here is simple example
df
A B C D
0 10 11 12 13
1 20 21 22 23
2 30 31 32 33
3 40 41 42 43
# Assume your user defined function is
def mul(x, y):
return x * y
which will multiply the values
Let's say you want to multiply first column 'A' with 3
df['A'].apply(lambda x: mul(x,3))
0 30
1 60
2 90
3 120
Now, you want to apply mul function to all columns of dataframe and create new dataframe with results
df1 = df.applymap(lambda x: mul(x, 3))
df1
A B C D
0 30 33 36 39
1 60 63 66 69
2 90 93 96 99
3 120 123 126 129

pd.DataFrame object also has its own apply method.
From the example given in the documentation of the link above:
>>> df = pd.DataFrame([[4, 9],] * 3, columns=['A', 'B'])
>>> df
A B
0 4 9
1 4 9
2 4 9
>>> df.apply(np.sqrt)
A B
0 2.0 3.0
1 2.0 3.0
2 2.0 3.0
Conclusion: you should be able to apply your function to the whole dataframe.

It looks like this is what you are trying to do in your output:
df = pd.DataFrame(
[[1456, 6744, 9876, 374, 65413, 1456],
[654, 2314, 674654, 2156, 872, 6744],
[875, 653, 36541, 345, 4963, 9876],
[6875, 7401, 3654, 465, 3547, 374],
[78654, 8662, 35, 6987, 6874, 65413],
[658, 94512, 687, 489, 8756, 5854]],
columns=list('ABCDEF'))
def fn(col):
return col[:-2].values + col[1:-1].values
Apply the function as mentioned in previous answers:
>>> df.apply(fn)
A B C D E F
0 2110 9058 684530 2530 66285 8200
1 1529 2967 711195 2501 5835 16620
2 7750 8054 40195 810 8510 10250
3 85529 16063 3689 7452 10421 65787

R's pdIndent function in RPy

I am working on translating the code for the lmeSplines tutorial to RPy.
I am now stuck at the following line:
fit1s <- lme(y ~ time, data=smSplineEx1,random=list(all=pdIdent(~Zt - 1)))
I have worked with nlme.lme before, and the following works just fine:
from rpy2.robjects.packages import importr
nlme = importr('nlme')
nlme.lme(r.formula('y ~ time'), data=some_data, random=r.formula('~1|ID'))
But this has an other random assignment. I am wondering how I can translate this bit and put it into my RPy code as well list(all=pdIdent(~Zt - 1)).
The structure of the (preprocessed) example data smSplineEx1 looks like this (with Zt.* up to 98):
time y y.true all Zt.1 Zt.2 Zt.3
1 1 5.797149 4.235263 1 1.168560e+00 2.071261e+00 2.944953e+00
2 2 5.469222 4.461302 1 1.487859e-01 1.072013e+00 1.948857e+00
3 3 4.567237 4.678477 1 -5.449190e-02 7.276623e-02 9.527613e-01
4 4 3.645763 4.887137 1 -5.364552e-02 -1.359115e-01 -4.333438e-02
5 5 5.094126 5.087615 1 -5.279913e-02 -1.337708e-01 -2.506194e-01
6 6 4.636121 5.280233 1 -5.195275e-02 -1.316300e-01 -2.466158e-01
7 7 5.501538 5.465298 1 -5.110637e-02 -1.294892e-01 -2.426123e-01
8 8 5.011509 5.643106 1 -5.025998e-02 -1.273485e-01 -2.386087e-01
9 9 6.114037 5.813942 1 -4.941360e-02 -1.252077e-01 -2.346052e-01
10 10 5.696472 5.978080 1 -4.856722e-02 -1.230670e-01 -2.306016e-01
11 11 6.615363 6.135781 1 -4.772083e-02 -1.209262e-01 -2.265980e-01
12 12 8.002526 6.287300 1 -4.687445e-02 -1.187854e-01 -2.225945e-01
13 13 6.887444 6.432877 1 -4.602807e-02 -1.166447e-01 -2.185909e-01
14 14 6.319205 6.572746 1 -4.518168e-02 -1.145039e-01 -2.145874e-01
15 15 6.482771 6.707130 1 -4.433530e-02 -1.123632e-01 -2.105838e-01
16 16 7.938015 6.836245 1 -4.348892e-02 -1.102224e-01 -2.065802e-01
17 17 7.585533 6.960298 1 -4.264253e-02 -1.080816e-01 -2.025767e-01
18 18 7.560287 7.079486 1 -4.179615e-02 -1.059409e-01 -1.985731e-01
19 19 7.571020 7.194001 1 -4.094977e-02 -1.038001e-01 -1.945696e-01
20 20 8.922418 7.304026 1 -4.010338e-02 -1.016594e-01 -1.905660e-01
21 21 8.241394 7.409737 1 -3.925700e-02 -9.951861e-02 -1.865625e-01
22 22 7.447076 7.511303 1 -3.841062e-02 -9.737785e-02 -1.825589e-01
23 23 7.317292 7.608886 1 -3.756423e-02 -9.523709e-02 -1.785553e-01
24 24 7.077333 7.702643 1 -3.671785e-02 -9.309633e-02 -1.745518e-01
25 25 8.268601 7.792723 1 -3.587147e-02 -9.095557e-02 -1.705482e-01
26 26 8.216013 7.879272 1 -3.502508e-02 -8.881481e-02 -1.665447e-01
27 27 8.968495 7.962427 1 -3.417870e-02 -8.667405e-02 -1.625411e-01
28 28 9.085605 8.042321 1 -3.333232e-02 -8.453329e-02 -1.585375e-01
29 29 9.002575 8.119083 1 -3.248593e-02 -8.239253e-02 -1.545340e-01
30 30 8.763187 8.192835 1 -3.163955e-02 -8.025177e-02 -1.505304e-01
31 31 8.936370 8.263695 1 -3.079317e-02 -7.811101e-02 -1.465269e-01
32 32 9.033403 8.331776 1 -2.994678e-02 -7.597025e-02 -1.425233e-01
33 33 8.248328 8.397188 1 -2.910040e-02 -7.382949e-02 -1.385198e-01
34 34 5.961721 8.460035 1 -2.825402e-02 -7.168873e-02 -1.345162e-01
35 35 8.400489 8.520418 1 -2.740763e-02 -6.954797e-02 -1.305126e-01
36 36 6.855125 8.578433 1 -2.656125e-02 -6.740721e-02 -1.265091e-01
37 37 9.798931 8.634174 1 -2.571487e-02 -6.526645e-02 -1.225055e-01
38 38 8.862758 8.687729 1 -2.486848e-02 -6.312569e-02 -1.185020e-01
39 39 7.282970 8.739184 1 -2.402210e-02 -6.098493e-02 -1.144984e-01
40 40 7.484208 8.788621 1 -2.317572e-02 -5.884417e-02 -1.104949e-01
41 41 8.404670 8.836120 1 -2.232933e-02 -5.670341e-02 -1.064913e-01
42 42 8.880734 8.881756 1 -2.148295e-02 -5.456265e-02 -1.024877e-01
43 43 8.826189 8.925603 1 -2.063657e-02 -5.242189e-02 -9.848418e-02
44 44 9.827906 8.967731 1 -1.979018e-02 -5.028113e-02 -9.448062e-02
45 45 8.528795 9.008207 1 -1.894380e-02 -4.814037e-02 -9.047706e-02
46 46 9.484073 9.047095 1 -1.809742e-02 -4.599961e-02 -8.647351e-02
47 47 8.911947 9.084459 1 -1.725103e-02 -4.385885e-02 -8.246995e-02
48 48 10.201343 9.120358 1 -1.640465e-02 -4.171809e-02 -7.846639e-02
49 49 8.908016 9.154849 1 -1.555827e-02 -3.957733e-02 -7.446283e-02
50 50 8.202368 9.187988 1 -1.471188e-02 -3.743657e-02 -7.045927e-02
51 51 7.432851 9.219828 1 -1.386550e-02 -3.529581e-02 -6.645572e-02
52 52 8.063268 9.250419 1 -1.301912e-02 -3.315505e-02 -6.245216e-02
53 53 10.155756 9.279810 1 -1.217273e-02 -3.101429e-02 -5.844860e-02
54 54 7.905281 9.308049 1 -1.132635e-02 -2.887353e-02 -5.444504e-02
55 55 9.688337 9.335181 1 -1.047997e-02 -2.673277e-02 -5.044148e-02
56 56 9.437176 9.361249 1 -9.633582e-03 -2.459201e-02 -4.643793e-02
57 57 9.165873 9.386295 1 -8.787198e-03 -2.245125e-02 -4.243437e-02
58 58 9.120195 9.410358 1 -7.940815e-03 -2.031049e-02 -3.843081e-02
59 59 9.955840 9.433479 1 -7.094432e-03 -1.816973e-02 -3.442725e-02
60 60 9.314230 9.455692 1 -6.248048e-03 -1.602897e-02 -3.042369e-02
61 61 9.706852 9.477035 1 -5.401665e-03 -1.388821e-02 -2.642014e-02
62 62 9.615765 9.497541 1 -4.555282e-03 -1.174746e-02 -2.241658e-02
63 63 7.918843 9.517242 1 -3.708898e-03 -9.606695e-03 -1.841302e-02
64 64 9.352892 9.536172 1 -2.862515e-03 -7.465935e-03 -1.440946e-02
65 65 9.722685 9.554359 1 -2.016132e-03 -5.325176e-03 -1.040590e-02
66 66 9.186888 9.571832 1 -1.169748e-03 -3.184416e-03 -6.402346e-03
67 67 8.652299 9.588621 1 -3.233650e-04 -1.043656e-03 -2.398788e-03
68 68 8.681421 9.604751 1 5.230184e-04 1.097104e-03 1.604770e-03
69 69 10.279181 9.620249 1 1.369402e-03 3.237864e-03 5.608328e-03
70 70 9.314963 9.635140 1 2.215785e-03 5.378623e-03 9.611886e-03
71 71 6.897151 9.649446 1 3.062168e-03 7.519383e-03 1.361544e-02
72 72 9.343135 9.663191 1 3.908552e-03 9.660143e-03 1.761900e-02
73 73 9.273135 9.676398 1 4.754935e-03 1.180090e-02 2.162256e-02
74 74 10.041796 9.689086 1 5.601318e-03 1.394166e-02 2.562612e-02
75 75 9.724713 9.701278 1 6.447702e-03 1.608242e-02 2.962968e-02
76 76 8.593517 9.712991 1 7.294085e-03 1.822318e-02 3.363323e-02
77 77 7.401988 9.724244 1 8.140468e-03 2.036394e-02 3.763679e-02
78 78 10.258688 9.735057 1 8.986852e-03 2.250470e-02 4.164035e-02
79 79 10.037192 9.745446 1 9.833235e-03 2.464546e-02 4.564391e-02
80 80 9.637510 9.755427 1 1.067962e-02 2.678622e-02 4.964747e-02
81 81 8.887625 9.765017 1 1.152600e-02 2.892698e-02 5.365102e-02
82 82 9.922013 9.774230 1 1.237239e-02 3.106774e-02 5.765458e-02
83 83 10.466709 9.783083 1 1.321877e-02 3.320850e-02 6.165814e-02
84 84 11.132830 9.791588 1 1.406515e-02 3.534926e-02 6.566170e-02
85 85 10.154038 9.799760 1 1.491154e-02 3.749002e-02 6.966526e-02
86 86 10.433068 9.807612 1 1.575792e-02 3.963078e-02 7.366881e-02
87 87 9.666781 9.815156 1 1.660430e-02 4.177154e-02 7.767237e-02
88 88 9.478004 9.822403 1 1.745069e-02 4.391230e-02 8.167593e-02
89 89 10.002749 9.829367 1 1.829707e-02 4.605306e-02 8.567949e-02
90 90 7.593259 9.836058 1 1.914345e-02 4.819382e-02 8.968305e-02
91 91 10.915754 9.842486 1 1.998984e-02 5.033458e-02 9.368660e-02
92 92 8.855580 9.848662 1 2.083622e-02 5.247534e-02 9.769016e-02
93 93 8.884683 9.854596 1 2.168260e-02 5.461610e-02 1.016937e-01
94 94 9.757451 9.860298 1 2.252899e-02 5.675686e-02 1.056973e-01
95 95 10.222361 9.865775 1 2.337537e-02 5.889762e-02 1.097008e-01
96 96 9.090410 9.871038 1 2.422175e-02 6.103838e-02 1.137044e-01
97 97 8.837872 9.876095 1 2.506814e-02 6.317914e-02 1.177080e-01
98 98 9.413135 9.880953 1 2.591452e-02 6.531990e-02 1.217115e-01
99 99 9.295531 9.885621 1 2.676090e-02 6.746066e-02 1.257151e-01
100 100 9.698118 9.890106 1 2.760729e-02 6.960142e-02 1.297186e-01

You can put list(all=pdIdent(~Zt - 1)) in the R's global environment using reval() method:
In [55]:
import rpy2.robjects as ro
import pandas.rpy.common as com
mydata = ro.r['data.frame']
read = ro.r['read.csv']
head = ro.r['head']
summary = ro.r['summary']
library = ro.r['library']
In [56]:
formula = '~ time'
library('lmeSplines')
ro.reval('data(smSplineEx1)')
ro.reval('smSplineEx1$all <- rep(1,nrow(smSplineEx1))')
ro.reval('smSplineEx1$Zt <- smspline(~ time, data=smSplineEx1)')
ro.reval('rnd <- list(all=pdIdent(~Zt - 1))')
#result = ro.r.smspline(formula=ro.r(formula), data=ro.r.smSplineEx1) #notice: data=ro.r.smSplineEx1
result = ro.r.lme(ro.r('y~time'), data=ro.r.smSplineEx1, random=ro.r.rnd)
In [57]:
print com.convert_robj(result.rx('coefficients'))
{'coefficients': {'random': {'all': Zt1 Zt2 Zt3 Zt4 Zt5 Zt6 Zt7 \
1 0.000509 0.001057 0.001352 0.001184 0.000869 0.000283 -0.000424
Zt8 Zt9 Zt10 ... Zt89 Zt90 Zt91 \
1 -0.001367 -0.002325 -0.003405 ... -0.001506 -0.001347 -0.000864
Zt92 Zt93 Zt94 Zt95 Zt96 Zt97 Zt98
1 -0.000631 -0.000569 -0.000392 -0.000049 0.000127 0.000114 0.000071
[1 rows x 98 columns]}, 'fixed': (Intercept) 6.498800
time 0.038723
dtype: float64}}
Be careful, the result is a little bit out of shape. Basically it is nested dictionary which can not be converted into a pandas.DataFrame.
You can access y in smsSplineEx by ro.r.smSplineEx1.rx('y'), similar to smsSplineEx1$y as you would do so in R.
Now say if you have the result variable in python, generated by
result = ro.r.lme(ro.r('y~time'), data=ro.r.smSplineEx1, random=ro.r.rnd)
and you want to plot it using R, (instead of plotting it using, say, matplotlib), you need to assign it to a variable in R workspace:
ro.R().assign('result', result)
Now there is a variable named result in R workspace, you can access it using ro.r.result.
Plotting it using R:
In [17]:
ro.reval('plot(smSplineEx1$time,smSplineEx1$y,pch="o",type="n", \
main="Spline fits: lme(y ~ time, random=list(all=pdIdent(~Zt-1)))", \
xlab="time",ylab="y")')
Out[17]:
rpy2.rinterface.NULL
In [21]:
ro.reval('lines(smSplineEx1$time, fitted(result),col=2)')
Out[21]:
rpy2.rinterface.NULL
Or you can do everything in R:
ro.reval('result <- lme(y ~ time, data=smSplineEx1,random=list(all=pdIdent(~Zt - 1)))')
ro.reval('plot(smSplineEx1$time,smSplineEx1$y,pch="o",type="n", \
main="Spline fits: lme(y ~ time, random=list(all=pdIdent(~Zt-1)))", \
xlab="time",ylab="y")')
ro.reval('lines(smSplineEx1$time, fitted(result),col=2)')
and access the R variables using:ro.r.smSplineEx1.rx2('time') or ro.r.result
Edit
Notice some R objects can not be converted to pandas.dataFrame as-is due to mixture of data structure:
In [62]:
ro.r["smSplineEx1"]
Out[62]:
<DataFrame - Python:0x108525518 / R:0x109e5da38>
[FloatVe..., FloatVe..., FloatVe..., FloatVe..., Matrix]
time: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x10807e518 / R:0x1022599e0>
[1.000000, 2.000000, 3.000000, ..., 98.000000, 99.000000, 100.000000]
y: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x108525a70 / R:0x102259d30>
[5.797149, 5.469222, 4.567237, ..., 9.413135, 9.295531, 9.698118]
y.true: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x1085257a0 / R:0x10225dfb0>
[4.235263, 4.461302, 4.678477, ..., 9.880953, 9.885621, 9.890106]
all: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x1085258c0 / R:0x10225e300>
[1.000000, 1.000000, 1.000000, ..., 1.000000, 1.000000, 1.000000]
Zt: <class 'rpy2.robjects.vectors.Matrix'>
<Matrix - Python:0x108525908 / R:0x103e8ba00>
[1.168560, 0.148786, -0.054492, ..., -0.030141, -0.030610, 0.757597]
Notice that we have a few vectors but the last one is a Matrix. We have to convert smSplineEx to python in two parts.
In [63]:
ro.r["smSplineEx1"].names
Out[63]:
<StrVector - Python:0x108525dd0 / R:0x1042ca7c0>
['time', 'y', 'y.true', 'all', 'Zt']
In [64]:
print com.convert_robj(ro.r["smSplineEx1"].rx(ro.IntVector(range(1, 5)))).head()
time y y.true all
1 1 5.797149 4.235263 1
2 2 5.469222 4.461302 1
3 3 4.567237 4.678477 1
4 4 3.645763 4.887137 1
5 5 5.094126 5.087615 1
In [65]:
print com.convert_robj(ro.r["smSplineEx1"].rx2('Zt')).head(2)
0 1 2 3 4 5 6 \
1 1.168560 2.071261 2.944953 3.782848 4.584037 5.348937 6.078121
2 0.148786 1.072013 1.948857 2.789264 3.593423 4.361817 5.095016
7 8 9 ... 88 89 90 \
1 6.772184 7.431719 8.057321 ... 0.933947 0.769591 0.619420
2 5.793601 6.458153 7.089255 ... 0.904395 0.745337 0.599976
91 92 93 94 95 96 97
1 0.484029 0.36401 0.259959 0.172468 0.102133 0.049547 0.015305
2 0.468893 0.35267 0.251890 0.167135 0.098986 0.048026 0.014836
[2 rows x 98 columns]
com.convert_robj(ro.r["smSplineEx1"]) will not work due to the mixed data structure issue.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

output multiple files based on column value python pandas - python

Related

how to do complex calculations in pandas dataframe

How to create files from a groupby object, based on the length of the dataframe

Python DataFrames concat or append problem

How to apply a function to all the columns in a data frame and take output in the form of dataframe in python

R's pdIndent function in RPy

Categories

Resources