I can make numpy ndarrays with rec2csv,
data = recfromcsv(dataset1, names=True)
xvars = ['exp','exp_sqr','wks','occ','ind','south','smsa','ms','union','ed','fem','blk']
y = data['lwage']
X = data[xvars]
c = ones_like(data['lwage'])
X = add_field(X, 'constant', c)
But, I have no idea how to take this into an R data frame usable by Rpy2,
p = roptim(theta,robjects.r['ols'],method="BFGS",hessian=True ,y= robjects.FloatVector(y),X = base.matrix(X))
ValueError: Nothing can be done for the type <class 'numpy.core.records.recarray'> at the moment.
p = roptim(theta,robjects.r['ols'],method="BFGS",hessian=True ,y= robjects.FloatVector(y),X = base.matrix(array(X)))
ValueError: Nothing can be done for the type <type 'numpy.ndarray'> at the moment.
Just to get an RPY2 DataFrame from a csv file, in RPY2.3, you can just do:
df = robjects.DataFrame.from_csvfile('filename.csv')
Documentation here.
I'm not 100% sure I understand your issue, but a couple things:
1) if it's ok, you can read a csv into R directly, that is:
robjects.r('name <- read.csv(filename.csv)')
After which you can refer to the resulting data frame in later functions.
Or 2) you can convert a numpy array into a data frame - to do this you need to import the package 'rpy2.robjects.numpy2ri'
Then you could do something like:
array_ex = np.array([[4,3],[3,2], [1,5]])
rmatrix = robjects.r('matrix')
rdf = robjects.r('data.frame')
rlm = robjects.r('lm')
mat_ex = rmatrix(array_ex, ncol = 2)
df_ex = rdf(mat_ex)
fit_ex = rlm('X1 ~ X2', data = df_ex)
Or whatever other functions you wanted.
There may be a more direct way - I get frustrated going between the two data types and so I am much more likely to use option 1) if possible.
Would either of these methods get you to where you need to be?
Related
So I'm using Omnet++, a discrete time network simulator, to simulate different networking scenarios. At some point one can further process Omnet++ output statistics and store them in a .csv file.
The interesting thing about it is that for each time (vectime) there is a value (vecvalue). Those vectime/vecvalues are stored in a single cell of such .csv file. When imported into a Pandas Dataframe, I get something like this.
In [45]: df1[['module','vectime','vecvalue']]
Out[45]:
module vectime vecvalue
237 Tictoc13.tic[1] [2.542245319062, 3.066965320033, 4.78723506093... [0.334535581612, 0.390459633837, 0.50391696492...
249 Tictoc13.tic[4] [2.649303071938, 6.02527384362, 21.42434044990... [2.649303071938, 1.654927100273, 3.11051622577...
261 Tictoc13.tic[3] [4.28876656608, 16.104821448604, 19.5989313700... [2.245250432259, 3.201153958979, 2.39023520069...
277 Tictoc13.tic[2] [13.884917126016, 21.467263378748, 29.59962616... [0.411703261805, 0.764708518232, 0.83288346614...
289 Tictoc13.tic[5] [14.146524815409, 14.349744576545, 24.95022463... [1.732060647139, 8.66456377103, 2.275388282721...
For example, if I needed to plot each vectime/vecvalue for each module, today I'm doing the following...
%pylab
def runningAvg(x):
sigma_x = np.cumsum(x)
sigma_n = np.arange(1,x.size + 1)
return sigma_x / sigma_n
for row in df1.itertuples():
t = row.vectime
x = row.vecvalue
x = runningAvg(x)
plot(t,x)
... to obtain this ...
My question is: what's best in terms of performance:
use the data as is, meaning using those arrays inside each cell, looping over the DF to plot each array;
convert those arrays as pd.Series. In this case, what would be better to still have the module as index?
would I benefit from unnesting those arrays into pd.Series?
thanks!
Well, I've wondered around and it seems that converting Omnet data into pd.Series might not be as efficient as I thought.
These are my two methods:
1) Using Omnet data as is, lists inside Pandas DF.
figure(1)
start = datetime.datetime.now()
for row in df1.itertuples():
t = row.vectime
x = row.vecvalue
x = runningAvg(x)
plot(t,x)
total = (datetime.datetime.now() - start).total_seconds()
print(total)
When running the above, the total is 0.026571.
2) Converting Omnet data to pd.Series.
To obtain the same result, I had to transpose the series several times.
figure(2)
start = datetime.datetime.now()
t = df1.vectime
v = df1.vecvalue
t = t.apply(pd.Series)
v = v.apply(pd.Series)
t = t.T
v = v.T
sigma_v = np.cumsum(v)
sigma_n = np.arange(1,v.shape[0]+1)
sigma = sigma_v.T / sigma_n
plot(t,sigma.T)
total = (datetime.datetime.now() - start).total_seconds()
print(total)
For the later, total is 0.57266.
So it seems that I'll stick to method 1, looping over the different rows.
What is the purpose for using cols method in Pytables? I have got big dataset and I am interested in reading only one column from that dataset.
These two methods gives me same time, but totally different variable memory consumption:
import tables
from sys import getsizeof
f = tables.open_file(myhdf5_path, 'r')
# These two methods takes the same amount of time
x = f.root.set1[:500000]['param1']
y = f.root.set1.cols.param1[:500000]
# But totally different memory consumption:
print(getsizeof(x)) # gives me 96
print(getsizeof(y)) # gives me 2000096
They are both the same numpy array data type. Can anybody explain me what is the purpose of using cols method?
%time x = f.root.set1[:500000]['param1'] # gives ~7ms
%time y = f.root.set1.cols.param1[:500000] # gives also about 7ms
Your question caught my curiosity. I typically use table.read(field='name') because it compliments the other table.read_ methods I use (for example: .read_where() and .read_coordinates()).
After a reviewing the docs, I found at least 4 ways to read one column of table data with PyTables. You showed 2, and there are 2 more:
table.read(field='name')
table.col('name') (singular)
I ran some tests with all 4, plus 2 tests on the entire table (dataset) for additional comparisons. I called getsizeof() for all 6 objects, and the size varies based on method. Although all 4 behave the same with numpy indexing, I suspect there's a difference in the returned object. However, I'm not a PyTables developer, so this is more inference than fact. It could also be that getsizeof() interprets the object differently.
Code Below:
import tables as tb
import numpy as np
from sys import getsizeof
# Create h5 file with 1 dataset
h5f = tb.open_file('SO_55254831.h5', 'w')
mydtype = np.dtype([('param1',float),('param2',float),('param3',float)])
arr = np.array(np.arange(3.*500000.).reshape(500000,3))
recarr = np.core.records.array(arr,dtype=mydtype)
h5f.create_table('/', 'set1', obj=recarr )
# Close, then Reopen file READ ONLY
h5f.close()
h5f = tb.open_file('SO_55254831.h5', 'r')
testds_1 = h5f.root.set1
print ("\nFOR: testds_1 = h5f.root.set1")
print (testds_1.dtype)
print (testds_1.shape)
print (getsizeof(testds_1)) # gives 128
testds_2 = h5f.root.set1.read()
print ("\nFOR: testds_2 = h5f.root.set1.read()")
print (getsizeof(testds_2)) # gives 12000096
x = h5f.root.set1[:500000]['param1']
print ("\nFOR: x = h5f.root.set1[:500000]['param1']")
print(getsizeof(x)) # gives 96
print ("\nFOR: y = h5f.root.set1.cols.param1[:500000]")
y = h5f.root.set1.cols.param1[:500000]
print(getsizeof(y)) # gives 4000096
print ("\nFOR: z = h5f.root.set1.read(stop=500000,field='param1')")
z = h5f.root.set1.read(stop=500000,field='param1')
print(getsizeof(z)) # also gives 4000096
print ("\nFOR: a = h5f.root.set1.col('param1')")
a = h5f.root.set1.col('param1')
print(getsizeof(a)) # also gives 4000096
h5f.close()
Output from Above:
FOR: testds_1 = h5f.root.set1
[('param1', '<f8'), ('param2', '<f8'), ('param3', '<f8')]
(500000,)
128
FOR: testds_2 = h5f.root.set1.read()
12000096
FOR: x = h5f.root.set1[:500000]['param1']
96
FOR: y = h5f.root.set1.cols.param1[:500000]
4000096
FOR: z = h5f.root.set1.read(stop=500000,field='param1')
4000096
FOR: a = h5f.root.set1.col('param1')
4000096
I have two pandas dataframes that on inspection look identical. One was created using the Pandas builtin:
df.corr(method='pearson')
While the other was created with a custom function:
def cor_matrix(dataframe, method):
coeffmat = pd.DataFrame(index=dataframe.columns,
columns=dataframe.columns)
pvalmat = pd.DataFrame(index=dataframe.columns, columns=dataframe.columns)
for i in range(dataframe.shape[1]):
for j in range(dataframe.shape[1]):
x = np.array(dataframe[dataframe.columns[i]])
y = np.array(dataframe[dataframe.columns[j]])
bad = ~np.logical_or(np.isnan(x), np.isnan(y))
if method == 'spearman':
corrtest = spearmanr(np.compress(bad,x), np.compress(bad,y))
if method == 'pearson':
corrtest = pearsonr(np.compress(bad,x), np.compress(bad,y))
coeffmat.iloc[i,j] = corrtest[0]
pvalmat.iloc[i,j] = corrtest[1]
return (coeffmat, pvalmat)
Both look identical and have same type (pandas.core.frame.DataFrame) and their entries are also of same type (numpy.float64)
However when I try to plot these using:
import matplotlib.pyplot as plt
plt.imshow((df))
Only the dataframe created with the pandas builtin function works. For the other dataframe I receive the error: TypeError: Image data cannot be converted to float. Can anyone explain what is going on, how the two dataframes are different and what can be done to address the error?
Edit - It looks as though there is one difference, when I convert the dataframes to a numpy array, the one that doesn't work has dtype = object at the end. Is there a way to remove this?
Amending the function to specify the dataframe as float fixed the issue:
def cor_matrix(dataframe, method):
coeffmat = pd.DataFrame(index=dataframe.columns, columns=dataframe.columns)
pvalmat = pd.DataFrame(index=dataframe.columns, columns=dataframe.columns)
for i in range(dataframe.shape[1]):
for j in range(dataframe.shape[1]):
x = np.array(dataframe[dataframe.columns[i]])
y = np.array(dataframe[dataframe.columns[j]])
bad = ~np.logical_or(np.isnan(x), np.isnan(y))
if method == 'spearman':
corrtest = spearmanr(np.compress(bad,x), np.compress(bad,y))
if method == 'pearson':
corrtest = pearsonr(np.compress(bad,x), np.compress(bad,y))
coeffmat.iloc[i,j] = corrtest[0]
pvalmat.iloc[i,j] = corrtest[1]
#This is to convert to float type otherwise can cause problems when e.g. plotting
coeffmat=coeffmat.apply(pd.to_numeric, errors='ignore')
pvalmat=pvalmat.apply(pd.to_numeric, errors='ignore')
return (coeffmat, pvalmat)
want to do a simple normalization of the data in a numpy ndarray.
specifically want X-mu/sigma. Tried using the exact code that
that I found in earlier questions - kept getting error = TypeError
cannot perform reduce with flexible type. Gave up and tried a simpler
normzlization method X-mu/X.ptp - got the same error.
import csv
import numpy as np
from numpy import *
import urllib.request
#Import comma separated data from git.hub
url = 'http://archive.ics.uci.edu/ml/machine-learning-
databases/wine/wine.data'
urllib.request.urlretrieve(url,'F:/Python/Wine Dataset/wine_data')
#open file
filename = 'F:/Python/Wine Dataset/wine_data';
raw_data = open(filename,'rt');
#Put raw_data into a numpy.ndarray
reader = csv.reader(raw_data);
x = list(reader);
data = np.array(x)
#First column is classification, other columns are features
y= data[:,0];
X_raw = data[:,1:13];
# Attempt at normalizing data- really wanted X-mu/sigma gave up
# even this simplified version doesn't work
# latest error is TypeError cannot perform reduce with flexible type?????
X = (X_raw - X_raw.min(0)) / X_raw.ptp(0);
print(X);
#
#
#
#
Finally figured it out. The line "data = np.array(x)" returned an array containing string data.
was:
data = "np.array(x)"
changed to: "np.array(x).astype(np.float)"
after that everything worked - simple issue cost me hours
I am using rpy2 to do some statistical analyses in R via python. After importing a data file I want to sort the data and do a couple other things with it in R. Once I import the data and try to sort the data I get this error message:
TypeError: 'tuple' object cannot be interpreted as an index
The last 2 lines of my code are where I am trying to sort my data, and the few lines before that are where I import the data.
root = os.getcwd()
dirs = [os.path.abspath(name) for name in os.listdir(".") if os.path.isdir(name)]
for d in dirs:
os.chdir(d)
cwd = os.getcwd()
files_to_analyze = (glob.glob("*.afa"))
for f in files_to_analyze:
afa_file = os.path.join(cwd + '/' + f)
readfasta = robjects.r['read.fasta']
mydatafasta = readfasta(afa_file)
names = robjects.r['names']
IDnames = names(mydatafasta)
substr = robjects.r['substr']
ID = substr(IDnames, 1,8)
#print ID
readtable = robjects.r['read.table']
gps_file = os.path.join(root + '/' + "GPS.txt")
xy = readtable(gps_file, sep="\t")
#print xy
order = robjects.r['order']
gps = xy[order(xy[:,2]),]
I don't understand why my data is a tuple and not a dataframe that I can manipulate further using R. Is there a way to transform this into a workable dataframe that can be used by R?
My xy data look like:
Species AB425882 35.62 -83.4
Species AB425905 35.66 -83.33
Species KC413768 37.35 127.03
Species AB425841 35.33 -82.82
Species JX402724 29.38 -82.2
I want to sort the data alphanumerically by the second column using the order function in R.
There is a quite a bit of guesswork since the example is not sufficient to reproduce what you have.
In the following, if xy is an R data frame, you will want to use the method dedicated to R-style subsetting to perform R-style subsetting (see the doc):
# Note R indices are 1-based while Python indices are 0-based.
# When using R-style subsetting the indices are 1-based.
gps = xy.rx(order(xy.rx(True, 2)),
True)