I am trying to build this converter for one of my personal project using numpy and getting the Memory error. I am new to python. This works fine for small data but breaks when i give 5MB of data as input(attached the data). Here is the code. Could experts point out where the memory is blowing up here? Link to data can be found here
import numpy as np
import gc as gc
"""
USAGE: convert(data,cols)
data - numpy array of data
cols - tuple of columns to process. These columns should be categorical columns.
IMP: Indexing of colum in data starts with 0. Ypou cant index last column.
Ex: you want to index second col here, then
data
a b c
a b c
x y z
cols=(1,)
if you want to index 1st and second, then
cols=(0,1)
All 3
cols=(0,1,2)
You can also skip numeric column, which you dont want to encode, like
cols=(0,2) will skip 1 col
"""
def lookupBuilder(strArray):
a=np.arange(len(strArray))+1
lookups={k:v for (k,v) in zip(strArray,a)}
return lookups
def convert(data,cols):
for ix,i in enumerate(cols):
col=data[:,i:i+1]
lookup_data=lookupBuilder(np.unique(col))
for idx,value in enumerate(col):
col[idx]=lookup_data[value[0]]
np.delete(data,i,1)
gc.collect()
np.insert(data,i,col,axis=1)
return data
if __name__=="__main__":
pass
Error
Traceback (most recent call last):
File "C:\MLDatabases\python_scripts\MLP.py", line 230, in <module>
data=cc.convert(data,(1,2,3,4,5,6,7,8,9,13,19))
File "C:\MLDatabases\python_scripts\categorical_converter.py", line 49, in convert
np.insert(data,i,col,axis=1)
File "C:\python\lib\site-packages\numpy\lib\function_base.py", line 4906, in insert
new = empty(newshape, arr.dtype, arrorder)
MemoryError
Related
I'm writing a script to perform LLoD analysis for qPCR assays for my lab. I import the relevant columns from the .csv of data from the instrument using pandas.read_csv() with the usecols parameter, make a list of the unique values of RNA quantity/concentration column, and then I need to determine the detection rate / hit rate at each given concentration. If the target is detected, the result will be a number; if not, it'll be listed as "TND" or "Undetermined" or some other non-numeric string (depends on the instrument). So I wrote a function that (should) take a quantity and the dataframe of results and return the probability of detection for that quantity. However, on running the script, I get the following error:
Traceback (most recent call last):
File "C:\Python\llod_custom.py", line 34, in <module>
prop[idx] = hitrate(val, data)
File "C:\Python\llod_custom.py", line 29, in hitrate
df = pd.to_numeric(list[:,1], errors='coerce').isna()
File "C:\Users\wmacturk\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\wmacturk\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 75, in pandas._libs.index.IndexEngine.get_loc
TypeError: '(slice(None, None, None), 1)' is an invalid key
The idea in the line that's throwing the error (df = pd.to_numeric(list[:,1], errors='coerce').isna()) is to change any non-numeric values in the column to NaN, then get a boolean array telling me whether a given row's entry is NaN, so I can count the number of numeric entries with df.sum() later.
I'm sure it's something that should be obvious to anyone who's worked with pandas / dataframes, but I haven't used dataframes in python before, so I'm at a loss. I'm also much more familiar with C and JavaScript, so something like python that isn't as rigid can actually be a bit confusing since it's so flexible. Any help would be greatly appreciated.
N.B. the conc column will consist of 5 to 10 different values, each repeated 5-10 times (i.e. 5-10 replicates at each of the 5-10 concentrations); the detect column will contain either a number or a character string in each row -- numbers mean success, strings mean failure... For my purposes the value of the numbers is irrelevant, I only need to know if the target was detected or not for a given replicate. My script (up to this point) follows:
import os
import pandas as pd
import numpy as np
import statsmodels as sm
from scipy.stats import norm
from tkinter import filedialog
from tkinter import *
# initialize tkinter
root = Tk()
root.withdraw()
# prompt for data file and column headers, then read those columns into a dataframe
print("In the directory prompt, select the .csv file containing data for analysis")
path = filedialog.askopenfilename()
conc = input("Enter the column header for concentration/number of copies: ")
detect = input("Enter the column header for target detection: ")
tnd = input("Enter the value listed when a target is not detected (e.g. \"TND\", \"Undetected\", etc.): ")
data = pd.read_csv(path, usecols=[conc, detect])
# create list of unique values for quantity of RNA, initialize vectors of same length
# to store probabilies and probit scores for each
qtys = data[conc].unique()
prop = probit = [0] * len(qtys)
# Function to get the hitrate/probability of detection for a given quantity
def hitrate(qty, dataFrame):
list = dataFrame[dataFrame.iloc[:,0] == qty]
df = pd.to_numeric(list[:,1], errors='coerce').isna()
return (len(df) - (len(df)-df.sum()))/len(df)
# iterate over quantities to calculate the corresponding probability of Detection
# and its associate probit score
for idx, val in enumerate(qtys):
prop[idx] = hitrate(val, data)
probit[idx] = norm.ppf(hitrate(val, data))
# create an array of the quantities with their associated probabilities & Probit scores
hitTable = vstack([qtys,prop,probit])
sample dataframe can be created with:
d = {'qty':[1,1,1,1,1, 10,10,10,10,10, 20,20,20,20,20, 50,50,50,50,50, 100,100,100,100,100], 'result':['TND','TND','TND',5,'TND', 'TND',5,'TND',5,'TND', 5,'TND',5,'TND',5, 5,6,5,5,'TND', 5,5,5,5,5]}
exData = pd.DataFrame(data=d)
Then just use exData as the dataframe data in the original code
EDIT: I've fixed the problem by tweaking Loic RW's answer slightly. The function hitrate should be
def hitrate(qty, df):
t_s = df[df.qty == qty].result
t_s = t_s.apply(pd.to_numeric, args=('coerce',)).isna()
return (len(t_s)-t_s.sum())/len(t_s)
Does the following achieve what you want? I made some assumptions on the structure of your data.
def hitrate(qty, df):
target_subset = df[df.qty == qty].target
target_subset = target_subset.apply(pd.to_numeric, args=('coerce',)).isna()
return 1-((target_subset.sum())/len(target_subset))
If i run the following:
data = pd.DataFrame({'qty': [1,2,2,2,3],
'target': [.5, .8, 'TND', 'Undetermined', .99]})
hitrate(2, data)
I get:
0.33333333333333337
Ok, I've got a weird one. I might have found a bug, but let's assume I made a mistake first. Anyways, I am running into some issues with pandas.
I want to locate the two last columns of a dataframe to compare the values of column 'Col'. I run the code inside a for loop because it needs to run on all files in a folder. This code:
import pandas
for item in itemList:
df = df[['Col']].tail(2)
valA = df.iloc[1]['Col']
valB = df.iloc[0]['Col']
Works mostly. I ran it over 1040 data frames without issues. Then at 1041 of about 2000 it causes this error:
Traceback (most recent call last):
File "/path/to/script.py", line 206, in <module>
valA = df.iloc[1]['Col']
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1373, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1830, in _getitem_axis
self._is_valid_integer(key, axis)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexing.py", line 1713, in _is_valid_integer
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
From this I thought, the data frame might be too short. It shouldn't be, I test for this elsewhere, but ok, mistakes happen so let's print(df) to figure this out.
If I print(df) before the assignment of .tail(2) like this:
print(df)
df = df[['Col']].tail(2)
valA = df.iloc[1]['Col']
valB = df.iloc[0]['Col']
I see a data frame of 37 rows. In my world, 37 > 2.
Now, let's move the print(df) down one line like so:
df = df[['Col']].tail(2)
print(df)
The output is usually two lines as one would expect. However, at the error the df.tail(2) returns a single row of data frame out of a data frame with 37 rows. Not two rows, one row. However, this only happens for one item in the loop. All others work fine. If I skip over the item manually like so:
for item in itemList:
if item == 'troublemaker':
continue
... the script runs through to the end. No errors happen.
I must add, I am fairly new to all this, so I might overlook something entirely. Am I? Suggestions appreciated. Thanks.
Edit: Here's the output of print(df) in case of the error
Col
Date
2018-11-30 True
and in all other cases:
Col
Date
2018-10-31 False
2018-11-30 True
Since it does not have second index, that is why return the error , try using tail and head , be aware of this , for your sample df, valA and valB will be the same value
import pandas
for item in itemList:
df = df[['Col']].tail(2)
valA = df.tail(1)['Col']
valB = df.head(1)['Col']
I don't think it's a bug since it only happens to one df in 2000. Can you show that df?
I also don't think you need tail here, have you tried
valA = df.iloc[-2]['Col']
valB = df.iloc[-1]['Col']
to get the last values.
I have some very noisy (astronomy) data in csv format. Its shape is (815900,2) with 815k points giving information of what the mass of a disk is at a certain time. The fluctuations are pretty noticeable when you look at it close up. For example, here is an snippet of the data where the first column is time in seconds and the second is mass in kg:
40023700,2.40896E+028
40145700,2.44487E+028
40267700,2.44487E+028
40389700,2.44478E+028
40511600,1.535E+028
40633500,2.19067E+028
40755400,2.44496E+028
40877200,2.44489E+028
40999000,2.44489E+028
41120800,2.34767E+028
41242600,2.40936E+028
So it looks like there is a 1.53E+028 data point of noise, and also probably the 2.19E+028 and 2.35E+028 points.
To fix this, I am trying to set a Python script that will read in the csv data, then put some restriction on it so that if the mass is e.g. < 2.35E+028, it will remove the whole row and then create a new csv file with only the "good" data points:
40023700,2.40896E+028
40145700,2.44487E+028
40267700,2.44487E+028
40389700,2.44478E+028
40755400,2.44496E+028
40877200,2.44489E+028
40999000,2.44489E+028
41242600,2.40936E+028
Following this old question top answer by n8henrie, I so far have:
import pandas as pd
import csv
# Here are the locations of my csv file of my original data and an EMPTY csv file that will contain my good, noiseless set of data
originaldata = '/Users/myname/anaconda2/originaldata.csv'
gooddata = '/Users/myname/anaconda2/gooddata.csv'
# I use pandas to read in the original data because then I can separate the columns of time as 'T' and mass as 'M'
originaldata = pd.read_csv('originaldata.csv',delimiter=',',header=None,names=['t','m'])
# Numerical values of the mass values
M = originaldata['m'].values
# Now to put a restriction in
for row in M:
new_row = []
for column in row:
if column > 2.35E+028:
new_row.append(column)
csv.writer(open(newfile,'a')).writerow(new_row)
print('\n\n')
print('After:')
print(open(newfile).read())
However, when I run this, I get this error:
TypeError: 'numpy.float64' object is not iterable
I know the first column (time) is dtype int64 and the second column (mass) is dtype float64... but as a beginner, I'm still not quite sure what this error means or where I'm going wrong. Any help at all would be appreciated. Thank you very much in advance.
You can select rows by a boolean operation. Example:
import pandas as pd
from io import StringIO
data = StringIO('''\
40023700,2.40896E+028
40145700,2.44487E+028
40267700,2.44487E+028
40389700,2.44478E+028
40511600,1.535E+028
40633500,2.19067E+028
40755400,2.44496E+028
40877200,2.44489E+028
40999000,2.44489E+028
41120800,2.34767E+028
41242600,2.40936E+028
''')
df = pd.read_csv(data,names=['t','m'])
good = df[df.m > 2.35e+28]
out = StringIO()
good.to_csv(out,index=False,header=False)
print(out.getvalue())
Output:
40023700,2.40896e+28
40145700,2.44487e+28
40267700,2.44487e+28
40389700,2.44478e+28
40755400,2.44496e+28
40877200,2.44489e+28
40999000,2.44489e+28
41242600,2.40936e+28
This returns a column: M = originaldata['m'].values
So when you do for row in M:, you get only one value in row, so you can't iterate on it again.
This very simple piece of code,
# imports...
from lifelines import CoxPHFitter
import pandas as pd
src_file = "Pred.csv"
df = pd.read_csv(src_file, header=0, delimiter=',')
df = df.drop(columns=['score'])
cph = CoxPHFitter()
cph.fit(df, duration_col='Length', event_col='Status', show_progress=True)
produces an error:
Traceback (most recent call last):
File
"C:/Users/.../predictor.py", line 11,
in
cph.fit(df, duration_col='Length', event_col='Status', show_progress=True)
File
"C:\Users\...\AppData\Local\conda\conda\envs\hrpred\lib\site-packages\lifelines\fitters\coxph_fitter.py",
line 298, in fit
self._check_values(df)
File "C:\Users\...\AppData\Local\conda\conda\envs\hrpred\lib\site-packages\lifelines\fitters\coxph_fitter.py",
line 323, in _check_values
cols = str(list(X.columns[low_var]))
File "C:\Users\...\AppData\Local\conda\conda\envs\hrpred\lib\site-packages\pandas\core\indexes\base.py",
line 1754, in _ _ getitem _ _
result = getitem(key)
IndexError: boolean index did not match indexed array along dimension 0; dimension is 88 but corresponding
boolean dimension is 76
However, when I print df itself, everything's all right. As you can see, everything is inside the library. And the library's examples work fine.
Without knowing what your data look like - I had the same error, which was resolved when I removed all but the duration, event and coefficient(s) from the pandas df I was using. That is, I had a lot of extra columns in the df that were confusing the cox PH fitter since you don't actually specify which coef you want to include as an argument to cph.fit().
I am an avid user of R, but recently switched to Python for a few different reasons. However, I am struggling a little to run the vector AR model in Python from statsmodels.
Q#1. I get an error when I run this, and I have a suspicion it has something to do with the type of my vector.
import numpy as np
import statsmodels.tsa.api
from statsmodels import datasets
import datetime as dt
import pandas as pd
from pandas import Series
from pandas import DataFrame
import os
df = pd.read_csv('myfile.csv')
speedonly = DataFrame(df['speed'])
results = statsmodels.tsa.api.VAR(speedonly)
Traceback (most recent call last):
File "<pyshell#14>", line 1, in <module>
results = statsmodels.tsa.api.VAR(speedonly)
File "C:\Python27\lib\site-packages\statsmodels\tsa\vector_ar\var_model.py", line 336, in __init__
super(VAR, self).__init__(endog, None, dates, freq)
File "C:\Python27\lib\site-packages\statsmodels\tsa\base\tsa_model.py", line 40, in __init__
self._init_dates(dates, freq)
File "C:\Python27\lib\site-packages\statsmodels\tsa\base\tsa_model.py", line 54, in _init_dates
raise ValueError("dates must be of type datetime")
ValueError: dates must be of type datetime
Now, interestingly, when I run the VAR example from here https://github.com/statsmodels/statsmodels/blob/master/docs/source/vector_ar.rst#id5, it works fine.
I try the VAR model with a third, shorter vector, ts, from Wes McKinney's "Python for Data Analysis," page 293 and it doesn't work.
Okay, so now I'm thinking it's because the vectors are different types:
>>> speedonly.head()
speed
0 559.984
1 559.984
2 559.984
3 559.984
4 559.984
>>> type(speedonly)
<class 'pandas.core.frame.DataFrame'> #DOESN'T WORK
>>> type(data)
<type 'numpy.ndarray'> #WORKS
>>> ts
2011-01-02 -0.682317
2011-01-05 1.121983
2011-01-07 0.507047
2011-01-08 -0.038240
2011-01-10 -0.890730
2011-01-12 -0.388685
>>> type(ts)
<class 'pandas.core.series.TimeSeries'> #DOESN'T WORK
So I convert speedonly to an ndarray... and it still doesn't work. But this time I get another error:
>>> nda_speedonly = np.array(speedonly)
>>> results = statsmodels.tsa.api.VAR(nda_speedonly)
Traceback (most recent call last):
File "<pyshell#47>", line 1, in <module>
results = statsmodels.tsa.api.VAR(nda_speedonly)
File "C:\Python27\lib\site-packages\statsmodels\tsa\vector_ar\var_model.py", line 345, in __init__
self.neqs = self.endog.shape[1]
IndexError: tuple index out of range
Any suggestions?
Q#2. I have exogenous feature variables in my data set that appear to be useful for predictions. Is the above model from statsmodels even the best one to use?
When you give a pandas object to a time-series model, it expects that the index is dates. The error message is improved in the current source (to be released soon).
ValueError: Given a pandas object and the index does not contain dates
In the second case, you're giving a single 1d series to a VAR. VARs are used when you have more than one series. That's why you have the shape error because it expects there to be a second dimension in your array. We could probably improve the error message here. For a single series AR model with exogenous variables, you probably want to use sm.tsa.ARMA. Note that there is a known bug in ARMA.predict for models with exogenous variables to fixed soon. If you could provide a test case for this it would be helpful.