Python, manipulating dataframes - python

Department = input("what dept")
editfile = pd.read_csv('52.csv', encoding='Latin-1')
editfilevalues= editfile.loc[editfile['Customer'].str.contains(Department, na=False), 'May-18\nQty']
editfilevalues = editfilevalues.fillna(int(0))
print(int(editfilevalues) *1.3)
I have looked through stackoverflow and no answer seems to help me this problem. I simply want to be able to manipulate data in a series like this but I get different errors, with this current code I receive this:
"{0}".format(str(converter))) TypeError: cannot convert the series to <class 'int'>
My main issue is converting a series to an int type, I have tried several different ways to do this and none are giving me the results

So a pandas Series is a bit like a list, but with different functions and properties. You can't convert the Series to int using int() because the function wasn't designed to work on list-like objects in that way.
If you need to convert the Series to all integers, this method will work.
int_series = your_series.astype(int)
This will get your entire series as 'int32' specifically. Below is a bonus if you want it in a numpy array.
int_array = your_series.values.astype(int)
From here you have a few options to do your calculation.
# where x is a value in your series and lambda is a nameless function
calculated_series = int_series.apply(lambda x: some_number*x)
The output will be another Series object with your rows calculated. Bonus using numpy array below.
calculated_array = int_array * some_number
Edit to show everything at once.
# for series
int_series = your_series.astype(int)
calculated_series = int_series.apply(lambda x: x * some_number)
# for np.array
int_array = your_series.values.astype(int)
calculated_array = int_array * some_number
Either will work, and it is ultimately up to what kind of data structure you want at the end of it all.

Related

How to use np.Vectorize() with Pandas function?

I have the function that operates in Pandas DataFrame format. It works with pandas.apply() but it does not work with np.Vectorize(). Find the function below:
def AMTTL(inputData, amortization = []):
rate = inputData['EIR']
payment = inputData['INSTALMENT']
amount = inputData['OUTSTANDING']
amortization = [amount]
if amount - payment <= 0:
return amortization
else:
while amount > 0:
amount = BALTL(rate, payment, amount)
if amount <= 0:
continue
amortization.append(amount)
return amortization
The function receives inputData as Pandas DataFrame format. The EIR, INSTALMENT and OUTSTANDING are the columns name. This function works well with pandas.apply()
data.apply(AMTTL, axis = 1)
However, I have tried to use np.Vectorize(). it does not work with the code below:
vfunc = np.vectorize(AMTTL)
vfunc(data)
It got error like 'Timestamp' object is not subscriptable. So, I tried to drop other columns that not used but it still got the another error like invalid index to scalar variable.
I am not sure how to adjust pandas.apply() to np.Vectorize().
Any suggestion? Thank you in advance.
np.vectorize is nothing more than a map function that is applied to all the elements of the array - meaning you cannot differentiate between the columns with in the function. It has no idea of the column names like EIR or INSTALMENT. Therefore your current implementation for numpy will not work.
From the docs:
The vectorized function evaluates pyfunc over successive tuples of the
input arrays like the python map function, except it uses the
broadcasting rules of numpy.
The vectorize function is provided primarily for convenience, not for
performance. The implementation is essentially a for loop.
Based on your problem, you should try np.apply_along_axis instead, where you can refer different columns with their indexes.

Numpy/Pandas: Error converting ndarray to series

I have the ndarray "diffTemp":
diffTemp = np.diff([df.Temp])
Where Temp are temperature values whose differences I compute using the difference operator. In this case using print() I get:
print(diffTemp) = [[-0.16 -0.05]]
To convert it into a column vector I use:
diffTemp = diffTemp.transpose()
And then convert is from ndarray into Series using:
diffTemp = pd.Series([diffTemp])
(This allows me later to concatenate diffTime with its corresponding Series dates (diffDates).)
Unfortunately this outputs that diffTemp is:
print(diffTemp) = 0 [[-0.16000000000000014], [-0.05000000000000071]]
If I instead use (i.e. without hard brackets [ ]), such that instead:
diffTemp = pd.Series(diffTemp)
I instead get the error message:
Exception: Data must be 1-dimensional
Totally new to Python and have tried google the last few days without any success. Any help is much much appreciated.
The issue here is that you are trying to convert a two-dimensional array into a 1-dimensional series. Notice that there are two brackets around [[-0.16 -0.05]]. You can write the following to get back a series by just grabbing the 1-d array that you want:
diffTemp = pd.Series(diffTemp[0])

string indices must be integers pandas dataframe

I am pretty new in data science. I am trying to deal DataFrame data inside a list. I have read the almost every post about string indices must be integers, but it did not help at all.
My DataFrame looks like this:
And the my list look like this
myList -> [0098b710-3259-4794-9075-3c83fc1ba058 1.561642e+09 32.775882 39.897459],
[0098b710-3259-4794-9075-3c83fc1ba057 1.561642e+09 32.775882 39.897459],
and goes on...
This is the Data in case you need to reproduce something guys.
I need to access the list items(dataframes) one by one, then I need to split dataframe if the difference between two timestamps greater than 60000
I wrote code this, but it gives an error, whenever I tried to access timestamp. Can you guys help with the problem
mycode:
a = []
for i in range(0,len(data_one_user)):
x = data_one_user[i]
x['label'] = (x['timestamp'] - x['timestamp'].shift(1))
x['trip'] = np.where(x['label'] > 60000, True, False)
x = x.drop('label', axis=1)
x['trip'] = np.where(x['trip'] == True, a.append(x) , a.extend(x))
#a = a.drop('trip', axis=1)
x = a
Edit: If you wonder the object types
data_one_user -> list
data_one_user[0] = x -> pandas. core.frame.DataFrame
data_one_user[0]['timestamp'] = x['timestamp'] -> pandas.core.series.Series
Edit2: I added the error print out
Edit3: Output of x
I found the problem that causes the error. At the end of the list, labels are repeated.

Strange issue when storing FFT periods in Pandas dataframe

I am trying to store the results of FFT calculations in a Pandas data frame:
ft = pd.DataFrame(index=range(90))
ft['y'] = ft.index.map(lambda x: np.sin(2*x))
ft['spectrum'] = np.fft.fft(ft['y'])
ft['freq'] = np.fft.fftfreq(len(ft.index)).real
ft['T'] = ft['freq'].apply(lambda f: 1/f if f != 0 else 0)
Everything seems to be working fine until the last line: the column T which is supposed to store periods has for some reason all the columns of the frame, ie.:
In [499]: ft.T[0]
Out[499]:
y 0j
spectrum (0.913756021471+0j)
freq 0j
T 0j
Name: 0, dtype: complex128
I cannot figure out why is that. It happens also when I only take the real part of freq:
ft['freq'] = np.fft.fftfreq(len(ft.index)).real
or I try to calculate T values using alternative ways, such as:
ft.T = ft.index.map(lambda i: 1/ft.freq[i] if ft.freq[i] else np.inf)
ft.T = 1/ft.freq
All other columns look tidy when I run head() or describe() on them no matter if they contain real or complex values. The freq column looks like a normal 1D series, because np.fft.fftfreq() returns 1D array of complex numbers, so what could be the reason why the column T is so messed up?
I am using Pandas v. 1.19.2 and Numpy v. 1.12.0.
Pandas DataFrame objects have a property called T, which is used "to transpose index and columns" of the DataFrame object. If you use a different column name instead of T, everything works as expected.

Use numpy.average with weights for resampling a pandas array

I need to resample some data with numpys weighted-average-function - and it just doesn't work... .
This is my test-case:
import numpy as np
import pandas as pd
time_vec = [datetime.datetime(2007,1,1,0,0)
,datetime.datetime(2007,1,1,0,1)
,datetime.datetime(2007,1,1,0,5)
,datetime.datetime(2007,1,1,0,8)
,datetime.datetime(2007,1,1,0,10)
]
df = pd.DataFrame([2,3,1,7,4],index = time_vec)
A normal resampling without weights works fine (using the lambda function as a parameter to how is suggested here: Pandas resampling using numpy percentile? Thanks!):
df.resample('5min',how = lambda x: np.average(x[0]))
But if i try to use some weights, it always returns a TypeError: Axis must be specified when shapes of a and weights differ:
df.resample('5min',how = lambda x: np.average(x[0],weights = [1,2,3,4,5]))
I tried this with many different numbers of weights, but it did not get better:
for i in xrange(20):
try:
print range(i)
print df.resample('5min',how = lambda x:np.average(x[0],weights = range(i)))
print i
break
except TypeError:
print i,'typeError'
I'd be glad about any suggestions.
The short answer here is that the weights in your lambda need to be created dynamically based on the length of the series that is being averaged. In addition, you need to be careful about the types of objects that you're manipulating.
The code that I got to compute what I think you're trying to do is as follows:
df.resample('5min', how=lambda x: np.average(x, weights=1+np.arange(len(x))))
There are two differences compared with the line that was giving you problems:
x[0] is now just x. The x object in the lambda is a pd.Series, and so x[0] gives just the first value in the series. This was working without raising an exception in the first example (without the weights) because np.average(c) just returns c when c is a scalar. But I think it was actually computing incorrect averages even in that case, because each of the sampled subsets was just returning its first value as the "average".
The weights are created dynamically based on the length of data in the Series being resampled. You need to do this because the x in your lambda might be a Series of different length for each time interval being computed.
The way I figured this out was through some simple type debugging, by replacing the lambda with a proper function definition:
def avg(x):
print(type(x), x.shape, type(x[0]))
return np.average(x, weights=np.arange(1, 1+len(x)))
df.resample('5Min', how=avg)
This let me have a look at what was happening with the x variable. Hope that helps!

Categories

Resources