I'm trying to construct a pandas Series to concatenate onto a dataframe.
import numpy as np
import pandas as pd
rawData = pd.read_csv(input, header=1) # the DataFrame
strikes = pd.Series() # the empty Series
for i, row in rawData.iterrows():
sym = rawData.loc[i,'Symbol']
strike = float(sym[-6:])/1000
strikes = strikes.set_value(i, strike)
print("at26: ",strikes.values)
This program works, but I get the error message:
"line 25: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead."
Every way I have tried to substitute .at, I get a syntax error. Many of the suggestions posted relate to DataFrames, not Series. Append requires another series, and complains when I give it a scalar.
What is the proper way to do it?
Replace strikes.set_value(i, strike) with strikes.at[i] = strike.
Note that assignment back to a series is not necessary with set_value:
s = pd.Series()
s.set_value(0, 10)
s.at[1] = 20
print(s)
0 10
1 20
dtype: int64
For the algorithm you are looking to run, you can simply use assignment:
strikes = rawData['Symbol'].str[-6:].astype(float) / 1000
Related
I'm trying to give numerical representations of strings, so I'm using Pandas'
factorize
For example Toyota = 1, Safeway = 2 , Starbucks =3
Currently it looks like (and this works):
#Create easy unique IDs for subscription names i.e. 1,2,3,4,5...etc..
df['SUBS_GROUP_ID'] = pd.factorize(df['SUBSCRIPTION_NAME'])[0] + 1
However, I only want to factorize subscription names where the SUB_GROUP_ID is null. So my thought was, grab all null rows, then run factorize function.
mask_to_grab_nulls = df['SUBS_GROUP_ID'].isnull()
df[mask_to_grab_nulls]['SUBS_GROUP_ID'] = pd.factorize(df[mask_to_grab_nulls]['SUBSCRIPTION_NAME'])[0] + 1
This runs, but does not change any values... any ideas on how to solve this?
This is likely related to chained assignments (see more here). Try the solution below, which isn't optimal but should work fine in your case:
df2 = df[df['SUBS_GROUP_ID'].isnull()] # isolate the Null IDs
df2['SUBS_GROUP_ID'] = pd.factorize(df2['SUBSCRIPTION_NAME'])[0] + 1 # factorize
df = df.dropna() # drop Null rows from the original table
df_fin = pd.concat([df,df2]) # concat df and df2
What you are doing is called chained indexing, which has two major downsides and should be avoided:
It can be slower than the alternative, because it involves more function calls.
The result is unpredictable: Why does assignment fail when using chained indexing?
I'm a bit surprised you haven't seen a SettingWithCopy warning. The warning points you in the right direction:
... Try using .loc[row_indexer,col_indexer] = value instead
So this should work:
mask_to_grab_nulls = df['SUBS_GROUP_ID'].isnull()
df.loc[mask_to_grab_nulls, 'SUBS_GROUP_ID'] = pd.factorize(
df.loc[mask_to_grab_nulls, 'SUBSCRIPTION_NAME']
)[0] + 1
You can use labelencoder.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df=df.dropna(subset=['SUBS_GROUP_ID'])#drop null values
df_results =le.fit_transform(df.SUBS_GROUP_ID.values) #encode string to classes
df_results
I would use numpy.where to factorize only the non nan values.
import pandas as pd
import numpy as np
df = pd.DataFrame({'SUBS_GROUP_ID': ['ID-001', 'ID-002', np.nan, 'ID-004', 'ID-005'],
'SUBSCRIPTION_NAME': ['Toyota', 'Safeway', 'Starbucks', 'Safeway', 'Toyota']})
df['SUBS_GROUP_ID'] = np.where(~df['SUBS_GROUP_ID'].isnull(), pd.factorize(df['SUBSCRIPTION_NAME'])[0] + 1, np.nan)
>>> print(df)
My DataFrame has a complex128 in one column. When I access another value via the .loc method it returns a complex128 instead of the stored dtype.
I encountered the problem when I was using some values from a DataFrame inside a class in a function.
Here is a minimal example:
import pandas as pd
arrays = [["f","i","c"],["float","int","complex"]]
ind = pd.MultiIndex.from_arrays(arrays,names=("varname","intended dtype"))
a = pd.DataFrame(columns=ind)
m1 = 1.33+1e-9j
parms1 = [1.,2,None]
a.loc["aa"] = parms1
a.loc["aa","c"] = m1
print(a.dtypes)
print(a.loc["aa","f"])
print("-----------------------------")
print(a.loc["aa",("f","float")])
print("-----------------------------")
print(a["f"])
If the MultiIndex is taken away that does not happen. So it seems to have some impact. But also accessing it in the MultiIndex-way does not help.
I noticed that the dtype assignment happens, because I have not specified any index in the DataFrame creation. This is necessary, because I don't know what to be filled in the beginning.
Is this a normal behavior or can I get rid of it?
pandas version is: 0.24.2
is reproducible in 0.25.3
import pandas as pd
import numpy as np
test_df = pd.DataFrame([[1,2]]*4, columns=['x','y'])
test_df.iloc[0,0] = '1'
test_df.iloc[0,0] = 1
test_df.select_dtypes(include=['number'])
I want to know that why column x does not included in this case
I can reproduce on Pandas v0.19.2. The issue is when, if at all, Pandas chooses to check and recast series. You first define the series as dtype object with this assignment:
test_df.iloc[0, 0] = '1'
Pandas stores any series with strings as object dtype. You then overwrite a value in the next line without explicitly changing the dtype of the series:
test_df.iloc[0, 0] = 1
But you should not assume this automatically triggers conversion to a numeric dtype for the entire series. As far as I am aware, this is not a documented behaviour. While it may work in more recent versions, it is not a behaviour you should assume for a production workflow.
I spend lots of time try to insert data into pandas' DataFrame
but just cannot as I expected.
there are two index:
1. current_time
2. company_name
After I use data.ix[] to insert a row,
the Dataframe create another column (named by the company_name)
Can anyone give me some advice, please.
import pandas
data=pandas.DataFrame(columns=['Date', 'Name', 'd1'])
data.set_index(['Date', 'Name'], inplace=True)
now = pandas.datetime.now()
data.ix[now, 'ACompany'] = [1]
To let pandas know the now, 'ACompany' are the levels of the index, you have to use some extra parantheses:
data.ix[(now, 'ACompany'), :] = 1
By just doing data.ix[now, 'ACompany'], pandas will by default try to interpret this as index=now, column='ACompany' (in the sense of .ix[rows, columns])
Further, it is recommended to use .loc instead of .ix if you want to index solely by the labels.
I am trying to pass values to stats.friedmanchisquare from a dataframe df, that has shape (11,17).
This is what works for me (only for three rows in this example):
df = df.as_matrix()
print stats.friedmanchisquare(df[1, :], df[2, :], df[3, :])
which yields
(16.714285714285694, 0.00023471398805908193)
However, the line of code is too long when I want to use all 11 rows of df.
First, I tried to pass the values in the following manner:
df = df.as_matrix()
print stats.friedmanchisquare([df[x, :] for x in np.arange(df.shape[0])])
but I get:
ValueError:
Less than 3 levels. Friedman test not appropriate.
Second, I also tried not converting it to a matrix-form leaving it as a DataFrame (which would be ideal for me), but I guess this is not supported yet, or I am doing it wrong:
print stats.friedmanchisquare([row for index, row in df.iterrows()])
which also gives me the error:
ValueError:
Less than 3 levels. Friedman test not appropriate.
So, my question is: what is the correct way of passing parameters to stats.friedmanchisquare based on df? (or even using its df.as_matrix() representation)
You can download my dataframe in csv format here and read it using:
df = pd.read_csv('df.csv', header=0, index_col=0)
Thank you for your help :)
Solution:
Based on #Ami Tavory and #vicg's answers (please vote on them), the solution to my problem, based on the matrix representation of the data, is to add the *-operator defined here, but better explained here, as follows:
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
And the same is true if you want to work with the original dataframe, which is what I ideally wanted:
print stats.friedmanchisquare(*[row for index, row in df.iterrows()])
in this manner you iterate over the dataframe in its native format.
Note that I went ahead and ran some timeit tests to see which way is faster and as it turns out, converting it first to a numpy array beforehand is twice as fast than using df in its original dataframe format.
This was my experimental setup:
import timeit
setup = '''
import pandas as pd
import scipy.stats as stats
import numpy as np
df = pd.read_csv('df.csv', header=0, index_col=0)
'''
theCommand = '''
df = np.array(df)
stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
theCommand = '''
stats.friedmanchisquare(*[row for index, row in df.iterrows()])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
which yields the following results:
4.97029900551
8.7627799511
The problem I see with your first attempt is that you end up passing one list with multiple dataframes inside of it.
The stats.friedmanchisquare needs multiple array_like arguments, not one list
Try using the * (star/unpack) operator to unpack the list
Like this
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
You could pass it using the "star operator", similarly to this:
a = np.array([[1, 2, 3], [2, 3, 4] ,[4, 5, 6]])
friedmanchisquare(*(a[i, :] for i in range(a.shape[0])))