A series query on groupby function - python

I have a data frame called active and it has 10 unique POS column values.
Then I group POS values and mean normalize the OPW columns and then store the normalized values as a seperate column ['resid'].
If I groupby on POS values shouldnt the new active data frame's POS columns contain only unique POS values??
For example:
df2 = pd.DataFrame({'X' : ['B', 'B', 'A', 'A'], 'Y' : [1, 2, 3, 4]})
print df2
df2.groupby(['X']).sum()
I get an output like this:
Y
X
A 7
B 3
In my example shouldn't I get an column with only unique Pos values as mentioned below??
POS Other Columns
Rf values
2B values
LF values
2B values
OF values

I can't be 100% sure without the actual data, but i'm pretty sure that the problem here is that you are not aggregating the data.
Let's go through the groupby step by step.
When you do active.groupby('POS'), what's actually happening is that you are slicing the dataframe per each unique POS, and passing each of these sclices, sequentially, to the applied function.
You can get a better vision of what's happening by using get_group (ex : active.groupby('POS').get_group('RF') )
So you're applying your meanNormalizeOPW function to each of those slices. That function creates a mean normalized value of the column 'resid' for each line of the passed dataframe. And you return that dataframe, ending with a shape that is similar than what was passed.
So if you just add an aggregation function to the returned df, it should work fine. I guess here you want a mean, so just change return df into return df.mean()

Related

Multi-slice pandas dataframe

I have a dataframe:
import pandas as pd
df = pd.DataFrame({'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']})
that I would like to slice into two new dataframes such that the first contains every nth value, while the second contains the remaining values not in the first.
For example, in the case of n=3, the second dataframe would keep two values from the original dataframe, skip one, keep two, skip one, etc. This slice is illustrated in the following image where the original dataframe values are blue, and these are split into a green set and a red set:
I have achieved this successfully using a combination of iloc and isin:
df1 = df.iloc[::3]
df2 = df[~df.val.isin(df1.val)]
but what I would like to know is:
Is this the most Pythonic way to achieve this? It seems inefficient and not particularly elegant to take what I want out of a dataframe then get the rest of what I want by checking what is not in the new dataframe that is in the original. Instead, is there an iloc expression, like that which was used to generate df1, which could do the second part of the slicing procedure and replace the isin line? Even better, is there a single expression that could execute the the entire two-step slice in one step?
Use modulo 3 with compare for not equal first values (same like sliced rows):
#for default RangeIndex
df2 = df[df.index % 3 != 0]
#for any Index
df2 = df[np.arange(len(df)) % 3 != 0]
print (df2)
val
1 b
2 c
4 e
5 f
7 h

Python: Extract unique index values and use them in a loop

I would like to apply the loop below where for each index value the unique values of a column called SERIAL_NUMBER will be returned. Essentially I want to confirm that for each index there is a unique serial number.
index_values = df.index.levels
for i in index_values:
x = df.loc[[i]]
x["SERIAL_NUMBER"].unique()
The problem, however, is that my dataset has a multi-index and as you can see below it is stored in a frozen list. I am just interested in the index values that contain a long number. The word "vehicle" also as an index can be removed as it is repeated all over the dataset.
How can I extract these values into a list so I can use them in the loop?
index_values
>>
FrozenList([['0557bf98-c3e0-4955-a23f-2394635ab531', '074705a3-a96a-418c-9bfe-14c37f5c4e6f', '0f47e260-0fa2-40ba-a417-7c00ea74248c', '17342ca2-6246-4150-8080-96d6125cf2b5', '26c6c0d1-0134-4b3a-a149-61dd93afab3b', '7600be43-5d0a-49b3-a1ee-fd107db5822f', 'a07f2b0c-447c-4143-a361-d7ddbffdcc77', 'b929801c-2f32-4a95-bfc4-48a05b48ee01', 'cc912023-0113-42cd-8fe7-4df4005127c2', 'e424bd02-e188-462e-a1a6-2f4ed8fe0a2d'], ['vehicle']])
without an example its hard to judge, but I think you need
df.index.get_level_values(0).unique() # add .tolist() if you want a list
import pandas as pd
df = pd.DataFrame({'A' : [5]*5, 'B' : [6]*5})
df = df.set_index('A',append=True)
df.index.get_level_values(0).unique()
Int64Index([0, 1, 2, 3, 4], dtype='int64')
df.index.get_level_values(1).unique()
Int64Index([5], dtype='int64', name='A')
to drop duplicates from an index level use the .duplicated() method.
df[~df.index.get_level_values(1).duplicated(keep='first')]
B
A
0 5 6

How to pass a series to call a user defined function?

I am trying to pass a series to a user defined function and getting this error:
Function:
def scale(series):
sc=StandardScaler()
sc.fit_transform(series)
print(series)
Code for calling:
df['Value'].apply(scale) # df['Value'] is a Series having float dtype.
Error:
ValueError: Expected 2D array, got scalar array instead:
array=28.69.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Can anyone help address this issue?
The method apply will apply a function to each element in the Series (or in case of a DataFrame either each row or each column depending on the chosen axis). Here you expect your function to process the entire Series and to output a new Series in its stead.
You can therefore simply run:
StandardScaler().fit_transform(df['Value'].values.reshape(-1, 1))
StandardScaler excepts a 2D array as input where each row is a sample input that consists of one or more features. Even it is just a single feature (as seems to be the case in your example) it has to have the right dimensions. Therefore, before handing over your Series to sklearn I am accessing the values (the numpy representation) and reshaping it accordingly.
For more details on reshape(-1, ...) check this out: What does -1 mean in numpy reshape?
Now, the best bit. If your entire DataFrame consists of a single column you could simply do:
StandardScaler().fit_transform(df)
And even if it doesn't, you could still avoid the reshape:
StandardScaler().fit_transform(df[['Value']])
Note how in this case 'Value' is surrounded by 2 sets of braces so this time it is not a Series but rather a DataFrame with a subset of the original columns (in case you do not want to scale all of them). Since a DataFrame is already 2-dimensional, you don't need to worry about reshaping.
Finally, if you want to scale just some of the columns and update your original DataFrame all you have to do is:
>>> df = pd.DataFrame({'A': [1,2,3], 'B': [0,5,6], 'C': [7, 8, 9]})
>>> columns_to_scale = ['A', 'B']
>>> df[columns_to_scale] = StandardScaler().fit_transform(df[columns_to_scale])
>>> df
A B C
0 -1.224745 -1.397001 7
1 0.000000 0.508001 8
2 1.224745 0.889001 9

A fast method to add a label column to large pd dataframe based on a range of another column

I'm fairly new to python and am working with large dataframes with upwards of 40 million rows. I would like to be able to add another 'label' column based on the value of another column.
if I have a pandas dataframe (much smaller here for detailing the problem)
import pandas as pd
import numpy as np
#using random to randomly get vals (as my data is not sorted)
my_df = pd.DataFrame(np.random.randint(0,100,1000),columns = ['col1'])
I then have another dictionary containing ranges associated with a specific label, similar to something like:
my_label_dict ={}
my_label_dict['label1'] = np.array([[0,10],[30,40],[50,55]])
my_label_dict['label2'] = np.array([[11,15],[45,50]])
Where any data in my_df should be 'label1' if it is between 0,10 or 30,40 or 50,55
And any data should be 'label2' if it between 11 to 15 or 45 to 50.
I have only managed to isolate data based on the labels and retrieve an index through something like:
idx_save = np.full(len(my_label_dict['col1']),False,dtype = bool).reshape(-1,1)
for rng in my_label_dict['label1']:
idx_temp = np.logical_and( my_label_dict['col1']> rng[0], my_label_dict['col1'] < rng[1]
idx_save = idx_save | idx_temp
and then use this index to access label1 values from my_dict. and then repeat for label2.
Ideally I would like to add another column on my_label_dict named 'labels' which would add 'label1' for all datapoints that satisfy the given ranges etc. Or just a quick method to retrieve all values from the dataframe that satisfy the ranges in the labels.
I'm new to generator functions, and havent completely gotten my head around them but maybe they could be used here?
Thanks for any help!!
You can to the task in a "more pandasonic" way.
Start from creating a Series, named labels, initially with empty strings:
labels = pd.Series([''] * 100).rename('label')
The length is 100, just as the upper limit of your values.
Then fill it with proper labels:
for key, val in my_label_dict.items():
for v in val:
labels[v[0]:v[1]+1] = key
And the only thing to do is to merge your DataFrame with labels:
my_df = my_df.merge(labels, how='left', left_on='col1', right_index=True)
I also noticed such a contradiction in my_label_dict:
you have label1 for range between 50 and 55 (I assume inclusive),
you have also label2 for range between 45 and 50,
so for value of 50 you have two definitions.
My program acts on the "last decision takes precedence" principle, so the label
for 50 is label2. Maybe you should change one of these range borders?
Edit
A modified solution if the upper limit of col1 is "unpredictable":
Define labels the following way:
rngMax = max(np.array(list(itertools.chain.from_iterable(
my_label_dict.values())))[:,1])
labels = pd.Series([np.nan] * (rngMax + 1)).rename('label')
for key, val in my_label_dict.items():
for v in val:
labels[v[0]:v[1]+1] = key
labels.dropna(inplace=True)
Add .fillna('') to my_df.merge(...).
Here is a solution that would also work for float ranges, where you can't create a mapping for all possible values. This solution requires resorting your dataframes.
# build a dataframe you can join and sort it for the from-field
join_df=pd.DataFrame({
'from': [ 0, 30, 50, 11, 45],
'to': [10, 40, 55, 15, 50],
'label': ['label1', 'label1', 'label1', 'label2', 'label2']
})
join_df.sort_values('from', axis='index', inplace=True)
# calculate the maximum range length (but you could alternatively set it to any value larger than your largest range as well)
max_tolerance=(join_df['to'] - join_df['from']).max()
# sort your value dataframe for the column to join on and do the join
my_df.sort_values('col1', axis='index', inplace=True)
result= pd.merge_asof(my_df, join_df, left_on='col1', right_on='from', direction='backward', tolerance=max_tolerance)
# now you just have to remove the lables for the rows for which the value passed the end of the range and drop the two range columns
result.loc[result['to']<result['col1'], 'label']= np.NaN
result.drop(['from', 'to'], axis='columns', inplace=True)
The merge_asof(...direchtion='backward',...) just joines for each row in my_df the row in join_df with the maximum value in from that still sattisfies from<=col1. It doesn't look at the to column at all. This is why we remove the labels where the to boundary is hurt by the assignment of np.NaN in the line with the .loc.

Python Pandas imputation of Null values

I am attempting to impute Null values with an offset that corresponds to the average of the row df[row,'avg'] and average of the column ('impute[col]'). Is there a way to do this that would make the method parallelize with .map? Or is there a better way to iterate through the indexes containing Null values?
test = pd.DataFrame({'a':[None,2,3,1], 'b':[2,np.nan,4,2],
'c':[3,4,np.nan,3], 'avg':[2.5,3,3.5,2]});
df = df[['a', 'b', 'c', 'avg']];
impute = dict({'a':2, 'b':3.33, 'c':6 } )
def smarterImpute(df, impute):
df2 = df
for col in df.columns[:-1]:
for row in test.index:
if pd.isnull(df.loc[row,col]):
df2.loc[row, col] = impute[col]
+ (df.loc[:,'avg'].mean() - df.loc[row,'avg'] )
return print(df2)
smarterImpute(test, impute)
Notice that in your 'filling' expression:
impute[col] + (df.loc[:,'avg'].mean() - df.loc[row,'avg']`
The first term only depends on the column and the third only on the row; the second is just a constant. So we can create an imputation dataframe to look up whenever there's a value that needs to be filled:
impute_df = pd.DataFrame(impute, index = test.index).add(test.avg.mean() - test.avg, axis = 0)
Then, there's a method in called .combine_first() that allows you fill the NAs in one dataframe with the values of another, which is exactly what we need. We use this, and we're done:
test.combine_first(impute_df)
With pandas, you generally want to avoid using loops, and seek to make use of vectorization.

Categories

Resources