Processing dataframe with conditionals, using df.apply

Processing dataframe with conditionals, using df.apply - python

I have a catalog of trees, which I've imported into a dataframe. It looks like this:
>>> df
ID Tree Zone Temp_Limit Grade
0 1 Apple 1 21 A
1 2 Apple 1 21 B
2 3 Orange 3 28 B
3 4 Pear 2 26 A
4 5 Apple 4 24 C
The idea is that depending on the type of tree, zone, and temp_limit, the dosage for irrigation, fertilizers, estimated transplant date, etc. would be calculated. Those would be additional columns in the dataframe.
The problem is that the formulas are conditional. It's not just "multiply temp limit by 5 and add 4", more like "if it's an apple tree in zone 2, apply this formula, if it's an orange tree in zone 1, formula goes like this... etc"
And to make things a bit more complicated, there might be rows that have an ID, a Tree type, and no data, that correspond to trees that haven't been delivered, etc.
My current solution is to use df.apply and have a function to do the conditionals and skip the blank rows:
def calculate_irrigation(species,zone,templimit,grade):
if species.lower() == "apple":
if zone == 3:
etc etc etc
df['irrigation'] = df.apply (lambda x: calculate_irrigation(x['Tree'], x['Zone'], x['Temp_limit'], x['Grade'])
Question: is a Dataframe and df.apply the best solution for this? I used a df because it adapts very well to the data I'm working with and getting the data in there is pretty straightforward. Plus exporting the final results is easy. But when you have to do different operations based on values, and have to start putting functions in there, it makes you wonder if there's a better way you're not seeing.

Related

Splitting column of a really big dataframe in two (or more) new cols

Problem
Hey there! I'm having some trouble trying to split one column of my dataframe in two (or even more) new columns. I think this depends on the fact that the dataframe I'm working with comes from a really big csv file, almost 10gb worth of space. Once it is loaded into a Pandas dataframe, this is represented by ~60mil of rows and 5 cols.
Example
Initially, the dataframes looks something like this:
In [1]: df
Out[1]:
category other_col
0 animal.cat 5
1 animal.dog 3
2 clothes.shirt.sports 6
3 shoes.laces 1
4 None 0
I want to first remove the rows of the df for which the category is not defined (i.e., the last one), and then split the category column in three new columns based on where the dot appears: one for the main category, one for the first subcategory and another one for the last subcategory (if that actually exists). Finally, I want to merge the whole dataframe back together.
In other words, this is what I want to obtain:
In [2]: df_after
Out[2]:
other_col main_cat sub_category_1 sub_category_2
0 5 animal cat None
1 3 animal dog None
2 6 clothes shirt sports
3 1 shoes laces None
My approach
My approach for this was the following:
df = df[df['category'].notnull()]
df_wt_cat = df.drop(columns=['category'])
df_cat_subcat = df['category'].str.split('.', expand=True).rename(columns={0: 'main_cat', 1: 'sub_category_1', 2: 'sub_category_2', 3: 'sub_category_3'})
df_after = pd.concat([df_wt_cat, df_cat_subcat], axis=1)
which seems to work just fine with small datasets, but it sucks up too much memory when this is applied on a dataframe that big and the Jupyter kernel just dies.
I've tried to read the dataframe in chunks, but I'm not really sure how should I proceed after that; I've obviously tried searching this kind of problem here on stack overflow, but I didn't manage to find anything useful.
Any help is appreciated!

split and join methods do the job:
results = df['category'].str.split(".", expand = True))
df_after = df.join(results)
after doing that you can freely filter resulting dataframe

Sort values in dataframe, but randomize order of items with same value

I am writing a recommendation system that recommends products based on a score assigned to each product, for example in the following dataframe:
index product_name score
0 prod_1 2
1 prod_2 2
2 prod_3 1
3 prod_4 3
I can of course sort this dataframe by score, using sort_values('score', ascending = False), however, this will always result in the following dataframe:
index product_name score
3 prod_4 3
0 prod_1 2
1 prod_2 2
2 prod_3 1
However, I would like to randomly shuffle the order of prod_1 and prod_2, as they have the same score. It doesn't seem like sort_values has any way of achieving this.
The only solution I can come up with is to fetch all possible scores from the dataframe, then make a new dataframe for each score, shuffle those, and then stitch them back together, but it seems like there should be a better way.

What about a new column with completely random numbers (use e.g. numpy.random.randint) and then sort it by both?
sort_values(by=["score","rand_col"], ascending=[False,False])

How to handle categorical data for preprocessing in Machine Learning

This may be a basic question, I have a categorical data and I want to feed this into my machine learning model. my ML model accepts only numerical data. What is the correct way to convert this categorical data into numerical data.
My Sample DF:
T-size Gender Label
0 L M 1
1 L M 1
2 M F 1
3 S F 0
4 M M 1
5 L M 0
6 S F 1
7 S F 0
8 M M 1
I know this following code convert my categorical data into numerical
Type-1:
df['T-size'] = df['T-size'].cat.codes
Above line simply converts category from 0 to N-1. It doesn't follow any relationship between them.
For this example I know S < M < L. What should I do when I have want to convert data like above.
Type-2:
In this type I No relationship between M and F. But I can tell that When M has more probability than F. i.e., sample to be 1 / Total number of sample
for Male,
(4/5)
for Female,
(2/4)
WKT,
(4/5) > (2/4)
How should I replace for this kind of column?
Can I replace M with (4/5) and F with (2/4) for this problem?
What is the proper way to dealing with column?
help me to understand this better.

There are many ways to encode categorical data, some of them depend on exactly what you plan to do with it. For example, one-hot-encoding which is easily the most popular choice is an extremely poor choice if you're planning on using a decision tree / random forest / GBM.
Regarding your t-shirts above, you can give a pandas categorical type an order:
df['T-size'].astype(pd.api.types.CategoricalDtype(['S','M','L'],ordered=True)).
if you had set up your tshirt categorical like that then your .cat.codes method would work perfectly. It also means you can easily use scikit-learn's LabelEconder which fits neatly into pipelines.
Regarding you encoding of gender, you need to be very careful when using your target variable (your Label). You don't want to do this encoding before your train-test split otherwise you're using knowledge of your unseen data making it not truly unseen. This gets even more complicated if you're using cross-validation as you'll need to do the encoding with in each CV iteration (i.e. new encoding per fold). If you want to do this, I recommend you check out TargetEncoder from skcontribs Category Encoders but again, be sure to use this within an sklearn Pipeline or you will mess up the train-test splits and leak information from your test set into you training set.

If you want to have a hierarchy in your size parameter, you may consider using a linear mapping for it. This would be :
size_mapping = {"S": 1, "M":2 , "L":3}
#mapping to the DataFrame
df['T-size_num'] = df['T-size'].map(size_mapping)
This allows you to treat the input as numerical data while preserving the hierarchy
And as for the gender, you are misconceiving the repartition and the preproces. If you already put the repartition as an input, you will introduce a bias in your data. You must consider that Male and female as two distinct categories regardless of their existing repartition. You should map it with two different numbers, but without introducing proportions.
df['Gender_num'] = df['Gender'].map({'M':0 , 'F':1})
For a more detailed explanation and a coverage of more specificities than your question, I suggest reading this article regarding categorical data in Machine Learning

For the first question, if you have a small number of categories, you could map the column with a dictionary. In this way you can set an order:
d = {'L':2, 'M':1, 'S':0}
df['T-size'] = df['T-size'].map(d)
Output:
T-size Gender Label
0 2 M 1
1 2 M 1
2 1 F 1
3 0 F 0
4 1 M 1
5 2 M 0
6 0 F 1
7 0 F 0
8 1 M 1
For the second question, you can use the same method, but i would leave the 2 values for males and females 0 and 1. If you need just the category and you dont have to make operations with the values, a values is equal to another.

It might be overkill for the M/F example, since it's binary - but if you are ever concerned about mapping a categorical into a numerical form, then consider one hot encoding. It basically stretches your single column containing n categories, into n binary columns.
So a dataset of:
Gender
M
F
M
M
F
Would become
Gender_M Gender_F
1 0
0 1
1 0
1 0
0 1
This takes away any notion of one thing being more "positive" than another - an absolute must for categorical data with more than 2 options, where there's no transitive A > B > C relationship and you don't want to smear your results by forcing one into your encoding scheme.

Pandas: Nesting Dataframes

Hello I want to store a dataframe in another dataframe cell.
I have a data that looks like this
I have daily data which consists of date, steps, and calories. In addition, I have minute by minute HR data of a specific date. Obviously it would be easy to put the minute by minute data in 2 dimensional list but I'm fearing that would be harder to analyze later.
What would be the best practice when I want to have both data in one dataframe? Is it even possible to even nest dataframes?
Any better ideas ? Thanks!

Yes, it seems possible to nest dataframes but I would recommend instead rethinking how you want to structure your data, which depends on your application or the analyses you want to run on it after.
How to "nest" dataframes into another dataframe
Your dataframe containing your nested "sub-dataframes" won't be displayed very nicely. However, just to show that it is possible to nest your dataframes, take a look at this mini-example:
Here we have 3 random dataframes:
>>> df1
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057
>>> df2
0 1 2
0 0.090917 0.457668 0.598548
1 0.748639 0.729935 0.680409
2 0.301244 0.024004 0.361283
>>> df3
0 1 2
0 0.200375 0.059798 0.665323
1 0.086708 0.320635 0.594862
2 0.299289 0.014134 0.085295
We can make a main dataframe that includes these dataframes as values in individual "cells":
df = pd.DataFrame({'idx':[1,2,3], 'dfs':[df1, df2, df3]})
We can then access these nested datframes as we would access any value in any other dataframe:
>>> df['dfs'].iloc[0]
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057

Machine Learning: combining features into single feature

I am a beginner in machine learning. I am confused how to combine different features of a data set into one single feature.
For example, I have a data set in Python Pandas data frame of features like this:
movie unknown action adventure animation fantasy horror romance sci-fi
Toy Story 0 1 1 0 1 0 0 1
Golden Eye 0 1 0 0 0 0 1 0
Four Rooms 1 0 0 0 0 0 0 0
Get Shorty 0 0 0 1 1 0 1 0
Copy Cat 0 0 1 0 0 1 0 0
I would like to convert this n features into one single feature named "movie_genre". One solution would be assign an integer value to each genre (unknown = 0, action = 1, adventure = 2 ..etc) and create a data frame like this:
movie genre
Toy Story 1,2,4,7
Golden Eye 1,6
Four Rooms 0
Get Shorty 3,4,6
Copy Cat 2,5
But in this case the entries in the column will be no longer an integer/ float value. Will that affect my future steps in machine learning process like fitting model and evaluating the algorithms?

convert each series of zeros and ones into an 8-bit number
toy story = 01101001
in binary, that's 105
similarly, Golden Eye=01000010 = 26946
you can do the rest here manually: http://www.binaryhexconverter.com/binary-to-decimal-converter
it's relatively straight forward to do programatically - just look through each label, and assign it the appropriate power of two then sum them up

It may be effective to leave them in their current multi-feature format and perform some sort of dimensionality reduction technique on that data.
This is very similar to a classic question: how do we treat categorical variables? One answer is one-hot or dummy encoding, which your original DataFrame is very similar to. With one-hot encoding, you start with a single, categorical feature. Using that feature, you make a column for each level, and assign a binary value to that column. The encoded result looks quite similar to what you are starting with. This sort of encoding is popular and many find it quite effective. Yours takes this one step further as each movie could be multiple genres. I'm not sure reversing that is a good idea.
Simply having more features is not always a bad thing if it is representing the data appropriately, and if you have enough observations. If you end up with a prohibitive number of features, there are many ways of reducing dimensionality. There is a wealth of knowledge on this topic out there, but one common technique is to apply principal component analysis (PCA) to a higher-dimensional dataset to find a lower-dimensional representation.
Since you're using python, you might want to check out what's available in scikit-learn for more ideas. A few resources in their documentation can be found here and here.

One thing you can do is to make a matrix of all possible combinations and reshape it into a single vector. If you want to account for all combinations it will have the same length as the original. If there are combinations that you don't need simply don't take them into account. Your network is label-agnostic and it won't mind.
But why is that a problem? Your dataset looks small enough.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Processing dataframe with conditionals, using df.apply - python

Related

Splitting column of a really big dataframe in two (or more) new cols

Sort values in dataframe, but randomize order of items with same value

How to handle categorical data for preprocessing in Machine Learning

Pandas: Nesting Dataframes

Machine Learning: combining features into single feature

Categories

Resources