I use GradientBoosting classifier to predict gender of users. The data have a lot of predictors and one of them is the country. For each country I have binary column. There are always only one column set to 1 for all country columns. But such desicion is very slow from computation point of view. Is there any way to represent country columns with only one column? I mean correct way.
You can replace the binary variable with the actual country name then collapse all of these columns into one column. Use LabelEncoder on this column to create a proper integer variable and you should be all set.
Related
I have two datasets, patient data and disease data.
The patient dataset has diseases written in alphanumeric code format which I want to search in the disease dataset to display the disease name.
Patient dataset snapshot
Disease dataset snapshot
I want use groupby function on the ICD column and find out the occurrence of a disease and rank it in descending order to display the top 5. I have been trying to find a reference for the same, but could not.
Would appreciate the help!
EDIT!!
avg2 = joined.groupby('disease_name').TIME_DELTA.mean().disease_name.value_counts()
I am getting this error "'Series' object has no attribute 'disease_name'"
Assuming that the data you have are in two pandas dataframes called patients and diseases and that the diseases dataset has the column names disease_id and disease_name this could be a solution:
joined = patients.merge(diseases, left_on='ICD', right_on='disease_id')
top_5 = joined.disease_name.value_counts().head(5)
This solution joins the data together and then use value_counts instead of grouping. It should solve what I preceive to be what you are asking for even if it is not exactly the functionality you asked for.
I want to create a new column in the dataset in which a ZipCode is assigned to a specific Region.
There are in total 5 Regions. Every Region consists of an x amount of ZipCodes. I would like to use the two different datasets to create a new column.
I tried some codes already, however, I failed because the series are not identically labeled. How should I tackle this problem?
I have two datasets, one of them has 1518 rows x 3 columns and the other one has
46603 rows x 3 columns.
As you can see in the picture:
df1 is the first dataset with the Postcode and Regio columns, which are the ZipCodes assigned to the corresponding Regio.
df2 is the second dataset where the Regio column is missing as you can see. I would like to add a new column into the df2 dataset which contains the corresponding Regio.
I hope someone could help me out.
Kind regards.
I believe you need to map the zipcode from dataframe 2 to the region column from the first dataframe. Assuming Postcode and ZipCode are same.
First create a dictionary from df1 and then replace the zipcode values based on the dictionary values
zip_dict = dict(zip(df1.Postcode, df1.Regio))
df2.ZipCode.replace(zip_dict)
I have few questions about preparing the data for learning.
Im very confused about how to convert columns to categorical and binary columns when i want to use the for correlations and classifier decision tree.
for exmaple in NBA_df, convert the position column to categorical column for using decision tree, can i convert it to categorical with .astype('category').cat.codes? (I know in basketball you can note the position by number 1-5.
NBA_df
And in students_df why its more correct to convert the 'gender','race/ethnicity','lunch','test preparation course' columns to a new binary columns with .get_dummies and not do the categorical convert in the same column ?
students_df
Its same in correlation and trees?
I'm not sure I totally understand what you mean by converting to categorical "in the same column", but I assume you mean replacing the categorical response from positions into numbers 1 through 5 and keeping those numbers in the same column.
Assuming this is what you meant, you have to think about how the computer will interpret the input. Is a Small Forward (position 3 in basketball) 3 times a Point Guard (1 * 3)? Of course not, but a computer will see it that way. It will determine relationships with the target that are not realistic. For this reason, you need separate columns with a binary indicator like .get_dummies is doing. That way, the computer will not see the positions as numeric values that can be operated on, but it will see the positions as separate entities.
I have a dataframe where I want to extract values from two columns but the criteria set is unique values from one of the columns. In the image below, I want to extract unique values of 'education' along with its corresponding values from 'education-num'. I can easily extract the unique values with df['education'].unique() and I am stuck with not being able to extract the 'education-num'.
image of the dataframe.
(Originally the task was to compute the population of people with education of Bachelors, Masters and Doctorate and I assume this would be easier when comparing the 'education-num' rather than logical operators on string. But if there's any way we could do it directly from the 'education' that would also be helpful.
Edit: Turns out the Dataframe.isin helps to select rows by the list of string as given in the solution here.)
P.S. stack-overflow didn't allow me to post the image directly and posted a link to it instead...😒
Select columns by subset and call DataFrame.drop_duplicates:
df1 = df[['education', 'education-num']].drop_duplicates()
If need count population use:
df2 = df.groupby(['education', 'education-num']).size().reset_index(name='count')
I have a data set that is almost perfect but there is a column called Refferer which has 3672 missing values, here you can find an image about the current data set. The Refferer column, contains categorical attributes and I would like to replace all the NaN cells keeping the proportion of strings that are already populating the column (find the proportions here).
Is there a way to automatically do that with a pandas DataFrame?