I have this column in my dataset where the values are not consistent.
You can either find values with just one decimal house or four decimal values.
I need this columns to calculate some means.
How can I treat this column?
Related
I have two DataFrames, each containing customer account information (e.g. name, address, sales, latitude and longitude). Both DataFrames have the latitude and longitude of the account. I'd like to match the accounts in the two DataFrames based on the latitude and longitudes by making the assumption that if the latitude and longitude both math at 4 decimal places then the accounts must also match. The output would be a new series (e.g. "Matched") in one of the two DataFrames that is "1" if there is a match or "0" if there isn't a match.
How can I do this using Pandas?
Synthesize a string column of "lat, lng" formatted to your favorite decimal precision.
Then just .merge( ... ) your dataframes on that.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
I have few questions about preparing the data for learning.
Im very confused about how to convert columns to categorical and binary columns when i want to use the for correlations and classifier decision tree.
for exmaple in NBA_df, convert the position column to categorical column for using decision tree, can i convert it to categorical with .astype('category').cat.codes? (I know in basketball you can note the position by number 1-5.
NBA_df
And in students_df why its more correct to convert the 'gender','race/ethnicity','lunch','test preparation course' columns to a new binary columns with .get_dummies and not do the categorical convert in the same column ?
students_df
Its same in correlation and trees?
I'm not sure I totally understand what you mean by converting to categorical "in the same column", but I assume you mean replacing the categorical response from positions into numbers 1 through 5 and keeping those numbers in the same column.
Assuming this is what you meant, you have to think about how the computer will interpret the input. Is a Small Forward (position 3 in basketball) 3 times a Point Guard (1 * 3)? Of course not, but a computer will see it that way. It will determine relationships with the target that are not realistic. For this reason, you need separate columns with a binary indicator like .get_dummies is doing. That way, the computer will not see the positions as numeric values that can be operated on, but it will see the positions as separate entities.
Suppose I have a dataframe with an index column filled with strings. Now, suppose I have very similar but somewhat different strings that I want to use to look up rows within the dataframe. How would I do this since they aren't identical? My guess would be to simply choose the row with the lowest distance between the two strings, but I'm not really sure how I could do that efficiently.
For example, if my dataframe is:
and I want to lookup "Lord of the rings", I should get the 2nd row. How would I do this in pandas?
I have a dataset like:
pointID lat lon otherinfo
I want to round up the coordinates and aggregate all the points whose coordinates become equal into one single item, and assign it a new name, which would probably be a new dataframe column. The "otherinfo" column must be preserved, meaning that by the end of the operation I will have the same number of rows I had before, but with new IDs based on the rounded coordinates.
How can I achieve this using pandas? Is it any easier if I use geoPandas?
If you already have columns for coodinates (lat and lon), you can do for example (rounding to 2 decimal numbers):
df['new_id'] = df.groupby([df.lat.round(2), df.lon.round(2)]).ngroup()
The ngroup method on the groupby gives for each original row to which group it belongs, so in fact gives you a new unique ID based on rounded lat/lon.
I use GradientBoosting classifier to predict gender of users. The data have a lot of predictors and one of them is the country. For each country I have binary column. There are always only one column set to 1 for all country columns. But such desicion is very slow from computation point of view. Is there any way to represent country columns with only one column? I mean correct way.
You can replace the binary variable with the actual country name then collapse all of these columns into one column. Use LabelEncoder on this column to create a proper integer variable and you should be all set.