How to hot encode a dataframe column with multiple strings? - python

I am currently working on building a regressor model to predict the food delivery time.
This is the dataframe with a few observation
If you observe the Cuisines column has many strings. Used the code
pd.get_dummies(data.Cuisines.str.split(',',expand=True),prefix='c')
This helped me split the strings and hot encode, however, there is a new issue to be dealt with.
Merged the dataframe and dummies. fastfood appears in 1st and 3rd rows. Expected output was a single fastfood column with value 1 on first and third rows, however, there are two fastfood columns are created. fastfood(4th column) is created for first row and fastfood(15th column) for thrid row.
Can someone help me solve this help me get a single fastfood column with value 1 on first and third rows and similarly for the other cuisines too.

The two Fast Food are different by a trailing space. You probably want to try:
data.Cuisines.str.get_dummies(',\s*')

Related

How to parse batches of flagged rows and keep the row sastisfying some conditions in a Pandas dataframes?

I have a dataframe containing duplicates, that are flagged by a specific variable. The df looks like this:
enter image description here
The idea is that the rows to keep and its duplicates are stacked in batches (a pair or more if many duplicates)and identified by the "duplicate" column. I would like, for each batch, to keep the row depending on one conditions: it has to be the row with the smallest number of empty cells. For Alice for instance, it should be the second row (and not the one flagged "keep").
The difficulty lies also in the fact that I cannot group by on the "name", "lastname" or "phone" column, because they are not always filled (the duplicates are computed on these 3 concatenated columns by a ML algo).
Unlike already posted questions I've seen (how do I remove rows with duplicate values of columns in pandas data frame?), here the conditions to select the row to keep is not fixed (like keeping the first row or the last withing the batch of duplicates) but depends on the rows completion in each batch.
How can I parse the dataframe according to this column "duplicate", and among each batch extract the row I want ?
I tried to assign an unique label for each batch, in order to iterate over these label, but it fails.

Checking if a pandas column value is present in another pandas column (list)

I have a pandas column with a string value and I want to see if a separate column (listed format) contains the string at all.
Category
top predicted
Category A. Molecular Pathogenesis and Physiology
list see below
[("Category A. Molecular Pathogenesis and Physiology::HiClass::Separator::1. Amyloid beta::HiClass::Separator::f. Amyloid Structure",
0.054),
('Category B. Diagnosis and Assessment::HiClass::Separator::8. Methodologies::HiClass::Separator::None',
0.049),
('Category B. Diagnosis and Assessment::HiClass::Separator::1. Fluid Biomarkers::HiClass::Separator::b. Blood-based',
0.035)]
The list generated provides Category and 2 further sub-categories.
What I desire is a way to determine and identify how many times the Category column value appears in the list for column top predicted. In the above case "Category A. Molecular Pathogenesis and Physiology" for example would return a 1. If the value was "Category B. Diagnosis and Assessment" then 2 would be returned.
This would then iterate through the rows in the pandas dataframe.
Any help in achieving this would be much appreciated :) Many thanks!
Your second column contains a list of tuples, which in turn contain the strings to check for. The following lines of code should do it:
df['count'] = df.apply(lambda row: sum(1 for x in row['top predicted'] if row['Category'] in x[0]), axis=1)
You should use apply() instead of iterating over the rows as you suggested.

Sort the DataFrames columns which are dynamically generated

I have a dataframe which is similar to this
d1 = pd.DataFrame({'name':['xyz','abc','dfg'],
'age':[15,34,22],
'sex':['s1','s2','s3'],
'w-1(6)':[96,66,74],
'w-2(5)':[55,86,99],
'w-3(4)':[11,66,44]})
Note that in my original DataFrame the week numbers are generated dynamically (i.e) The columns
w-1(6),w-2(5) and w-3(4) are generated dynamically and change every week. I want to sort all the three columns of the week based on descending order of the values.
But the names of the columns cannot be used as they change every week.
Is there any possible way to achieve this?
Edit : The numbers might not always present for all the three weeks, in the sense that if W-1 has no data, i wont have that column in the dataset at all. So that would mean only two week columns and not three.
You can use the column indices.
d1.sort_values(by=[d1.columns[3], d1.columns[4], d1.columns[5]] , ascending=False)

How could I create a column with matchin values from different datasets with different lengths

I want to create a new column in the dataset in which a ZipCode is assigned to a specific Region.
There are in total 5 Regions. Every Region consists of an x amount of ZipCodes. I would like to use the two different datasets to create a new column.
I tried some codes already, however, I failed because the series are not identically labeled. How should I tackle this problem?
I have two datasets, one of them has 1518 rows x 3 columns and the other one has
46603 rows x 3 columns.
As you can see in the picture:
df1 is the first dataset with the Postcode and Regio columns, which are the ZipCodes assigned to the corresponding Regio.
df2 is the second dataset where the Regio column is missing as you can see. I would like to add a new column into the df2 dataset which contains the corresponding Regio.
I hope someone could help me out.
Kind regards.
I believe you need to map the zipcode from dataframe 2 to the region column from the first dataframe. Assuming Postcode and ZipCode are same.
First create a dictionary from df1 and then replace the zipcode values based on the dictionary values
zip_dict = dict(zip(df1.Postcode, df1.Regio))
df2.ZipCode.replace(zip_dict)

Python Pandas extract unique values from a column and another column

I am studying pandas, bokeh etc. to get started with Data Vizualisation. Right now I am practising with a giant table containing different birds. There are plenty of columns; two of those columns are "SCIENTIFIC NAME" and another one is "OBSERVATION COUNT".
I want to extract those two columns.
I did
df2 = df[["SCIENTIFIC NAME" , "OBSERVATION COUNT"]]
but the problem then is, that every entry is inside the table (since sometimes there are multiple entries/rows due to other columns of the same SCIENTIFIC NAME, but the OBSERVATION COUNT is always the same for the scientific name)
How can I get those two sectors but with the unique values, so every scientific name once, with the corresonding observation count.
EDIT: I just realized that sometimes the same scientific names have different observation counts due to another column. Is there a way to extract every first unique item from a column
IIUC, You can use drop_duplicates:
df2 = df[["SCIENTIFIC NAME" , "OBSERVATION COUNT"]].drop_duplicates()
To get counts:
df2 = df.groupby(["SCIENTIFIC NAME" , "OBSERVATION COUNT"])["SCIENTIFIC NAME"].count()

Categories

Resources