How can we extract duplicate values from multiple columns? - python

I have a dataset regarding Big Mart sales.
(You can find it here)
https://www.kaggle.com/brijbhushannanda1979/bigmart-sales-data
In the dataset there are columns like 'Outlet_Location_Type' and 'Outlet_Size'.
I want to find how many Tier1 locations have Medium 'Outlet_Size' and want to visualize this using grouped bar chart.I need a pythonic solution to this.
Any help is appreciated.

You need to use the groupby method :
df = pd.read_csv('Test.csv')
df = df[df['Outlet_Location_Type']=='Tier 1'].groupby(['Outlet_Size']).count()
Each column is equal and contains the number of element so you can select one randomly to plot the count :
df['Item_Identifier'].plot(kind='bar', stacked=True)
plt.show()

Related

Pandas - How do you find the top n elements of 1 column based on a condition from another column

I am struggling with a question based on Pandas. I have an earthquake data set with columns of countries and magnitudes. I am asked to:
"Find the top 10 states / countries where the strongest and weakest
earthquakes occurred."
From this question, I garnered that I am meant to find the top 10 countries ["country"] with the highest values (value_counts) , but sorting by magnitude ["mag"].
How would I go about doing this? I've looked around but there's nothing I've found about this online.
Are you sure you did not find something useful? If I understand your question correct, it is a simple one. After creating a dataframe by using below methods, you will get what you need.
import pandas as pd
df = pd.read_csv(".csv")
df.nlargest(x, ['Column Name'])
x is the number of elements which are the largest ones.
Same is goes for nsmallest too. Just use these:
DataFrame.nsmallest(n, columns, keep='first')
DataFrame.nlargest(n, columns, keep='first')
Please read and check the documentation first.

How could I create a column with matchin values from different datasets with different lengths

I want to create a new column in the dataset in which a ZipCode is assigned to a specific Region.
There are in total 5 Regions. Every Region consists of an x amount of ZipCodes. I would like to use the two different datasets to create a new column.
I tried some codes already, however, I failed because the series are not identically labeled. How should I tackle this problem?
I have two datasets, one of them has 1518 rows x 3 columns and the other one has
46603 rows x 3 columns.
As you can see in the picture:
df1 is the first dataset with the Postcode and Regio columns, which are the ZipCodes assigned to the corresponding Regio.
df2 is the second dataset where the Regio column is missing as you can see. I would like to add a new column into the df2 dataset which contains the corresponding Regio.
I hope someone could help me out.
Kind regards.
I believe you need to map the zipcode from dataframe 2 to the region column from the first dataframe. Assuming Postcode and ZipCode are same.
First create a dictionary from df1 and then replace the zipcode values based on the dictionary values
zip_dict = dict(zip(df1.Postcode, df1.Regio))
df2.ZipCode.replace(zip_dict)

Extract unique value with multiple columns from DataFrame

I have a dataframe where I want to extract values from two columns but the criteria set is unique values from one of the columns. In the image below, I want to extract unique values of 'education' along with its corresponding values from 'education-num'. I can easily extract the unique values with df['education'].unique() and I am stuck with not being able to extract the 'education-num'.
image of the dataframe.
(Originally the task was to compute the population of people with education of Bachelors, Masters and Doctorate and I assume this would be easier when comparing the 'education-num' rather than logical operators on string. But if there's any way we could do it directly from the 'education' that would also be helpful.
Edit: Turns out the Dataframe.isin helps to select rows by the list of string as given in the solution here.)
P.S. stack-overflow didn't allow me to post the image directly and posted a link to it instead...😒
Select columns by subset and call DataFrame.drop_duplicates:
df1 = df[['education', 'education-num']].drop_duplicates()
If need count population use:
df2 = df.groupby(['education', 'education-num']).size().reset_index(name='count')

Perform actions such as median, mean, etc. on certain unique values ​from a particular column using Pandas

animals = pd.DataFrame({'animal': ['Dog','Cat','Snake','Snake','Dog','Hamster','Cat','Alligator','Cat','Cat','Dog','Hamster','Alligator'],
'age':[2,1,5,7,5,1,4,15,6,9,3,2,40],
'weight':[10,4,3,20,15,0.1,6,300,7.1,10,12,0.15,350],
'length':[1,0.45,1,2,1.2,0.16,0.40,4.8,0.45,0.50,0.49,0.14,5]})
Suppose I have such a data frame
and I want to find out let's say what the average weight of cats is.
How can this be done?
Look into groupby and mean. It's similar to what you'd do with SQL.
animals.groupby('animal').mean().loc['Cat', 'weight']

Z-score calculation/standardisation using pandas

I come across this video and it bugs me.
Essentially, at 5:50, they calculate Z-score for the whole data frame by the following snippet:
df_z = (df - df.describle.T['mean'])/df.describle.T['std']
It is a neat and beautiful line.
However, df.describle.T looks like this and df looks like this
df.describle.T['mean'] and df.describle.T['std'] are two individual series, which take the df columns name as index and describle statistic parameters as columns, and df is an ordinary pd.DataFramewhich has numercial index and columns names in the right places.
My question is: how does that line make sense when they are not matching at all, in particular, how do they ensure that every variable example (x_i) matches their mean or std?
Thank you.

Categories

Resources