Stacked Bar Plot with Two Key DataFrame - python

I have a dataframe with two keys. I'm looking to do a stacked bar plot of the number of items within key2 (meaning taking the count values from a fully populated column of data).
A small portion of the dataframe I have is:
Sector industry
Basic Industries Agricultural Chemicals 17
Aluminum 3
Containers/Packaging 1
Electric Utilities: Central 2
Engineering & Construction 12
Name: Symbol, dtype: int64
Key1 is Sector, Key2 is Industry. I want the value in Symbol (the counted column to be represented as industry stackings) in a bar comprising Basic Industries.
I know if I do a df.reset_index I'll have a column with (non-unique) Sectors and Industries with an integer counter. Is there a way to simply assign the column 1,2,3 data to pandas plot or matplotlib to make a stacked bar chart?
Alternatively, is there a way to easily specify using both keys in the aforementioned dataframe?
I'm looking for both guidance on approach from more experienced people as well as help with the actual syntax.

I just added a new Sector to improve the example.
Symbol
Sector industry
Basic Industries Agricultural Chemicals 17
Aluminum 3
Containers/Packaging 1
Electric Utilities: Central 2
Engineering & Construction 22
Basic Industries2 Agricultural Chemicals 7
Aluminum 8
Containers/Packaging 11
Electric Utilities: Central 7
Engineering & Construction 4
Assuming your dataframe is indexed by ["Sector", "industry"] you need first reset_index and then pivot your dataframe and finally make the stacked plot.
df.reset_index().pivot_table(index="industry", columns="Sector", values="Symbol").T.plot(kind='bar', stacked=True, figsize=(14, 6))

Another way, instead of reset_index, you can use this:
df.unstack().Symbol.plot(kind='bar', stacked=True)

Related

Which imputation technique to use for filling missing population data based on 3 categorical columns?

I am fairly new to data science. Apologies if the question is unclear.
**My Data is following format:**
*year age_group pop Gender Ethnicity
0 1957 0 - 4 Years 264727 Mixed Mixed
1 1957 5 - 9 Years 218097 Male Indian
2 1958 10 - 14 Years 136280 Female Indian
3 1958 15 - 19 Years 135679 Female Chinese
4 1959 20 - 24 Years 119266 Mixed Mixed*
.
.
.
.
Here Mixed means Both Male & Female for gender and Indian & Chinese & others for Ethnicity
where as pop is the population
I have some rows with missing values like the following:
year age_group pop Gender Ethnicity
344 1958 70 - 74 Years NaN Mixed Mixed
345 1958 75 - 79 Years NaN Male Indian
346 1958 80 - 84 Years NaN Mixed Mixed
349 1958 75 Years & Over NaN Mixed Mixed
350 1958 80 Years & Over NaN Female Chinese
.
.
.
These can't be dropped or filled with mean/median/previous values.
I am looking for any cold deck/any imputation techniques which would allow me fill the pop based on the values in year, age_group, gender and ethnicity.
Please do provide any sample code or documentation that would help me.
It's hard to a give a specific answer without knowing what you might want to use the data for. But here are some questions you should ask:
How many null values are there?
If there are a few e.g. less than 20, and you have the time, then you can look at each one individually. In this case, you might want to look up census data on google etc and make a guesstimate for each cell.
If there are more than can be individually assessed then we'll need to work some other magic.
Do you know how the other variables should relate to population?
Think about how other variables should relate to population. For example, if you know there's 500 males in one age cohort of a certain ethnicity but you don't know how many females... well 500 females would be a fair guess.
This would only cover some nulls, but is a logical assumption. You might be able to step through imputations of decreasing strength:
Find all single sex null values where the corresponding gender cohort is known, assume a 50:50 gender ratio for the cohort
Find all null values where the older cohort and younger cohort is known, impute this cohorts population linearly between them
Something else...
This is a lot of work -- but again -- what do you want the data for? If you're looking for a graph it's probably not worth it. But if you're doing a larger study / trying to win a kaggle competition... then maybe it is?
What real world context do you have?
For example, if this data is for population in a certain country, then you might know the age distribution curve of that country? You could then impute values for ethnicities along the age distribution curve given where other age cohorts for the same ethnicity sit. This is brutally simplistic, but might be ok for your use case.
Do you need this column?
If there are lots of nulls, then whatever imputation you do will likely add a good degree of error. So what are you doing with the data? If you don't have to have the column, and there are a lot of nulls, then perhaps you're better without it.
~~
Hope that helps -- good luck!

Pandas count unique value and change to percentage and put into the plotly bar chart

I have a pandas dataframe which looks like this:
Country
Japan
Japan
Korea
India
India
USA
USA
USA
I need to count the unique values of the country column and change to percentage and need to put in the x-axis and y-axis of plotly bar chart. Can anyone teach me how to do it?
Use value_counts:
df.Country.value_counts(normalize=True)

How should I structure my time series dataframe in Python/Pandas?

I have a dataframe with multiple repeating series, over time.
I want to create visualizations that compare series over time, as well as with one another.
What is the best way to structure my data to accomplish this?
I have thus far been trying to make smaller dataframes from this, by either dropping years or selecting only one series; using a variety of indexes, lists or series calls that refer to the multiple years, .Series, .loc or .drop etc..
I always seem to encounter the same issues when making the actual graphs; usually relating to the years.
My best result has been making simple bar graphs with countries on the x axis and GDP from only 2018 on the Y axis.
I would like to eventually be able to have countries represented by color with 3D plotly graphs, wherein a series like GDP is Z (depth), Years are Y, and some other series like GNI could be X.
For now I am just aiming to make a scatterplot
I am also fine with using matplot, seaborn, whatever makes the most sense here.
Columns: [country, series, 1994, 1995, 1996, etc..]
Country Series 1994 1995 1996 ...
USA GDP 3.12 4.13
USA Export% 25.5 32
USA GNI 867,123,111 989,666,123
UK GDP 2.87 etc.
UK Export% 43.1
UK GNI 981,125,555
China GDP 5.98
China Export% NaN
China GNI 787,123,447
...
df1 = df.loc[df['series']== 'GDP']
time = df1['1994':'1996']
gdp_time = px.scatter(df1, x = time, y= 'series', color="country")
gdp_time.show()
#Desired Graph
gdp_time = px.scatter(df1, x = years, y= GDP, color= Countries)
gdp_time.show()
I find it hard to believe that I cant simply create a series that references the multiple year columns as a singular 'time'.
What am I missing?

Making pie chart with xlsxwriter

I have a big excel file and I want to make a pie chart according to two columns the two columns are Name and State. If I wanted to use the State as the categories and then the Names as the values how would I go about doing that using the xlsxwriter charts?
my two columns look like this:
Name State
0 Jeff MN
1 Jeff MN
2 Jack MI
3 Jill TX
4 Parker TX
5 Kalic AZ
6 Kalic AZ
7 Kalic AZ
8 Kalic TX
I have gotten it to work but instead of returning me a pie chart with just one category for MN or AZ it returs multiple categories I want it to just get unique State names and then group everything up under that unique State. I dont want it to give me different slices in my pie chart for each entry.

Seaborn lmplot - Changing Marker Style and Color of single Datapoint

I was trying to find an answer to Harvards CS109, Homework 1, Part 1c from the year 2013 using seaborn, which they don't.
"Choose a plot to show this relationship and specifically annotate the Oakland baseball team on the on the plot. Show this plot across multiple years. In which years can you detect a competitive advantage from the Oakland baseball team of using data science? When did this end?"
So we do have for multiple years and multiple teams, salaries as well as wins. I want to build a seaborn facet for each year regressing salaries against wins AND call out the datapoint for Oakland. Building the facet for one regression for each year works fine. But how would I change the data point for oakland?
Thats how my data looks like (the first 5 entries):
teamID yearID salary W
0 ANA 1997 31135472 84
1 ANA 1998 41281000 85
2 ANA 1999 55388166 70
3 ANA 2000 51464167 82
4 ANA 2001 47535167 75
...
This is how I am plotting the data:
facetplots = sns.lmplot(x="salary", y="W", col="yearID", data=df_data, col_wrap=4, size=3)
Any help would be much appreciated.
Regards

Categories

Resources