Why Pandas does not recognize X values in a Chart as Distinct? - python

I have a DataFrame with the following Columns:
countriesAndTerritories = Name of the Countries (Only contains Portugal and Spain)
cases = Number of Covid Cases
This is how the DataFrame looks like:
We have 1 row per "dateRep".
I tried to create a BarChart with the following code:
df.plot.bar(x="countriesAndTerritories", y="cases", rot=70,
title="Number of Covid cases per Country")
The result is the following:
As you can see, instead of having the total number of cases per Country (Portugal and Spain), i have multiple values in the X axis.
I've tried to investigate a little, but the examples i've found were with a inline df. So, if someone can help me, i apreciate.
PS: I'm used to QlikSense, and what i'm trying to achieve, would be something along these lines:

Related

Pandas histogram bins

Coming from other reporting layers I have to apologise for what is probably a stupid question. I have a dataframe that looks like so:
I am attempting to create a histogram for Total Household Income. The data type for the column is int and all rows contain a value. The dataframe has about 40000 rows and the range of the values are roughly what can be seen in the screenshot.
I then run this command:
df.hist(column='Total Household Income', bins=10)
The result is this:
The bins are weird though. I then created a test data frame with just the values that can be seen in the screenshot above. It has the same data type.
my_dict = {'Total Household Income' :[480332, 189235, 82785, 107589, 189322]}
df2 = pd.DataFrame(my_dict)
If I then run df2.hist(column='Total Household Income', bins=10) the result looks a lot better.
Can anyone tell me what I am doing wrong?

How do I create new pandas dataframe by grouping multiple variables?

I am having tremendous difficulty getting my data sorted. I'm at the point where I could have manually created a new .csv file in the time I have spent trying to figure this out, but I need to do this through code. I have a large dataset of baseball salaries by player going back 150 years.
This is what my dataset looks like.
I want to create a new dataframe that adds the individual player salaries for a given team for a given year, organized by team and by year. Using the following technique I have come up with this: team_salaries_groupby_team = salaries.groupby(['teamID','yearID']).agg({'salary' : ['sum']}), which outputs this: my output. On screen it looks sort of like what I want, but I want a dataframe with three columns (plus an index on the left). I can't really do the sort of analysis I want to do with this output.
Lastly, I have also tried this method: new_column = salaries['teamID'] + salaries['yearID'].astype(str) salaries['teamyear'] = new_column salaries teamyear = salaries.groupby(['teamyear']).agg({'salary' : ['sum']}) print(teamyear). Another output It adds the individual player salaries per team for a given year, but now I don't know how to separate the year and put it into its own column. Help please?
You just need to reset_index()
Here is sample code :
salaries = pd.DataFrame(columns=['yearID','teamID','igID','playerID','salary'])
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'A','salary':10000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'B','salary':20000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'A','salary':10000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'C','salary':5000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'B','salary':20000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'A','salary':100000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'B','salary':200000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'C','salary':50000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'A','salary':100000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'B','salary':200000},ignore_index=True)
After that , groupby and reset_index
sample_df = salaries.groupby(['teamID', 'yearID']).salary.sum().reset_index()
Is this what you are looking for ?

How can I show the labels of the line in a grouby data plot in pandas , python

I am trying to use for loop to plot the Total Order amount of each hour in each city from the total data.
But I only get one plot and I don't know which line belongs to which city in that plot, how can I label those lines in my code?
If possible, I also want to know how I can have one plot for each city instead of having multiple lines in one plot.
Your advice will be much appreciated!
Here are my codes:
city_grp=all_data.groupby('city') # to get the list of the cities
for cty in all_data['city'].unique():
cgroup=city_grp.get_group(cty) # to get the df of each city group
h_grp=cgroup.groupby('Hour') # to group the df by hours
hs=[h for h,df in h_grp['Quantity Ordered']]
plt.xticks(hs)
plt.xlabel('{} Hour in a day'.format(cty))
plt.ylabel('Quantity Ordered')
plt.plot(hs,h_grp['Quantity Ordered'].sum())
Here is the plot that I got
Here's a solution (with fake data). Basically, you should use the legend mehtod.
cities = ["New York", "Los Angeles"]
for city in cities:
p = plt.plot(df.index, df[city])
plt.legend(cities)
The result is:
You can try something like this.
all_data.groupby(["city","Hour"])['Quantity"].sum().unstack().plot()
You may need one more groupby().
See example here https://cmdlinetips.com/2020/05/fun-with-pandas-groupby-aggregate-multi-index-and-unstack/

How to set parameters for new column in pandas dataframe OR for a value count on python?

I'm using a some data from Kaggle about blue plaques in Europe. Many of these plaques describe famous people, but others describe places or events or animals. The dataframe includes the years of both birth and death for those famous people, and I have added a new column that displays the age of the lead subject at their time of death with the following code:
plaques['subject_age'] = plaques['lead_subject_died_in'] - plaques['lead_subject_born_in']
This works for some of the dataset, but since some of the subjects don't have values for the columns 'lead_subject_died_in' and 'lead_subject_born_in', some of my results are funky.
I was trying to determine the most common age of death with this:
agecount = plaques['subject_age'].value_counts()
print(agecount)
--and I got some crazy stuff-- negative numbers, 600+, etc.-- how do I make it so that it only counts the values for people who actually have data in both of those columns?
By the way, I'm a beginner, so if the operations you suggest are very difficult, please explain what they're doing so that I can learn and use it in the future!
you can use dropna function to remove the nan values in certain columns:
# remove nan values from these 2 columns
plaques = plaques.dropna(subset = ['lead_subject_died_in', 'lead_subject_born_in'])
plaques['subject_age'] = plaques['lead_subject_died_in'] - plaques['lead_subject_born_in']
# get the most frequent age
plaques['subject_age'].value_counts().idxmax()
# get the top 5 top five most common ages
plaques['subject_age'].value_counts().head()

Factorplot with multiindex dataframe

This is the dataframe I am working with:
(only the first two years don't have data for country 69 I will fix this). nkill being the number of killed for that year summed from the original long form dataframe.
I am trying to do something similar to this plot:
However, with the country code as a hue. I know there are similar posts but none have helped me solve this, thank you in advance.
By Hue I mean that in the seaborn syntactical use As pictured in this third picture. See in this example Hue creates a plot for every type of variable in that column. So if I had two country codes in the country column, for every year it would plot two bars (one for each country) side by side.
Just looking at the data it should be possible to directly use the hue argument.
But first you would need to create actual columns from the dataframe
df.reset_index(inplace=True)
Then something like
sns.barplot(x = "year", y="nkill", hue="country", data=df)
should give you the desired plot.

Categories

Resources