Pandas histogram bins - python

Coming from other reporting layers I have to apologise for what is probably a stupid question. I have a dataframe that looks like so:
I am attempting to create a histogram for Total Household Income. The data type for the column is int and all rows contain a value. The dataframe has about 40000 rows and the range of the values are roughly what can be seen in the screenshot.
I then run this command:
df.hist(column='Total Household Income', bins=10)
The result is this:
The bins are weird though. I then created a test data frame with just the values that can be seen in the screenshot above. It has the same data type.
my_dict = {'Total Household Income' :[480332, 189235, 82785, 107589, 189322]}
df2 = pd.DataFrame(my_dict)
If I then run df2.hist(column='Total Household Income', bins=10) the result looks a lot better.
Can anyone tell me what I am doing wrong?

Related

How do I create new pandas dataframe by grouping multiple variables?

I am having tremendous difficulty getting my data sorted. I'm at the point where I could have manually created a new .csv file in the time I have spent trying to figure this out, but I need to do this through code. I have a large dataset of baseball salaries by player going back 150 years.
This is what my dataset looks like.
I want to create a new dataframe that adds the individual player salaries for a given team for a given year, organized by team and by year. Using the following technique I have come up with this: team_salaries_groupby_team = salaries.groupby(['teamID','yearID']).agg({'salary' : ['sum']}), which outputs this: my output. On screen it looks sort of like what I want, but I want a dataframe with three columns (plus an index on the left). I can't really do the sort of analysis I want to do with this output.
Lastly, I have also tried this method: new_column = salaries['teamID'] + salaries['yearID'].astype(str) salaries['teamyear'] = new_column salaries teamyear = salaries.groupby(['teamyear']).agg({'salary' : ['sum']}) print(teamyear). Another output It adds the individual player salaries per team for a given year, but now I don't know how to separate the year and put it into its own column. Help please?
You just need to reset_index()
Here is sample code :
salaries = pd.DataFrame(columns=['yearID','teamID','igID','playerID','salary'])
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'A','salary':10000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'B','salary':20000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'A','salary':10000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'C','salary':5000},ignore_index=True)
salaries=salaries.append({'yearID':1985,'teamID':'ATL','igID':'NL','playerID':'B','salary':20000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'A','salary':100000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'B','salary':200000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'C','salary':50000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'A','salary':100000},ignore_index=True)
salaries=salaries.append({'yearID':2016,'teamID':'ATL','igID':'NL','playerID':'B','salary':200000},ignore_index=True)
After that , groupby and reset_index
sample_df = salaries.groupby(['teamID', 'yearID']).salary.sum().reset_index()
Is this what you are looking for ?

Why Pandas does not recognize X values in a Chart as Distinct?

I have a DataFrame with the following Columns:
countriesAndTerritories = Name of the Countries (Only contains Portugal and Spain)
cases = Number of Covid Cases
This is how the DataFrame looks like:
We have 1 row per "dateRep".
I tried to create a BarChart with the following code:
df.plot.bar(x="countriesAndTerritories", y="cases", rot=70,
title="Number of Covid cases per Country")
The result is the following:
As you can see, instead of having the total number of cases per Country (Portugal and Spain), i have multiple values in the X axis.
I've tried to investigate a little, but the examples i've found were with a inline df. So, if someone can help me, i apreciate.
PS: I'm used to QlikSense, and what i'm trying to achieve, would be something along these lines:

Python Pandas - Sample certain number of individuals from binned data

Here is a dummy example of the DF I'm working with. It effectively comprises binned data, where the first column gives a category and the second column the number of individuals in that category.
df = pd.DataFrame(data={'Category':['A','B','C','D','E','F','G','H','I'],
'Count':[1000,200,850,350,4000,20,35,4585,2],})
I want to take a random sample, say of 100 individuals, from these data. So for example my random sample could be:
sample1 = pd.DataFrame(data={'Category':['A','B','C','D','E','F','G','H','I'],
'Count':[15,2,4,4,35,0,15,25,0],})
I.e. the sample cannot contain more individuals than are actually in any of the categories. Sampling 0 individuals from a category is possible (and more likely for categories with a lower Count).
How could I go about doing this? I feel like there must be a simple answer but I can't think of it!
Thank you in advance!
You can try sample with replacement:
df.sample(n=100, replace=True, weights=df.Count).groupby(by='Category').count()

Factorplot with multiindex dataframe

This is the dataframe I am working with:
(only the first two years don't have data for country 69 I will fix this). nkill being the number of killed for that year summed from the original long form dataframe.
I am trying to do something similar to this plot:
However, with the country code as a hue. I know there are similar posts but none have helped me solve this, thank you in advance.
By Hue I mean that in the seaborn syntactical use As pictured in this third picture. See in this example Hue creates a plot for every type of variable in that column. So if I had two country codes in the country column, for every year it would plot two bars (one for each country) side by side.
Just looking at the data it should be possible to directly use the hue argument.
But first you would need to create actual columns from the dataframe
df.reset_index(inplace=True)
Then something like
sns.barplot(x = "year", y="nkill", hue="country", data=df)
should give you the desired plot.

Pandas: Calculate mean, var of similar columns grouped together

Trying to do a analysis of network trace data using pandas. I have read the dump file and created the following DataFrame:
So to detect the individual flows in the DataFrame data2, I have grouped the entire DataFrame according to ['ip_src', 'ip_dst', 'sport', 'dport', 'ip_proto', 'service'] using the following piece of code:
flow = ['ip_src', 'ip_dst', 'sport', 'dport', 'ip_proto', 'service']
grp1 = data2.groupby(flow, sort=False)
So when I do grp1.size() of the first twenty rows of data2, I get the following information:
What I would like to do now is to calculate the mean of ip_len, packet_len, var of ip_len, packet_len and mean of the interpacket arrival times (using the timestamps of packets belonging to the same flow).
How can I accomplish this in pandas so that the dataframe I get contains the statistics of each flow i.e. the columns should contain the ip_src, ip_dst, sport, dport, ip_proto, service, and the mean & var values calculated as earlier. I have tried both the aggr and apply methods, but haven't been able to do it. Thanks in advance!
data2.groupby(['colName1','colName2']).mean()
should do the job.

Categories

Resources