I have a big excel file and I want to make a pie chart according to two columns the two columns are Name and State. If I wanted to use the State as the categories and then the Names as the values how would I go about doing that using the xlsxwriter charts?
my two columns look like this:
Name State
0 Jeff MN
1 Jeff MN
2 Jack MI
3 Jill TX
4 Parker TX
5 Kalic AZ
6 Kalic AZ
7 Kalic AZ
8 Kalic TX
I have gotten it to work but instead of returning me a pie chart with just one category for MN or AZ it returs multiple categories I want it to just get unique State names and then group everything up under that unique State. I dont want it to give me different slices in my pie chart for each entry.
Related
I have the following source dataframe
Person
Country
Is Rich?
0
US
Yes
1
India
No
2
India
Yes
3
US
Yes
4
US
Yes
5
India
No
6
US
No
7
India
No
I need to convert it another dataframe for plotting a bar graph like below for easily accessing data
Bar chart of economic status per country
Data frame to be created is like below.
Country
Rich
Poor
US
3
1
India
1
3
I am new to Pandas and Exploratory data science. Please help here
You can try pivot_table
df['Is Rich?'] = df['Is Rich?'].replace({'Yes': 'Rich', 'No': 'Poor'})
out = df.pivot_table(index='Country', columns='Is Rich?', values='Person', aggfunc='count')
print(out)
Is Rich? Poor Rich
Country
India 3 1
US 1 3
You could do:
converted = df.assign(Rich=df['Is Rich?'].eq('Yes')).eval('Poor = ~Rich').groupby('Country').agg({'Rich': 'sum', 'Poor': 'sum'})
print(converted)
Rich Poor
Country
India 1 3
US 3 1
However, if you want to plot it as a barplot, the following format might work best with a plotting library like seaborn:
plot_df = converted.reset_index().melt(id_vars='Country', value_name='No. of people', var_name='Status')
print(plot_df)
Country Status No. of people
0 India Rich 1
1 US Rich 3
2 India Poor 3
3 US Poor 1
Then, with seaborn:
import seaborn as sns
sns.barplot(x='Country', hue='Status', y='No. of people', data=plot_df)
Resulting plot:
I have the following dataset:
user artist sex country
0 1 red hot chili peppers f Germany
1 1 the black dahlia murder f Germany
2 1 goldfrapp f Germany
3 2 dropkick murphys f Germany
4 2 le tigre f Germany
.
.
289950 19718 bob dylan f Canada
289951 19718 pixies f Canada
289952 19718 the clash f Canada
I want to create a Boolean indicator matrix using a dataframe, where there is one row for each user and one column for each artist. For each row(user) if there is artist return 1 else return 0.
Just to mention, there are 1004 unique artists and 15000 unique users—it’s a large data set.
I have created an empty matrix using the following:
pd.DataFrame(index=user, columns=artist)
I am having difficulty populating the dataframe correctly.
There is a method in pandas called notnull
Suppose your dataframe is named df, you should use:
df['has_artist'] = df['artist'].notnull()
This will add a column of boolean named has_artist to your dataframe
If you want to have 0 and 1 do instead:
df['has_artist'] = df['artist'].notnull().astype(int)
You can also store it in a different variable and not alter your dataframe.
I have a dataframe like this:
Country Year Column1 Column2
1 Guatemala 1999 5 1
4 Mexico 2000 1 3
5 Mexico 2000 2 2
6 Mexico 2000 2 1
8 Guatemala 2000 3 2
11 Guatemala 2003 4 3
12 Guatemala 2003 6 4
13 Guatemala 2003 5 5
What I want to make is a boxplot for each group in Country, displaying a number of boxes corresponding to the number of unique values in Years. These boxes should represent the values in Column2.
I group the data and get boxplots like this:
df1=df.groupby('Origin').boxplot(column='Column2', subplots=True)
That gives me a boxplot for each Country, but with just one plot in it, representing all the values from that group, not separated by years. How can I get a box for each unique value in year, representing the values in Column2 in my code?
I would use the seaborn package, in particular combining the FacetGrid with boxplot.
For your situation, the code might look like this:
import seaborn as sns
g = sns.FacetGrid(df, col="Country", sharex=False)
g.map(sns.boxplot, 'Year', 'Column2')
Edit: this is what I get for your data above:
I have a dataframe with two keys. I'm looking to do a stacked bar plot of the number of items within key2 (meaning taking the count values from a fully populated column of data).
A small portion of the dataframe I have is:
Sector industry
Basic Industries Agricultural Chemicals 17
Aluminum 3
Containers/Packaging 1
Electric Utilities: Central 2
Engineering & Construction 12
Name: Symbol, dtype: int64
Key1 is Sector, Key2 is Industry. I want the value in Symbol (the counted column to be represented as industry stackings) in a bar comprising Basic Industries.
I know if I do a df.reset_index I'll have a column with (non-unique) Sectors and Industries with an integer counter. Is there a way to simply assign the column 1,2,3 data to pandas plot or matplotlib to make a stacked bar chart?
Alternatively, is there a way to easily specify using both keys in the aforementioned dataframe?
I'm looking for both guidance on approach from more experienced people as well as help with the actual syntax.
I just added a new Sector to improve the example.
Symbol
Sector industry
Basic Industries Agricultural Chemicals 17
Aluminum 3
Containers/Packaging 1
Electric Utilities: Central 2
Engineering & Construction 22
Basic Industries2 Agricultural Chemicals 7
Aluminum 8
Containers/Packaging 11
Electric Utilities: Central 7
Engineering & Construction 4
Assuming your dataframe is indexed by ["Sector", "industry"] you need first reset_index and then pivot your dataframe and finally make the stacked plot.
df.reset_index().pivot_table(index="industry", columns="Sector", values="Symbol").T.plot(kind='bar', stacked=True, figsize=(14, 6))
Another way, instead of reset_index, you can use this:
df.unstack().Symbol.plot(kind='bar', stacked=True)
I am trying to get a daily status count from the following DataFrame (it's a subset, the real data set is ~14k jobs with overlapping dates, only one status at any given time within a job):
Job Status User
Date / Time
1/24/2011 10:58:04 1 A Ted
1/24/2011 10:59:20 1 C Bill
2/11/2011 6:53:14 1 A Ted
2/11/2011 6:53:23 1 B Max
2/15/2011 9:43:13 1 C Bill
2/21/2011 15:24:42 1 F Jim
3/2/2011 15:55:22 1 G Phil Jr.
3/4/2011 14:57:45 1 H Ted
3/7/2011 14:11:02 1 I Jim
3/9/2011 9:57:34 1 J Tim
8/18/2014 11:59:35 2 A Ted
8/18/2014 13:56:21 2 F Bill
5/21/2015 9:30:30 2 G Jim
6/5/2015 13:17:54 2 H Jim
6/5/2015 14:40:38 2 I Ted
6/9/2015 10:39:15 2 J Tom
1/16/2015 7:45:58 3 A Phil Jr.
1/16/2015 7:48:23 3 C Jim
3/6/2015 14:09:42 3 A Bill
3/11/2015 11:16:04 3 K Jim
My initial thought (from the following link) was to groupby the job column, fill in the missing dates for each group and then ffill the statuses down.
Pandas reindex dates in Groupby
I was able to make this work...kinda...if two statuses occurred on the same date, one would not be included in output and consequently some statuses were missing.
I then found the following, it supposedly handles the duplicate issue, but I am unable to get it to work with my data.
Efficiently re-indexing one level with "forward-fill" in a multi-index dataframe
Am I on the right path thinking that filling in the missing dates and then ffill down the statuses is the correct way to ultimately capture daily counts of individual statuses? Is there another method that might better use pandas features that I'm missing?