I'm new to Python and after a lot of tinkering, have managed to clean up some .csv data.
I now have a bunch of countries as rows and a bunch of dates as columns, and am trying to create a chart showing a line for each country's value over time.
The problem is that when I enter df.plot() it results in a chart with each date as a line.
I have melted the data such that the first column is country, second is date, and third is value, but all I get is a single blue block growing over time (not multiple lines). How can I fix this?
You can use the transpose function in [pandas][1]:
Or instead of df.plot, you can use plot(coloumn, row).
As it was mentioned in comments, it is always better to provide an example (look at #importanceofbeingeenest comment).
Related
I'm a beginner in coding and I wrote some codes in python pandas that I didn't understand fully and need some clarification.
Lets say this is the data, DeathYear, Age, Gender and Country are all columns in an excel file.
How to plot a table with non-numeric values in python?
I saw this question and I used this command
df.groupby('Gender')['Gender'].count().plot.pie(autopct='%.2f',figsize=(5,5))
it works and gives me a pie chart of the percentage of each gender,
but the normal pie chart command that I know for numerical data looks like this
df["Gender"].plot.pie(autopct="%.2f",figsize=(5,5))
My question is why did we add the .count()?
is it to transform non numerical data to numerical?
and why did why use the group by and type the column twice ('Gender')['Gender']?
I'll address the second part of your question first since it makes more sense to explain it that way
The reason that you use ('Gender')['Gender'] is that it does two different things. The first ('Gender') is the argument to the groupby function. It tells you that you want the DataFrame to be grouped by the 'Gender' column. Note that the groupby function needs to have a column or level to group by or else it will not work.
The second ['Gender'] tells you to only look at the 'Gender' column in the resulting DataFrame. The easiest way to see what the second ['Gender'] does is to compare the output of df.groupby('Gender').count() and df.groupby('Gender')['Gender'].count() and see what happens.
One detail that I omitted in first part for clarity it that the output of df.groupby('Gender') is not a DataFrame, but actually a DataFrameGroupBy object. The details of what exactly this object is are not important to your question, but the key is that to get a DataFrame back you need to have a function that tells you what to put in the rows of the DataFrame that you wish to create. The .count() function is one of those options (along with many others such as .mean(), etc.). In your case, since you want the total counts to make a pie chart, the .count() function does exactly that; it will count the number of times 'Female' and 'Male' appears in the 'Gender' column and that sum will be the entries in the corresponding row. The DataFrame is then able to be used to create a pie chart. So you are correct in that the .count() function transforms the non-numeric 'Female' and 'Male' entries into a numeric value which corresponds to how often those entries appeared in the initial DataFrame.
I have a prebuilt and populated powerpoint presentation where I am modifying the data for charts and tables. I would like to retain all formatting (and much of the text), but replace the data in a line chart within a slide.
I have a function that will replace the data using a pandas data frame that works with bar charts.
def replaceCategoryChart(df, chart, skipLastCol=0):
"""
Replaces Category chartdata for a simple series chart. e.g. Nonfarm Employment
Parameters:
df: dataframe containing new data. column 0 is the categories
chart: the powerpoint shape chart.
skipLast: 0=don't skip last column, 1+ will skip that many columns from the end
Returns: replaced chart(?)
"""
cols1= list(df)
#print(cols1)
#create chart data object
chart_data = CategoryChartData()
#create categories
chart_data.categories=df[cols1[0]]
# Loop over all series
for col in cols1[1:-skipLastCol]:
chart_data.add_series(col, df[col])
#replace chart data
chart.replace_data(chart_data)
...
S0_L= pd.read_excel(EXCEL_BOOK, sheet_name="S0_L", usecols="A:F")
S0_L_chart = prs.slides[0].shapes[3].chart
print(S0_L)
replaceCategoryChart(S0_L, S0_L_chart)
...
The python file runs successfully, however, when I open the powerpoint file I get the error
Powerpoint found a problem with content in Name.pptx.
Powerpoint can attempt to repair the presentation.
If you trust the source of this presentation, click Repair.
After clicking repair, the slide I attempted to modify is replaced by a blank layout.
Because this function works for bar charts, I think there is a mistake in the way I am understanding how to use replace_data() for a line chart.
Thank you for your help!
If your "line chart" is an "XY Scatter" chart, you'll need a different chart-data object, the XyChartData object and then to populate its XySeries objects: https://python-pptx.readthedocs.io/en/latest/api/chart-data.html#pptx.chart.data.XyChartData
I would recommend starting by getting it working using literal values, e.g. "South" and 1.05, and then proceed to supply the values from Pandas dataframes. That way you're sure the python-pptx part of your code is properly structured and you'll know where to go looking for any problems that arise.
As scanny mentioned, replace_data() does work for category line charts.
The repair error was (probably) caused by incorrectly adding series data, (there was a bad loop, corrected below).
# Loop over all series
for col in cols1[1:len(cols1)-skipLastCol]:
print('Type of column is ' + str(type(col)))
chart_data.add_series(col, df[col])
Ok, so I have aggregated a bunch of data that looks like this:
X-mean y-Mean z- Mean
1 0.3444 2.34987 1.347
2 etc.
3
4
5
6
Except, it is not three columns, but 561 of them :-)
So, it seems like such a simple problem to me: I know how to plot the first column vs. the x column using Mean_f_values.plot(y= y_vals, use_index=True).So, the column names are often a bunch of gibberish, so I want to plot individual plots by not referring to their names, but just their location. I want to do some kind of for loop and display several graphs as I try to weed out useless columns. But all I can find (so far) is that we can only refer to column name, not their location when plotting. It seems obvious to me that this cannot be true, at least with some kind of simple plotting method. I am kinda noob, so what am I missing? Thanks!
Whenever I try to plot data using the plotly python library (in this case from Modeanalytics dataframe), it ends up connecting out-of-order data points together and causing a mess as follows:
If I sort my data with the SQL query that genrates the dataframe, then the plot looks great!
However, I want to actually sort the data in python and not in SQL.
I attempted to take the out-of-order dataframe and do this:
df.sort_values(by=['time'])
but it still resulted in the messy plot.
How can I sort my data frame in python such that it is plotted correctly?
By default sort_values() returns a new dataframe without modifying the original.
You can either set the flag to True or assign the output back to the original dataframe.
Try:
df = df.sort_values(by=['time'])
Or
df.sort_values(by=['time'], inplace=True)
Hi hoping someone can help. I have a data frame where one of the columns contains a list of names. These names are repeated in some circumstances but not all. I am trying to plot a graph where the x-axis contains the name and then the y-axis contains the number of times that name appears in the column.
I have used the following to count the number of time each name appears.
df.groupby('name').name.count()
Then tried to use the following to plot the graph. However, I get a key error messasge.
df.plot.bar(x='name', y=df.groupby('name').name.count())
Anyone able to tell me what I am doing wrong?
Thanks
I believe you need plot Series returned from count function by Series.plot.bar:
df.groupby('name').name.count().plot.bar()
Or use value_counts:
df['name'].value_counts().plot.bar()