Plotting Unsorted Dataframes with Plotly Scatter Plots - python

Whenever I try to plot data using the plotly python library (in this case from Modeanalytics dataframe), it ends up connecting out-of-order data points together and causing a mess as follows:
If I sort my data with the SQL query that genrates the dataframe, then the plot looks great!
However, I want to actually sort the data in python and not in SQL.
I attempted to take the out-of-order dataframe and do this:
df.sort_values(by=['time'])
but it still resulted in the messy plot.
How can I sort my data frame in python such that it is plotted correctly?

By default sort_values() returns a new dataframe without modifying the original.
You can either set the flag to True or assign the output back to the original dataframe.
Try:
df = df.sort_values(by=['time'])
Or
df.sort_values(by=['time'], inplace=True)

Related

Order of the values in the streamlit line chart is upside down

I am getting data from SQL database and it is converted in to the pandas dataframe. When I try to "print" my chart in streamlit, the order of the values is upside down.
dashboard_chart1 = st.line_chart(df, x="time", width=300, height=500)
I was trying to find something in the official streamlit docs, but there is no argument for the order.
Yes, I found a solution!
I was getting data from database with pandas function pd.read_sql(). All columns in dataframe were objects. I used function df['column_name'] = df['column_name'].astype(float) to convert them to floats. Now, my data are shown correctly.
Screenshot from working chart:

Key error while plotting a bar graph using Matplotlib

I have been facing one issue while I am trying to plot a bar graph using the matplotlib library.
Please find the sample data below
Sample Data Image
count_movies_year = n_db.groupby('release_year').agg({'title':'count'}).rename(columns={'title':'no_of_titles'})
count_movies_year.reset_index()
I have written the above code and did the group_by on certain cases and renamed the column in the dataframe that I have in place. Now after this I wanted to plot a bar graph of the same using the matplotlib and I have written the below code
plt.bar(count_movies_year['release_year'],count_movies_year['no_of_titles'])
plt.xlabel('release_year')
plt.ylabel('no_of_titles')
plt.show()
but, when I do this I have some errors in place and the key_error shows me 'release_year'. Can I know what is wrong over here as I am new to Python and Matplotlib understanding. Can someone guide me where exactly things are going wrong so that I can correct them next time?
When doing a group_by, the column "release_year" no longer exist in you Dataframe, since it's now the index.
You have multiple solution :
using a reset_index as you did, but you should reattribute it to your variable
count_movies_year = count_movies_year.reset_index()
or use the inplace parameter
count_movies_year.reset_index(inplace=True)
use the .index directly in your plot
plt.bar(count_movies_year.index, count_movies_year['no_of_titles'])

How do you replace data for an existing line chart using python-pptx?

I have a prebuilt and populated powerpoint presentation where I am modifying the data for charts and tables. I would like to retain all formatting (and much of the text), but replace the data in a line chart within a slide.
I have a function that will replace the data using a pandas data frame that works with bar charts.
def replaceCategoryChart(df, chart, skipLastCol=0):
"""
Replaces Category chartdata for a simple series chart. e.g. Nonfarm Employment
Parameters:
df: dataframe containing new data. column 0 is the categories
chart: the powerpoint shape chart.
skipLast: 0=don't skip last column, 1+ will skip that many columns from the end
Returns: replaced chart(?)
"""
cols1= list(df)
#print(cols1)
#create chart data object
chart_data = CategoryChartData()
#create categories
chart_data.categories=df[cols1[0]]
# Loop over all series
for col in cols1[1:-skipLastCol]:
chart_data.add_series(col, df[col])
#replace chart data
chart.replace_data(chart_data)
...
S0_L= pd.read_excel(EXCEL_BOOK, sheet_name="S0_L", usecols="A:F")
S0_L_chart = prs.slides[0].shapes[3].chart
print(S0_L)
replaceCategoryChart(S0_L, S0_L_chart)
...
The python file runs successfully, however, when I open the powerpoint file I get the error
Powerpoint found a problem with content in Name.pptx.
Powerpoint can attempt to repair the presentation.
If you trust the source of this presentation, click Repair.
After clicking repair, the slide I attempted to modify is replaced by a blank layout.
Because this function works for bar charts, I think there is a mistake in the way I am understanding how to use replace_data() for a line chart.
Thank you for your help!
If your "line chart" is an "XY Scatter" chart, you'll need a different chart-data object, the XyChartData object and then to populate its XySeries objects: https://python-pptx.readthedocs.io/en/latest/api/chart-data.html#pptx.chart.data.XyChartData
I would recommend starting by getting it working using literal values, e.g. "South" and 1.05, and then proceed to supply the values from Pandas dataframes. That way you're sure the python-pptx part of your code is properly structured and you'll know where to go looking for any problems that arise.
As scanny mentioned, replace_data() does work for category line charts.
The repair error was (probably) caused by incorrectly adding series data, (there was a bad loop, corrected below).
# Loop over all series
for col in cols1[1:len(cols1)-skipLastCol]:
print('Type of column is ' + str(type(col)))
chart_data.add_series(col, df[col])

Swapping dataframe column data without changing the index for the table

While compiling a pandas table to plot certain activity on a tool I have encountered a rare error in the data that creates an extra 2 columns for certain entries. This means that one of my computed column data goes into the table 2 cells further on that the other and kills the plot.
I was hoping to find a way to pull the contents of a single cell in a row and swap it into the other cell beside it, which contains irrelevant information in the error case, but which is used for the plot of all the other pd data.
I've tried a couple of different ways to swap the data around but keep hitting errors.
My attempts to fix it include:
for rows in df['server']:
if '%USERID' in line:
df['server'] = df[7] # both versions of this and below
df['server'].replace(df['server'],df[7])
else:
pass
if '%USERID' in df['server']: # Attempt to fix missing server name
df['server'] = df[7];
else:
pass
if '%USERID' in df['server']:
return row['7'], row['server']
else:
pass
I'd like the data from column '7' to be replicated in 'server', only in the case of the error - where the data in the cell contains a string starting with '%USERID'
Turns out I was over-thinking this one. I took a step back, worked the code a bit and solved it.
Rather than trying to smash a one-size fits all bit of code for the all data I built separate lists for the general data and 2 exception I found, by writing a nested loop and created 3 data frames. These were easy enough to then manipulate individually, and finally concatenate together. All working fine now.

Switching rows and columns in pyplot

I'm new to Python and after a lot of tinkering, have managed to clean up some .csv data.
I now have a bunch of countries as rows and a bunch of dates as columns, and am trying to create a chart showing a line for each country's value over time.
The problem is that when I enter df.plot() it results in a chart with each date as a line.
I have melted the data such that the first column is country, second is date, and third is value, but all I get is a single blue block growing over time (not multiple lines). How can I fix this?
You can use the transpose function in [pandas][1]:
Or instead of df.plot, you can use plot(coloumn, row).
As it was mentioned in comments, it is always better to provide an example (look at #importanceofbeingeenest comment).

Categories

Resources