Converting pandas dataframe to numeric; seaborn can't plot - python

I'm trying to create some charts using weather data, pandas, and seaborn. I'm having trouble using lmplot (or any other seaborn plot function for that matter), though. I'm being told it can't concatenate str and float objects, but I used convert_objects(convert_numeric=True) beforehand, so I'm not sure what the issue is, and when I just print the dataframe I don't see anything wrong, per se.
import numpy as np
import pandas as pd
import seaborn as sns
new.convert_objects(convert_numeric=True)
sns.lmplot("AvgSpeed", "Max5Speed", new)
Some of the examples of unwanted placeholder characters that I saw in the few non-numeric spaces just glancing through the dataset were "M", " ", "-", "null", and some other random strings. Would any of these cause a problem for convert_objects? Does seaborn know to ignore NaN? I don't know what's wrong. Thanks for the help.

You need to assign the result to itself:
new = new.convert_objects(convert_numeric=True)
See the docs
convert_objects is now deprecated as of version 0.21.0, you have to use to_numeric:
new = new.convert_objects()
if you have multiple columns:
new = new.apply(pd.to_numeric)

Related

How to show more categories in a line plot of a pivot table

I have an Excel file containing rows of objects with at least two columns of variables: one for year and one for category. There are 22 types in the category variable.
So far, I can read the Excel file into a DataFrame and apply a pivot table to show the count of each category per year. I can also plot these yearly counts by category. However, when I do so, only 4 of the 22 categories are plotted. How do I instruct Matplotlib to show plot lines and labels for each of the 22 categories?
Here is my code
import numpy as np
import pandas as pd
import matplotlib as plt
df = pd.read_excel("table_merged.xlsx", sheet_name="records", encoding="utf8")
df.pivot_table(index="year", columns="category", values="y_m_d", aggfunc=np.count_nonzero, fill_value="0").plot(figsize=(10,10))
I checked the matplotlib documentation for plot(). The only argument that seemed remotely related to what I'm trying to accomplish is markevery() but it produced the error "positional argument follows keyword argument", so it doesn't seem right. I was able to use several of the other arguments successfully, like making the lines dashed, etc.
Here is the dataframe
Here is the resulting plot generated by matplotlib
Here are the same data plotted in Excel. I'm trying to make a similar plot using matplotlib
Solution
Change pivot(...,fill_value="0") to pivot(...,fill_value=0) and all of the categories appear in the figure as coded above. In the original figure, the four displayed categories were the only ones of the 22 that did not have a 0 value for any year. This is why they were displayed. Any category that had a "0" value was ignored by matplotlib.
A simpler, and better solution is pd.crosstab(df['year'],df['category']) rather than my line 5 above.
The problem comes with the pivot, most likely you don't need that since you are just tabulating years and category. the y-m-d column is not useful at all.
Try something like below:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'year':np.random.randint(2008,2020,1000),
'category':np.random.choice(np.arange(10),size=1000,p=np.arange(10)/sum(np.arange(10))),
'y_m_d':np.random.choice(['a','b','c'],1000)})
pd.crosstab(df['year'],df['category']).plot()
And looking at the code you have, the error comes from:
pivot(...,fill_value="0")
You are filling with a string "0" and this changes the column to something else, and will be ignored by matplotlib. It should be fill_value=0 and it will work, though a very complicated approach......

Adding a key on a density graph with Pandas

I want to add a key so that I'm able to know which color is which column in my data frame. I made this by df.column_name.plot.density() multiple times. I've seen other examples with the key but I haven't been able to locate the code that adds it in.
In matplotlib, the display you're talking about is called a legend. I'm not sure if it's the same in pandas, but it's worth looking at!
Since your example didn't include enough code for me to try it out, I didn't.
Don't plot the variables one by one. Use df.plot.density(). If you want to plot a subset of variables: df.plot[var_list].density(). If you want to plot them one by one for some reason you may need to use label argument in plot function and add a legend at the end.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.normal(size = (10,4)),
columns = ["Col1", "Col2", "Col3", "Col4"])
df.plot.density()
plt.show()

ValueError: Could not interpret input 'index' when using index with seaborn lineplot

I want the use the index of a pandas DataFrame as x value for a seaborn plot. However, this raises a value error.
A small test example:
import pandas as pd
import seaborn as sns
sns.lineplot(x='index',y='test',hue='test2',data=pd.DataFrame({'test':range(9),'test2':range(9)}))
It raises:
ValueError: Could not interpret input 'index'
Is it not possible to use the index as x values? What am I doing wrong?
Python 2.7, seaborn 0.9
I would rather prefer to use it this way. You need to remove hue as I assume it has a different purpose which doesn't apply in your current DataFrame because you have a single line. Visit the official docs here for more info.
df=pd.DataFrame({'test':range(9),'test2':range(9)})
sns.lineplot(x=df.index, y='test', data=df)
Output
You would need to make sure the string you provide to the x argument is actually a column in your dataframe. The easiest solution to achieve that is to reset the index of the dataframe to convert the index to a column.
sns.lineplot(x='index', y='test', data=pd.DataFrame({'test':range(9),'test2':range(9)}).reset_index())
I know it's an old question, and maybe this wasn't around back then, but there's a much simpler way to achieve this:
If you just pass a series from a dataframe as the 'data' parameter, seaborn will automatically use the index as the x values.
sns.lineplot(data=df.column1)

How to use matplotlib to plot line charts

I use pandas to read my csv file and turn two columns into arrays as independent/dependent variables respectively.
the data reading, array-turning trans and value assign
Then when I want to use matplotlib.pyplot to plot the line charts out, it turns out that 'numpy.ndarray' objects has no attribute 'find'.
import numpy as np
import matplotlib.pyplot as plt
plt.plot(x,y)
The problem is probably with your dtypes, assuming your data are in df check the df.dtypes. Columns you are trying to plot must be numeric (float, int, bool).
I guess that at least one of the columns you are plotting has object dtype, try to find out why (maybe missing values were read as some sort of string, or everything is just considered string) and convert it to correct type with astype, i.e.
df['float_col'] = df['float_col'].astype(np.float64)
Edit:
If you are trying to plot date use, make sure that dtype is actually a date i.e. datetime64[ns] and use matplotlibs dedicated method plot_date

Efficiently ploting a table in csv format using Python

I am trying to plot a csv formatted table using Python. So far, I was able to get the result I wanted by reading similar questions on the site, but my solution doesn't seem too "pythonic", nor did I found a very straightforward way of doing this. I am sure there is a more efficient way for plotting a table, so I'm asking this question to learn more about Python and let others have a straight answer for the same problem. Here it goes:
I have a table with data, which have headers and a first column. In my case, it is months and years respectively. i.e.:
Year,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DIC
1998,,0.78,0.60,0.50,0.50,,,,,0.62,,0.45
1999,0.40,0.30,0.28,0.22,0.26,0.50,0.52,0.76,0.89,0.85,0.74,0.67
2000,0.58,0.58,0.51,0.47,0.63,0.92,1.00,1.00,0.99,1.00,0.96,0.91
2001,0.86,0.83,0.80,0.71,0.83,0.98,1.05,1.11,1.09,0.99,0.87,0.80
...
As you can see, there is missing data too.
My solution was the following:
import numpy as np
from matplotlib import pyplot as plt
#Import Data
Data=np.genfromtxt('LakeLevels.csv',delimiter=',',names=True,dtype=float)
#Extract data
Months=list(Data.dtype.names[1:])
Years=Data['Year']
Level=Data.view(dtype=float).reshape(Data.shape + (-1,))[:,1:]
Level_masked= np.ma.array (Level, mask=np.isnan(Level))
#Plot
fig=plt.pcolor(np.linspace(1,12,12),Years,Level_masked)
plt.colorbar()
plt.xticks(range(12),Months,rotation=45)
I found the solution was too complex for a very simple task. Is there a better way of achieving the same result or parts of the code I can improve? Maybe even a function that does this automatically.
Thanks in advance.
You might consider using Pandas for this munging + plotting of data.
I didn't follow through your logic all the way (i.e., the mask), but here is the output of the following two lines (on part of your data):
import pandas as pd
df = pd.read_csv('stuff.csv', delimiter=',', index_col='year').T.plot();
The more stuff you have (e.g., handling missing data, etc.) - the longer the difference in lines of code will become. Numpy is great, but you should probably use higher-level libraries (built over it!) - for this sort of stuff.
I will post my final solution based on #Ami Tavory's answer.
import pandas as pd
import seaborn as sns
df = pd.read_csv('LakeLevels.csv', delimiter=',', index_col='Year')
sns.heatmap(df)
So by using these 2 packages (i.e. pandas and seaborn) I was able to get my desired result in 2 lines!
Best regards.

Categories

Resources