I am trying to plot a csv formatted table using Python. So far, I was able to get the result I wanted by reading similar questions on the site, but my solution doesn't seem too "pythonic", nor did I found a very straightforward way of doing this. I am sure there is a more efficient way for plotting a table, so I'm asking this question to learn more about Python and let others have a straight answer for the same problem. Here it goes:
I have a table with data, which have headers and a first column. In my case, it is months and years respectively. i.e.:
Year,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DIC
1998,,0.78,0.60,0.50,0.50,,,,,0.62,,0.45
1999,0.40,0.30,0.28,0.22,0.26,0.50,0.52,0.76,0.89,0.85,0.74,0.67
2000,0.58,0.58,0.51,0.47,0.63,0.92,1.00,1.00,0.99,1.00,0.96,0.91
2001,0.86,0.83,0.80,0.71,0.83,0.98,1.05,1.11,1.09,0.99,0.87,0.80
...
As you can see, there is missing data too.
My solution was the following:
import numpy as np
from matplotlib import pyplot as plt
#Import Data
Data=np.genfromtxt('LakeLevels.csv',delimiter=',',names=True,dtype=float)
#Extract data
Months=list(Data.dtype.names[1:])
Years=Data['Year']
Level=Data.view(dtype=float).reshape(Data.shape + (-1,))[:,1:]
Level_masked= np.ma.array (Level, mask=np.isnan(Level))
#Plot
fig=plt.pcolor(np.linspace(1,12,12),Years,Level_masked)
plt.colorbar()
plt.xticks(range(12),Months,rotation=45)
I found the solution was too complex for a very simple task. Is there a better way of achieving the same result or parts of the code I can improve? Maybe even a function that does this automatically.
Thanks in advance.
You might consider using Pandas for this munging + plotting of data.
I didn't follow through your logic all the way (i.e., the mask), but here is the output of the following two lines (on part of your data):
import pandas as pd
df = pd.read_csv('stuff.csv', delimiter=',', index_col='year').T.plot();
The more stuff you have (e.g., handling missing data, etc.) - the longer the difference in lines of code will become. Numpy is great, but you should probably use higher-level libraries (built over it!) - for this sort of stuff.
I will post my final solution based on #Ami Tavory's answer.
import pandas as pd
import seaborn as sns
df = pd.read_csv('LakeLevels.csv', delimiter=',', index_col='Year')
sns.heatmap(df)
So by using these 2 packages (i.e. pandas and seaborn) I was able to get my desired result in 2 lines!
Best regards.
Related
I have an Excel file containing rows of objects with at least two columns of variables: one for year and one for category. There are 22 types in the category variable.
So far, I can read the Excel file into a DataFrame and apply a pivot table to show the count of each category per year. I can also plot these yearly counts by category. However, when I do so, only 4 of the 22 categories are plotted. How do I instruct Matplotlib to show plot lines and labels for each of the 22 categories?
Here is my code
import numpy as np
import pandas as pd
import matplotlib as plt
df = pd.read_excel("table_merged.xlsx", sheet_name="records", encoding="utf8")
df.pivot_table(index="year", columns="category", values="y_m_d", aggfunc=np.count_nonzero, fill_value="0").plot(figsize=(10,10))
I checked the matplotlib documentation for plot(). The only argument that seemed remotely related to what I'm trying to accomplish is markevery() but it produced the error "positional argument follows keyword argument", so it doesn't seem right. I was able to use several of the other arguments successfully, like making the lines dashed, etc.
Here is the dataframe
Here is the resulting plot generated by matplotlib
Here are the same data plotted in Excel. I'm trying to make a similar plot using matplotlib
Solution
Change pivot(...,fill_value="0") to pivot(...,fill_value=0) and all of the categories appear in the figure as coded above. In the original figure, the four displayed categories were the only ones of the 22 that did not have a 0 value for any year. This is why they were displayed. Any category that had a "0" value was ignored by matplotlib.
A simpler, and better solution is pd.crosstab(df['year'],df['category']) rather than my line 5 above.
The problem comes with the pivot, most likely you don't need that since you are just tabulating years and category. the y-m-d column is not useful at all.
Try something like below:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'year':np.random.randint(2008,2020,1000),
'category':np.random.choice(np.arange(10),size=1000,p=np.arange(10)/sum(np.arange(10))),
'y_m_d':np.random.choice(['a','b','c'],1000)})
pd.crosstab(df['year'],df['category']).plot()
And looking at the code you have, the error comes from:
pivot(...,fill_value="0")
You are filling with a string "0" and this changes the column to something else, and will be ignored by matplotlib. It should be fill_value=0 and it will work, though a very complicated approach......
I want to add a key so that I'm able to know which color is which column in my data frame. I made this by df.column_name.plot.density() multiple times. I've seen other examples with the key but I haven't been able to locate the code that adds it in.
In matplotlib, the display you're talking about is called a legend. I'm not sure if it's the same in pandas, but it's worth looking at!
Since your example didn't include enough code for me to try it out, I didn't.
Don't plot the variables one by one. Use df.plot.density(). If you want to plot a subset of variables: df.plot[var_list].density(). If you want to plot them one by one for some reason you may need to use label argument in plot function and add a legend at the end.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.normal(size = (10,4)),
columns = ["Col1", "Col2", "Col3", "Col4"])
df.plot.density()
plt.show()
I'm trying to filter some data from my plot. Specifically it is the 'hair' at the top of the cycle that I wish to remove.
The 'hair' is due to some numerical errors and I'm not aware of any method in python or how to write a code that would filter away points that don't occur frequently.
I wish to filter the data as I plot it.
You can use Smooth function for this. to get rid of noice,
import pandas as pd
import matplotlib.pyplot as plt
# Plot the Raw Data
# plt. Your Data >>>Bla Bla Blaa
smooth_data = pd.rolling_mean(ts,5).plot(style='k')
plt.show()
I'm working with some Electrodermal data in Python and hoping to be able to calculate and graph z scores for my data. My data is structured as a single column in a csv. I've managed to get as far as importing this and turning it into a list with this:
import csv
with open("1538130011EDA.csv",'rb') as f:
reader = csv.reader(f)
for row in reader:
print row
import numpy as np
EDAdata = np.genfromtxt('1538130011EDA.csv',delimiter=',')
EDAlist = EDAdata.tolist()
print EDAlist
Then I imported the zscore function from scipy and checked it was working:
from scipy.stats import zscore
print zscore([1, 2, 3])
I'm not sure how to apply that to EDAlist, whether I can outright do that or need to transform that list in some way first.
I'm really sorry if this is a dumb question or i've overlooked something really simple. I am very much a beginner and really just need this one bit of code to help me get started on my project. Thank you so much for your help.
You can apply zscore to any array like object, your list is an array like object so you can apply the function directly on your list like:
zscore(EDAlist)
I'm trying to create some charts using weather data, pandas, and seaborn. I'm having trouble using lmplot (or any other seaborn plot function for that matter), though. I'm being told it can't concatenate str and float objects, but I used convert_objects(convert_numeric=True) beforehand, so I'm not sure what the issue is, and when I just print the dataframe I don't see anything wrong, per se.
import numpy as np
import pandas as pd
import seaborn as sns
new.convert_objects(convert_numeric=True)
sns.lmplot("AvgSpeed", "Max5Speed", new)
Some of the examples of unwanted placeholder characters that I saw in the few non-numeric spaces just glancing through the dataset were "M", " ", "-", "null", and some other random strings. Would any of these cause a problem for convert_objects? Does seaborn know to ignore NaN? I don't know what's wrong. Thanks for the help.
You need to assign the result to itself:
new = new.convert_objects(convert_numeric=True)
See the docs
convert_objects is now deprecated as of version 0.21.0, you have to use to_numeric:
new = new.convert_objects()
if you have multiple columns:
new = new.apply(pd.to_numeric)