dataframe line plot is not plotting a line with column values - python

I think there is something wrong with the data in my dataframe, but I am having a hard time coming to a conclusion. I think there might be some missing datetime values, which is the index of the dataframe. Given that there are over 1000 rows, it isn't possible for me to check each row manually. Here is a picture of my data and the corresponding line plt. Clearly this isn't a line plot!
Is there any way to supplement the possible missing values in my dataframe somehow?
I also did a line plot in seaborne as well to get another perspective, but I don't think it was immediately helpful.

You have effectively done same as I have simulated. Really you have a multi-index date and age_group. plotting both together means line jumps between the two. Separate them out and plot as separate lines and it is as you expect.
d = pd.date_range("1-jan-2020", "16-mar-2021")
df = pd.concat([pd.DataFrame({"daily_percent":np.sort(np.random.uniform(0.5,1, len(d)))}, index=d).assign(age_group="0-9 Years"),
pd.DataFrame({"daily_percent":np.sort(np.random.uniform(0,0.5, len(d)))}, index=d).assign(age_group="20-29 Years")])
df.plot(kind="line", y="daily_percent", color="red")
df.set_index("age_group", append=True).unstack(1).droplevel(0, axis=1).plot(kind="line", color=["red","blue"])

Related

how to remove line connecting discontinued point in matplotlib.pyplot.plot

I have a dataframe which has one column shows price, and its index is datetime.
2018-09-18T02:29:56.5 524.6
2018-09-18T02:29:57.0 524.6
2018-09-18T02:29:57.5 524.8
2018-09-18T02:29:59.0 525.1
2018-09-18T02:29:59.5 525.1
2018-09-18T02:30:00.0 524.8
2018-09-19T21:00:00.5 527.1
2018-09-19T21:00:01.0 527.1
2018-09-19T21:00:01.5 527.3
2018-09-19T21:00:02.0 527.7
2018-09-19T21:00:02.5 527.5
2018-09-19T21:00:03.0 527.6
2018-09-19T21:00:03.5 527.4
im trying to plot the timeplot by matplotlib.pyplot.plot(df).
It gives a plot but with a long straight line connecting the discontinued datapoint (last data point on 2018-09-18T02:30:00.0 and the first data point on 2018-09-19T21:00:00.5). Is there a way to remove the connecting line between the data point gap?
sry... i think how i can do it... just use
df.plot(x=df.index.astype(str))
basically, convert my index from datetime to string, and use the strings as my x-axis

How to populate arrays with values read in from csv via pandas?

I have create a DataFrame using pandas by reading a csv file. What I want to do is iterate down the rows (for the values in column 1) into a certain array, and do the same for the values in column 2 for a different array. This seems like it would normally be a fairly easy thing to do, so I think I am missing something, however I can't find much online that doesn't get too complicated and doesn't seem to do what I want. Stack questions like this one appear to be asking the same thing, but the answers are long and complicated. Is there no way to do this in a few lines of code? Here is what I have set up:
import pandas as pd
#available possible players
playerNames = []
df = pd.read_csv('Fantasy Week 1.csv')
What I anticipate I should be able to do would be something like:
for row in df.columns[1]:
playerNames.append(row)
This however does not return the desired result.
Essentially, if df =
[1,2,3
4,5,6
7,8,9], I would want my array to be [1,4,7]
Do:
for row in df[df.columns[1]]:
playerNames.append(row)
Or even better:
print(df[df.columns[1]].tolist())
In this case you want the 1st column's values so do:
for row in df[df.columns[0]]:
playerNames.append(row)
Or even better:
print(df[df.columns[0]].tolist())

Iterate through Groupby.unstack() items to make separate plots

I have an dataframe called afplot:
apple_fplot = apple_f1.groupby(['Year','Domain Category'])['Value'].sum()
afplot = apple_fplot.unstack('Domain Category')
I now need to produce a plot for each column of afplot, and need to save each plot to a unique filename.
I've been trying to do this through a for loop, (I know thats inefficient) but can't seem to get it right.
for index, column in afplot.iteritems():
plt.figure(index); afplot[column].plot(figsize=(12,6))
plt.xlabel('Year')
plt.ylabel('Fungicide used / lb')
plt.title('Amount of fungicides used on apples in the US')
plt.legend()
plt.savefig('C:/Users/User/Documents/Work/Year 3/Project/Plots/apple_fplot{}'.format(index))
I'm not sure if I'm going about this the right way, but the whole idea is to have the plot be reset each time it goes through the iteration, plotting only the next column's line plot, and then saves it to a new filename.
The df.iteritems() iterator returns (column name, series) pairs ([see docs])1. So you can simplify:
for col, data in afplot.iteritems():
ax = data.plot(title='Amount of fungicides used on apples in the US'))
ax.set_ylabel('Fungicide used / lb')
plt.gcf().savefig('C:/Users/User/Documents/Work/Year 3/Project/Plots/apple_fplot{}'.format(col))
plt.close()
The xlabel should already be 'Year' as this seems to be the name of the index. Legend is True by default. See additional plot parameters.

My time series plot showing the wrong order

I'm plotting:
df['close'].plot(legend=True,figsize=(10,4))
The original data series comes in an descending order,I then did:
df.sort_values(['quote_date'])
The table now looks good and sorted in the desired manner, but the graph is still the same, showing today first and then going back in time.
Does the .plot() order by index? If so, how can I fix this ?
Alternatively, I'm importing the data with:
df = pd.read_csv(url1)
Can I somehow sort the data there already?
There are two problems with this code:
1) df.sort_values(['quote_date']) does not sort in place. This returns a sorted data frame but df is unchanged =>
df = df.sort_values(['quote_date'])
2) Yes, the plot() method plots by index by default but you can change this behavior with the keyword use_index
df['close'].plot(use_index=False, legend=True,figsize=(10,4))

Adding names and assigning data types to ASCII data

My professor uses IDL and sent me a file of ASCII data that I need to eventually be able to read and manipulate.
He used the following command to read the data:
readcol, 'sn-full.txt', format='A,X,X,X,X,X,F,A,F,A,X,X,X,X,X,X,X,X,X,A,X,X,X,X,A,X,X,X,X,F,X,I,X,F,F,X,X,F,X,F,F,F,F,F,F', $
sn, off1, dir1, off2, dir2, type, gal, dist, htype, d1, d2, pa, ai, b, berr, b0, k, kerr
Here's a picture of what the first two rows look like: http://i.imgur.com/hT7YIE3.png
Since I'm not going to be an astronomer, I am using Python but since I am new to it, I am having a hard time reading the data.
I know that the his code assigns the data type A (string data) to column one, skips columns two -six by using an X, and then assigns the data type F (floating point) to column seven, etc. Then sn is assigned to the first column that isn't skipped, etc.
I have been trying to replicate this by using either numpy.loadtxt("sn-full.txt") or ascii.read("sn-full.txt") but am not sure how to enter the dtype parameter. I know I could assign everything to be a certain data type, but how do I assign data types to individual columns?
Using astropy.io.ascii you should be able to read your file relatively easily:
from astropy.io import ascii
# Give names for ALL of the columns, as there is no easy way to skip columns
# for a table with no column header.
colnames = ('sn', 'gal_name1', 'gal_name2', 'year', 'month', 'day', ...)
table = ascii.read('sn_full.txt', Reader=ascii.NoHeader, names=colnames)
This gives you a table with all of the data columns. The fact that you have some columns you don't need is not a problem unless the table is mega-rows long. For the table you showed you don't need to specify the dtypes explicitly since io.ascii.read will figure them out correctly.
One slight catch here is that the table you've shown is really a fixed width table, meaning that all the columns line up vertically. Notice that the first row begins with 1998S NGC 3877. As long as every row has the same pattern with three space-delimited columns indicating the supernova name and the galaxy name as two words, then you're fine. But if any of the galaxy names are a single word then the parsing will fail. I suspect that if the IDL readcol is working then the corresponding io.ascii version should work out of the box. If not then io.ascii has a way of reading fixed width tables where you supply the column names and positions explicitly.
[EDIT]
Looks like in this case a fixed width reader is needed to inform the parser how to split the columns instead of just using space as delimiter. So basically you need to add two rows at the top of the table file, where the first one gives the column names and the second has dashes that indicate the span of each column:
a b c
---- ------------ ------
1.2 hello there 2
2.4 worlds 3
It's also possible in astropy.io.ascii to just specify by code the start and stop position of each column if you don't have the option of modifying the input data file, e.g.:
>>> ascii.read(table, Reader=ascii.FixedWidthNoHeader,
names=('Name', 'Phone', 'TCP'),
col_starts=(0, 9, 18),
col_ends=(5, 17, 28),
)
http://casa.colorado.edu/~ginsbura/pyreadcol.htm looks like it does what you want. It emulates IDL's readcol function.
Another possibility is https://pypi.python.org/pypi/fortranformat. It looks like it might be more capable and the data you're looking at is in fixed format and the format specifiers (X, A, etc.) are fortran format specifiers.
I would use Pandas for that particular purpose. The easiest way to do it is, assuming your columns are single-tab-separated:
import pandas as pd
import scipy as sp # Provides all functionality from numpy, too
mydata = pd.read_table(
'filename.dat', sep='\t', header=None,
names=['sn', 'gal_name1', 'gal_name2', 'year', 'month',...],
dtype={'sn':sp.float64, 'gal_name1':object, 'year':sp.int64, ...},)
(Strings here fall into the general 'object' datatype).
Each column now has a name and can be accessed as mydata['colname'], and this can then be sliced like regular numpy 1D arrays like e.g. mydata['colname'][20:50] etc. etc.
Pandas has built-in plotting calls to matplotlib, so you can quickly get an overview of a numerical type column by mydata['column'].plot(), or two different columns against each other as mydata.plot('col1', 'col2'). All normal plotting keywords can be passed.
If you want to plot the data in a normal matplotlib routine, you can just pass the columns to matplotlib, where they will be treated as ordinary Numpy vectors.
Each column can be accessed as an ordinary Numpy vector as mydata['colname'].values.
EDIT
If your data are not uniformly separated, numpy's genfromtxt() function is better. You can then convert it to a Pandas DataFrame by
mydf = pd.DataFrame(myarray, columns=['col1', 'col2', ...],
dtype={'col1':sp.float64, 'col2':object, ...})

Categories

Resources