Matplotlib.plot squiggly lines - python

Am trying to plot the closed price for starbucks. I have parsed the dates and taken the close price. This is the code and the output below:
df2 = pd.read_csv('sbux.csv', parse_dates = ['date'], index_col=
'date')
df2= df2[['close']]
plt.plot(df2)
When i try to run it through matplotlib, this is the result i got below.
I'm not sure if it has to do with the format or something. I did try to change the format using datetime but it's still returning the same thing. Any help will be appreciated.

Have you tried ordering your results by date? You can use:
df2 = df2.sort_values(by=['date'])
to order the values by date so that the trend moves as expected rather than jumping back and forth.

Related

Column names are not recognized? How to set the column names?

I have a dataset for which I am not able to call the columns. In the screen shoot below, I have marked in yellow what I need to be recognized as column (Vale On, Petroleo etc.) and the Date column, which I need to recognize as date since I am working with time series data.
I have tried to reset index and some solutions related but nothing worked. I am new to Python, so I am sorry if it is too obvious.
# use first row as column names
df.columns = df.iloc[0]
# and then drop it
df = df.iloc[1:]
# convert first col to date
# if it doesnt work, try passing format=... refer https://strftime.org/
# also https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
df['Date'] = pd.to_datetime(df['Date'])
A debugging hint if parsing the date keeps failing is to check if your date strings are consistent, perhaps like so: df['Date'].str.len().value_counts(). That should hopefully return only one length. If that returns multiple rows, that means you have inconsistent and anomalous data which you'll have to clean.

.fillna breaking .dt.normalize()

I am trying to clean up some data, by formatting my floats to show no decimal points and my date/time to only show date. After this, I want to fill in my NaNs with an empty string, but when I do that, my date goes back to showing both date/time. Any idea why? Or how to fix it.
This is before I run the fillna() method with a picture of what my data looks like:
#Creating DataFrame from path variable
daily_production_df = pd.read_excel(path)
#Reformated Data series to only include date (excluded time)
daily_production_df['Date'] = daily_production_df['Date'].dt.normalize()
pd.options.display.float_format = '{:,.0f}'.format
#daily_production_df = daily_production_df.fillna('')
#Called only the 16 rows that have data, including the columns/header
daily_production_df.head(16)
code with NaNs
This is when I run the fillna() method:
daily_production_df = pd.read_excel(path)
#Reformated Data series to only include date (excluded time)
daily_production_df['Date'] = daily_production_df['Date'].dt.normalize()
pd.options.display.float_format = '{:,.0f}'.format
daily_production_df = daily_production_df.fillna('')
#Called only the 16 rows that have data, including the columns/header
daily_production_df.head(16)
date_time
Using normalize() does not change the dtype of the column, pandas just stop displaying the time portion when print because they share the same time portion.
I would recommend the correct solution which is convert the column to actual datetime.date instead of using normalize():
df['date'] = pd.to_datetime(df['date']).dt.date

Sum Cells with Same Date

I am a complete noob at this Python and Jupiter Notebook stuff. I am taking an Intro to Python Course and have been assigned a task to do. This is to extract information from a .csv file. The following is a snapshot of my .csv file titled "feeds1.csv"
https://i.imgur.com/BlknyC3.png
I can import the .csv into Jupyter Notebook, and have tried groupby function to sort it. But it won't work due to the fact that column also has time in it.
import pandas as pd
df = pd.read_csv("feeds1.csv")
I need it to output as follows:
https://i.imgur.com/BDfnZrZ.png
The ultimate goal would be to create a csv file with this accumulated data and use it to plot a chart,
If you do not need the time of day but just the date, you can simply use this:
df.created_at = df.created_at.str.split(' ').str[0]
dfout = df.groupby(['created_at']).count()
dfout.reset_index(level=0, inplace=True)
finaldf = dfout[['created_at', 'entry_id']]
finaldf.columns = ['Date', 'field2']
finaldf.to_csv('outputfile.csv', index=False)
The first line will split the created_at column at the space between the date and time. The .str[0] means it will only keep the first part of the split (which is the date).
The second line groups them by date and gives you the count.
When writing to csv, if you do not want the index to show (as in your pic), then use index=False. If you want the index, then just leave that portion out.
First you need to parse your date right:
df["date_string"] = df["created_at"].str.split(" ").str[0]
df["date_time"] = pd.to_datetime(df["date_string"])
# You can chose to drop earlier columns
# Now you just want to groupby with the date and apply the aggregation/function you want to
df = df.groupby(["date_time"]).sum("field2").reset_index() # for example
df.to_csv("abc.csv", index=False)

Faster solution for date formatting

I am trying to change the format of the date in a pandas dataframe.
If I check the date in the beginning, I have:
df['Date'][0]
Out[158]: '01/02/2008'
Then, I use:
df['Date'] = pd.to_datetime(df['Date']).dt.date
To change the format to
df['Date'][0]
Out[157]: datetime.date(2008, 1, 2)
However, this takes a veeeery long time, since my dataframe has millions of rows.
All I want to do is change the date format from MM-DD-YYYY to YYYY-MM-DD.
How can I do it in a faster way?
You should first collapse by Date using the groupby method to reduce the dimensionality of the problem.
Then you parse the dates into the new format and merge the results back into the original DataFrame.
This requires some time because of the merging, but it takes advantage from the fact that many dates are repeated a large number of times. You want to convert each date only once!
You can use the following code:
date_parser = lambda x: pd.datetime.strptime(str(x), '%m/%d/%Y')
df['date_index'] = df['Date']
dates = df.groupby(['date_index']).first()['Date'].apply(date_parser)
df = df.set_index([ 'date_index' ])
df['New Date'] = dates
df = df.reset_index()
df.head()
In my case, the execution time for a DataFrame with 3 million lines reduced from 30 seconds to about 1.5 seconds.
I'm not sure if this will help with the performance issue, as I haven't tested with a dataset of your size, but at least in theory, this should help. Pandas has a built in parameter you can use to specify that it should load a column as a date or datetime field. See the parse_dates parameter in the pandas documentation.
Simply pass in a list of columns that you want to be parsed as a date and pandas will convert the columns for you when creating the DataFrame. Then, you won't have to worry about looping back through the dataframe and attempting the conversion after.
import pandas as pd
df = pd.read_csv('test.csv', parse_dates=[0,2])
The above example would try to parse the 1st and 3rd (zero-based) columns as dates.
The type of each resulting column value will be a pandas timestamp and you can then use pandas to print this out however you'd like when working with the dataframe.
Following a lead at #pygo's comment, I found that my mistake was to try to read the data as
df['Date'] = pd.to_datetime(df['Date']).dt.date
This would be, as this answer explains:
This is because pandas falls back to dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied (this is much more flexible, but also slower).
As you have shown above, you can improve the performance by supplying a format string to to_datetime. Or another option is to use infer_datetime_format=True
When using any of the date parsers from the answers above, we go into the for loop. Also, when specifying the format we want (instead of the format we have) in the pd.to_datetime, we also go into the for loop.
Hence, instead of doing
df['Date'] = pd.to_datetime(df['Date'],format='%Y-%m-%d')
or
df['Date'] = pd.to_datetime(df['Date']).dt.date
we should do
df['Date'] = pd.to_datetime(df['Date'],format='%m/%d/%Y').dt.date
By supplying the current format of the data, it is read really fast into datetime format. Then, using .dt.date, it is fast to change it to the new format without the parser.
Thank you to everyone who helped!

Can't plot dataframe when index is a date

I have a CSV file that looks like this:
Date,Close
16-Mar-17,848.78
15-Mar-17,847.2
Whenever I try to load it in and set the date as the index by doing:
df = pd.read_csv("new_data.csv")
df.set_index("Date")
I getValueError: could not convert string to float: '18-Mar-16'. Why is this happening? I thought you could set a date even if it was a string. I am a novice to pandas so it is most likely a simple misunderstanding.
EDIT:
I was reading the error on the wrong line, here is the chunck of code that throws the error.
df = pd.read_csv("new_data.csv")
Close = df.sort_index(ascending=True)
plt.plot(Close)
plt.gca().invert_xaxis()
plt.show()
Now, you need to convert string Date to datetime:
Close['Date'] = pd.to_datetime(Close['Date'])
I had similar issues, all related with taking good care of the date data.
good practice is while loading data, using pandas functionality to load the date info.
df = pd.read_csv("new_data.csv", parse_dates=[0], infer_datetime_format = True)
where column[0] is where date column located.
and then pandas will do the "magic" and handle xticks of the plot nicely.

Categories

Resources