Pandas Data frames and sorting values - python

I am having a difficult time with writing this hw assignment, and am not sure where I messed up. I have tried several things, and believe my issue lies in the sort_values or maybe in the groupby command.
The issue is that I want to only display graph data from the year 2007. (using pandas and plotly in jupyternotebook for my class). I have the graph I want mostly but cannot get it to display the data correctly. It simply isn't filtering out the years, or taking data from specific dates as requested.
import pandas as pd
import plotly.express as px
df = pd.read_csv('Data/Country_Data.csv')
print(df.shape)
df.head(2)
df_Q1 = df.query("year == '2007'")
print(df_Q1.shape)
df_Q1.head()
This is where the issue begins, because it prints a table with only header information. As in it prints all the column names, but none of the data for them, and then later on it displays a graph of what I assume is the most recent death data rather than the year 2007 as specified.

Related

Why does not Seaborn Relplot print datetime value on x-axis?

I'm trying to solve a Kaggle Competition to get deeper into data science knowledge. I'm dealing with an issue with seaborn library. I'm trying to plot a distribution of a feature along the date but the relplot function is not able to print the datetime value. On the output, I see a big black box instead of values.
Here there is my code, for plotting:
rainfall_types = list(auser.loc[:,1:])
grid = sns.relplot(x='Date', y=rainfall_types[0], kind="line", data=auser);
grid.fig.autofmt_xdate()
Here there is the
Seaborn.relpot output and the head of my dataset
I found the error. Pratically, when you use pandas.read_csv(dataset), if your dataset contains datetime column they are parsed as object, but python read these values as 'str' (string). So when you are going to plot them, matplotlib is not able to show them correctly.
To avoid this behaviour, you should convert the datetime value into datetime object by using:
df = pandas.read_csv(dataset, parse_date='Column_Date')
In this way, we are going to indicate to pandas library that there is a date column identified by the key 'Column_Date' and it has to be converted into datetime object.
If you want, you could use the Column Date as index for your dataframe, to speed up the analyis along the time. To do it add argument index='Column_Date' at your read_csv.
I hope you will find it helpful.

How to show more categories in a line plot of a pivot table

I have an Excel file containing rows of objects with at least two columns of variables: one for year and one for category. There are 22 types in the category variable.
So far, I can read the Excel file into a DataFrame and apply a pivot table to show the count of each category per year. I can also plot these yearly counts by category. However, when I do so, only 4 of the 22 categories are plotted. How do I instruct Matplotlib to show plot lines and labels for each of the 22 categories?
Here is my code
import numpy as np
import pandas as pd
import matplotlib as plt
df = pd.read_excel("table_merged.xlsx", sheet_name="records", encoding="utf8")
df.pivot_table(index="year", columns="category", values="y_m_d", aggfunc=np.count_nonzero, fill_value="0").plot(figsize=(10,10))
I checked the matplotlib documentation for plot(). The only argument that seemed remotely related to what I'm trying to accomplish is markevery() but it produced the error "positional argument follows keyword argument", so it doesn't seem right. I was able to use several of the other arguments successfully, like making the lines dashed, etc.
Here is the dataframe
Here is the resulting plot generated by matplotlib
Here are the same data plotted in Excel. I'm trying to make a similar plot using matplotlib
Solution
Change pivot(...,fill_value="0") to pivot(...,fill_value=0) and all of the categories appear in the figure as coded above. In the original figure, the four displayed categories were the only ones of the 22 that did not have a 0 value for any year. This is why they were displayed. Any category that had a "0" value was ignored by matplotlib.
A simpler, and better solution is pd.crosstab(df['year'],df['category']) rather than my line 5 above.
The problem comes with the pivot, most likely you don't need that since you are just tabulating years and category. the y-m-d column is not useful at all.
Try something like below:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'year':np.random.randint(2008,2020,1000),
'category':np.random.choice(np.arange(10),size=1000,p=np.arange(10)/sum(np.arange(10))),
'y_m_d':np.random.choice(['a','b','c'],1000)})
pd.crosstab(df['year'],df['category']).plot()
And looking at the code you have, the error comes from:
pivot(...,fill_value="0")
You are filling with a string "0" and this changes the column to something else, and will be ignored by matplotlib. It should be fill_value=0 and it will work, though a very complicated approach......

Plotting multiple time series from pandas dataframe

I have a pandas dataframe loaded from file in the following format:
ID,Date,Time,Value1,Value2,Value3,Value4
0063,04/21/2020,11:22:55,0.0347,0.41,1440,10.5
0064,04/21/2020,11:22:56,0.0355,0.41,1440,10.4
...
9849,04/22/2020,10:46:19,0.058,1.05,1460,10.6
I have tried multiple methods of plotting a line graph of each value vs date/time or a single graph with multiple subplots with limited success. I am hoping someone with much more experience may have an elegant solution to try as opposed to my blind swinging. Note that the dataset may have large breaks in time between days.
Thanks!
parsing dates during the import of the pandas dataframe seemed to be my biggest issue. Once I added parse_dates to the pd.read_csv I was able to define the dt column and plot with matplotlib as expected.
df = pd.read_csv(input_text, parse_dates = [["Date", "Time"]])
dt = df["Date_Time"]

OHLC python chart

I'm new to pandas and matplotlib and I'm trying to code some algorithmic trading.
I bought this course, and now I understand more, BUT...
It does not includes sample code for OHLC chart in intraday (I mean, it is not complete)
And there are others problems that i have like that my native language is not English (there is no quality material in Spanish about those libraries)
All the material that I found online only plots "daily chart" and is based in matplotlib.finance, and now it is deprecated, currently python uses mplfinance.
Please I need a sample code to chart the csv file in seconds, minutes, hours and days.
I really had tried, I'm not a lazy person, but is taking a lot of time just to plot that chart, the course does not solve my requirement.
Here you have csv file for Alibaba (BABA) in 1 second, 5 second, 15 second, 30 second and 1 minute OHLC chart.
My data
MPLFINANCE
You can use mplfinance. I tried it and it worked, here is the sample code.
note: you need to rename the column in your source data so the columns Open, High, Low, Close have uppercase in their first character.
import mplfinance as mpf
import pandas as pd
data = pd.read_csv('NYSE_BABA, 5s.csv', index_col=0)
data.index = pd.to_datetime(data.index)
mpf.plot(data,type='candle')
Well yes the candlestick is difficult to see because we have the short range data, but you get the idea. Hope it helps!
PLOTLY
You might want to consider Plotly for a nicer visualization.
import plotly.graph_objects as go
import pandas as pd
data = pd.read_csv('NYSE_BABA, 5s.csv')
data['time'] = pd.to_datetime(data['time'], unit='s')
fig = go.Figure(data=[go.Candlestick(x=data['time'],
open=data['Open'],
high=data['High'],
low=data['Low'],
close=data['Close'])])
fig.show()

Plotly date formatting issue for pandas dataframe

I am using the Plotly python API to upload a pandas dataframe that contains a date column (which is generated via pandas date_range function).
If I look at the pandas dataframe locally the date formatting is as I'd expect, i.e YYYY-MM-DD. However, when I view it in Plotly I see it in the form YYYY-MM-DD HH:MM:SS. I really don't need this level of precision and also having such a wide column results in formatting issues when I try to fit all the other columns that I want in.
Is there a way to prevent Plotly from re-formatting the pandas dataframe?
A basic example of my current approach looks like:
import plotly.plotly as py
from plotly.tools import FigureFactory as FF
import pandas as pd
dates = pd.date_range('2016-01-01', '2016-02-01', freq='D')
df = pd.DataFrame(dates)
table = FF.create_table(df)
py.plot(table, filename='example table')
It turns out that this problem wasn't solvable - Plotly just happened to treat datetimes in that way.
This has since been updated (fixed) - you can read more here.

Categories

Resources