pandas dataset OverflowError when trying to use datetime data

pandas dataset OverflowError when trying to use datetime data - python

Continuation from: Getting date/time and data out of csv into matplotlib
import datetime
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.cbook as cbook
import pandas
import StringIO
f = open(r'clean data.csv')
#Make a string buffer and read in the CSV file while stripping \x00's
output = StringIO.StringIO()
for x in f.readlines():
output.write(x.replace('\x00',''))
#Close out input file
f.close()
#Set position back to start for pandas to read
output.seek(0)
df = pandas.read_csv(output, skiprows=38, parse_dates=['Time'], index_col="Time")
fig, ax = plt.subplots()
ax.plot(df.index,df['108 <Air> (C)'])
#ax.xaxis.set_major_locator(mdates.DayLocator())
#ax.format_xdata = mdates.DateFormatter('%Y-%m-%d')
#fig.autofmt_xdate()
plt.show()
So I can actually plot this data with this current code, the problem occurs when I try to continue on with this example: https://matplotlib.org/gallery/api/date.html#sphx-glr-gallery-api-date-py
If you uncomment out
ax.xaxis.set_major_locator(mdates.DayLocator())
I get
OverflowError: Python int too large to convert to C long
Whats up with that?
Here is some input data: https://pastebin.com/SSZyaSJ4

Related

How to read only part of a CSV file?

I have the following code in which I read CSV files and get a graph plotted:
import numpy as np
import matplotlib.pyplot as plt
import scipy.odr
from scipy.interpolate import interp1d
plt.rcParams["figure.figsize"] = (15,10)
def readPV(filename="HE3.csv",d=32.5e-3):
t=np.genfromtxt(fname=filename,delimiter=',',skip_header=1, usecols=0)
P=np.genfromtxt(fname=filename,delimiter=',',skip_header=1, usecols=1)
V=np.genfromtxt(fname=filename,delimiter=',',skip_header=1, usecols=2,filling_values=np.nan)
V=V*np.pi*(d/2)**2
Vi= interp1d(t[~np.isnan(V)],V[~np.isnan(V)],fill_value="extrapolate")
V=Vi(t)
return P,V,t
P,V,t=readPV(filename="HE3.csv")
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(V,P,'ko')
ax.set_xlabel("Volume")
ax.set_ylabel("Pressure")
plt.show()
From this code, the following graph is made:
The CSV file has several data points in one column, separated by commas; I want to know how to pick a range of columns to read, instead of all of them.

How to plot dates from a csv file?

I am new to python and I am having some issues to plot my dates from a csv file.
The code is the following:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from pandas import DataFrame
import matplotlib.pyplot as plt
df = pd.read_csv(r"file.csv",index_col=0)
print(df.describe())
BHSI_cycle, BHSI_trend = sm.tsa.filters.hpfilter(df['BHSI-TCA'])
df['BHSI_trend'] = BHSI_trend
df['BHSI_cycle'] = BHSI_cycle
BHSI_plot = df[['BHSI-TCA','BHSI_trend']].plot(figsize=(12,10))
plt.show(BHSI_plot)
BHSI_plot2 = df[['BHSI_cycle']].plot(figsize=(12,10))
plt.show(BHSI_plot2)
And the CSV file is:
Date BHSI-TCA
23/05/2006 14821
25/05/2006 14878
30/05/2006 14837
How can I plot the dates?

Try parsing dates properly when you import from csv.
df = pd.read_csv(r"file.csv", index_col=0, parse_dates=<your_date_column>)

Plotting 2 columns from multiple csv files from NASDAQ in a directory

I am trying to make a program where I read multiple csv files in a directory. The files has been downloaded from http://www.nasdaqomxnordic.com/aktier/historiskakurser
The first row is sep= and it is skipped. The separator is ';'
The problem is that even though I get the data printed from all csv files, I get only blank plots.
The idea is to show a plot of data in column 6 with date as x-axis (column 0) for one csv file at a time and so on until the given directory is empty.
I would prefer the name of the csv file (paper) only as title. Now I get the directory/csv name.
It seems as matplotlib do not understand the csv file correct even though the data is printed.
My code looks as this:
import pandas as pd
#import csv
import glob
import matplotlib.pyplot as plt
#from matplotlib.dates import date2num
import pylab
#import numpy as np
#from matplotlib import style
ferms = glob.glob("OMX-C20_ScrapeData_Long_Name/*.csv")
for ferm in ferms:
print(ferm)
# define the dataframe
data = pd.read_csv(ferm, skiprows=[0], encoding='utf-8', sep=';', header=0)
print(data)
data.head()
pylab.rcParams['figure.figsize'] = (25, 20)
plt.rcParams['figure.dpi'] = 80
plt.rcParams['legend.fontsize'] = 'medium'
plt.rcParams['figure.titlesize'] = 'large'
plt.rcParams['figure.autolayout'] = 'true'
plt.rcParams['xtick.minor.visible'] = 'true'
plt.xlabel('Date')
plt.ylabel('Closing price')
plt.title(ferm)
plt.show()
I have tried some other ways to open the csv files but the result is the same. No curves.
Hope one of you experienced guys can give a hint.

I have made a few additions to your code. I downloaded a single file from the page you linked and ran the below code. Change your ferms and add the for loop back again. One reason why you weren't getting anything is because you haven't plotted the data anywhere. You have changed the aesthetics and everything but nowhere in your code you are telling python that you want to plot this data.
Secondly even if you add the plotting command it still wouldn't plot because neither of Date and Closing price are in numeric format. I change the Date column to datetime format. Your Closing price is a comma separated string. It could be representing a number either in thousands or maybe a decimal. I have assumed it is a decimal although its more likely to a number in thousands separated by a comma. I have changed it to numeric by using a self defined function called to_num using the apply method of pandas dataframe. It replaces the comma with a decimal.
import pandas as pd
#import csv
import glob
import matplotlib.pyplot as plt
#from matplotlib.dates import date2num
import pylab
#import numpy as np
#from matplotlib import style
ferm = glob.glob("Downloads/trial/*.csv")[0]
def to_num(inpt_string):
nums = [x.strip() for x in inpt_string.split()]
return float(''.join(nums).replace(',', '.'))
# print(ferm)
data = pd.read_csv(ferm, skiprows=[0], encoding='utf-8', sep=';', header=0)
data['Date'] = pd.to_datetime(data['Date'])
data['Closing price'] = data['Closing price'].apply(to_num)
# print(data)
# data.head()
pylab.rcParams['figure.figsize'] = (25, 20)
plt.rcParams['figure.dpi'] = 80
plt.rcParams['legend.fontsize'] = 'medium'
plt.rcParams['figure.titlesize'] = 'large'
plt.rcParams['figure.autolayout'] = 'true'
plt.rcParams['xtick.minor.visible'] = 'true'
plt.xlabel('Date')
plt.ylabel('Closing price')
plt.title(ferm)
plt.plot(data.loc[:,'Date'], data.loc[:,'Closing price']) # this line plots the data
plt.show()
EDIT
Maintaining the same code structure as yours -
import pandas as pd
#import csv
import glob
import matplotlib.pyplot as plt
#from matplotlib.dates import date2num
import pylab
#import numpy as np
#from matplotlib import style
ferms = glob.glob("OMX-C20_ScrapeData_Long_Name/*.csv")
def to_num(inpt_string):
nums = [x.strip() for x in inpt_string.split()]
return float(''.join(nums).replace(',', '.'))
for ferm in ferms:
data = pd.read_csv(ferm, skiprows=[0], encoding='utf-8', sep=';', header=0)
data['Date'] = pd.to_datetime(data['Date'])
data['Closing price'] = data['Closing price'].apply(to_num) # change to numeric
# print(data)
# data.head()
pylab.rcParams['figure.figsize'] = (25, 20)
plt.rcParams['figure.dpi'] = 80
plt.rcParams['legend.fontsize'] = 'medium'
plt.rcParams['figure.titlesize'] = 'large'
plt.rcParams['figure.autolayout'] = 'true'
plt.rcParams['xtick.minor.visible'] = 'true'
plt.xlabel('Date')
plt.ylabel('Closing price')
plt.title(ferm)
plt.plot(data.loc[:,'Date'], data.loc[:,'Closing price'])
plt.show()

Plotting data from multiple pandas data frames in one plot

I am interested in plotting a time series with data from several different pandas data frames. I know how to plot a data for a single time series and I know how to do subplots, but how would I manage to plot from several different data frames in a single plot? I have my code below. Basically what I am doing is I am scanning through a folder of json files and parsing that json file into a panda so that I can plot. When I run this code it is only plotting from one of the pandas instead of the ten pandas created. I know that 10 pandas are created because I have a print statement to ensure they are all correct.
import sys, re
import numpy as np
import smtplib
import matplotlib.pyplot as plt
from random import randint
import csv
import pylab as pl
import math
import pandas as pd
from pandas.tools.plotting import scatter_matrix
import argparse
import matplotlib.patches as mpatches
import os
import json
parser = argparse.ArgumentParser()
parser.add_argument('-file', '--f', help = 'folder where JSON files are stored')
if len(sys.argv) == 1:
parser.print_help()
sys.exit(1)
args = parser.parse_args()
dat = {}
i = 0
direc = args.f
directory = os.fsencode(direc)
fig1 = plt.figure()
ax1 = fig1.add_subplot(111)
for files in os.listdir(direc):
filename = os.fsdecode(files)
if filename.endswith(".json"):
path = '/Users/Katie/Desktop/Work/' + args.f + "/" +filename
with open(path, 'r') as data_file:
data = json.load(data_file)
for r in data["commits"]:
dat[i] = (r["author_name"], r["num_deletions"], r["num_insertions"], r["num_lines_changed"],
r["num_files_changed"], r["author_date"])
name = "df" + str(i).zfill(2)
i = i + 1
name = pd.DataFrame.from_dict(dat, orient='index').reset_index()
name.columns = ["index", "author_name", "num_deletions",
"num_insertions", "num_lines_changed",
"num_files_changed", "author_date"]
del name['index']
name['author_date'] = name['author_date'].astype(int)
name['author_date'] = pd.to_datetime(name['author_date'], unit='s')
ax1.plot(name['author_date'], name['num_lines_changed'], '*',c=np.random.rand(3,))
print(name)
continue
else:
continue
plt.xticks(rotation='35')
plt.title('Number of Lines Changed vs. Author Date')
plt.show()

Quite straightforward actually. Don't let pandas confuse you. Underneath it every column is just a numpy array.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df1 = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
fig1 = plt.figure()
ax1 = fig1.add_subplot(111)
ax1.plot(df1['A'])
ax1.plot(df2['B'])

pd.DataFrame.plot method has an argument ax for this:
fig = plt.figure()
ax = plt.subplot(111)
df1['Col1'].plot(ax=ax)
df2['Col2'].plot(ax=ax)

If you are using pandas plot, the return from datafame.plot is axes, so you can assign the next dataframe.plot equal to that axes.
df1 = pd.DataFrame({'Frame 1':pd.np.arange(5)*2},index=pd.np.arange(5))
df2 = pd.DataFrame({'Frame 2':pd.np.arange(5)*.5},index=pd.np.arange(5))
ax = df1.plot(label='df1')
df2.plot(ax=ax)
Output:
Or if your dataframes have the same index, you can use pd.concat:
pd.concat([df1,df2],axis=1).plot()

Trust me. #omdv's answer is the only solution I have found so far. Pandas dataframe plot function doesn't show plotting at all when you pass ax to it.
df_hdf = pd.read_csv(f_hd, header=None,names=['degree', 'rank', 'hits'],
dtype={'degree': np.int32, 'rank': np.float32, 'hits': np.float32})
df_hdf_pt = pd.read_csv(pt_f_hd, header=None,names=['degree', 'rank', 'hits'],
dtype={'degree': np.int32, 'rank': np.float32, 'hits': np.float32})
ax = plt.subplot()
ax.plot(df_hdf_pt['hits'])
ax.plot(df_hdf['hits'])

Title not appearing in pdf

I am iterating through files in folder and for each file I am plotting the close_price on x-axis and date on y-axis.
here is code.Everything is working fine except I want title "abc" to appear on each page but it not coming. What am I doing wrong here.
import os
from matplotlib.backends.backend_pdf import PdfPages
import pandas as pd
import matplotlib.pyplot as plt
pp = PdfPages('multipage.pdf')
pth = "D:/Technical_Data/"
for fle in os.listdir(pth):
df = pd.read_csv(os.path.join(pth, fle),usecols=(0, 4))
if not df.empty:
df=df.astype(float)
plt.title("abc")
df.plot()
pp.savefig()
pp.close()

You should pass the title as an argument of the plot() method, like:
import os
from matplotlib.backends.backend_pdf import PdfPages
import pandas as pd
import matplotlib.pyplot as plt
pp = PdfPages('multipage.pdf')
pth = "D:/Technical_Data/"
for fle in os.listdir(pth):
df = pd.read_csv(os.path.join(pth, fle),usecols=(0, 4))
if not df.empty:
df=df.astype(float)
df.plot(title="abc")
pp.savefig()
pp.close()
Another way would be to put plt.title("abc") after df.plot(). Currently, your title "abc" was overwritten by the default title of df.plot()… which is None.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas dataset OverflowError when trying to use datetime data - python

Related

How to read only part of a CSV file?

How to plot dates from a csv file?

Plotting 2 columns from multiple csv files from NASDAQ in a directory

Plotting data from multiple pandas data frames in one plot

Title not appearing in pdf

Categories

Resources