pd.read_csv fails after converting the timezone - python
So after I converted the UTC timezone in the Time column of my dataframe and saved it to a new csv file, I decided to draw a time plot of frequency of tweets. My time plot was initially working when timezone was UTC but after being converted to Eastern, it gives me the error below. How should I fix it?
import pandas as pd
import matplotlib.pyplot as plt
time_interval = pd.offsets.Second(10)
fig, ax = plt.subplots(figsize=(6, 3.5))
ax = (
pd.read_csv('converted_timezone_tweets.csv', parse_dates=['Time'])
.resample(time_interval, on='Time')['ID']
.count()
.plot.line(ax=ax)
)
plt.show()
And the error is:
/scratch/sjn/anaconda/bin/python /scratch2/debate_tweets/temporal_analysis.py
Traceback (most recent call last):
File "/scratch2/debate_tweets/temporal_analysis.py", line 18, in <module>
pd.read_csv('converted_timezone_tweets.csv', parse_dates=['Time'])
File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py", line 655, in parser_f
return _read(filepath_or_buffer, kwds)
File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py", line 411, in _read
data = parser.read(nrows)
File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py", line 1005, in read
ret = self._engine.read(nrows)
File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py", line 1748, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10862)
File "pandas/_libs/parsers.pyx", line 912, in pandas._libs.parsers.TextReader._read_low_memory (pandas/_libs/parsers.c:11138)
File "pandas/_libs/parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)
File "pandas/_libs/parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)
File "pandas/_libs/parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
Process finished with exit code 1
converted_timezone_tweets.csv look like this:
,Candidate,ID,Time,Username,Tweet
0,Clinton,788948653016842240,2016-10-19 23:43:11-04:00,Tamayo_castle,Hillary Clinton dresses as Christian Bale at the debate via /r/pics
1,Clinton,788948666501464064,2016-10-19 23:43:14-04:00,ThinkCenter1968,"It's like I told my kids, a reason U don't want 2 vote 4 Hillary is U want the inheritance I'm leaving U, Right? They changed their minds!"
2,Clinton,788948673594097664,2016-10-19 23:43:16-04:00,21stCenRevolt,When hearing about Saudi Arabia murdering people for being gay. Hillary laughed with glee. She disgusting and disgraceful. #debatenight
3,Both,788948662881751040,2016-10-19 23:43:13-04:00,mikeywan,MEGYN IS A PAID HILLARY WHORE #TrumpPence2016 #TrumpTrain
4,Both,788948675313696769,2016-10-19 23:43:16-04:00,erwoti,Can't wait to hear #realDonaldTrump call that Nasty Woman (Hillary Clinton) - Madam President #debatenight #ChrisWallace
5,Clinton,788948671756955650,2016-10-19 23:43:15-04:00,isaac_urner,"The Clinton campaign already has redirecting to their site. That's what a real campaign looks like.
#badhombres2016"
Same code works for valid_tweets.csv and creates a plot like below:
valid_tweets.csv lines look like:
Candidate,ID,Time,Username,Tweet
Clinton,788948653016842240,2016-10-20 03:43:11+00:00,Tamayo_castle,Hillary Clinton dresses as Christian Bale at the debate via /r/pics
Clinton,788948666501464064,2016-10-20 03:43:14+00:00,ThinkCenter1968,"It's like I told my kids, a reason U don't want 2 vote 4 Hillary is U want the inheritance I'm leaving U, Right? They changed their minds!"
Clinton,788948673594097664,2016-10-20 03:43:16+00:00,21stCenRevolt,When hearing about Saudi Arabia murdering people for being gay. Hillary laughed with glee. She disgusting and disgraceful. #debatenight
Both,788948662881751040,2016-10-20 03:43:13+00:00,mikeywan,MEGYN IS A PAID HILLARY WHORE #TrumpPence2016 #TrumpTrain
Both,788948675313696769,2016-10-20 03:43:16+00:00,erwoti,Can't wait to hear #realDonaldTrump call that Nasty Woman (Hillary Clinton) - Madam President #debatenight #ChrisWallace
Clinton,788948671756955650,2016-10-20 03:43:15+00:00,isaac_urner,"The Clinton campaign already has redirecting to their site. That's what a real campaign looks like.
#badhombres2016"
Update:
in my first file I have:
import pandas as pd
import matplotlib.pyplot as plt
#2016-10-20 03:43:11+00:00
tweets_df = pd.read_csv('valid_tweets.csv')
tweets_df['Time'] = pd.Index(pd.to_datetime(tweets_df['Time'], utc=True)).tz_localize('UTC').tz_convert('US/Eastern')
tweets_df.to_csv('converted_timezone_tweets.csv', index=False)
In my second file I have:
import pandas as pd
import matplotlib.pyplot as plt
time_interval = pd.offsets.Second(10)
fig, ax = plt.subplots(figsize=(6, 3.5))
ax = (
pd.read_csv('converted_timezone_tweets.csv', engine='python', parse_dates=['Time'])
.resample(time_interval, on='Time')['ID']
.count()
.plot.line(ax=ax)
)
plt.show()
After using the engine='python' as in one of the answers, I get this error:
/scratch/sjn/anaconda/bin/python /scratch2/debate_tweets/temporal_analysis.py
Traceback (most recent call last):
File "/scratch2/debate_tweets/temporal_analysis.py", line 11, in <module>
.resample(time_interval, on='Time')['ID']
File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 4729, in resample
base=base, key=on, level=level)
File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/core/resample.py", line 969, in resample
return tg._get_resampler(obj, kind=kind)
File "/scratch/sjn/anaconda/lib/python3.6/site-packages/pandas/core/resample.py", line 1091, in _get_resampler
"but got an instance of %r" % type(ax).__name__)
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
Process finished with exit code 1
I did a vimdiff of the first 5 lines of each csv and this is what I get:
It seems like the error is with using the C engine to parse the csv. I'm not knowledgeable enough to know why that might be, but a possible workaround to to force the df.read_csv() bit to use the python engine by passing the engine = 'python' argument. As per the Pandas documentation, pd.read_csv() defaults to using the C engine for speed. Given that your error is hinting at a problem with the C engine, that might be a good place to start. so, try pd.read_csv('converted_timezone_tweets.csv', parse_dates=['Time'], engine = 'python') There was also something on GitHub hinting towards similar problems and fixes
Per the comment, this code
df1 = pd.read_csv('converted_timezone_tweets.csv', engine='python')
mask = pd.isnull(pd.to_datetime(df1['Time'], errors='coerce'))
print(df1.loc[mask, 'Time'])
prints
9941 None
27457 None
27458 None
...
this implies there are a number of entries in converted_timezone_tweets.csv whose Time field is the string 'None'.
You might want to go back and investigate what these values were in your original CSV:
df1 = pd.read_csv('converted_timezone_tweets.csv', engine='python')
mask = pd.isnull(pd.to_datetime(df1['Time'], errors='coerce'))
tweets_df = pd.read_csv('valid_tweets.csv')
print(tweets_df.loc[mask, 'Time'])
If there is no Time data for these tweets perhaps the most sensible thing to do is throw them away since we can't classify what time interval they belong to.
You could use df1 = df1.loc[mask, :] to remove the offending rows:
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.read_csv('converted_timezone_tweets.csv', engine='python')
df1['Time'] = pd.to_datetime(df1['Time'], errors='coerce')
mask = pd.notnull(df1['Time'])
df1 = df1.loc[mask, :]
df1 = df1.set_index('Time')
counts = df1.resample('10S')['ID'].count()
fig, ax = plt.subplots(figsize=(6, 3.5))
counts.plot.line(ax=ax)
plt.show()
To avoid parsing errors, we call pd.read_csv (above) without setting the parse_dates parameter. So pd.read_csv returns a DataFrame whose Time column contains date strings:
df1 = pd.read_csv('converted_timezone_tweets.csv', engine='python')
# ID Time
# 0 5 2016-10-19 23:43:00-04:00
# 1 5 2016-10-19 23:43:05-04:00
# 2 5 2016-10-19 23:43:10-04:00
# 3 5 2016-10-19 23:43:15-04:00
# ...
We then use pd.to_datetime to parse the date strings into datetimes.
pd.to_datetime parses the date strings by converting them to UTC while taking timezone offsets into account. The resulting datetimes are naive -- no timezone information is attached. This behavior is derived from the underlying NumPy datetime64[ns] data type used by Pandas to represent datetimes.
Therefore, to make the datetimes once again timezone-aware, you would need to call tz_localize/tz_convert again:
df1['Time'] = pd.Index(df1['Time']).tz_localize('UTC').tz_convert('US/Eastern')
But this also shows there was nothing gained by calling tz_convert the first time and storing the result in converted_timezone_tweets.csv the first time.
So a better solution (which does not require calling tz_convert after loading converted_timezone_tweets.csv) is to write converted_timezone_tweets.csv without the timezone offset. You can do that by dropping the timezone offset by calling tz_localize(None):
df1['Time'] = pd.Index(pd.to_datetime(df1['Time'], utc=True)).tz_localize('UTC').tz_convert('US/Eastern').tz_localize(None)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
N = 10
df = pd.DataFrame({'Time':pd.date_range('2016-10-20 03:43:00', periods=N, freq='5S'), 'ID':np.random.randint(N)})
df1 = df.copy()
df1['Time'] = pd.Index(pd.to_datetime(df1['Time'], utc=True)).tz_localize('UTC').tz_convert('US/Eastern').tz_localize(None)
df1.to_csv('converted_timezone_tweets.csv', index=False)
df1 = pd.read_csv('converted_timezone_tweets.csv', engine='python')
df1['Time'] = pd.to_datetime(df1['Time'], errors='coerce')
mask = pd.notnull(df1['Time'])
df1 = df1.loc[mask, :]
df = df.set_index('Time')
df1 = df1.set_index('Time')
counts1 = df1.resample('10S')['ID'].count()
counts = df.resample('10S')['ID'].count()
fig, ax = plt.subplots(figsize=(6, 3.5), nrows=2)
counts.plot.line(ax=ax[0])
counts1.plot.line(ax=ax[1])
plt.show()
Note that it might be more appealing to store all time-related data in UTC
rather than with respect to some other local timezone. That way, if you have many
CSV files you do not have to keep track of which timezone the time data is
relative to. From this point of view, it would be preferrable to keep
valid_tweets.csv, drop converted_timezone_tweets.csv, and do the conversion to
US/Eastern only when necessary:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('valid_tweets.csv')
df['Time'] = pd.to_datetime(df['Time'], errors='coerce')
mask = pd.notnull(df['Time'])
df = df.loc[mask, :]
df['Time'] = pd.Index(df['Time']).tz_localize('UTC').tz_convert('US/Eastern')
df = df.set_index('Time')
counts = df.resample('10S')['ID'].count()
fig, ax = plt.subplots(figsize=(6, 3.5))
counts.plot.line(ax=ax)
plt.show()
Related
How can i Parse my Date Column after getting Nasdaq dataset from yahoofinance in Python
I got a live data from yahoo finance as follows: ndx = yf.Ticker("NDX") # get stock info print(ndx.info) # get historical market data hist = ndx.history(period="1825d") I downloaded it and Exported to CSV file as follows: #Download stock data then export as CSV df = yf.download("NDX", start="2016-01-01", end="2022-11-02") df.to_csv('ndx.csv') Viewed the data as follows: df = pd.read_csv("ndx.csv") df The data was displayed as seen in the picture: THE PROBLEM.... Anytime i tried to use the Date column it throws an error as KeyError 'Date'. here is my Auto Arima Model and the error thrown. Please Help. ERROR THROWN i want to be able to use the Date column. i tried Parsing the Date column but throw the same error. i will need help parsing the data first so as to convert Date to day format or string. Thanks
Always great to see people trying to learn financial analysis: Before I get into the solution I just want to remind you to make sure you put your imports in your question (yfinance isn't always aliased as yf). Also make sure you type or copy/paste your code so that we can easily grab it and run it! So, I am going to assume the variable "orig_df" is just the call to pd.read_csv('ndx.csv') since that's what the screenshot looks like. Firstly, always check your data types of your columns after reading in the file: (assuming you are using Jupyter) orig_df = pd.read_csv('ndx.csv') orig_df.dtypes Date is an object, which just means string in pandas. if orig_df is the actual call to yf.ticker(...), then "Date" is your index, so it is does not act like a column. How to fix and Run: from statsmodels.api import tsa import numpy as np import matplotlib.pyplot as plt from datetime import datetime as dt, timedelta orig_df = pd.read_csv('ndx.csv', parse_dates=['Date'], index_col=0) model = tsa.arima.ARIMA(np.log(orig_df['Close']), order=(10, 1, 10)) fitted = model.fit() fc = fitted.get_forecast(5) fc = (fc.summary_frame(alpha=0.05)) fc_mean = fc['mean'] fc_lower = fc['mean_ci_lower'] fc_upper = fc['mean_ci_upper'] orig_df.iloc[-50:,:].plot(y='Close', title='Nasdaq 100 Closing price', figsize=(10, 6)) # call orig_df.index[-1] for most recent trading day, not just today future_5_days = [orig_df.index[-1] + timedelta(days=x) for x in range(5)] plt.plot(future_5_days, np.exp(fc_mean), label='mean_forecast', linewidth=1.5) plt.fill_between(future_5_days, np.exp(fc_lower), np.exp(fc_upper), color='b', alpha=.1, label='95% confidence') plt.title('Nasdaq 5 Days Forecast') plt.legend(loc='upper left', fontsize=8) plt.show()
Bug when indexing date column in Pandas
I'm trying to make pandas recognise the first column as a date. import csv import pandas as pd import plotly.express as px cl = open('cl.csv') cl = pd.read_csv('CL.csv', parse_dates=['Date'], index_col=['Date']) cl.info() Then to visualise the price: fig = px.line(cl, y="Adj Close", title='Crude Oil Price', labels = {'Adj Close':'Crude Oil Price(in USD)'}) But it gives back a ruined chart: Date indexed chart If I comment out 'parse_dates=['Date'], index_col=['Date'])' and just leave 'cl = pd.read_csv('CL.csv')' the chart will look just fine. Chart without date What am I doing wrong here?
If you print c1 out and the dates look fine, then the reason behind the graph could likely be that your c1 wasn't sorted by Date, do the following before visualizing it: c1 = c1.sort_values('Date')
I think this problem can be caused by the type of date format that column contains ('Date'), so researching the documentation, I quote the following: For non-standard datetime parsing, use pd.to_datetime after pd.read_csv. To parse an index or column with a mixture of timezones, specify date_parser to be a partially-applied pandas.to_datetime() with utc=True. See Parsing a CSV with mixed timezones for more, then you could replace cl = pd.read_csv('CL.csv', parse_dates=['Date'], index_col=['Date']) with cl = pd.read_csv('CL.csv', parse_dates=['Date'], date_parser=lambda col: pd.to_datetime(col, utc=True))
Conversion RGB to xyY with colormath
With colormath I make a conversion from RGB to xyY value. It works fine for 1 RGB value, but I can't find the right code to do the conversion for multiple RGB values imported from an Excel. I use to following code: from colormath.color_objects import sRGBColor, xyYColor from colormath.color_conversions import convert_color import pandas as pd data = pd.read_excel(r'C:/Users/User/Desktop/Color/Fontane/RGB/FontaneHuco.xlsx') df = pd.DataFrame(data, columns=['R', 'G', 'B']) #print(df) rgb = sRGBColor(df['R'],df['G'],df['B'], is_upscaled=True) xyz = convert_color(rgb, xyYColor) print(xyz) But when i run this code i receive to following error: Traceback (most recent call last): File "C:\Users\User\PycharmProjects\pythonProject4\Overige\Chroma.py", line 9, in <module> lab = sRGBColor(df['R'], df['G'], df['B']) File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\colormath\color_objects.py", line 524, in __init__ self.rgb_r = float(rgb_r) File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py", line 141, in wrapper raise TypeError(f"cannot convert the series to {converter}") TypeError: cannot convert the series to <class 'float'> Does anyone has an idea how to fix this problem?
convert_color is expecting floats and you're giving it dataframe columns instead. You need to apply the conversion one row at at time, which can be done as follows: xyz = df.apply( lambda row: convert_color( sRGBColor(row.R, row.G, row.B, is_upscaled=True), xyYColor ), axis=1, )
Python - Pandas - resample issue
I am trying to adapt a Pandas.Series with a certain frequency to a Pandas.Series with a different frequency. Therefore I used the resample function but it does not recognize for instance that 'M' is a subperiod of '3M' and raised an error import pandas as pd idx_1 = pd.period_range('2017-01-01', periods=6, freq='M') data_1 = pd.Series(range(6), index=idx_1) data_higher_freq = data_1.resample('3M', kind="Period").sum() Raises the following exception: Traceback (most recent call last): File "/home/mitch/Programs/Infrastructure_software/Sandbox/spyderTest.py", line 15, in <module> data_higher_freq = data_1.resample('3M', kind="Period").sum() File "/home/mitch/anaconda3/lib/python3.6/site-packages/pandas/core/resample.py", line 758, in f return self._downsample(_method, min_count=min_count) File "/home/mitch/anaconda3/lib/python3.6/site-packages/pandas/core/resamplepy", line 1061, in _downsample 'sub or super periods'.format(ax.freq, self.freq)) pandas._libs.tslibs.period.IncompatibleFrequency: Frequency <MonthEnd> cannot be resampled to <3 * MonthEnds>, as they are not sub or super periods This seems to be due to the pd.tseries.frequencies.is_subperiod function: import pandas as pd pd.tseries.frequencies.is_subperiod('M', '3M') pd.tseries.frequencies.is_subperiod('M', 'Q') Indeed it returns False for the first command and True for the second. I would really appreciate any hints about any solution. Thks.
Try changing from PeriodIndex to DateTimeIndex before resampling: import pandas as pd idx_1 = pd.period_range('2017-01-01', periods=6, freq='M') data_1 = pd.Series(range(6), index=idx_1) data_1.index = data_1.index.astype('datetime64[ns]') data_higher_freq = data_1.resample('3M', kind='period').sum() Output: data_higher_freq Out[582]: 2017-01 3 2017-04 12 Freq: 3M, dtype: int64
Lifelines boolean index in Python did not match indexed array along dimension 0; dimension is 88 but corresponding boolean dimension is 76
This very simple piece of code, # imports... from lifelines import CoxPHFitter import pandas as pd src_file = "Pred.csv" df = pd.read_csv(src_file, header=0, delimiter=',') df = df.drop(columns=['score']) cph = CoxPHFitter() cph.fit(df, duration_col='Length', event_col='Status', show_progress=True) produces an error: Traceback (most recent call last): File "C:/Users/.../predictor.py", line 11, in cph.fit(df, duration_col='Length', event_col='Status', show_progress=True) File "C:\Users\...\AppData\Local\conda\conda\envs\hrpred\lib\site-packages\lifelines\fitters\coxph_fitter.py", line 298, in fit self._check_values(df) File "C:\Users\...\AppData\Local\conda\conda\envs\hrpred\lib\site-packages\lifelines\fitters\coxph_fitter.py", line 323, in _check_values cols = str(list(X.columns[low_var])) File "C:\Users\...\AppData\Local\conda\conda\envs\hrpred\lib\site-packages\pandas\core\indexes\base.py", line 1754, in _ _ getitem _ _ result = getitem(key) IndexError: boolean index did not match indexed array along dimension 0; dimension is 88 but corresponding boolean dimension is 76 However, when I print df itself, everything's all right. As you can see, everything is inside the library. And the library's examples work fine.
Without knowing what your data look like - I had the same error, which was resolved when I removed all but the duration, event and coefficient(s) from the pandas df I was using. That is, I had a lot of extra columns in the df that were confusing the cox PH fitter since you don't actually specify which coef you want to include as an argument to cph.fit().