last two weeks I had to plot different time series with different datetime formats. There was no problem converting those into one format. Now I face a new challenge and am struggling solving it. All data (csv) I got from my colleagues had one specific field with both date and time inside --> read it into a pandas data frame and reformat the datetime format. Today I got from a different system new data to process with two index cols one for date and a second one for time. My problem here is that those index cols are designed as multiindex cols (see below).
Old Data:
Datetime
Data
01/01/2021 00:00
0,15
01/01/2021 00:15
5,18
Datetime;Data
2021-01-01 00:15:00;1,829
2021-01-01 00:30:00;1,675
2021-01-01 00:45:00;1,501
New Data:
Date
Time
Data
01/01/2021
00:00
0,15
00:15
5,18
Date; Time; Data
01/01/2021;00:15;71,04
;00:30;62,8
;00:45;73,2
;01:00;73,48
;01:15;66,8
;01:30;67,48
;01:45;71,12
;02:00;73,88
After reading this csv into a pandas dataframe with following code, I am not able to add the time specific data to the existing data because the indexes are not equal.
obtain = pd.read_csv('csv/data.csv', sep=';', encoding='utf-8', index_col=['Date', 'Time'], names=['Date', 'Time', 'Data'], dtype={'Date' : 'string', 'Time': 'string', \ 'Data': 'float'}, decimal=',')
How do I reset the index of the new data to a single Index in a pandas dataframe as a datetime column?
I tried to just convert the index to datetime as following
obtain.index = pd.to_datetime(obtain.index.map(' '.join))
obtain.index = pd.to_datetime(obtain.index)
you ca nadd parameter parse_dates if repeated Date values:
obtain = pd.read_csv('csv/data.csv',
sep=';',
encoding='utf-8',
index_col=['Date', 'Time'],
parse_dates=['Date', 'Time'],
names=['Date', 'Time', 'Data'],
dtype={'Data': 'float'},
decimal=',')
But if there are no dates:
obtain = pd.read_csv('csv/data.csv',
sep=';',
encoding='utf-8',
names=['Date', 'Time', 'Data'],
dtype={'Data': 'float'},
decimal=',')
obtain.index = pd.to_datetime(obtain.pop('Date').ffill() + ' ' + obtain.pop('Time'))
Related
I parsed my CSV data with
data = pd.read_csv('Data.csv', parse_dates= True, index_col=6, date_parser = parser)
Then, when I try to access the Time column doing something like data["Time"], I get a key access error. If I don't parse the data using read_csv and parse it instead after with #data['Date'] = pd.to_datetime(data['Date'], format='%m/%d/%Y %H:%M:%S'), then my graphs don't automatically have the time on the x axis if I only plot y. My end goal is to be able to have the user select the time frame of the graph, and I'm having trouble doing so because I can't access the Data column after I parse dates. Any help would be appreciated, thanks.
The sample CSV headers are these:
"Name","Date", "Data"
"Data", "05/14/2022 21:30:00", "100"
"Data", "05/14/2022 21:30:00", "100"
"Data", "05/14/2022 21:30:00", "100
Given a CSV that looks like this:
Name,Date,Data
Data,05/13/2022 21:30:00,100
Data,05/14/2022 21:30:00,100
Data,05/15/2022 21:30:00,100
Data,05/16/2022 21:30:00,100
Note: no double quotes and no space after the comma delimiter
You have several options to load the data.
Below is the easiest way if the data is a timeseries (all dates in Date column are different)
import pandas as pd
data = pd.read_csv("Data.csv", parse_dates=True, index_col="Date")
The above returns a dataframe with the Date column as a DatetimeIndex with a dtype of datetime64[ns] and is accessed via data.index.
Resulting dataframe:
Name Data
Date
2022-05-13 21:30:00 Data 100
2022-05-14 21:30:00 Data 100
2022-05-15 21:30:00 Data 100
2022-05-16 21:30:00 Data 100
You can then plot the data with a simple data.plot().
If you want to filter what data is plot based on time, e.g. Only data on 05/14 and 05/15:
data[(data.index < "2022-05-16") & (data.index > "2022-05-13")].plot()
or
new_data = data[(data.index < "2022-05-16") & (data.index > "2022-05-15")]
new_data.plot()
I have a DataFrame that looks like this
date Burned
8/11/2019 7:00 0.0
8/11/2019 7:00 10101.0
8/11/2019 8:16 5.2
I have this code:
import pandas as pd
import numpy as np
# Read data from file 'filename.csv'
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later)
df = pd.read_csv("../example.csv")
# Preview the first 5 lines of the loaded data
df = df.assign(Burned = df['Quantity'])
df.loc[df['To'] != '0x0000000000000000000000000000000000000000', 'Burned'] = 0.0
# OR:
df['cum_sum'] = df['Burned'].cumsum()
df['percent_burned'] = df['cum_sum']/df['Quantity'].max()*100.0
a=pd.concat([df['DateTime'], df['Burned']], axis=1, keys=['date', 'Burned'])
b=a.groupby(df.index.date).count()
But I get this error: AttributeError: 'RangeIndex' object has no attribute 'date'
Basically I am wanting to sort all these times just by day since it has timestamps throughout the day. I don't care what time of the day different things occured, I just want to get the total number of 'Burned' per day.
First add parse_dates=['DateTime'] to read_csv for convert column Datetime:
df = pd.read_csv("../example.csv", parse_dates=['DateTime'])
Or first column:
df = pd.read_csv("../example.csv", parse_dates=[0])
In your solution is date column, so need Series.dt.date with sum:
b = a.groupby(a['date'].dt.date)['Burned'].sum().reset_index(name='Total')
I imported a csv file in python. Then, I changed the first column to datetime format.
datetime Bid32 Ask32
2019-01-01 22:06:11.699 1.14587 1.14727
2019-01-01 22:06:12.634 1.14567 1.14707
2019-01-01 22:06:13.091 1.14507 1.14647
I saw three ways for indexing first column.
df.index = df.datetime
del datetime
or
df.set_index('datetime', inplace=True)
and
df.set_index(pd.DatetimeIndex('datetime'), inplace=True)
My question is about the second and third ways. Why in some sources they used pd.DatetimeIndex() with df.set_index() (like third code) while the second code was enough?
In case you are not changing the 'datetime' column with to_datetime():
df = pd.DataFrame(columns=['datetime', 'Bid32', 'Ask32'])
df.loc[0] = ['2019-01-01 22:06:11.699', '1.14587', '1.14727']
df.set_index('datetime', inplace=True) # option 2
print(type(df.index))
Result:
pandas.core.indexes.base.Index
vs.
df = pd.DataFrame(columns=['datetime', 'Bid32', 'Ask32'])
df.loc[0] = ['2019-01-01 22:06:11.699', '1.14587', '1.14727']
df.set_index(pd.DatetimeIndex(df['datetime']), inplace=True) # option 3
print(type(df.index))
Result:
pandas.core.indexes.datetimes.DatetimeIndex
So the third one with pd.DatetimeIndex() makes it an actual datetime index, which is what you want.
Documentation:
pandas.Index
pandas.DatetimeIndex
I currently have a df in pandas with a variable called 'Dates' that records the data an complaint was filed.
data = pd.read_csv("filename.csv")
Dates
Initially Received
07-MAR-08
08-APR-08
19-MAY-08
As you can see there are missing dates between when complaints are filed, also multiple complaints may have been filed on the same day. Is there a way to fill in the missing days while keeping complaints that were filed on the same day the same?
I tried creating a new df with datetime and merging the dataframes together,
days = pd.date_range(start='01-JAN-2008', end='31-DEC-2017')
df = pd.DataFrame(data=days)
df.index = range(3653)
dates = pd.merge(days, data['Dates'], how='inner')
but I get the following error:
ValueError: can not merge DataFrame with instance of type <class
'pandas.tseries.index.DatetimeIndex'>
Here are the first four rows of data
You were close, there's an issue with your input
First do:
df = pd.read_csv('filename.csv', skiprows = 1)
Then
days = pd.date_range(start='01-JAN-2008', end='31-DEC-2017')
df_clean = df.reset_index()
df_clean['idx dates'] = pd.to_datetime(df_clean['Initially Received'])
df2 = pd.DataFrame(data=days, index = range(3653), columns=['full dates'])
dates = pd.merge(df2, df_clean, left_on='full dates', right_on = 'idx dates', how='left')
Create your date range, and use merge to outer join it to the original dataframe, preserving duplicates.
import pandas as pd
from io import StringIO
TESTDATA = StringIO(
"""Dates;fruit
05-APR-08;apple
08-APR-08;banana
08-APR-08;pear
11-APR-08;grapefruit
""")
df = pd.read_csv(TESTDATA, sep=';', parse_dates=['Dates'])
dates = pd.date_range(start='04-APR-2008', end='12-APR-2008').to_frame()
pd.merge(
df, dates, left_on='Dates', right_on=0,
how='outer').sort_values(by=['Dates']).drop(columns=0)
# Dates fruit
# 2008-04-04 NaN
# 2008-04-05 apple
# 2008-04-06 NaN
# 2008-04-07 NaN
# 2008-04-08 banana
# 2008-04-08 pear
# 2008-04-09 NaN
# 2008-04-10 NaN
# 2008-04-11 grapefruit
# 2008-04-12 NaN
I'm trying to import a csv file that looks like this
Irrelevant row
"TIMESTAMP","RECORD","Site","Logger","Avg_70mSE_Avg","Avg_60mS_Avg",
"TS","RN","","","metres/second","metres/second",
"","","Smp","Smp","Avg","Avg",
"2010-05-18 12:30:00",0,"Sisters",5068,5.162,4.996
"2010-05-18 12:40:00",1,"Sisters",5068,5.683,5.571
The second row is the header but rows 0, 2, 3 are irrelevant. My code at the moment is:
parse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv('data.csv', header=1, index_col=['TIMESTAMP'],
parse_dates=['TIMESTAMP'], date_parser = parse)
The problem is that since rows 2 and 3 don't have correct dates I get an error (or at least I think this the error).
Would it be possible to exclude these rows, using something like skiprows, but for rows that are not in the beginning of the file? Or do you have any other suggestions?
You can use the skiprows keyword to ignore the rows:
pd.read_csv('data.csv', skiprows=[0, 2, 3],
index_col=['TIMESTAMP'], parse_dates=['TIMESTAMP'])
Which for your sample data gives:
RECORD Site Logger Avg_70mSE_Avg Avg_60mS_Avg
TIMESTAMP
2010-05-18 12:30:00 0 Sisters 5068 5.162 4.996
2010-05-18 12:40:00 1 Sisters 5068 5.683 5.571
The first parsed row (1) becomes the header and read_csv's default parser correctly parses the timestamp column.