How can I access parsed dates by pd.read_csv? - python

I parsed my CSV data with
data = pd.read_csv('Data.csv', parse_dates= True, index_col=6, date_parser = parser)
Then, when I try to access the Time column doing something like data["Time"], I get a key access error. If I don't parse the data using read_csv and parse it instead after with #data['Date'] = pd.to_datetime(data['Date'], format='%m/%d/%Y %H:%M:%S'), then my graphs don't automatically have the time on the x axis if I only plot y. My end goal is to be able to have the user select the time frame of the graph, and I'm having trouble doing so because I can't access the Data column after I parse dates. Any help would be appreciated, thanks.
The sample CSV headers are these:
"Name","Date", "Data"
"Data", "05/14/2022 21:30:00", "100"
"Data", "05/14/2022 21:30:00", "100"
"Data", "05/14/2022 21:30:00", "100

Given a CSV that looks like this:
Name,Date,Data
Data,05/13/2022 21:30:00,100
Data,05/14/2022 21:30:00,100
Data,05/15/2022 21:30:00,100
Data,05/16/2022 21:30:00,100
Note: no double quotes and no space after the comma delimiter
You have several options to load the data.
Below is the easiest way if the data is a timeseries (all dates in Date column are different)
import pandas as pd
data = pd.read_csv("Data.csv", parse_dates=True, index_col="Date")
The above returns a dataframe with the Date column as a DatetimeIndex with a dtype of datetime64[ns] and is accessed via data.index.
Resulting dataframe:
Name Data
Date
2022-05-13 21:30:00 Data 100
2022-05-14 21:30:00 Data 100
2022-05-15 21:30:00 Data 100
2022-05-16 21:30:00 Data 100
You can then plot the data with a simple data.plot().
If you want to filter what data is plot based on time, e.g. Only data on 05/14 and 05/15:
data[(data.index < "2022-05-16") & (data.index > "2022-05-13")].plot()
or
new_data = data[(data.index < "2022-05-16") & (data.index > "2022-05-15")]
new_data.plot()

Related

Convert "Price" column values in a CSV file from "46.25 lacs" to 4625000 using python

I have a CSV file. And in this CSV file I have a column named Price. These price values are like 6.35 crore and 27.2 lacs. I want to convert these values to actual values (63500000 and 2720000) with integer data type.
This is what I have done till now
import pandas as pd
df = pd.DataFrame({'selling_price' : ['5.5 Lakh*', '5.7 Lakh*', '3.5 Lakh*', '3.15 Lakh*'],
'new-price':['Rs.7.11-7.48 Lakh*','Rs.10.14-13.79 Lakh*','Rs.5.16-6.94 Lakh*','Rs.6.54-6.63 Lakh*',]})
df = pd.DataFrame({'selling_price' :[int(float(str(x).strip(' Lakh*'))*100000) for x in df['selling_price'].to_list()]})
print(df)
This gives me actual values. But I cannot figure out, how to apply it on a CSV file. Improvement or any better solution would be high appreciated. Thanks.

Revert multiindex Date and time to singleindex datetime

last two weeks I had to plot different time series with different datetime formats. There was no problem converting those into one format. Now I face a new challenge and am struggling solving it. All data (csv) I got from my colleagues had one specific field with both date and time inside --> read it into a pandas data frame and reformat the datetime format. Today I got from a different system new data to process with two index cols one for date and a second one for time. My problem here is that those index cols are designed as multiindex cols (see below).
Old Data:
Datetime
Data
01/01/2021 00:00
0,15
01/01/2021 00:15
5,18
Datetime;Data
2021-01-01 00:15:00;1,829
2021-01-01 00:30:00;1,675
2021-01-01 00:45:00;1,501
New Data:
Date
Time
Data
01/01/2021
00:00
0,15
00:15
5,18
Date; Time; Data
01/01/2021;00:15;71,04
;00:30;62,8
;00:45;73,2
;01:00;73,48
;01:15;66,8
;01:30;67,48
;01:45;71,12
;02:00;73,88
After reading this csv into a pandas dataframe with following code, I am not able to add the time specific data to the existing data because the indexes are not equal.
obtain = pd.read_csv('csv/data.csv', sep=';', encoding='utf-8', index_col=['Date', 'Time'], names=['Date', 'Time', 'Data'], dtype={'Date' : 'string', 'Time': 'string', \ 'Data': 'float'}, decimal=',')
How do I reset the index of the new data to a single Index in a pandas dataframe as a datetime column?
I tried to just convert the index to datetime as following
obtain.index = pd.to_datetime(obtain.index.map(' '.join))
obtain.index = pd.to_datetime(obtain.index)
you ca nadd parameter parse_dates if repeated Date values:
obtain = pd.read_csv('csv/data.csv',
sep=';',
encoding='utf-8',
index_col=['Date', 'Time'],
parse_dates=['Date', 'Time'],
names=['Date', 'Time', 'Data'],
dtype={'Data': 'float'},
decimal=',')
But if there are no dates:
obtain = pd.read_csv('csv/data.csv',
sep=';',
encoding='utf-8',
names=['Date', 'Time', 'Data'],
dtype={'Data': 'float'},
decimal=',')
obtain.index = pd.to_datetime(obtain.pop('Date').ffill() + ' ' + obtain.pop('Time'))

How to change datetime format while converting df to to_json using pandas

While reading sql query pandas dataframe showing correct date and timestamp format. but while converting df to json using pd.to_json date and timestamp format showing wrong format.
import json
from ast import literal_eval
sql_data = pd.read_sql_query(''' select * from sample_table ''',con)
sql_data
tabId tab_int tab_char tab_decimal tab_date tab_timestamp
1 100 test5 99.54 2021-08-16 2021-08-16 23:30:48
2 20 test1 85.24 2021-08-16 2021-08-16 23:31:10
json_data = sql_data.to_json(orient="records", date_format='iso')
Output :
[{"tabId":1,"tab_int":100,"tab_char":"test5","tab_decimal":99.54,"tab_date":"2021-08-16T00:00:00.000Z","tab_timestamp":"2021-08-16T23:30:48.000Z"},{"tabId":2,"tab_int":20,"tab_char":"test1","tab_decimal":85.24,"tab_date":"2021-08-16T00:00:00.000Z","tab_timestamp":"2021-08-16T23:31:10.000Z"}]
Expected Output format like :
[{"tabId":1,"tab_int":100,"tab_char":"test5","tab_decimal":99.54,"tab_date":"2021-08-16","tab_timestamp":"2021-08-16 23:30:48"},{"tabId":2,"tab_int":20,"tab_char":"test1","tab_decimal":85.24,"tab_date":"2021-08-16","tab_timestamp":"2021-08-16 23:31:10"}]
If I know the columns name means I can able to achieve using below method before converting to json.
sql_data['tab_timestamp'] = sql_data['tab_timestamp'].dt.strftime('%Y-%m-%d %H:%M:%S')
But, I need to read data from random tables and wants to convert that. That time I don't know which is the right column. Request you to please give any suggestion for this.
There is an Open PR for this issue which is still Open No way with to_json to write only date out of datetime.
May be one thing which you could try:
sql_data.to_json(orient='records', date_format='iso', date_unit='s')

Collect all transactions for each day and report total spent that day

I have a DataFrame that looks like this
date Burned
8/11/2019 7:00 0.0
8/11/2019 7:00 10101.0
8/11/2019 8:16 5.2
I have this code:
import pandas as pd
import numpy as np
# Read data from file 'filename.csv'
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later)
df = pd.read_csv("../example.csv")
# Preview the first 5 lines of the loaded data
df = df.assign(Burned = df['Quantity'])
df.loc[df['To'] != '0x0000000000000000000000000000000000000000', 'Burned'] = 0.0
# OR:
df['cum_sum'] = df['Burned'].cumsum()
df['percent_burned'] = df['cum_sum']/df['Quantity'].max()*100.0
a=pd.concat([df['DateTime'], df['Burned']], axis=1, keys=['date', 'Burned'])
b=a.groupby(df.index.date).count()
But I get this error: AttributeError: 'RangeIndex' object has no attribute 'date'
Basically I am wanting to sort all these times just by day since it has timestamps throughout the day. I don't care what time of the day different things occured, I just want to get the total number of 'Burned' per day.
First add parse_dates=['DateTime'] to read_csv for convert column Datetime:
df = pd.read_csv("../example.csv", parse_dates=['DateTime'])
Or first column:
df = pd.read_csv("../example.csv", parse_dates=[0])
In your solution is date column, so need Series.dt.date with sum:
b = a.groupby(a['date'].dt.date)['Burned'].sum().reset_index(name='Total')

Time Series using numpy or pandas

I'm a beginner of Python related environment and I have problem with using time series data.
The below is my OHLC 1 minute data.
2011-11-01,9:00:00,248.50,248.95,248.20,248.70
2011-11-01,9:01:00,248.70,249.00,248.65,248.85
2011-11-01,9:02:00,248.90,249.25,248.70,249.15
...
2011-11-01,15:03:00,250.25,250.30,250.05,250.15
2011-11-01,15:04:00,250.15,250.60,250.10,250.60
2011-11-01,15:15:00,250.55,250.55,250.55,250.55
2011-11-02,9:00:00,245.55,246.25,245.40,245.80
2011-11-02,9:01:00,245.85,246.40,245.75,246.35
2011-11-02,9:02:00,246.30,246.45,245.75,245.80
2011-11-02,9:03:00,245.75,245.85,245.30,245.35
...
I'd like to extract the last "CLOSE" data per each row and convert data format like the following:
2011-11-01, 248.70, 248.85, 249.15, ... 250.15, 250.60, 250.55
2011-11-02, 245.80, 246.35, 245.80, ...
...
I'd like to calculate the highest Close value and it's time(minute) per EACH DAY like the following:
2011-11-01, 10:23:03, 250.55
2011-11-02, 11:02:36, 251.00
....
Any help would be very appreciated.
Thank you in advance,
You can use the pandas library. In the case of your data you can get the max as:
import pandas as pd
# Read in the data and parse the first two columns as a
# date-time and set it as index
df = pd.read_csv('your_file', parse_dates=[[0,1]], index_col=0, header=None)
# get only the fifth column (close)
df = df[[5]]
# Resample to date frequency and get the max value for each day.
df.resample('D', how='max')
If you want to show also the times, keep them in your DataFrame as a column and pass a function that will determine the max close value and return that row:
>>> df = pd.read_csv('your_file', parse_dates=[[0,1]], index_col=0, header=None,
usecols=[0, 1, 5], names=['d', 't', 'close'])
>>> df['time'] = df.index
>>> df.resample('D', how=lambda group: group.iloc[group['close'].argmax()])
close time
d_t
2011-11-01 250.60 2011-11-01 15:04:00
2011-11-02 246.35 2011-11-02 09:01:00
And if you wan't a list of the prices per day then just do a groupby per day and return the list of all the prices from every group using the apply on the grouped object:
>>> df.groupby(lambda dt: dt.date()).apply(lambda group: list(group['close']))
2011-11-01 [248.7, 248.85, 249.15, 250.15, 250.6, 250.55]
2011-11-02 [245.8, 246.35, 245.8, 245.35]
For more information take a look at the docs: Time Series
Update for the concrete data set:
The problem with your data set is that you have some days without any data, so the function passed in as the resampler should handle those cases:
def func(group):
if len(group) == 0:
return None
return group.iloc[group['close'].argmax()]
df.resample('D', how=func).dropna()

Categories

Resources