I have table which contains a lot of data about plants from different dates.
I am trying to select all the data from specific date, but anytime I do that all the data disappear and I get table that has only the columns names.
this is my code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df_plants = pd.read_csv('Data_plants_26_11_2019.csv')
df_Nit=pd.read_csv('chemometrics.csv')
df_plants.head()
#create new colum which contains aonly the hour using lambda
df_plants['Hour']=df_plants['time'].apply(lambda time: time.split(' ')[1])
df_plants['date']=df_plants['time'].apply(lambda time: time.split(' ')[0])
df_plants['Hour'] = pd.to_datetime(df_plants['Hour']).apply(lambda x: str(x.hour) + ':00')
df_indices=df_plants[['plant','date','Hour','Treatment','Line','NDVI','YU_index','Zhao 405-715']]
df_indices[df_indices['date']==6/22/2019]
The results:
this is how the table looks if I use head() before I try to get only specific dates:
My end goal is to get new table which contains ONLY the values from a specific date I choose
The main issue seems to be that you are performing an operation with the argument you are setting the equal to, hence Python/pandas is interpeting as a number and not a date.
You should use the value between ' ' apostrophes, like this:
df_indices[df_indices['date']=='6/22/2019']
Or in a more complex case (pandas 0.19 and above):
df_indices[df_indices['date'] == pd.Timestamp(year=2019,month=6,day=22).dt.date]
I would do:
# first create a datetime column with the date
# (probably you should make some changes here because of your
# datetime format)
df_indices['date'] = pd.to_datetime(df_indices['date'])
# then use pd.to_datetime() method
df_indices[df_indices['date']==pd.to_datetime('2019-06-22')]
Related
I'm not sure if I'm going about this the right way. However, I'm trying to create a multi-index off three fields, which eventually will be three separate joins based upon input data.
I'm creating a second data frame in the example below, which I want to use a date lookup table, which will be able to join to the expiration date.
When I try to create the multi-index I get the following error:
ValueError: Length of names must match number of levels in MultiIndex.
I've tried turn the dataframe data_index into a series, but I'm still getting the error.
import pandas as pd
import numpy as np
import requests
from datetime import date
raw_data = requests.get(f"https://cdn.cboe.com/api/global/delayed_quotes/options/_SPX.json")
dict_data = pd.DataFrame.from_dict(raw_data.json())
#create dataframe from options key
data = pd.DataFrame(dict_data.loc["options", "data"])
# regex to strip the contract type, strike, and expire date
data["contract_type"] = data.option.str.extract(r"\d([A-Z])\d")
data["strike_price"] = data.option.str.extract(r"\d[A-Z](\d+)\d\d\d").astype(int)
data["expiration_date"] = str(20) + data.option.str.extract(r"[A-Z](\d+)").astype(str)
# Convert expiration to datetime format
data["expiration_date"] = pd.to_datetime(data["expiration_date"], format="%Y-%m-%d")
data_index = pd.MultiIndex.from_frame(data, names=[["expiration_date", "strike_price", "contract_type"]]) ## this is where the error occurs
print(data_index)
date_df = pd.DataFrame(data[["expiration_date"]].unique())
date_df.set_index('expiration_date').join(data_index.set_index('expiration_date'))
today_date = date.today().strftime("%Y-%m-%d")
days_till_expire = (np.busday_count(today_date,date_df["expiration_date"].dt.strftime("%Y-%m-%d").to_list())/ 252)
date_df.loc[date_df["days_till_expire"].eq(0), "days_till_expire"] = 1 / 252
print(date_df)
How can I get the multi index to work with a join on a single index? Joining the date_df into the multi_index dataframe? I believe the logic below should work if the multi-index is setup correctly
It's probably too long to put it into comments, so here in form of an answer:
I suppose that what you wanted to achieve is this (not from the entire frame, but from the three columns out of it):
data_index = pd.MultiIndex.from_arrays([data["expiration_date"], data["strike_price"], data["contract_type"]], names=["expiration_date", "strike_price", "contract_type"]) ## this is where the error occurs
For how to join on a single index check out:
How to join a multi-index series to a single index dataframe with Pandas?
Does it help you?
I have two excels. The first one is the dependent variable: data with Date & station ID, With date as the 'index' of dataframe, and station ID for headers. as shown as below,
The second one (independent variable) is the data I used to simulate the dependent variable(first excel above), also with Date, have the format as one column for year and the other two for month and date respectively. As shown in the image below
what I want is 1. skip the NaN value in the first excel.
2. add the value in the first table to the second excel based on the same date and the same water monitoring station ID
These are the codes I write until now, I am new to python and have been struggling for days.
import csv
import pandas
import pandas as pd
import openpyxl
from numpy import size
from pandas import DataFrame
from datetime import datetime
import datetime as dt
import numpy as np
# firstly, exclude those don't have value
# read csv file(2)
csvB4reflectance = pd.read_csv('GEEdownload.csv')
b4 = pd.read_csv('GEEdownload.csv',sep=',',parse_dates=['system:time_start'])
b4.set_index('system:time_start',inplace=True) #set index and change index type, to drop out
print(csvB4reflectance)
path = 'F:/72hourtimewindow/project/waterqualitydate/29UMT/'
excelorder = pd.read_excel(path+'Stationwithorder.xls',header = 0, index_col=0)
print(excelorder)
b41 = b4.dropna(axis=0,how='all')
print(b41)
# process this table, start to calculate when data in the form is not NaN
b41num = b41.to_numpy()
print(b41num)
# import excel order
for i in b41num:
for j in i:
if j == NaN:
break
else:
if
print(j)```
I have came across this problem, the second chart need to be melted firstly!~ and then everything seems to be esay
I saw this code
combine rows and add up value in dataframe,
but I want to add the values in cells for the same day, i.e. add all data for a day. how do I modify the code to achieve this?
Check below code:
import pandas as pd
df = pd.DataFrame({'Price':[10000,10000,10000,10000,10000,10000],
'Time':['2012.05','2012.05','2012.05','2012.06','2012.06','2012.07'],
'Type':['Q','T','Q','T','T','Q'],
'Volume':[10,20,10,20,30,10]
})
df.assign(daily_volume = df.groupby('Time')['Volume'].transform('sum'))
Output:
I have a pandas Dataframe with an index using UTC time and a column with data (in the example the column "value_1").
My question is: How could I create a new column in which each value is the value of the first column but 20 seconds later. Using the example below, I would get for the first value of this second column the value at the moment "2011-01-01 00:00:20".
import pandas as pd
import numpy as np
data_1 = pd.DataFrame(index=pd.date_range('1/1/2011', periods = 1000, freq ='S'))
data_1['value_1'] = 100 + np.random.randint(0,1000,size=(1000, 1))
data_1['value_2'] = ??¿¿
I don't know if it would be possible if I change the index to a different format.
I have seen that pandas have some useful functionalities to work with time series but I have not found the one in order to solve this problem yet.
Thank-you in advance.
you can either use shift with the value of seconds you want to use (here 20):
data_1['value_2'] = data_1['value_1'].shift(-20)
or can reindex with the index + 20s and get values with to_numpy:
data_1['value_2'] = data_1['value_1'].reindex(data_1['value_1'].index+pd.Timedelta(seconds=20)).to_numpy()
I am working on a data frame uploaded from CSV, I have tried changing the data typed on the CSV file and to save it but it doesn't let me save it for some reason, and therefore when I upload it to Pandas the date and time columns appear as object.
I have tried a few ways to transform them to datetime but without a lot of success:
1) df['COLUMN'] = pd.to_datetime(df['COLUMN'].str.strip(), format='%m/%d/%Y')
gives me the error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
2) Defining dtypes at the beginning and then using it in the read_csv command - gave me an error as well since it does not accept datetime but only string/int.
Some of the columns I want to have a datetime format of date, such as: 2019/1/1, and some of time: 20:00:00
Do you know of an effective way of transforming those datatype object columns to either date or time?
Based on the discussion, I downloaded the data set from the link you provided and read it through pandas. I took one column and a part of it; which has the date and used the pandas data-time module as you did. By doing so I can use the script you mentioned.
#import necessary library
import numpy as np
import pandas as pd
#load the data into csv
data = pd.read_csv("NYPD_Complaint_Data_Historic.csv")
#take one column which contains the datatime as an example
dte = data['CMPLNT_FR_DT']
# =============================================================================
# I will try to take a part of the data from dte which contains the
# date time and convert it to date time
# =============================================================================
from pandas import datetime
test_data = dte[0:10]
df1 = pd.DataFrame(test_data)
df1['new_col'] = pd.to_datetime(df1['CMPLNT_FR_DT'])
df1['year'] = [i.year for i in df1['new_col']]
df1['month'] = [i.month for i in df1['new_col']]
df1['day'] = [i.day for i in df1['new_col']]
#The way you used to convert the data also works
df1['COLUMN'] = pd.to_datetime(df1['CMPLNT_FR_DT'].str.strip(), format='%m/%d/%Y')
It might be the way you get the data. You can see the output from this attached. As the result can be stored in dataframe it won't be a problem to save in any format. Please let me know if I understood correctly and it helped you. The month is not shown in the image, but you can get it.