I am trying to combine 2 separate data series using one minute data to create a ratio then creating Open High Low Close (OHLC) files for the ratio for the entire day. I am bringing in two time series then creating associated dataframes using pandas. The time series have missing data so I am creating a datetime variable in each file then merging the files using the pd.merge approach on the datetime variable. Up this this point everything is going fine.
Next I group the data by the date using groupby. I then feed the grouped data to a for loop that calculates the OHLC and feeds that into a new dataframe for each respective day. However, the newly populated dataframe uses the date (from the grouping) as the dataframe index and the sorting is off. The index data looks like this (even when sorted):
01/29/2013
01/29/2014
01/29/2015
12/2/2013
12/2/2014
In short, the sorting is being done on only the month not the whole date as a date so it isn't chronological. My goal is to get it sorted by date so it would be chronological. Perhaps I need to create a new column in the dataframe referencing the index (not sure how). Or maybe there is a way to tell pandas the index is a date not just a value? I tried using various sort approaches including sort_index but since the dates are the index and don't seem to be treated as dates the sort functions sort by the month regardless of the year and thus my output file is out of order. In more general terms I am not sure how to reference/manipulate the actual unique identifier index in a pandas dataframe so any associated material would be useful.
Thank you
Years later...
This fixes the problem.
df is a dataframe
import pandas as pd
df.index = pd.to_datetime(df.index) #convert the index to a datetime object
df = df.sort_index() #sort the converted
This should get the sorting back into chronological order
Related
I have a CSV with more than 500.000 results with a lot of duplicates, I'm creating a new dataframe with unique values and trying to find the min value of a date.
I have this code here that works but takes way to much time to load. How can I improve this?
for i in range(len(df_leads)):
df_leads.loc[i,'Created Date'] = df[df['customer_lead_id'] == df_leads.loc[i,'Lead']].min()['created_date']
I'm creating a new dataframe with unique values and trying to find the min value of a date.
The second df comes from doing the unique of df['customer_lead_id']
Instead of that, we can drop_duplicates in a DataFrame after we sort_values of the date - this will keep the first occurrence, i. e. the min value:
df.sort_values('created_date').drop_duplicates('customer_lead_id')
In the long run, I'm trying to be able to merge different dataframes of data coming from different sources. The dataframes themselves are all a time series. I'm having difficulty with one dataset. The first column is DateTime. The initial data has a temporal resolution of 15 s, but in my code I have it being resampled and averaged for each minute (this is to have the same temporal resolution as my other datasets). What I'm trying to do, is make this 0 key of the datetimes, and then concatenate this horizontally to the initial data. I'm doing this because when I set the index column to 'DateTime', it seems to delete that column (when I export as csv and open this in excel, or print the dataframe, this column is no longer there), and concatenating the 0 (or df1_DateTimes, as in the code below) to the dataframe seems to reapply this lost data. The 0 key is automatically generated when I run the df1_DateTimes, I think it just makes the column header titled 0.
All of the input datetime data is in the format dd/mm/yyyy HH:MM. However, when I make this "df1_DateTimes", the datetimes are mm/dd/yyyy HH:MM. And the column length is equal to that of the data before it was resampled.
I'm wondering if anyone knows of a way to make this "df1_DateTimes" in the format dd/mm/yyyy HH:MM, and to have the length of the column to be the same length of the resampled data? The latter isn't as important because I could just have a bunch of empty data. I've tried things like putting format='%d%m%y %H:%M', but it wasn't seeming to work.
Or if anyone knows how to resample the data and not lose the DateTimes? And have the DateTimes in 1 min increments as well? Any information on any of this would be greatly appreciated. Just as long as the end result is a dataframe with the values resampled to every minute, and the DateTime column intact, with the datatype of the DateTime column to be datetime64 (so I can merge it with my other datasets). I have included my code below.
df1 = pd.read_csv('PATH',
parse_dates=True, usecols=[0,7,10,13,28],
infer_datetime_format=True, index_col='DateTime')
# Resample data to take minute averages
df1.dropna(inplace=True) # Drops missing values
df1=(df1.resample('Min').mean())
df1.to_csv('df1', index=False, encoding='utf-8-sig')
df1_DateTimes = pd.to_datetime(df1.index.values)
df1_DateTimes = df1_DateTimes.to_frame()
df1_DateTimes.to_csv('df1_DateTimes', index=False, encoding='utf-8-sig'`
Thanks for reading and hope to hear back.
import datetime
df1__DateTimes = k
k['TITLE OF DATES COLUMN'] = k['TITLES OF DATES COLUMN'].datetime.strftime('%d/%m/%y')
I think using the above snippet solves your issue.
It assigns the date column to the formatted version (dd/mm/yy) of itself.
More on the Kite docs
hello stackoverflow community. I am having an issue while trying to do a simple merge between two dataframes which share the same date column. sorry I am new to python and perhaps the way I express myself is not very clear. I am working on the project related to stock prices calculation. the first data frame has date and closing prices columns, while the second one only has similar date column. my goal is to obtain a single date column which will have matching closing prices column next to it.
this is what I have done to merge two dataframes
inner_join = pd.merge(df.iloc[7:79],df1[['Ex-Date','FDX UN Equity']],on ='Ex-date',how ='inner')
inner_join
Ex-date refers to date column and FXD UN Equity refers to column with closing prices
I get this as a result:
) = self._get_merge_keys()
# validate the merge keys dtypes. We may need to coerce
# Check for duplicates
# work-around for merge_asof(right_index=True)
KeyError: 'Ex-date'```
Pandas read the format of date columns differently, so I made the same format for date columns in original excel file but it hasn't helped. I tried all sorts of various merges but it didn't work either.
anyone have any ideas what is going on?
The code would look like this
import pandas as pd
inner_join = pd.merge_asof(df, df1, on = 'Ex-date')
Change both column header name to the same lower case and merge again.. check Ex-Date.. the column name header should be the same before you merge and use how=‘left’
I just have started learning pandas, so I am only at the beginning of the road. :)
The situation :
I have two dataframes (df1 and df2).
df1 contains multiple sensor data of a machine. The sensors transmit data every minute. I set the index of df1 in datetime format (this is actually the date and time when the sensors sent the data).
df2 contains the data of one production unit, meaning the unit id number (this is named 'Sarzs' in the dataframe) and the datetime when the process started and ended as well as the output quality of that particular production unit. The dataframe does not contain the data of the production unit related to that particular time (in the dataframe you can see that the column "Sarzs_no" is set to NaN at this stage). The starting and stopping dates and times of the production unit are stored in the "Start" and "Stop" columns and are in datetime format.
The problem:
I would like to iterate throught the rows of df1 and through the rows of df2 and check wether they are within (or between) the "Start" and "Stop" time of df2 and if this statement is true then udpdate the df1['Sarzs_no'] value with
the df2['Output'] value.
The progress so far::
So far I have wrote the code below:
for i in range (0, len(df2.index)):
for j in range(0, len(df1.index)):
print (df1.index)
and I have two questions basically:
How to actually write the filtering code and do the update?
Isn't there (it should be, I guess) a better way to make the filtering then iterating through all the rows in both dataframes, which it seems very time consuming therefore inefficient to me.
Thank you in advance for your help.
With dataframes containing timestamps as datetime object you could use something like the following :
#Loop over the dataframe containing start and end timestamps
for index,row in df2.iterrows():
#Create a boolean mask to filter data
mask = (df1.index > row['Start']) & (df1.index < row['Stop'])
df1.loc[mask,'Sarzs_no'] = row['Output']
This will make the rows that match the condition of the mask have the Output label of the row, for each rows of your dataframe containing start & end timestamps
The loc function return the indexes of the rows that match the conditions and the iterrows function create an iterator that move through your dataframe row by row
EDIT
As you have a datetime index, you can just use :
df1[row['Start']:row['Stop']]
instead of .loc() to get the rows you need to update
I have been having some problems adding the contents of a pandas Series to a pandas DataFrame. I start with an empty DataFrame, initialised with several columns (corresponding to consecutive dates).
I would like to then sequentially fill the DataFrame using different pandas Series, each one corresponding to a different date. However, each Series has a (potentially) different index.
I would like the resulting DataFrame to have an index that is essentially the union of each of the Series indices.
I have been doing this so far:
for date in dates:
df[date] = series_for_date
However, my df index corresponds to that of the first Series and so any data in successive Series that correspond to an index 'key' not in the first Series are lost.
Any help would be much appreciated!
Ben
If i understand you can use concat:
pd.concat([series1,series2,series3],axis=1)