pandas -Merging on date columns is not working - python

hello stackoverflow community. I am having an issue while trying to do a simple merge between two dataframes which share the same date column. sorry I am new to python and perhaps the way I express myself is not very clear. I am working on the project related to stock prices calculation. the first data frame has date and closing prices columns, while the second one only has similar date column. my goal is to obtain a single date column which will have matching closing prices column next to it.
this is what I have done to merge two dataframes
inner_join = pd.merge(df.iloc[7:79],df1[['Ex-Date','FDX UN Equity']],on ='Ex-date',how ='inner')
inner_join
Ex-date refers to date column and FXD UN Equity refers to column with closing prices
I get this as a result:
) = self._get_merge_keys()
# validate the merge keys dtypes. We may need to coerce
# Check for duplicates
# work-around for merge_asof(right_index=True)
KeyError: 'Ex-date'```
Pandas read the format of date columns differently, so I made the same format for date columns in original excel file but it hasn't helped. I tried all sorts of various merges but it didn't work either.
anyone have any ideas what is going on?

The code would look like this
import pandas as pd
inner_join = pd.merge_asof(df, df1, on = 'Ex-date')

Change both column header name to the same lower case and merge again.. check Ex-Date.. the column name header should be the same before you merge and use how=‘left’

Related

Key Error when Merging columns from 2 Data Frames

I am trying to create a new df with certain columns from 2 others:
The first called visas_df:
And the second called cpdf:
I only need the highlighted columns. But when I try this:
df_joined = pd.merge(cpdf,visas_df["visas"],on="date")
The error appearing is: KeyError: 'date'
I imagine this is due to how I created cpdf. It was a "bad dataset' so I did some fidgeting.Line 12 on the code snipped below might have something to do, but I am clueless...
I even renamed the date columns of both dfs as "date and checked that dtypes and number of rows are the same.
Any feedback would be much appreciated. Thanks!
df['visas'] in merge function is not a dataframe and its not contain date column. İf you want to df as a dataframe, you have to use double square bracket [[]] like this:
df_joined = pd.merge(cpdf,visas_df[["date","visas"]],on="date")

How to create a multi-indexed DataFrame using Python

I need to create a multi-indexed table of data using DataFrames in Python.
Basically, I want the left index to be a timestamp (it's in date-time), and the following data to be in columns indexed by date. [I.e. I have a timestamp and two columns of data stored in this DataFrame, say DF0.]
Say each of the DataFrames (i.e. DF0) has an ID attached to it. That would be the secondary index overhanging above the column titles.
[This is the table after merging two DataFrames, say DF0 and DF1.]
This is the ideal output but it needs a secondary index that I would be able to assign, we can say 5 and 6 for this example.
[The ideal output is this picture.]
Thank you in advance for your time and effort.
Try using:
pd.concat([df1,df2], keys=['ID1','ID2'], axis=1)

How to add a column to dataframe with multiindex columns with vlookup function?

i'm trying to figure out how to add a columns from this dataframe:
To this one:
As you guys can see, both dataframes have "SPU" column, so data need to be added according to this column(like vlookup function). The problems is the second dataframe have multiindex columns, so things like:
pv = pd.merge(dataframe1,dataframe2[['SPU','Adv_per_unit']],on = 'SPU',how='left')
is not working.
I was tried to figure it out by myself, by adding:
dataframe1['Ads', 'Adv_per_unit'] = dataframe2['Adv_per_unit']
but obviously this doesn't solve the problem, as data in 'Adv_per_unit' is not matched with data from dataframe2 because it wasn't merged properly.
p.s. i checked many already existed similar topics on stackoverflow, but didn't found a solution for the case when data need to be added with vlookup function.
If the SPU entries are unique, you could use that column for the index and then run
dataframe1['Ads', 'Adv_per_unit'] = dataframe2['Adv_per_unit']
Edit: expanded answer
To set the SPU as the index, you should run:
dataframe1.set_index("SPU", inplace=True)
dataframe2.set_index("SPU", inplace=True)

How do I add a new column to an existing dataframe and fill it with partial data from another column?

I have a dataframe jobs screenshot of dataframe
I need to add a new column ‘year’ to jobs data frame. This column should contain the corresponding year for each post_date (which is already a column). For example: for post_date value 2017-08-16 ‘year’ value should be 2017.
I am unsure how to insert a new column while also pulling data from a pre-existing column.
Use dt.year:
jobs['year'] = pd.to_datetime(jobs['post_date'], errors='coerce').dt.year
I would begin by transforming the column post_date into date format. After doing this, you could use a simple function to extract the year.
jobs["post_date"] =pd.to_datetime(jobs["post_date"])
should be enough to change it into a datetime type. If it doesnt you should use datetime strpstring in order to tell python what is the specific format of the "post_date" column, so it to read it as a date. After that do the following:
jobs["year"] =jobs["post_date"].dt.year
If I understand your question correctly, you want to add a new column of values of years to the existing dataframe from a column in your current dataframe.
For extracting only the year values, you need to do some calculations first. You can make use of pandas datetime.datetime and extract only the values of the year in your Post_date column. Have a look at this or this.
For storing these year values, you can simply do this:
jobs['year'] = jobs['post_date'].dt.year

Python Pandas Index Sorting/Grouping/DateTime

I am trying to combine 2 separate data series using one minute data to create a ratio then creating Open High Low Close (OHLC) files for the ratio for the entire day. I am bringing in two time series then creating associated dataframes using pandas. The time series have missing data so I am creating a datetime variable in each file then merging the files using the pd.merge approach on the datetime variable. Up this this point everything is going fine.
Next I group the data by the date using groupby. I then feed the grouped data to a for loop that calculates the OHLC and feeds that into a new dataframe for each respective day. However, the newly populated dataframe uses the date (from the grouping) as the dataframe index and the sorting is off. The index data looks like this (even when sorted):
01/29/2013
01/29/2014
01/29/2015
12/2/2013
12/2/2014
In short, the sorting is being done on only the month not the whole date as a date so it isn't chronological. My goal is to get it sorted by date so it would be chronological. Perhaps I need to create a new column in the dataframe referencing the index (not sure how). Or maybe there is a way to tell pandas the index is a date not just a value? I tried using various sort approaches including sort_index but since the dates are the index and don't seem to be treated as dates the sort functions sort by the month regardless of the year and thus my output file is out of order. In more general terms I am not sure how to reference/manipulate the actual unique identifier index in a pandas dataframe so any associated material would be useful.
Thank you
Years later...
This fixes the problem.
df is a dataframe
import pandas as pd
df.index = pd.to_datetime(df.index) #convert the index to a datetime object
df = df.sort_index() #sort the converted
This should get the sorting back into chronological order

Categories

Resources