Concatenate/Merge/Join two different Dataframes Pandas - python

I am looking to join two dataframes using pandas on the 'Date' columns. I usually use df2= pd.concat([df, df1],axis=1), however for some reason this is not working.
In this example, i am pulling the data from a sql file, creating a new column called 'Date' that is merging my year and month columns, and then pivoting. Whne i try and concatenate the two dataframes, the dataframe shows up side by side instead of merged together.
What comes up:
Date Count of Cats Date Count of Dogs
What I want to come up:
Date Count of Cats Count of Dogs
Any ideas?
My other problem is I am trying to make sure the Date columns writes to excel as a string and not a datetime function. Please keep this is mind when thinking about a solution.
Here is my code:
executeScriptsFromFile('cats.sql')
df = pd.DataFrame(cursor.fetchall())
df.columns = [rec[0] for rec in cursor.description]
monthend = {'Q1':'3/31','Q2':'6/30','Q3':'9/30','Q4':'12/31'}
df['Date']=df['QUARTER'].map(monthend)+'/'+ df['YEAR']
df['Date'] = pd.to_datetime(df['Date'])
df10= df.pivot_table(['Breed'], ['Date'], aggfunc=np.sum,fill_value=0)
df10.reset_index(drop=False, inplace=True)
df10.reindex_axis(['Breed', 'Count of Cats'], axis=1)
df10.columns = ('Breed', 'Count of Cats')
executeScriptsFromFile('dogs.sql')
df = pd.DataFrame(cursor.fetchall())
df.columns = [rec[0] for rec in cursor.description]
monthend = {'Q1':'3/31','Q2':'6/30','Q3':'9/30','Q4':'12/31'}
df['Date']=df['QUARTER'].map(monthend)+'/'+ df['YEAR']
df['Date'] = pd.to_datetime(df['Date'])
df11= df.pivot_table(['Breed'], ['Date'], aggfunc=np.sum,fill_value=0)
df11.reset_index(drop=False, inplace=True)
df11.reindex_axis(['Breed', 'Count of Dogs'], axis=1)
df11.columns = ('Breed', 'Count of Dogs')
df11a= df11.round(0)
df12= pd.concat([df10, df11a],axis=1)

I think you have to remove code:
df10.reset_index(drop=False, inplace=True)
df11.reset_index(drop=False, inplace=True)
because need level date in index for concat by date.
Also for convert index to string use:
df.inde = df.index.astype(str)

Related

How to add a column from df1 to df2 if it not present in df2, else do nothing

I have 2 data frame from a basic web scrape using Pandas (below). The second table has less columns than the first, and I need to concat the dataframes. I have been manually inserting columns for a while but seeing as they change frequently I would like to have a function that can assess the columns in df2, check whether they are all in df2, and if not, add the column, with the data from df2.
import pandas as pd
link = 'https://en.wikipedia.org/wiki/Opinion_polling_for_the_next_French_presidential_election'
df = pd.read_html(link,header=0)
df1 = df[1]
df1 = df1.drop([0])
df1 = df1.drop('Abs.',axis=1)
df2 = df[2]
df2 = df2.drop([0])
df2 = df2.drop(['Abs.'],axis=1)
Many thanks,
#divingTobi's answer:
pd.concat([df1, df2]) does the trick.

How to solve a Pandas Merge Error: key must be integer or timestamp?

I'm trying to merge to pandas dataframes, one is called DAILY and the other SF1.
DAILY csv:
ticker,date,lastupdated,ev,evebit,evebitda,marketcap,pb,pe,ps
A,2020-09-14,2020-09-14,31617.1,36.3,26.8,30652.1,6.2,44.4,5.9
SF1 csv (not sure why ticker is indented, just ignore that):
ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
A,ARQ,2020-09-14,2020-09-14,2020-09-14,2020-09-14,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000
Datasorting/cleaning code:
sf1 = sf1.drop(columns=['number','dimension', 'datekey', 'reportperiod','lastupdated', 'ev', 'evebit', 'evebitda', 'marketcap', 'pb', 'pe', 'ps'])
daily = daily.sort_values('date', ascending=True)
sf1 = sf1.sort_values('calendardate', ascending=True)
daily = daily.sort_values('ticker')
sf1 = sf1.sort_values('ticker')
Code to merge the dataframes:
df = pd.merge_asof(daily, sf1, by = 'ticker', left_on='date', right_on='calendardate', tolerance=pd.Timedelta(value=100, unit='D'), direction='backward')
I think what may be causing the error is that the dataframes are being merged by the ticker column. I'm not sure if that has to be an int or in a dateformat, or any specific format. Currently it is just the ticker of companies like shown above.
The dataframes are being merged on the date column in the DAILY csv and the calendardate column in the SF1 csv.
If someone could also make the distinction between what happens when merging by and how that changes if you were to only have left_on and right_on.
You are facing this problem because your date column in 'daily' and calendardate column in 'sf1' are of type object i.e string
Just change their type to datatime by pd.to_datetime() method
so just add these 2 lines of code in your Datasorting/cleaning code:-
daily['date']=pd.to_datetime(daily['date'])
sf1['calendardate']=pd.to_datetime(sf1['calendardate'])
Now write:-
df = pd.merge_asof(daily, sf1, by = 'ticker', left_on='date', right_on='calendardate', tolerance=pd.Timedelta(value=100, unit='D'), direction='backward')

How do I reclassify a data frame so that the text is indexed

df = pd.DataFrame(data=JLSData)
DataFrame = df.transpose()
DataFrame.head()
Employees = DataFrame['Employees']
That is I want to be able to index the data frame by date and by the column names
Without a copy/pastable example and desired output, this is what I can give you:
df = pd.DataFrame(data=JLSData)
df = df.transpose()
df.columns = df.iloc[0, :]
df = df.iloc[1:, :]
df_new = df[1:]
df_new.columns = df.iloc[0]
To set date column as index
df_new.rename(index={0:'Date'}, inplace=True)
df_new = df_new.set_index('Date')
Are you trying to rename the first column that contains dates (where column name says "unnamed:0" to DATE ? If so, then do this.
df.rename(columns={ df.columns[1]: "whatever" })
where whatever is the name you want to give.
If you can provide a copy of the data in text format, and provide more clarity, we can try and provide a better answer.

Extract holidays from a dataframe

I have a dataframe with dates as an index and values as the first column.
I want to take all of the Belgian holidays out of that dataframe and create a new dataframe.
Things I've tried:
be_holidays = holidays.BE()
#example of the data frame (same format)
index = pd.date_range(start='08/08/2018',end='08/09/2018',freq='1H')
df = pd.DataFrame([1,2,3,5,3,5,4,6,2,4,6,6,3,2,5,9,7,8,8,5,1,2,5,3,6],columns=['A'], index = index)
new_df = df.applymap(lambda x: str(df.index[x]).split()[0] in be_holidays)
new_df = df[~(str(df.index).split()[0]).isin(be_holidays)]
#for context
type(df.index[0])
#results is
pandas._libs.tslib.Timestamp
I think this would work:
import pandas as pd
# index must be a datetime
df.index = pd.to_datetime(df.index)
# boolean mask, to identify holidays
mask = df.index.isin( set(be_holidays) )
# drop holidays, keep rest (~ is `not`)
df = df.loc[~mask]

python pandas merge multiple csv files

I have around 600 csv file datasets, all have the very same column names [‘DateTime’, ‘Actual’, ‘Consensus’, ‘Previous’, ‘Revised’], all economic indicators and all-time series data sets.
the aim is to merge them all together in one csv file.
With ‘DateTime’ as an index.
The way I wanted this file to indexed in is the time line way which means let’s say the first event in the first csv dated in 12/18/2017 10:00:00 and first event in the second csv dated in 12/29/2017 09:00:00 and first event in the third csv dated in 12/20/2017 09:00:00.
So, I want to index them the later first and the newer after it, etc. despite the source csv it originally from.
I tried to merge just 3 of them as an experiment and the problem is the ‘DateTime’ because it prints the 3 of them together like this ('12/18/2017 10:00:00', '12/29/2017 09:00:00', '12/20/2017 09:00:00')
Here is the code:
import pandas as pd
df1 = pd.read_csv("E:\Business\Economic Indicators\Consumer Price Index - Core (YoY) - European Monetary Union.csv")
df2 = pd.read_csv("E:\Business\Economic Indicators\Private loans (YoY) - European Monetary Union.csv")
df3 = pd.read_csv("E:\Business\Economic Indicators\Current Account s.a - European Monetary Union.csv")
df = pd.concat([df1, df2, df3], axis=1, join='inner')
df.set_index('DateTime', inplace=True)
print(df.head())
df.to_csv('df.csv')
Consider using read_csv() args, index_col and parse_dates, to create indices during import and format as datetime. Then run your needed horizontal merge. Below assumes date is in first column of csv. And at the end use sort_index() on final dataframe to sort the datetimes.
df1 = pd.read_csv(r"E:\Business\Economic Indicators\Consumer Price Index - Core (YoY) - European Monetary Union.csv",
index_col=[0], parse_dates=[0])
df2 = pd.read_csv(r"E:\Business\Economic Indicators\Private loans (YoY) - European Monetary Union.csv",
index_col=[0], parse_dates=[0])
df3 = pd.read_csv(r"E:\Business\Economic Indicators\Current Account s.a - European Monetary Union.csv",
index_col=[0], parse_dates=[0])
finaldf = pd.concat([df1, df2, df3], axis=1, join='inner').sort_index()
And for DRY-er approach especially across the hundreds of csv files, use a list comprehension
import os
...
os.chdir('E:\\Business\\Economic Indicators')
dfs = [pd.read_csv(f, index_col=[0], parse_dates=[0])
for f in os.listdir(os.getcwd()) if f.endswith('csv')]
finaldf = pd.concat(dfs, axis=1, join='inner').sort_index()
You're trying to build one large dataframe out of the rows of many dataframes who all have the same column names. axis should be 0 (the default), not 1. Also you don't need to specify a type of join. This will have no effect since the column names are the same for each dataframe.
df = pd.concat([df1, df2, df3])
should be enough in order to concatenate the datasets.
(see https://pandas.pydata.org/pandas-docs/stable/merging.html )
Your call to set_index to define an index using the values in the DateTime column should then work.
dataset_1 = pd.read_csv('csv path')
dataset_2 = pd.read_csv('csv path')
new_dataset = pd.merge(dataset_1, dataset_2, left_on='same column name', right_on=('same column name'), how=('how to join ex:left'))
The problem is two folds: merging the csv into a single dataframe, and then ordering it by date.
As John Smith pointed out to merge dataframes along rows, you need to use:
df = pd.concat([df1,df2,df3])
Then you want to set an index and reorder your dataframe according to the index.
df.set_index('DateTime', inplace=True)
df.sort_index(inplace=True)
or in descending order
df.sort_index(inplace=True,ascending=False)
(see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html)
timeindex = pd.date_range('2018/01/01','2018/01/10')
randtimeindex = np.random.permutation(timeindex)
# Create three dataframes
df1 = pd.DataFrame(index=range(3),data=np.random.rand(3,3),
columns=['Actual','Consensus','DateTime'])
df1.DateTime=randtimeindex[:3]
df2 = pd.DataFrame(index=range(3),data=np.random.rand(3,3),
columns=['Actual','Consensus','DateTime'])
df2.DateTime=randtimeindex[3:6]
df3 = pd.DataFrame(index=range(4),data=np.random.rand(4,3),
columns=['Actual','Consensus','DateTime'])
df3.DateTime=randtimeindex[6:]
# Merge them
df4 = pd.concat([df1, df2, df3], axis=0)
# Reindex the merged dataframe, and sort it
df4.set_index('DateTime', inplace=True)
df4.sort_index(inplace=True, ascending=False)
print(df4.head())

Categories

Resources