I'm trying to merge to pandas dataframes, one is called DAILY and the other SF1.
DAILY csv:
ticker,date,lastupdated,ev,evebit,evebitda,marketcap,pb,pe,ps
A,2020-09-14,2020-09-14,31617.1,36.3,26.8,30652.1,6.2,44.4,5.9
SF1 csv (not sure why ticker is indented, just ignore that):
ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
A,ARQ,2020-09-14,2020-09-14,2020-09-14,2020-09-14,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000
Datasorting/cleaning code:
sf1 = sf1.drop(columns=['number','dimension', 'datekey', 'reportperiod','lastupdated', 'ev', 'evebit', 'evebitda', 'marketcap', 'pb', 'pe', 'ps'])
daily = daily.sort_values('date', ascending=True)
sf1 = sf1.sort_values('calendardate', ascending=True)
daily = daily.sort_values('ticker')
sf1 = sf1.sort_values('ticker')
Code to merge the dataframes:
df = pd.merge_asof(daily, sf1, by = 'ticker', left_on='date', right_on='calendardate', tolerance=pd.Timedelta(value=100, unit='D'), direction='backward')
I think what may be causing the error is that the dataframes are being merged by the ticker column. I'm not sure if that has to be an int or in a dateformat, or any specific format. Currently it is just the ticker of companies like shown above.
The dataframes are being merged on the date column in the DAILY csv and the calendardate column in the SF1 csv.
If someone could also make the distinction between what happens when merging by and how that changes if you were to only have left_on and right_on.
You are facing this problem because your date column in 'daily' and calendardate column in 'sf1' are of type object i.e string
Just change their type to datatime by pd.to_datetime() method
so just add these 2 lines of code in your Datasorting/cleaning code:-
daily['date']=pd.to_datetime(daily['date'])
sf1['calendardate']=pd.to_datetime(sf1['calendardate'])
Now write:-
df = pd.merge_asof(daily, sf1, by = 'ticker', left_on='date', right_on='calendardate', tolerance=pd.Timedelta(value=100, unit='D'), direction='backward')
I have a dataframe with dates as an index and values as the first column.
I want to take all of the Belgian holidays out of that dataframe and create a new dataframe.
Things I've tried:
be_holidays = holidays.BE()
#example of the data frame (same format)
index = pd.date_range(start='08/08/2018',end='08/09/2018',freq='1H')
df = pd.DataFrame([1,2,3,5,3,5,4,6,2,4,6,6,3,2,5,9,7,8,8,5,1,2,5,3,6],columns=['A'], index = index)
new_df = df.applymap(lambda x: str(df.index[x]).split()[0] in be_holidays)
new_df = df[~(str(df.index).split()[0]).isin(be_holidays)]
#for context
type(df.index[0])
#results is
pandas._libs.tslib.Timestamp
I think this would work:
import pandas as pd
# index must be a datetime
df.index = pd.to_datetime(df.index)
# boolean mask, to identify holidays
mask = df.index.isin( set(be_holidays) )
# drop holidays, keep rest (~ is `not`)
df = df.loc[~mask]
I have around 600 csv file datasets, all have the very same column names [‘DateTime’, ‘Actual’, ‘Consensus’, ‘Previous’, ‘Revised’], all economic indicators and all-time series data sets.
the aim is to merge them all together in one csv file.
With ‘DateTime’ as an index.
The way I wanted this file to indexed in is the time line way which means let’s say the first event in the first csv dated in 12/18/2017 10:00:00 and first event in the second csv dated in 12/29/2017 09:00:00 and first event in the third csv dated in 12/20/2017 09:00:00.
So, I want to index them the later first and the newer after it, etc. despite the source csv it originally from.
I tried to merge just 3 of them as an experiment and the problem is the ‘DateTime’ because it prints the 3 of them together like this ('12/18/2017 10:00:00', '12/29/2017 09:00:00', '12/20/2017 09:00:00')
Here is the code:
import pandas as pd
df1 = pd.read_csv("E:\Business\Economic Indicators\Consumer Price Index - Core (YoY) - European Monetary Union.csv")
df2 = pd.read_csv("E:\Business\Economic Indicators\Private loans (YoY) - European Monetary Union.csv")
df3 = pd.read_csv("E:\Business\Economic Indicators\Current Account s.a - European Monetary Union.csv")
df = pd.concat([df1, df2, df3], axis=1, join='inner')
df.set_index('DateTime', inplace=True)
print(df.head())
df.to_csv('df.csv')
Consider using read_csv() args, index_col and parse_dates, to create indices during import and format as datetime. Then run your needed horizontal merge. Below assumes date is in first column of csv. And at the end use sort_index() on final dataframe to sort the datetimes.
df1 = pd.read_csv(r"E:\Business\Economic Indicators\Consumer Price Index - Core (YoY) - European Monetary Union.csv",
index_col=[0], parse_dates=[0])
df2 = pd.read_csv(r"E:\Business\Economic Indicators\Private loans (YoY) - European Monetary Union.csv",
index_col=[0], parse_dates=[0])
df3 = pd.read_csv(r"E:\Business\Economic Indicators\Current Account s.a - European Monetary Union.csv",
index_col=[0], parse_dates=[0])
finaldf = pd.concat([df1, df2, df3], axis=1, join='inner').sort_index()
And for DRY-er approach especially across the hundreds of csv files, use a list comprehension
import os
...
os.chdir('E:\\Business\\Economic Indicators')
dfs = [pd.read_csv(f, index_col=[0], parse_dates=[0])
for f in os.listdir(os.getcwd()) if f.endswith('csv')]
finaldf = pd.concat(dfs, axis=1, join='inner').sort_index()
You're trying to build one large dataframe out of the rows of many dataframes who all have the same column names. axis should be 0 (the default), not 1. Also you don't need to specify a type of join. This will have no effect since the column names are the same for each dataframe.
df = pd.concat([df1, df2, df3])
should be enough in order to concatenate the datasets.
(see https://pandas.pydata.org/pandas-docs/stable/merging.html )
Your call to set_index to define an index using the values in the DateTime column should then work.
dataset_1 = pd.read_csv('csv path')
dataset_2 = pd.read_csv('csv path')
new_dataset = pd.merge(dataset_1, dataset_2, left_on='same column name', right_on=('same column name'), how=('how to join ex:left'))
The problem is two folds: merging the csv into a single dataframe, and then ordering it by date.
As John Smith pointed out to merge dataframes along rows, you need to use:
df = pd.concat([df1,df2,df3])
Then you want to set an index and reorder your dataframe according to the index.
df.set_index('DateTime', inplace=True)
df.sort_index(inplace=True)
or in descending order
df.sort_index(inplace=True,ascending=False)
(see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html)
timeindex = pd.date_range('2018/01/01','2018/01/10')
randtimeindex = np.random.permutation(timeindex)
# Create three dataframes
df1 = pd.DataFrame(index=range(3),data=np.random.rand(3,3),
columns=['Actual','Consensus','DateTime'])
df1.DateTime=randtimeindex[:3]
df2 = pd.DataFrame(index=range(3),data=np.random.rand(3,3),
columns=['Actual','Consensus','DateTime'])
df2.DateTime=randtimeindex[3:6]
df3 = pd.DataFrame(index=range(4),data=np.random.rand(4,3),
columns=['Actual','Consensus','DateTime'])
df3.DateTime=randtimeindex[6:]
# Merge them
df4 = pd.concat([df1, df2, df3], axis=0)
# Reindex the merged dataframe, and sort it
df4.set_index('DateTime', inplace=True)
df4.sort_index(inplace=True, ascending=False)
print(df4.head())