python pandas merge multiple csv files

python pandas merge multiple csv files - python

I have around 600 csv file datasets, all have the very same column names [‘DateTime’, ‘Actual’, ‘Consensus’, ‘Previous’, ‘Revised’], all economic indicators and all-time series data sets.
the aim is to merge them all together in one csv file.
With ‘DateTime’ as an index.
The way I wanted this file to indexed in is the time line way which means let’s say the first event in the first csv dated in 12/18/2017 10:00:00 and first event in the second csv dated in 12/29/2017 09:00:00 and first event in the third csv dated in 12/20/2017 09:00:00.
So, I want to index them the later first and the newer after it, etc. despite the source csv it originally from.
I tried to merge just 3 of them as an experiment and the problem is the ‘DateTime’ because it prints the 3 of them together like this ('12/18/2017 10:00:00', '12/29/2017 09:00:00', '12/20/2017 09:00:00')
Here is the code:
import pandas as pd
df1 = pd.read_csv("E:\Business\Economic Indicators\Consumer Price Index - Core (YoY) - European Monetary Union.csv")
df2 = pd.read_csv("E:\Business\Economic Indicators\Private loans (YoY) - European Monetary Union.csv")
df3 = pd.read_csv("E:\Business\Economic Indicators\Current Account s.a - European Monetary Union.csv")
df = pd.concat([df1, df2, df3], axis=1, join='inner')
df.set_index('DateTime', inplace=True)
print(df.head())
df.to_csv('df.csv')

Consider using read_csv() args, index_col and parse_dates, to create indices during import and format as datetime. Then run your needed horizontal merge. Below assumes date is in first column of csv. And at the end use sort_index() on final dataframe to sort the datetimes.
df1 = pd.read_csv(r"E:\Business\Economic Indicators\Consumer Price Index - Core (YoY) - European Monetary Union.csv",
index_col=[0], parse_dates=[0])
df2 = pd.read_csv(r"E:\Business\Economic Indicators\Private loans (YoY) - European Monetary Union.csv",
index_col=[0], parse_dates=[0])
df3 = pd.read_csv(r"E:\Business\Economic Indicators\Current Account s.a - European Monetary Union.csv",
index_col=[0], parse_dates=[0])
finaldf = pd.concat([df1, df2, df3], axis=1, join='inner').sort_index()
And for DRY-er approach especially across the hundreds of csv files, use a list comprehension
import os
...
os.chdir('E:\\Business\\Economic Indicators')
dfs = [pd.read_csv(f, index_col=[0], parse_dates=[0])
for f in os.listdir(os.getcwd()) if f.endswith('csv')]
finaldf = pd.concat(dfs, axis=1, join='inner').sort_index()

You're trying to build one large dataframe out of the rows of many dataframes who all have the same column names. axis should be 0 (the default), not 1. Also you don't need to specify a type of join. This will have no effect since the column names are the same for each dataframe.
df = pd.concat([df1, df2, df3])
should be enough in order to concatenate the datasets.
(see https://pandas.pydata.org/pandas-docs/stable/merging.html )
Your call to set_index to define an index using the values in the DateTime column should then work.

dataset_1 = pd.read_csv('csv path')
dataset_2 = pd.read_csv('csv path')
new_dataset = pd.merge(dataset_1, dataset_2, left_on='same column name', right_on=('same column name'), how=('how to join ex:left'))

The problem is two folds: merging the csv into a single dataframe, and then ordering it by date.
As John Smith pointed out to merge dataframes along rows, you need to use:
df = pd.concat([df1,df2,df3])
Then you want to set an index and reorder your dataframe according to the index.
df.set_index('DateTime', inplace=True)
df.sort_index(inplace=True)
or in descending order
df.sort_index(inplace=True,ascending=False)
(see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html)
timeindex = pd.date_range('2018/01/01','2018/01/10')
randtimeindex = np.random.permutation(timeindex)
# Create three dataframes
df1 = pd.DataFrame(index=range(3),data=np.random.rand(3,3),
columns=['Actual','Consensus','DateTime'])
df1.DateTime=randtimeindex[:3]
df2 = pd.DataFrame(index=range(3),data=np.random.rand(3,3),
columns=['Actual','Consensus','DateTime'])
df2.DateTime=randtimeindex[3:6]
df3 = pd.DataFrame(index=range(4),data=np.random.rand(4,3),
columns=['Actual','Consensus','DateTime'])
df3.DateTime=randtimeindex[6:]
# Merge them
df4 = pd.concat([df1, df2, df3], axis=0)
# Reindex the merged dataframe, and sort it
df4.set_index('DateTime', inplace=True)
df4.sort_index(inplace=True, ascending=False)
print(df4.head())

Related

How to add a column from df1 to df2 if it not present in df2, else do nothing

I have 2 data frame from a basic web scrape using Pandas (below). The second table has less columns than the first, and I need to concat the dataframes. I have been manually inserting columns for a while but seeing as they change frequently I would like to have a function that can assess the columns in df2, check whether they are all in df2, and if not, add the column, with the data from df2.
import pandas as pd
link = 'https://en.wikipedia.org/wiki/Opinion_polling_for_the_next_French_presidential_election'
df = pd.read_html(link,header=0)
df1 = df[1]
df1 = df1.drop([0])
df1 = df1.drop('Abs.',axis=1)
df2 = df[2]
df2 = df2.drop([0])
df2 = df2.drop(['Abs.'],axis=1)
Many thanks,

#divingTobi's answer:
pd.concat([df1, df2]) does the trick.

How to solve a Pandas Merge Error: key must be integer or timestamp?

I'm trying to merge to pandas dataframes, one is called DAILY and the other SF1.
DAILY csv:
ticker,date,lastupdated,ev,evebit,evebitda,marketcap,pb,pe,ps
A,2020-09-14,2020-09-14,31617.1,36.3,26.8,30652.1,6.2,44.4,5.9
SF1 csv (not sure why ticker is indented, just ignore that):
ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
A,ARQ,2020-09-14,2020-09-14,2020-09-14,2020-09-14,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000
Datasorting/cleaning code:
sf1 = sf1.drop(columns=['number','dimension', 'datekey', 'reportperiod','lastupdated', 'ev', 'evebit', 'evebitda', 'marketcap', 'pb', 'pe', 'ps'])
daily = daily.sort_values('date', ascending=True)
sf1 = sf1.sort_values('calendardate', ascending=True)
daily = daily.sort_values('ticker')
sf1 = sf1.sort_values('ticker')
Code to merge the dataframes:
df = pd.merge_asof(daily, sf1, by = 'ticker', left_on='date', right_on='calendardate', tolerance=pd.Timedelta(value=100, unit='D'), direction='backward')
I think what may be causing the error is that the dataframes are being merged by the ticker column. I'm not sure if that has to be an int or in a dateformat, or any specific format. Currently it is just the ticker of companies like shown above.
The dataframes are being merged on the date column in the DAILY csv and the calendardate column in the SF1 csv.
If someone could also make the distinction between what happens when merging by and how that changes if you were to only have left_on and right_on.

You are facing this problem because your date column in 'daily' and calendardate column in 'sf1' are of type object i.e string
Just change their type to datatime by pd.to_datetime() method
so just add these 2 lines of code in your Datasorting/cleaning code:-
daily['date']=pd.to_datetime(daily['date'])
sf1['calendardate']=pd.to_datetime(sf1['calendardate'])
Now write:-
df = pd.merge_asof(daily, sf1, by = 'ticker', left_on='date', right_on='calendardate', tolerance=pd.Timedelta(value=100, unit='D'), direction='backward')

Python: Pandas merge leads to NaN

I'm trying to execute a merge with pandas. The two files have a common key ("KEY_PLA") which I'm trying to use with a left join. But unfortunately, all columns which are transferred from the second file to the first file have NaN values.
Here is what I have done so far:
df_1 = pd.read_excel(path1, skiprows=1)
df_2 = pd.read_excel(path2, skiprows=1)
df_1.columns = ["Index", "KEY", "KEY_PLA", "INFO1", "INFO2"]
df_2.columns = ["Index", "KEY_PLA", "INFO4"]
df_1.drop(["Index"], axis=1, inplace=True)
df_2.drop(["Index"], axis=1, inplace=True)
# Merge all dataframes
df_merge = pd.DataFrame()
df_merge = df_1.merge(df_2, left_on="KEY_PLA", right_on="KEY_PLA", how="left")
print(df_merge)
This is the result:
Here are the excel files:
Excel1
Excel2
What is wrong with the code? I also checked the types and even converted the columns in strings. But nothing works.

I think problem is different types of joined columns KEY_PLA, obviously one is integer and another strings.
Solution is cast to same, e.g. to ints:
print (df_1['KEY_PLA'].dtype)
object
print (df_2['KEY_PLA'].dtype)
int64
df_1['KEY_PLA'] = df_1['KEY_PLA'].astype(int)

Pandas concatenation deleting original index

I am trying to concat two dataframes using the below code. df1 is a daily update of values to the indexes in df2, which is an ongoing monthly dataset. df3 is the result that is saved.
The problem I am experiencing is that when an index value is not in df1 (no values for that particular day), it gets deleted from df3 altogether. In other words, if the index value is not in df2, then it doesn't appear in df3 at all.
How can I keep the original index of df3, so that if the index value is not in df1, it doesn't delete it? I also cannot enter 0 values, as it is relevant to the data that it is empty.
import os
import pandas as pd
import glob
def Monthly_aggregation_merge(month, date):
# file to be merged
df1 = pd.read_csv(r'Data\{}\{}\Aggregated\Aggregated_Daily_All.csv'.format(month,date), usecols=['CU', 'Parameters', 'Total/Max/Min'], index_col =[0,1])
df1 = df1.rename(columns = {'Total/Max/Min':date}) # Change column name
# original file that data should be merged with
df2 = pd.read_csv(r'Data\{}\MonthlyData\July2017NEW.csv'.format(month), index_col = [0,1])
df3 = pd.concat([df2, df1], axis=1).reindex(df1.index)
df3.to_csv(r'Data\{}\MonthlyData\July2017NEW.csv'.format(month))
print 'Monthly Merge Done!'

Concatenate/Merge/Join two different Dataframes Pandas

I am looking to join two dataframes using pandas on the 'Date' columns. I usually use df2= pd.concat([df, df1],axis=1), however for some reason this is not working.
In this example, i am pulling the data from a sql file, creating a new column called 'Date' that is merging my year and month columns, and then pivoting. Whne i try and concatenate the two dataframes, the dataframe shows up side by side instead of merged together.
What comes up:
Date Count of Cats Date Count of Dogs
What I want to come up:
Date Count of Cats Count of Dogs
Any ideas?
My other problem is I am trying to make sure the Date columns writes to excel as a string and not a datetime function. Please keep this is mind when thinking about a solution.
Here is my code:
executeScriptsFromFile('cats.sql')
df = pd.DataFrame(cursor.fetchall())
df.columns = [rec[0] for rec in cursor.description]
monthend = {'Q1':'3/31','Q2':'6/30','Q3':'9/30','Q4':'12/31'}
df['Date']=df['QUARTER'].map(monthend)+'/'+ df['YEAR']
df['Date'] = pd.to_datetime(df['Date'])
df10= df.pivot_table(['Breed'], ['Date'], aggfunc=np.sum,fill_value=0)
df10.reset_index(drop=False, inplace=True)
df10.reindex_axis(['Breed', 'Count of Cats'], axis=1)
df10.columns = ('Breed', 'Count of Cats')
executeScriptsFromFile('dogs.sql')
df = pd.DataFrame(cursor.fetchall())
df.columns = [rec[0] for rec in cursor.description]
monthend = {'Q1':'3/31','Q2':'6/30','Q3':'9/30','Q4':'12/31'}
df['Date']=df['QUARTER'].map(monthend)+'/'+ df['YEAR']
df['Date'] = pd.to_datetime(df['Date'])
df11= df.pivot_table(['Breed'], ['Date'], aggfunc=np.sum,fill_value=0)
df11.reset_index(drop=False, inplace=True)
df11.reindex_axis(['Breed', 'Count of Dogs'], axis=1)
df11.columns = ('Breed', 'Count of Dogs')
df11a= df11.round(0)
df12= pd.concat([df10, df11a],axis=1)

I think you have to remove code:
df10.reset_index(drop=False, inplace=True)
df11.reset_index(drop=False, inplace=True)
because need level date in index for concat by date.
Also for convert index to string use:
df.inde = df.index.astype(str)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python pandas merge multiple csv files - python

dataset_1 = pd.read_csv('csv path') dataset_2 = pd.read_csv('csv path') new_dataset = pd.merge(dataset_1, dataset_2, left_on='same column name', right_on=('same column name'), how=('how to join ex:left'))

Related

How to add a column from df1 to df2 if it not present in df2, else do nothing

How to solve a Pandas Merge Error: key must be integer or timestamp?

Python: Pandas merge leads to NaN

Pandas concatenation deleting original index

Concatenate/Merge/Join two different Dataframes Pandas

Categories

Resources