Pandas concatenation deleting original index - python

I am trying to concat two dataframes using the below code. df1 is a daily update of values to the indexes in df2, which is an ongoing monthly dataset. df3 is the result that is saved.
The problem I am experiencing is that when an index value is not in df1 (no values for that particular day), it gets deleted from df3 altogether. In other words, if the index value is not in df2, then it doesn't appear in df3 at all.
How can I keep the original index of df3, so that if the index value is not in df1, it doesn't delete it? I also cannot enter 0 values, as it is relevant to the data that it is empty.
import os
import pandas as pd
import glob
def Monthly_aggregation_merge(month, date):
# file to be merged
df1 = pd.read_csv(r'Data\{}\{}\Aggregated\Aggregated_Daily_All.csv'.format(month,date), usecols=['CU', 'Parameters', 'Total/Max/Min'], index_col =[0,1])
df1 = df1.rename(columns = {'Total/Max/Min':date}) # Change column name
# original file that data should be merged with
df2 = pd.read_csv(r'Data\{}\MonthlyData\July2017NEW.csv'.format(month), index_col = [0,1])
df3 = pd.concat([df2, df1], axis=1).reindex(df1.index)
df3.to_csv(r'Data\{}\MonthlyData\July2017NEW.csv'.format(month))
print 'Monthly Merge Done!'

Related

I have a list of three dataframes i want to update df1 values into df2 and df2 values into df3 now my final df3 need to contain all updated values

My code is :
dfs = [df1,df2,df3]
le = dfs[0].drop_duplicates(subset=['id'])
df = dfs[1].set_index('id')
df.update(le.set_index('id'))
df.reset_index(inplace=True)
le1 = df.drop_duplicates(subset=['id'])
df1 = dfs[2].set_index('id')
df1.update(le1.set_index('id'))
df1.reset_index(inplace=True)
My final output is:
myfinalupdateddf = df1
How can I do my above code into dynamic way by using for loop instead of creating multiple variables to get my final output.?

How to add a column from df1 to df2 if it not present in df2, else do nothing

I have 2 data frame from a basic web scrape using Pandas (below). The second table has less columns than the first, and I need to concat the dataframes. I have been manually inserting columns for a while but seeing as they change frequently I would like to have a function that can assess the columns in df2, check whether they are all in df2, and if not, add the column, with the data from df2.
import pandas as pd
link = 'https://en.wikipedia.org/wiki/Opinion_polling_for_the_next_French_presidential_election'
df = pd.read_html(link,header=0)
df1 = df[1]
df1 = df1.drop([0])
df1 = df1.drop('Abs.',axis=1)
df2 = df[2]
df2 = df2.drop([0])
df2 = df2.drop(['Abs.'],axis=1)
Many thanks,
#divingTobi's answer:
pd.concat([df1, df2]) does the trick.

Concat dataframes - One with no column name

So I have 2 csv files with the same number of columns. The first csv file has its columns named (age, sex). The second file though doesn't name its columns like the first one but it's data corresponds to the according column of the first csv file. How can I concat them properly?
First csv.
Second csv.
This is how I read my files:
df1 = pd.read_csv("input1.csv")
df2 = pd.read_csv("input2.csv", header=None)
I tried using concat() like this but I get 4 columns as a result..
df = pd.concat([df1, df2])
You can also use the append function. Be careful to have the same column names for both, otherwise you will end with 4 columns.
Check this link, I found it very useful.
df1 = pd.read_csv("input1.csv")
df2 = pd.read_csv("input2.csv", header = None)
df2.columns = df1.columns
df = df1.append(df2, ignore_index=True)
I found a solution. After reading the second file I added
df2.columns = df1.columns
Works just like I wanted to. I guess I better research more next time :). Thanks
Final code:
df1 = pd.read_csv("input1.csv")
df2 = pd.read_csv("input2.csv", header = None)
df2.columns = df1.columns
df = pd.concat([df1, df2])

python pandas merge multiple csv files

I have around 600 csv file datasets, all have the very same column names [‘DateTime’, ‘Actual’, ‘Consensus’, ‘Previous’, ‘Revised’], all economic indicators and all-time series data sets.
the aim is to merge them all together in one csv file.
With ‘DateTime’ as an index.
The way I wanted this file to indexed in is the time line way which means let’s say the first event in the first csv dated in 12/18/2017 10:00:00 and first event in the second csv dated in 12/29/2017 09:00:00 and first event in the third csv dated in 12/20/2017 09:00:00.
So, I want to index them the later first and the newer after it, etc. despite the source csv it originally from.
I tried to merge just 3 of them as an experiment and the problem is the ‘DateTime’ because it prints the 3 of them together like this ('12/18/2017 10:00:00', '12/29/2017 09:00:00', '12/20/2017 09:00:00')
Here is the code:
import pandas as pd
df1 = pd.read_csv("E:\Business\Economic Indicators\Consumer Price Index - Core (YoY) - European Monetary Union.csv")
df2 = pd.read_csv("E:\Business\Economic Indicators\Private loans (YoY) - European Monetary Union.csv")
df3 = pd.read_csv("E:\Business\Economic Indicators\Current Account s.a - European Monetary Union.csv")
df = pd.concat([df1, df2, df3], axis=1, join='inner')
df.set_index('DateTime', inplace=True)
print(df.head())
df.to_csv('df.csv')
Consider using read_csv() args, index_col and parse_dates, to create indices during import and format as datetime. Then run your needed horizontal merge. Below assumes date is in first column of csv. And at the end use sort_index() on final dataframe to sort the datetimes.
df1 = pd.read_csv(r"E:\Business\Economic Indicators\Consumer Price Index - Core (YoY) - European Monetary Union.csv",
index_col=[0], parse_dates=[0])
df2 = pd.read_csv(r"E:\Business\Economic Indicators\Private loans (YoY) - European Monetary Union.csv",
index_col=[0], parse_dates=[0])
df3 = pd.read_csv(r"E:\Business\Economic Indicators\Current Account s.a - European Monetary Union.csv",
index_col=[0], parse_dates=[0])
finaldf = pd.concat([df1, df2, df3], axis=1, join='inner').sort_index()
And for DRY-er approach especially across the hundreds of csv files, use a list comprehension
import os
...
os.chdir('E:\\Business\\Economic Indicators')
dfs = [pd.read_csv(f, index_col=[0], parse_dates=[0])
for f in os.listdir(os.getcwd()) if f.endswith('csv')]
finaldf = pd.concat(dfs, axis=1, join='inner').sort_index()
You're trying to build one large dataframe out of the rows of many dataframes who all have the same column names. axis should be 0 (the default), not 1. Also you don't need to specify a type of join. This will have no effect since the column names are the same for each dataframe.
df = pd.concat([df1, df2, df3])
should be enough in order to concatenate the datasets.
(see https://pandas.pydata.org/pandas-docs/stable/merging.html )
Your call to set_index to define an index using the values in the DateTime column should then work.
dataset_1 = pd.read_csv('csv path')
dataset_2 = pd.read_csv('csv path')
new_dataset = pd.merge(dataset_1, dataset_2, left_on='same column name', right_on=('same column name'), how=('how to join ex:left'))
The problem is two folds: merging the csv into a single dataframe, and then ordering it by date.
As John Smith pointed out to merge dataframes along rows, you need to use:
df = pd.concat([df1,df2,df3])
Then you want to set an index and reorder your dataframe according to the index.
df.set_index('DateTime', inplace=True)
df.sort_index(inplace=True)
or in descending order
df.sort_index(inplace=True,ascending=False)
(see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html)
timeindex = pd.date_range('2018/01/01','2018/01/10')
randtimeindex = np.random.permutation(timeindex)
# Create three dataframes
df1 = pd.DataFrame(index=range(3),data=np.random.rand(3,3),
columns=['Actual','Consensus','DateTime'])
df1.DateTime=randtimeindex[:3]
df2 = pd.DataFrame(index=range(3),data=np.random.rand(3,3),
columns=['Actual','Consensus','DateTime'])
df2.DateTime=randtimeindex[3:6]
df3 = pd.DataFrame(index=range(4),data=np.random.rand(4,3),
columns=['Actual','Consensus','DateTime'])
df3.DateTime=randtimeindex[6:]
# Merge them
df4 = pd.concat([df1, df2, df3], axis=0)
# Reindex the merged dataframe, and sort it
df4.set_index('DateTime', inplace=True)
df4.sort_index(inplace=True, ascending=False)
print(df4.head())

pandas merge column by another column

I have two excel, named df1 and df2.
df1.columns : url, content, ortheryy
df2.columns : url, content, othterxx
Some contents in df1 are empty, and df1 and df2 share some urls(not all).
What I want to do is fill df1's empty content by df2 if that row has same url.
I tried
ndf = pd.merge(df1, df2[['url', 'content']], on='url', how='left')
# how='inner' result same
Which result:
two column: content_x and content_y
I know it can be solve by loop through df1 and df2, but I'd like to do is in pandas way.
I think need Series.combine_first or Series.fillna:
df1['content'] = df1['content'].combine_first(ndf['content_y'])
Or:
df1['content'] = df1['content'].fillna(ndf['content_y'])
It works, because left join create in ndf same index values as df1.

Categories

Resources