Merging data from python list into one dataframe - python
I have the following files in AAMC_K.txt, AAU.txt, ACU.txt, ACY.txt in a folder called AMEX. I am trying to merge these text files into one dataframe. I have tried to do so with pd.merge() but I get an error that the merge function needs a right and left parameter and my data is in a python list. How can I merge the data in the data_list into one pandas dataframe.
import pandas as pd
import os
textfile_names = os.listdir("AMEX")
textfile_names.sort()
data_list = []
for i in range(len(textfile_names)):
data = pd.read_csv("AMEX/"+textfile_names[i], index_col=None, header=0)
data_list.append(data)
frame = pd.merge(data_list, on='<DTYYYYMMDD>', how='outer')
"AE.txt"
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
AE,D,19970102,000000,12.6250,12.6250,11.7500,11.7500,144,0
AE,D,19970103,000000,11.8750,12.1250,11.8750,12.1250,25,0
AAU.txt
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
AAU,D,20020513,000000,0.4220,0.4220,0.4220,0.4220,0,0
AAU,D,20020514,000000,0.4177,0.4177,0.4177,0.4177,0,0
ACU.txt
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
ACU,D,19970102,000000,5.2500,5.3750,5.1250,5.1250,52,0
ACU,D,19970103,000000,5.1250,5.2500,5.0625,5.2500,12,0
ACY.txt
<TICKER>,<PER>,<DTYYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
ACY,D,19980116,000000,9.7500,9.7500,8.8125,8.8125,289,0
ACY,D,19980120,000000,8.7500,8.7500,8.1250,8.1250,151,0
I want the output to be filtered with the DTYYYYMMDD and put into one dataframe frame.
OUTPUT
<TICKER>,<PER>,<DTYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>,<TICKER>,<PER>,<DTYYYMMDD>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>
ACU,D,19970102,000000,5.2500,5.3750,5.1250,5.1250,52,0,AE,D,19970102,000000,12.6250,12.6250,11.7500,11.7500,144,0
ACU,D,19970103,000000,5.1250,5.2500,5.0625,5.2500,12,0,AE,D,19970103,000000,11.8750,12.1250,11.8750,12.1250,25,0
As #busybear says, pd.concat is the right tool for this job: frame = pd.concat(data_list).
merge is for when you're joining two dataframes which usually have some of the same columns and some different ones. You choose a column (or index or multiple) which identifies which rows in the two dataframes correspond to each other, and pandas handles making a dataframe whose rows are combinations of the corresponding rows in the two original dataframes. This function only works on 2 dataframes at a time; you'd have to do a loop to merge more in (it's uncommon to need to merge many dataframes this way).
concat is for when you have multiple dataframes and want to just append all of their rows or columns into one large dataframe. (Let's assume you're concatenating rows, as you want here.) It doesn't use an identifier to determine which rows correspond. All it does is create a new dataframe which has each row from each of the concated dataframes (all the rows from the first, then all from the second, etc.).
I think the above is a decent TLDR on merge vs concat but see here for a lengthy but much more comprehensive guide on using merge/join/concat with dataframes.
Related
Pandas: IndexingError: Unalignable boolean Series provided as indexer for merging two dataframes into one
I am trying to write code to perform a reconciliation of an invoice listing and credit card receipts. My thought process was to turn each of these into a separate dataframe, and then merge two dataframes in pandas based on two columns (amount paid and date). After merging the columns, I would like to try and identify which columns are the ones that have been merged, and then create another spreadsheet with the remaining rows for each dataframe that have not been merged. My code looks like this: After writing the code, I get the error "IndexingError: Unalignable boolean Series provided as indexer for merging two dataframes into one", and I'm not sure which part of the code is throwing out the error, or how to fix it. I would like any advice on how to possibly fix this error so the code can run. import pandas as pd # Read in the specific sheets from the workbook first_spreadsheet = pd.read_excel('C:/Users/ianch/Downloads/revenue_sep22.xlsx', sheet_name='Precious CMS') second_spreadsheet = pd.read_excel('C:/Users/ianch/Downloads/revenue_sep22.xlsx', sheet_name='First Data') second_spreadsheet.insert(0, 'Index', range(1, 1+len(second_spreadsheet))) filtered_first_spreadsheet = first_spreadsheet[first_spreadsheet['Mode'].isin(['MasterCard','Visa'])] # Merge the two dataframes using left_on and right_on parameters merged_df = pd.merge(filtered_first_spreadsheet, second_spreadsheet, left_on=['Date','Payment'], right_on=['Date','Payment']) merged_df1=merged_df.drop_duplicates(subset=['Patient','Payment'],keep='first') #separate out the merged dataframe into two original dataframes newfirst = merged_df1.iloc[:,:10] newsecond = merged_df1.iloc[:,10:] #create filters for indexes from the separated-merged-dataframe that are in the original dataframe invnofilter = newfirst['#'].isin(first_spreadsheet['#']) transactnofilter = newsecond['Index'].isin(second_spreadsheet['Index']) filteredfirst=first_spreadsheet[invnofilter].resetindex() filteredfirst.to_excel("C:/Users/ianch/Downloads/filteredpreciouscms.xlsx") # Save the merged dataframe to a new excel file merged_df1.to_excel("C:/Users/ianch/Downloads/merged_spreadsheet4.xlsx", sheet_name="Merged_Data", index=False)
Merge multiple df in python and keep the same rows only one time
i am trying to merge multiple dataframes and create a new dataframe containing all the rows from each dataframe but containing only one time the rows that are the same. For example: The dataframes that i have as input: input dataframes The dataframe that i want to have as output: output dataframe Do you know if there is a way to do that? If you could help me, i would be more than thankfull!! Thanks, Eleni
Join Two Data frames- Python pandas
I have two Excel files which I loaded into dataframes: In first Frame I have States State1,2,3... as column names: In second Frame I have State1,2,3... as column values I need to merge these two dataframes, based on value of state. Kindly suggest.
Merge multiple int columns/rows into one numpy array (pandas dataframe)
I have a pandas dataframe with few columns and rows. I want to merge the columns into one and then merge the rows based on id and date into one. Currently I am doing so by: df['matrix'] = df[[col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19,col20,col21,col22,col23,col24,col25,col26,col27,col28,col29,col30,col31,col32,col33,col34,col35,col36,col37,col38,col39,col40,col41,col42,col43,col44,col45,col46,col47,col48]].values.tolist() df = df.groupby(['id','date'])['matrix'].apply(list).reset_index(name='matrix') This gives me the matrix in form of a list. Later I convert it into numpy.ndarray using: df['matrix'] = df['matrix'].apply(np.array) This is a small segment of my dataset for reference: id,date,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19,col20,col21,col22,col23,col24,col25,col26,col27,col28,col29,col30,col31,col32,col33,col34,col35,col36,col37,col38,col39,col40,col41,col42,col43,col44,col45,col46,col47,col48 16,2014-06-22,0,0,0,10,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 16,2014-06-22,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 16,2014-06-22,2,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 16,2014-06-22,3,0,0,0,0,0,0,0,0,0,0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0 16,2014-06-22,4,0,0,0,0,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,22,0,0,0,0 Though the above piece of code works fine for small datasets, but sometimes crashes for larger ones. Specifically df['matrix'].apply(np.array) statement. Is there a way by which I can perform the merging to fetch me a numpy.array? This would save a lot of time.
No need to merge the columns at first. Split DataFrame using groupby and then flatten the result matrix=df.set_index(['id','date']).groupby(['id','date']).apply(lambda x: x.values.flatten())
How do I append an existing column to another column, aligning with the indices?
I have three dataframes that each have different columns, but they all have the same indices and the same number of rows (exact same index). How do I combine them into a single dataframe, keeping each column separate but joining on the indices? Currently, when I attempt to append them together, I get NaNs and the same indices are duplicated. I created an empty dataframe so that I can put all three dataframes into by append. Maybe this is wrong? What I am doing is as follows: df = pd.DataFrame() frames = a list of the three dataframes for x in frames: df = df.append(x)
DataFrames have a join method which does exactly this. You'll just have to modify your code a bit so that you're calling the method from the real dataframes rather than the empty one. df = pd.DataFrame() frames = a list of the three dataframes for x in frames: df = x.join(df) More in the docs.
I was able to come up with a solution by grouping by the index: df = df.groupby(df1.index)