Stacking multiple columns with different names into one giant dataframe - python

I have several data frames (with equal # columns but different names). I'm trying to create one data frame with rows stacked below each other. I don't care now about the column names (I can always rename them later). I saw different SO links but they don't address this problem completely.
Note I've 21 data frames and scalability is important. I was looking at
this
How I get df:
df = []
for f in files:
data = pd.read_csv(f, usecols = [0,1,2,3,4])
df.append(data)

Assuming your DataFrames are stored in some list df_l:
Rename the columns and concat:
df_l = [df1, df2, df3]
for df in df_l:
df.columns = df_l[0].columns # Just chose any DataFrame
pd.concat(df_l) # Columns named with above DataFrame
# Index is preserved
Or construct a new DataFrame:
pd.DataFrame(np.vstack([df.to_numpy() for df in df_l])) # Columns are RangeIndex
# Index is RangeIndex

I will do it at the beginning adding skiprows=1
names=[0,1,2,3,4]# what every you want to call them ..
pd.concat([pd.read_csv(f, usecols = [0,1,2,3,4],skiprows=1,names=[0,1,2,3,4]) for f in files])

Once you put all the data frames into a list, try this code.
import pandas as pd
df = [df1, df2, df3]
result = pd.DataFrame(columns=df1.columns)
for df in df:
result = pd.concat([result,df.rename(columns=df1.columns)], ignore_index=True)

Related

Merge two dataframes one common column

I have two data frames, one with three columns and another with two columns. Two columns are common in both data frames:
enter image description here
I have to update the Marks column of df1 from df2 where the data is missing only and keep the existing value as same in the df1.
I have tried pd.merge but the result created a separate column which was not intended.
Following worked for me:
df1['Mark'] = df1.Marks_x.combine_first(df1.Marks_y)
df1['Marks_x'] = df1['Mark']
df1 = df1.drop(['Marks_y', 'Mark'], axis=1)
df1 = df1.rename(columns = {'Marks_x':'Marks'})

Split one column with mixed data into dataframe with 2 columns using pandas

I have around 80 .txt files that contain only 1 column originally, but the one column has two parameters that I need to concatenate and split.
Something like this:
Column A
Cell 1
Cell N
--------
Column B
--------
Cell 1
Cell N
I need to aggregate column A in one dataset (or .txt file) and I can remove the data from column B.
How do I do that when I only have strings to do so?
I tried to merge the files into one, the problem is that column A data and column B still got one above the other.
I tried this:
path_dir = '/content/'
for files in os.listdir(path_dir):
if files.endswith(".txt"):
print(files)
new_df = pd.DataFrame()
for files in os.listdir(path_dir):
if files.endswith(".txt"):
DataFrame = pd.read_csv(path_dir + files)
new_df = pd.concat([new_df, DataFrame], ignore_index = True)
The code is working fine, but the problem is that the columns stay merged into one, like Column A / Column B / Column A / Column B, and so on.
Thanks for your time in advance!
When you need to merge several files into one dataset you usually need one of pandas methods: please, refer to merge, join, concatenate and compare article.
So steps are:
Read the data and create 2 Dataframes each having 1 column (that's what you already have).
Concat them with axis param specified.
df = pd.DataFrame([1, 2, 3, 'ColumnB', 'b', 'c', 'd'], columns=['ColumnA'])
second_column_index = df.index[df['ColumnA'] == 'ColumnB'].tolist()[0]
df1 = df.iloc[:second_column_index, :].reset_index(drop=True)
df2 = df.iloc[second_column_index+1:, :].rename(columns={'ColumnA': 'ColumnB'}).reset_index(drop=True)
df = pd.concat([df1, df2], axis=1)

Appending dataframes with non matching rows [duplicate]

I have a initial dataframe D. I extract two data frames from it like this:
A = D[D.label == k]
B = D[D.label != k]
I want to combine A and B into one DataFrame. The order of the data is not important. However, when we sample A and B from D, they retain their indexes from D.
DEPRECATED: DataFrame.append and Series.append were deprecated in v1.4.0.
Use append:
df_merged = df1.append(df2, ignore_index=True)
And to keep their indexes, set ignore_index=False.
Use pd.concat to join multiple dataframes:
df_merged = pd.concat([df1, df2], ignore_index=True, sort=False)
Merge across rows:
df_row_merged = pd.concat([df_a, df_b], ignore_index=True)
Merge across columns:
df_col_merged = pd.concat([df_a, df_b], axis=1)
If you're working with big data and need to concatenate multiple datasets calling concat many times can get performance-intensive.
If you don't want to create a new df each time, you can instead aggregate the changes and call concat only once:
frames = [df_A, df_B] # Or perform operations on the DFs
result = pd.concat(frames)
This is pointed out in the pandas docs under concatenating objects at the bottom of the section):
Note: It is worth noting however, that concat (and therefore append)
makes a full copy of the data, and that constantly reusing this
function can create a significant performance hit. If you need to use
the operation over several datasets, use a list comprehension.
If you want to update/replace the values of first dataframe df1 with the values of second dataframe df2. you can do it by following steps —
Step 1: Set index of the first dataframe (df1)
df1.set_index('id')
Step 2: Set index of the second dataframe (df2)
df2.set_index('id')
and finally update the dataframe using the following snippet —
df1.update(df2)
To join 2 pandas dataframes by column, using their indices as the join key, you can do this:
both = a.join(b)
And if you want to join multiple DataFrames, Series, or a mixture of them, by their index, just put them in a list, e.g.,:
everything = a.join([b, c, d])
See the pandas docs for DataFrame.join().
# collect excel content into list of dataframes
data = []
for excel_file in excel_files:
data.append(pd.read_excel(excel_file, engine="openpyxl"))
# concatenate dataframes horizontally
df = pd.concat(data, axis=1)
# save combined data to excel
df.to_excel(excelAutoNamed, index=False)
You can try the above when you are appending horizontally! Hope this helps sum1
Use this code to attach two Pandas Data Frames horizontally:
df3 = pd.concat([df1, df2],axis=1, ignore_index=True, sort=False)
You must specify around what axis you intend to merge two frames.

Merge/append two dataframes with common and different columns [duplicate]

I have a initial dataframe D. I extract two data frames from it like this:
A = D[D.label == k]
B = D[D.label != k]
I want to combine A and B into one DataFrame. The order of the data is not important. However, when we sample A and B from D, they retain their indexes from D.
DEPRECATED: DataFrame.append and Series.append were deprecated in v1.4.0.
Use append:
df_merged = df1.append(df2, ignore_index=True)
And to keep their indexes, set ignore_index=False.
Use pd.concat to join multiple dataframes:
df_merged = pd.concat([df1, df2], ignore_index=True, sort=False)
Merge across rows:
df_row_merged = pd.concat([df_a, df_b], ignore_index=True)
Merge across columns:
df_col_merged = pd.concat([df_a, df_b], axis=1)
If you're working with big data and need to concatenate multiple datasets calling concat many times can get performance-intensive.
If you don't want to create a new df each time, you can instead aggregate the changes and call concat only once:
frames = [df_A, df_B] # Or perform operations on the DFs
result = pd.concat(frames)
This is pointed out in the pandas docs under concatenating objects at the bottom of the section):
Note: It is worth noting however, that concat (and therefore append)
makes a full copy of the data, and that constantly reusing this
function can create a significant performance hit. If you need to use
the operation over several datasets, use a list comprehension.
If you want to update/replace the values of first dataframe df1 with the values of second dataframe df2. you can do it by following steps —
Step 1: Set index of the first dataframe (df1)
df1.set_index('id')
Step 2: Set index of the second dataframe (df2)
df2.set_index('id')
and finally update the dataframe using the following snippet —
df1.update(df2)
To join 2 pandas dataframes by column, using their indices as the join key, you can do this:
both = a.join(b)
And if you want to join multiple DataFrames, Series, or a mixture of them, by their index, just put them in a list, e.g.,:
everything = a.join([b, c, d])
See the pandas docs for DataFrame.join().
# collect excel content into list of dataframes
data = []
for excel_file in excel_files:
data.append(pd.read_excel(excel_file, engine="openpyxl"))
# concatenate dataframes horizontally
df = pd.concat(data, axis=1)
# save combined data to excel
df.to_excel(excelAutoNamed, index=False)
You can try the above when you are appending horizontally! Hope this helps sum1
Use this code to attach two Pandas Data Frames horizontally:
df3 = pd.concat([df1, df2],axis=1, ignore_index=True, sort=False)
You must specify around what axis you intend to merge two frames.

concat a list of DataFrames, only take columns that do not exist in previous DataFrames

I have a list of DataFrames [df1, df2, df3, df4, ...]
I want to do pd.concat, and only use the columns of a df it doesn't exist in previous dfs
dfs = [df1, df2, df3, df4, ...]
final_df = df1
for df in dfs[1:]:
final_df = pd.concat([final_df, df[df.columns.difference(final_df.columns)]])
Is there a better way than what I did above?
You can achieve this using sets. consider the following example:
s1 = set(('col1', 'col2'))
s2 = set(('col2', 'col3'))
s2.difference(s1) # this returns {'col3'}
So what you can do is accumulate a set of column names and for every new column append it to your data frame and the column name to the accumulated set.
On further reading of you question, it appears that you did pretty much the same thing. the only improvement that I can think of is if you have many frame and/or large frames then don't concat on every iteration because that creates a new frame and that will get slower and slower. Instead put all of them in a list and make a frame of them in the end.

Categories

Resources