Can't concatenate pandas dataframes with the same length? - python

This is weird. From the documentation I all ready read how to do concat and merge operations with pandas. Also I all ready know that concatenating to the right side can be done as follows:
df = pd.concat([df1, df2], axis=1)
The issue is that I generated the following dataframes:
In:
links = pd.DataFrame(links, columns=['link'])
So, I just want to concatenate the link dataframe column to intersection dataframe (note that link and intersection have 78 instances of length). Thus:
In:
full_table = pd.concat([lis_, lis_2], axis=1)
The problem is that as you can see in the above dataframe, it added some NaN values. Therefore, which is the correct way of concatenating links and intersection dataframes?.

Maybe your indexes don't match up. Try using the ignore_index parameter:
full_table = pd.concat([intersection, links], axis=1, ignore_index=True)

Related

How to merge 2 dataframe rows in a new dataframe row with pandas?

I have 2 variables (dataframes) one is 47 colums wide and the other is 87, they are DF2 and DF2.
Then I have a variable (dataframe) called full_data. Df1 and DF2 are two different subset of data I want to merge together once I find 2 rows are equal.
I am doing everything I want so far besides appending the right value to the new dataframe.
below is the line of code I have been playing around with:
full_data = full_data.append(pd.concat([df1[i:i+1].copy(),df2[j:j+1]].copy(), axis=1), ignore_index = True)
once I find the rows in both Df1 and DF2 are equal I am trying to read both those rows and put them one after the other as a single row in the variable full_data. What is happening right now is that the line of code is writting 2 rows and no one as I want.
what I want is full_data.append(Df1 DF2) and right now I am getting
full_data(i)=DF1
full_data(i+1)=DF2
Any help would be apreciated.
EM
full_data = full_data.append(pd.concat([df1[i:i+1].copy(),df2[j:j+1]].copy(), axis=1), ignore_index = True)
In the end I solved my problem. Probably I was not clear enough but my question but what was happening when concatenating is that I was getting duplicated or multiple rows when the expected result was getting a single row concatenation.
The issues was found to be with the indexing. Indexing had to be reset because of the way pandas works.
I found an example and explanation here
My solution here:
df3 = df2[j:j+1].copy()
df4 = df1[i:i+1].copy()
full_data = full_data.append(
pd.concat([df4.reset_index(drop=True), df3.reset_index(drop=True)], axis=1),
ignore_index = True
)
I first created a copy of my variables and then reset the indexes.

Appending dataframes with non matching rows [duplicate]

I have a initial dataframe D. I extract two data frames from it like this:
A = D[D.label == k]
B = D[D.label != k]
I want to combine A and B into one DataFrame. The order of the data is not important. However, when we sample A and B from D, they retain their indexes from D.
DEPRECATED: DataFrame.append and Series.append were deprecated in v1.4.0.
Use append:
df_merged = df1.append(df2, ignore_index=True)
And to keep their indexes, set ignore_index=False.
Use pd.concat to join multiple dataframes:
df_merged = pd.concat([df1, df2], ignore_index=True, sort=False)
Merge across rows:
df_row_merged = pd.concat([df_a, df_b], ignore_index=True)
Merge across columns:
df_col_merged = pd.concat([df_a, df_b], axis=1)
If you're working with big data and need to concatenate multiple datasets calling concat many times can get performance-intensive.
If you don't want to create a new df each time, you can instead aggregate the changes and call concat only once:
frames = [df_A, df_B] # Or perform operations on the DFs
result = pd.concat(frames)
This is pointed out in the pandas docs under concatenating objects at the bottom of the section):
Note: It is worth noting however, that concat (and therefore append)
makes a full copy of the data, and that constantly reusing this
function can create a significant performance hit. If you need to use
the operation over several datasets, use a list comprehension.
If you want to update/replace the values of first dataframe df1 with the values of second dataframe df2. you can do it by following steps —
Step 1: Set index of the first dataframe (df1)
df1.set_index('id')
Step 2: Set index of the second dataframe (df2)
df2.set_index('id')
and finally update the dataframe using the following snippet —
df1.update(df2)
To join 2 pandas dataframes by column, using their indices as the join key, you can do this:
both = a.join(b)
And if you want to join multiple DataFrames, Series, or a mixture of them, by their index, just put them in a list, e.g.,:
everything = a.join([b, c, d])
See the pandas docs for DataFrame.join().
# collect excel content into list of dataframes
data = []
for excel_file in excel_files:
data.append(pd.read_excel(excel_file, engine="openpyxl"))
# concatenate dataframes horizontally
df = pd.concat(data, axis=1)
# save combined data to excel
df.to_excel(excelAutoNamed, index=False)
You can try the above when you are appending horizontally! Hope this helps sum1
Use this code to attach two Pandas Data Frames horizontally:
df3 = pd.concat([df1, df2],axis=1, ignore_index=True, sort=False)
You must specify around what axis you intend to merge two frames.

Merge/append two dataframes with common and different columns [duplicate]

I have a initial dataframe D. I extract two data frames from it like this:
A = D[D.label == k]
B = D[D.label != k]
I want to combine A and B into one DataFrame. The order of the data is not important. However, when we sample A and B from D, they retain their indexes from D.
DEPRECATED: DataFrame.append and Series.append were deprecated in v1.4.0.
Use append:
df_merged = df1.append(df2, ignore_index=True)
And to keep their indexes, set ignore_index=False.
Use pd.concat to join multiple dataframes:
df_merged = pd.concat([df1, df2], ignore_index=True, sort=False)
Merge across rows:
df_row_merged = pd.concat([df_a, df_b], ignore_index=True)
Merge across columns:
df_col_merged = pd.concat([df_a, df_b], axis=1)
If you're working with big data and need to concatenate multiple datasets calling concat many times can get performance-intensive.
If you don't want to create a new df each time, you can instead aggregate the changes and call concat only once:
frames = [df_A, df_B] # Or perform operations on the DFs
result = pd.concat(frames)
This is pointed out in the pandas docs under concatenating objects at the bottom of the section):
Note: It is worth noting however, that concat (and therefore append)
makes a full copy of the data, and that constantly reusing this
function can create a significant performance hit. If you need to use
the operation over several datasets, use a list comprehension.
If you want to update/replace the values of first dataframe df1 with the values of second dataframe df2. you can do it by following steps —
Step 1: Set index of the first dataframe (df1)
df1.set_index('id')
Step 2: Set index of the second dataframe (df2)
df2.set_index('id')
and finally update the dataframe using the following snippet —
df1.update(df2)
To join 2 pandas dataframes by column, using their indices as the join key, you can do this:
both = a.join(b)
And if you want to join multiple DataFrames, Series, or a mixture of them, by their index, just put them in a list, e.g.,:
everything = a.join([b, c, d])
See the pandas docs for DataFrame.join().
# collect excel content into list of dataframes
data = []
for excel_file in excel_files:
data.append(pd.read_excel(excel_file, engine="openpyxl"))
# concatenate dataframes horizontally
df = pd.concat(data, axis=1)
# save combined data to excel
df.to_excel(excelAutoNamed, index=False)
You can try the above when you are appending horizontally! Hope this helps sum1
Use this code to attach two Pandas Data Frames horizontally:
df3 = pd.concat([df1, df2],axis=1, ignore_index=True, sort=False)
You must specify around what axis you intend to merge two frames.

Pandas concat columns

I have two df-s:
I want to concatenate along the columns, e.g. get a 1000x61118 DataFrame. so I'm doing:
df_full = pd.concat([df_dev, df_temp2], axis=1)
df_full
This, however, yields a 2000x61118 df, and fills everything with NaNs... And I have no idea why. What could cause this behaviour?
Create default index values by DataFrame.reset_index with drop=True for correct align both DataFrames:
df_full = pd.concat([df_dev.reset_index(drop=True), df_temp2.reset_index(drop=True)], axis=1)

pandas df.fillna - filling NaNs after outer join with correct values

I have two dataframes, sharing some columns together.
I'm trying to:
1) Merge the two dataframes together, i.e. adding the columns which are different:
diff = df2[df2.columns.difference(df1.columns)]
merged = pd.merge(df1, diff, how='outer', sort=False, on='ID')
Up to here, everything works as expected.
2) Now, to replace the NaN values with the values of df2
merged = merged[~merged.index.duplicated(keep='first')]
merged.fillna(value=df2)
And it is here that I get:
pandas.core.indexes.base.InvalidIndexError
I don't have any duplicates, and I can't find any information as to what can cause this.
The solution to this problem is to use a different method - combine_first()
this way, each row with missing data is filled with data from the other dataframe, as can be seen here Merging together values within Series or DataFrame columns
In case, number of rows changes because of the merge, fillna sometimes cause error. Try the following!
merged.fillna(df2.groupby(level=0).transform("mean"))
related question

Categories

Resources