Creating new pandas dataframe by extracting columns from other dataframes - ValueError - python

I have to extract columns from different pandas dataframes and merge them into a single new dataframe. This is what I am doing:
newdf=pd.DataFrame()
newdf['col1']=sorted(df1.columndf1.unique())
newdf['col2']=df2.columndf2.unique(),
newdf['col3']=df3.columndf3.unique()
newdf
I am sure that the three columns have the same length (I have checked) but I get the error
ValueError: Length of values does not match length of index
I have tried to pass them as pd.Series but the result is the same. I am on Python 2.7.

It seems there is problem length of unique values is different.
One possible solution is concat all data together and apply unique.
If unique data not same sizes, get NaNs in last values of columns.
newdf = pd.concat([df1.columndf1, df2.columndf2, df3.columndf3], axis=1)
.apply(lambda x: pd.Series(x.unique()))
EDIT:
Another possible solution:
a = sorted(df1.columndf1.unique())
b = list(df2.columndf2.unique())
c = list(df3.columndf3.unique())
newdf=pd.DataFrame({'col1':a, 'col2':b, 'col3':c})

Related

How to merge multiple (6) dataframes together based on one common column in python/pandas?

Before I start, I have found similar questions and tried the responding answers however, I am still running into an issue and can't figure out why.
I have 6 data frames. I want one resulting data frame that merges all of these 6 into one, based on their common index column Country. Things to note are: the data frames have different number of rows, some country's do not have corresponding values, resulting in NaN.
Here is what I have tried:
data_frames = [WorldPopulation_df, WorldEconomy_df, WorldEducation_df, WorldAggression_df, WorldCorruption_df, WorldCyberCapabilities_df]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Country'], how = 'outer'), data_frames)
This doesn't work as the final resulting data frame pairs up the wrong values with wrong country. Any suggestions?
let's see, "pd.merge" is used when you would add new columns from a key.
In case you have 6 dataframes with the same number of columns, in the same order, you can try this:
columns_order = ['country', 'column_1']
concat_ = pd.concat(
[data_1[columns_order], data_2[columns_order], data_3[columns_order],
data_4[columns_order], data_5[columns_order], data_6[columns_order]],
ignore_index=True,
axis=0
)
From here, if you want to have a single value for the "country" column, you can apply a group by to it:
concat_.groupby(by=['country']).agg({'column_1': max}).reset_index()

pandas df.fillna - filling NaNs after outer join with correct values

I have two dataframes, sharing some columns together.
I'm trying to:
1) Merge the two dataframes together, i.e. adding the columns which are different:
diff = df2[df2.columns.difference(df1.columns)]
merged = pd.merge(df1, diff, how='outer', sort=False, on='ID')
Up to here, everything works as expected.
2) Now, to replace the NaN values with the values of df2
merged = merged[~merged.index.duplicated(keep='first')]
merged.fillna(value=df2)
And it is here that I get:
pandas.core.indexes.base.InvalidIndexError
I don't have any duplicates, and I can't find any information as to what can cause this.
The solution to this problem is to use a different method - combine_first()
this way, each row with missing data is filled with data from the other dataframe, as can be seen here Merging together values within Series or DataFrame columns
In case, number of rows changes because of the merge, fillna sometimes cause error. Try the following!
merged.fillna(df2.groupby(level=0).transform("mean"))
related question

Concatenate two dataframes and remove duplicate rows based on condition

I am trying to concatenate two dataframe and in case of duplication I'd like to consider the row that has the maximum value for a column C
I tried this command :
df = pd.concat([df1, df2]).max(level=0)
So if two rows have same value for columns A and B, I will just take that row with the maximum value for column C.
You can sort by column C, then drop duplicates by columns A & B:
df = pd.concat([df1, df2])\
.sort_values('C')\
.drop_duplicates(subset=['A', 'B'], keep='last')
Your attempt exhibits a couple of misunderstandings:
pd.DataFrame.max is used to calculate maximum values, not to filter a dataframe.
The level parameter is relevant only for MultiIndex dataframes.

How do I append an existing column to another column, aligning with the indices?

I have three dataframes that each have different columns, but they all have the same indices and the same number of rows (exact same index). How do I combine them into a single dataframe, keeping each column separate but joining on the indices?
Currently, when I attempt to append them together, I get NaNs and the same indices are duplicated. I created an empty dataframe so that I can put all three dataframes into by append. Maybe this is wrong?
What I am doing is as follows:
df = pd.DataFrame()
frames = a list of the three dataframes
for x in frames:
df = df.append(x)
DataFrames have a join method which does exactly this. You'll just have to modify your code a bit so that you're calling the method from the real dataframes rather than the empty one.
df = pd.DataFrame()
frames = a list of the three dataframes
for x in frames:
df = x.join(df)
More in the docs.
I was able to come up with a solution by grouping by the index:
df = df.groupby(df1.index)

Adding DataFrame columns in Python pandas

I have pandas DataFrame that has a number of columns (about 20) containing string objects. I'm looking for a simple method to add all of the columns together into one new column, but have so far been unsuccessful e.g.:
for i in df.columns:
df[‘newcolumn’] = df[‘newcolumn’] + ‘/‘ + df.ix[:,i]
This results in an empty DataFrame column ‘newcolumn’ instead of the concatenated column.
I’m new to pandas, so any help would be much appreciated.
df['newcolumn'] = df.apply(''.join, axis=1)

Categories

Resources