How to append two pandas.DataFrame with different numbers of columns - python

Based on the fact that directly append two dataframe with different numbers of columns, an error would occur as pandas.io.common.CParserError: Error tokenizing data. C error: Expected 4 fields in line 242, saw 5. How can I do with pandas to avoid the error??
I have figure out one naive approach: just to process the original data, to make the numbers of columns equally.
Can it be more elegant?? I think the missing columns can be filled with np.nan after pd.append.

You should be able to concat the dataframes as shown.
You will need to rename the columns to suit you needs.
df1 = pd.DataFrame({'a':[1,2,3,4],'b':[1,2,3,4],'c':[1,2,3,4]})
df2 = pd.DataFrame({'a':[1,2,3,4],'c':[1,2,3,4]})
df = pd.concat([df1,df2])
print('df1')
print(df1)
print('\ndf2')
print(df2)
print('\ndf')
print(df)
Output:
df1
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
df2
a c
0 1 1
1 2 2
2 3 3
3 4 4
df
a b c
0 1 1.0 1
1 2 2.0 2
2 3 3.0 3
3 4 4.0 4
0 1 NaN 1
1 2 NaN 2
2 3 NaN 3
3 4 NaN 4

Related

Ignore nan elements in a list using loc pandas

I have 2 different dataframes: df1, df2
df1:
index a
0 10
1 2
2 3
3 1
4 7
5 6
df2:
index a
0 1
1 2
2 4
3 3
4 20
5 5
I want to find the index of maximum values with a specific lookback in df1 (let's consider lookback=3 in this example). To do this, I use the following code:
tdf['a'] = df1.rolling(lookback).apply(lambda x: x.idxmax())
And the result would be:
id a
0 nan
1 nan
2 0
3 2
4 4
5 4
Now I need to save the values in df2 for each index found by idxmax() in tdf['b']
So if tdf['a'].iloc[3] == 2, I want tdf['b'].iloc[3] == df2.iloc[2]. I expect the final result to be like this:
id b
0 nan
1 nan
2 1
3 4
4 20
5 20
I'm guessing that I can do this using .loc() function like this:
tdf['b'] = df2.loc[tdf['a']]
But it throws an exception because there are nan values in tdf['a']. If I use dropna() before passing tdf['a'] to the .loc() function, then the indices get messed up (for example in tdf['b'], index 0 has to be nan but it'll have a value after dropna()).
Is there any way to get what I want?
Simply use a map:
lookback = 3
s = df1['a'].rolling(lookback).apply(lambda x: x.idxmax())
s.map(df2['a'])
Output:
0 NaN
1 NaN
2 1.0
3 4.0
4 20.0
5 20.0
Name: a, dtype: float64

summing columns from different dataframes Pandas

I have 3 DataFrames, all with over 100 rows and 1000 columns. I am trying to combine all these DataFrames into one in such a way that common columns from each DataFrame are summed up. I understand there is a method of summation called "pd.DataFrame.sum()", but remember, I have over 1000 columns and I can not add each common column manually. I am attaching sample DataFrames and the result I want. Help will be appreciated.
#Sample DataFrames.
df_1 = pd.DataFrame({'a':[1,2,3],'b':[2,1,0],'c':[1,3,5]})
df_2 = pd.DataFrame({'a':[1,1,0],'b':[2,1,4],'c':[1,0,2],'d':[2,2,2]})
df_3 = pd.DataFrame({'a':[1,2,3],'c':[1,3,5], 'x':[2,3,4]})
#Result.
df_total = pd.DataFrame({'a':[3,5,6],'b':[4,2,4],'c':[3,6,12],'d':[2,2,2], 'x':[2,3,4]})
df_total
a b c d x
0 3 4 3 2 2
1 5 2 6 2 3
2 6 4 12 2 4
Let us do pd.concat then sum
out = pd.concat([df_1,df_2,df_3],axis=1).sum(level=0,axis=1)
Out[7]:
a b c d x
0 3 4 3 2 2
1 5 2 6 2 3
2 6 4 12 2 4
You can add with fill_value=0:
df_1.add(df_2, fill_value=0).add(df_3, fill_value=0).astype(int)
Output:
a b c d x
0 3 4 3 2 2
1 5 2 6 2 3
2 6 4 12 2 4
Note: pandas intrinsically aligns most operations along indexes (index and column headers).

Remove duplicate column based on a condition in pandas

I have a DataFrame in which I have a duplicate column namely weather.
As Seen in this picture of dataframe. One of them contains NaN values that is the one I want to remove from the DataFrame.
I tried this method
data_cleaned4.drop('Weather', axis=1)
It dropped both columns as it should. I tried to pass a condition to drop method but I couldn't. It shows me an error.
data_cleaned4.drop(data_cleaned4['Weather'].isnull().sum() > 0, axis=1)
Can anyone tell me how do I remove this column. Remember that the second last contains the NaN values not the last one.
A general solution. (df.isnull().any(axis=0).values) gets which columns have any NaN values and df.columns.duplicated(keep=False) marks all duplicates as True, both combined will give the columns which you want to retain
General Solution:
df.loc[:, ~((df.isnull().any(axis=0).values) & df.columns.duplicated(keep=False))]
Input
A B C C A
0 1 1 1 3.0 NaN
1 1 1 1 2.0 1.0
2 2 3 4 NaN 2.0
3 1 1 1 4.0 1.0
Output
A B C
0 1 1 1
1 1 1 1
2 2 3 4
3 1 1 1
Just for column C:
df.loc[:, ~(df.columns.duplicated(keep=False) & (df.isnull().any(axis=0).values)
& (df.columns == 'C'))]
Input
A B C C A
0 1 1 1 3.0 NaN
1 1 1 1 2.0 1.0
2 2 3 4 NaN 2.0
3 1 1 1 4.0 1.0
Output
A B C A
0 1 1 1 NaN
1 1 1 1 1.0
2 2 3 4 2.0
3 1 1 1 1.0
Due to the duplicate names you can rename a little bit, that's what the first lien of the code belwo does, then it should work...
data_cleaned4 = data_cleaned4.iloc[:, [j for j, c in enumerate(data_cleaned4.columns) if j != i]]
checkone = data_cleaned4.iloc[:,-1].isna().any()
checktwo = data_cleaned4.iloc[:,-2].isna().any()
if checkone:
data_cleaned4.drop(data_cleaned4.columns[-1], axis=1)
elif checktwo:
data_cleaned4.drop(data_cleaned4.columns[-2], axis=1)
else:
data_cleaned4.drop(data_cleaned4.columns[-2], axis=1)
Without a testable sample and assuming you don't have NaNs anywhere else in your dataframe
df = df.dropna(axis=1)
should work

Get only two values from 4 specified columns and merge valid values into 2 columns

df:
index a b c d
-
0 1 2 NaN NaN
1 2 NaN 3 NaN
2 5 NaN 6 NaN
3 1 NaN NaN 5
df expect:
index one two
-
0 1 2
1 2 3
2 5 6
3 1 5
Above output example is self-explanatory. Basically, I just need to shift the two values from columns [a, b, c, d] except NaN into another set of two columns ["one", "two"]
Use back filling missing values and select first 2 columns:
df = df.bfill(axis=1).iloc[:, :2].astype(int)
df.columns = ["one", "two"]
print (df)
one two
index
0 1 2
1 2 3
2 5 6
3 1 5
Or combine_first + drop:
df['two']=df.pop('b').combine_first(df.pop('c')).combine_first(df.pop('d'))
df=df.drop(['b','c','d'],1)
df.columns=['index','one','two']
Or fillna:
df['two']=df.pop('b').fillna(df.pop('c')).fillna(df.pop('d'))
df=df.drop(['b','c','d'],1)
df.columns=['index','one','two']
Both cases:
print(df)
Is:
index one two
0 0 1 2.0
1 1 2 3.0
2 2 5 6.0
3 3 1 5.0
If want output like #jezrael's, add a: (both cases all okay)
df=df.set_index('index')
And then:
print(df)
Is:
one two
index
0 1 2.0
1 2 3.0
2 5 6.0
3 1 5.0

what is the difference between with or without .loc when using groupby + transform in Pandas

I am new to python. here is the question I have, which is really weird to me.
A simple data frame looks like:
a1=pd.DataFrame({'Hash':[1,1,2,2,2,3,4,4],
'Card':[1,1,2,2,3,3,4,4]})
I need to group a1 by Hash, calculate how many rows in each group, then add one column in a1 to indicate row numbers. So, I want to use groupby + transform.
When I use:
a1['CustomerCount']=a1.groupby(['Hash']).transform(lambda x: x.shape[0])
The result is correct:
Card Hash CustomerCount
0 1 1 2
1 1 1 2
2 2 2 3
3 2 2 3
4 3 2 3
5 3 3 1
6 4 4 2
7 4 4 2
But when I use:
a1.loc[:,'CustomerCount']=a1.groupby(['Hash']).transform(lambda x: x.shape[0])
The result is:
Card Hash CustomerCount
0 1 1 NaN
1 1 1 NaN
2 2 2 NaN
3 2 2 NaN
4 3 2 NaN
5 3 3 NaN
6 4 4 NaN
7 4 4 NaN
So, why does this happen?
As far as I know, loc and iloc (like a1.loc[:,'CustomerCount']) are better than nothing (like a1['CustomerCount']) so loc and iloc are usually recommanded to use. But why this happens?
Also, I have tried loc and iloc a lot of times to generate a new column in one data frame. They usualy work. So does this have something to do with groupby + transform?
The difference is how loc deals with assigning a DataFrame object to a single column. When you assigned the DataFrame with the columns of Card it attempted to line up the index and the column name. The columns didn't line up and you got NaNs. When assigning via direct column access, it determined that it was one column for another and just did it.
Reduce to a single column
You can resolve this by either reducing the result of the groupby operation to just one column thus allowing for easy resolution.
a1.loc[:,'CustomerCount'] = a1.groupby(['Hash']).Card.transform('size')
a1
Hash Card CustomerCount
0 1 1 2
1 1 1 2
2 2 2 3
3 2 2 3
4 2 3 3
5 3 3 1
6 4 4 2
7 4 4 2
Rename the column
Don't really do this, the other answer is far simpler
a1.loc[:, 'CustomerCount'] = a1.groupby('Hash').transform(len).rename(
columns={'Card': 'CustomerCount'})
a1
pd.factorize and np.bincount
What I'd actually do
f, u = pd.factorize(a1.Hash)
a1['CustomerCount'] = np.bincount(f)[f]
a1
Or inline making a copy
a1.assign(CustomerCount=(lambda f: np.bincount(f)[f])(pd.factorize(a1.Hash)[0]))
Hash Card CustomerCount
0 1 1 2
1 1 1 2
2 2 2 3
3 2 2 3
4 2 3 3
5 3 3 1
6 4 4 2
7 4 4 2

Categories

Resources