Append list to pandas DataFrame as new row with index - python

Despite of the numerous stack overflow questions on appending data to a dataframe I could not really find an answer to the following.
I am looking for a straight forward solution to append a list as last row of a dataframe.
Imagine I have a simple dataframe:
indexlist=['one']
columnList=list('ABC')
values=np.array([1,2,3])
# take care, the values array is a 3x1 size array.
# row has to be 1x3 so we have to reshape it
values=values.reshape(1,3)
df3=pd.DataFrame(values,index=indexlist,columns=columnList)
print(df3)
A B C
one 1 2 3
After some operations I get the following list:
listtwo=[4,5,6]
I want to append it at the end of the dataframe.
I change that list into a series:
oseries=pd.Series(listtwo)
print(type(oseries))
oseries.name="two"
now, this does not work:
df3.append(oseries)
since it gives:
A B C 0 1 2
one 1.0 2.0 3.0 NaN NaN NaN
two NaN NaN NaN 5.0 6.0 7.0
I would like to have the values under A B and C.
I also tried:
df3.append(oseries, columns=list('ABC')) *** not working ***
df3.append(oseries, ignore_index=True) *** working but wrong result
df3.append(oseries, ignore_index=False) *** working but wrong result
df3.loc[oseries.name]=oseries adds a row with NaN values
what I am looking for is
a) how can I add a list to a particular index name
b) how can I simple add a row of values out of a list even if I don't have a name for index (leave it empty)

Either assign in-place with loc:
df.loc['two'] = [4, 5, 6]
# df.loc['two', :] = [4, 5, 6]
df
A B C
one 1 2 3
two 4 5 6
Or, use df.append with the second argument being a Series object having appropriate index and name:
s = pd.Series(dict(zip(df.columns, [4, 5, 6])).rename('two'))
df2 = df.append(s)
df2
A B C
one 1 2 3
two 4 5 6
If you are appending to a DataFrame without an index (i.e., having a numeric index), you can use loc after finding the max of the index and incrementing by 1:
df4 = pd.DataFrame(np.array([1,2,3]).reshape(1,3), columns=list('ABC'))
df4
A B C
0 1 2 3
df4.loc[df4.index.max() + 1, :] = [4, 5, 6]
df4
A B C
0 1.0 2.0 3.0
1 4.0 5.0 6.0
Or, using append with ignore_index=True:
df4.append(pd.Series(dict(zip(df4.columns, [4, 5, 6]))), ignore_index=True)
A B C
0 1 2 3
1 4 5 6

Without index
lst1 = [1,2,3]
lst2 = [4,5,6]
p1 = pd.DataFrame([lst1])
p2 = p1.append([lst2], ignore_index = True)
p2.columns = list('ABC')
p2
A B C
0 1 2 3
1 4 5 6
With index
lst1 = [1,2,3]
lst2 = [4,5,6]
p1 = pd.DataFrame([lst1], index = ['one'], columns = list('ABC'))
p2 = p1.append(pd.DataFrame([lst2], index = ['two'], columns = list('ABC')))
p2
A B C
one 1 2 3
two 4 5 6

Related

Creating new column taking single value from column of another dataframe

I have two dataframes. The first one is df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4]}) i.e
A B
0 5 2
1 0 4
another one is df2 = pd.DataFrame({'C': [1, 1], 'D': [3, 3]}) i.e
C D
0 1 3
1 1 3
I want want to grab only 4 from df1 and make new column in df2. I have tried this df2['E']=df1['B'][df1['B']==4] and got
C D E
0 1 3 NaN
1 1 3 4.0
I want both rows of df2 to be 4. How can I achieve this? Any help would be immense help.
if the value '4' appears as the last value in your column( like your example), you could do:
df2['E'].fillna(method= 'backfill')
for other methods, have a look here:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
It is not actually clear what you wanna accomplish here, but I assume you would like to check if there is any "4" in df1 (column B) and then filling all rows in df2 (column E) with "4". Then you could do:
import numpy as np
df2['E'] = np.where(df1['B'].isin([4]).any(), 4, np.nan)
Output:
C D E
0 1 3 4.0
1 1 3 4.0

pandas most efficient way to execute arithmetic operations on multiple dataframe columns

my first post!
I'm running python 3.8.5 & pandas 1.1.0 on jupyter notebooks.
I want to divide several columns by the corresponding elements in another column of the same dataframe.
For example:
import pandas as pd
df = pd.DataFrame({'a': [2, 3, 4], 'b': [4, 6, 8], 'c':[6, 9, 12]})
df
a b c
0 2 4 6
1 3 6 9
2 4 8 12
I'd like to divide columns 'b' & 'c' by the corresponding values in 'a' and substitute the values in 'b' and 'c' with the result of this division. So the above dataframe becomes:
a b c
0 2 2 3
1 3 2 3
2 4 2 3
I tried
df.iloc[: , 1:] = df.iloc[: , 1:] / df['a']
but this gives:
a b c
0 2 NaN NaN
1 3 NaN NaN
2 4 NaN NaN
I got it working by doing:
for colname in df.columns[1:]:
df[colname] = (df[colname] / df['a'])
Is there a faster way of doing the above by avoiding the for loop?
thanks,
mk
Almost there, use div with axis=0:
df.iloc[:,1:] = df.iloc[:,1:].div(df.a, axis=0)
df.b= df.b/df.a
df.c=df.c/df.a
or
df[['b','c']]=df.apply(lambda x: x[['b','c']]/x.a ,axis=1)

Filter pandas data frame for NaN value without isnull

I have a list A:
A = [nan, 2, 3, 4, 6]
And a pandas data frame df:
index X Y
0 A NaN
1 B 2
2 C 6
3 D 4
4 E 3
I'd like to create a list comprehension to get a list of the index where each value in the list equals column Y. Usually I would do this:
B = [df[df.Y == x].index[0] for x in A]
However, this doesn't work for the first element of A, nan. Obviously I could do the above with a normal for loop and using isnull, as below, but is there a way to do it with a list comprehension?
B = []
for x in A:
if pd.isnull(x):
B.append(df[pd.isnull(df.Y)].index[0])
else:
B.append(df[df.Y == x])
Expected result:
B = [0,1,4,3,2]
Giving you exactly what you want (and by essentially just re-purposing your existing if statement), try:
B = [df[pd.isnull(df.Y)].index[0] if pd.isnull(x) else df[df.Y == x].index[0] for x in A]
Using merge , about how it work check the link Why does pandas merge on NaN?
A = [np.nan, 2, 3, 4, 6]
pd.DataFrame({'Y':A}).merge(df,how='left')
Out[394]:
Y index X
0 NaN 0 A
1 2.0 1 B
2 3.0 4 E
3 4.0 3 D
4 6.0 2 C

sum values in different rows and columns dataframe python

My Data Frame
A B C D
2 3 4 5
1 4 5 6
5 6 7 8
How do I add values of different rows and different columns
Column A Row 2 with Column B row 1
Column A Row 3 with Column B row 2
Similarly for all rows
If you only need do this with two columns (and I understand your question well), I think you can use the shift function.
Your data frame (pandas?) is something like:
d = {'A': [2, 1, 5], 'B': [3, 4, 6], 'C': [4, 5, 7], 'D':[5, 6, 8]}
df = pd.DataFrame(data=d)
So, it's possible to create a new data frame with B column shifted:
df2 = df['B'].shift(1)
which gives:
0 NaN
1 3.0
2 4.0
Name: B, dtype: float64
and then, merge this new data with the previous df and, for example, sum the values:
df = df.join(df2, rsuffix='shift')
df['out'] = df['A'] + df['Bshift']
The final output is in out column:
A B C D Bshift out
0 2 3 4 5 NaN NaN
1 1 4 5 6 3.0 4.0
2 5 6 7 8 4.0 9.0
But it's only an intuition, I'm not sure about your question!

Find the column name which has the 2nd maximum value for each row (pandas)

Based on this post: Find the column name which has the maximum value for each row it is clear how to get the column name with the max value of each row using df.idxmax(axis=1).
The question is, how can I get the 2nd, 3rd and so on maximum value per row?
You need numpy.argsort for position and then reorder columns names by indexing:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
print (df)
A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
arr = np.argsort(-df.values, axis=1)
df1 = pd.DataFrame(df.columns[arr], index=df.index)
print (df1)
0 1 2 3 4
0 A B D E C
1 D B C E A
2 E A B C D
3 C D A E B
4 C A E D B
Verify:
#first column
print (df.idxmax(axis=1))
0 A
1 D
2 E
3 C
4 C
dtype: object
#last column
print (df.idxmin(axis=1))
0 C
1 A
2 D
3 B
4 B
dtype: object
While there is no method to find specific ranks within a row, you can rank elements in a pandas dataframe using the rank method.
For example, for a dataframe like this:
df = pd.DataFrame([[1, 2, 4],[3, 1, 7], [10, 4, 2]], columns=['A','B','C'])
>>> print(df)
A B C
0 1 2 4
1 3 1 7
2 10 4 2
You can get the ranks of each row by doing:
>>> df.rank(axis=1,method='dense', ascending=False)
A B C
0 3.0 2.0 1.0
1 2.0 3.0 1.0
2 1.0 2.0 3.0
By default, applying rank to dataframes and using method='dense' will result in float ranks. This can be easily fixed just by doing:
>>> ranks = df.rank(axis=1,method='dense', ascending=False).astype(int)
>>> ranks
A B C
0 3 2 1
1 2 3 1
2 1 2 3
Finding the indices is a little trickier in pandas, but it can be resumed to apply a filter on a condition (i.e. ranks==2):
>>> ranks.where(ranks==2)
A B C
0 NaN 2.0 NaN
1 2.0 NaN NaN
2 NaN 2.0 NaN
Applying where will return only the elements matching the condition and the rest set to NaN. We can retrieve the columns and row indices by doing:
>>> ranks.where(ranks==2).notnull().values.nonzero()
(array([0, 1, 2]), array([1, 0, 1]))
And for retrieving the column index or position within a row, which is the answer to your question:
>>> ranks.where(ranks==2).notnull().values.nonzero()[0]
array([1, 0, 1])
For the third element you just need to change the condition in where to ranks.where(ranks==3) and so on for other ranks.

Categories

Resources