Merge two pandas series based on missing data - python

If I have:
col1 col2
0 1 np.nan
1 2 np.nan
2 np.nan 3
4 np.nan 4
How would I efficiently get to:
col1 col2 col3
0 1 np.nan 1
1 2 np.nan 2
2 np.nan 3 3
4 np.nan 4 4
My current solution is:
test = pd.Series([1,2,np.nan, np.nan])
test2 = pd.Series([np.nan, np.nan, 3,4])
temp_df = pd.concat([test, test2], axis = 1)
init_cols = list(temp_df.columns)
temp_df['test3'] = ""
for col in init_cols:
temp_df.ix[temp_df[col].fillna("") != "", 'test3'] = list(temp_df.ix[temp_df[col].fillna("") != "", col])
Ideally I would like to avoid the use of loops.

It depends on what you want to do in the event that each column has a non-null value.
take col1 first then fill missing with col2
df['col3'] = df.col1.fillna(df.col2)
take col2 first then fill missing with col1
df['col3'] = df.col2.fillna(df.col1)
average the overlap
df['col3'] = df.mean(1)
sum the overlap
df['col3'] = df.sum(1)

Related

Create a column based on a value from another columns values on pandas

I'm new with python and pandas and I'm struggling with a problem
Here is a dataset
data = {'col1': ['a','b','a','c'], 'col2': [None,None,'a',None], 'col3': [None,'a',None,'b'], 'col4': ['a',None,'b',None], 'col5': ['b','c','c',None]}
df = pd.DataFrame(data)
I need to create 3 columns based on the unique values of col1 to col4 and whenever the col1 or col2 or col3 or col4 have a value equals to the header of the new columns it should return 1 otherwise it should return 0
need a output like this
dataset output example:
data = {'col1': ['a','b','a','c'], 'col2': [None,None,'a',None], 'col3': [None,'a',None,'b'], 'col4': ['a',None,'b',None], 'col5': ['b','c','c',None], 'a':[1,1,1,0],'b':[0,1,1,1],'c':[0,1,1,1]}
df = pd.DataFrame(data)
I was able to create a new colum and set it to 1 using the code below
df['a'] = 0
df['a'] = (df['col1'] == 'a').astype(int)
but it works only with the first column, I would have to repeat it for all columns.
Is there a way to make it happens for all columns at once?
Check with pd.get_dummies and groupby
df = pd.concat([df,
pd.get_dummies(df,prefix='',prefix_sep='').groupby(level=0,axis=1).max()],
axis=1)
Out[377]:
col1 col2 col3 col4 col5 a b c
0 a None None a b 1 1 0
1 b None a None c 1 1 1
2 a a None b c 1 1 1
3 c None b None None 0 1 1
pd.concat([df, pd.get_dummies(df.stack().droplevel(1)).groupby(level=0).max()], axis=1)
result:
col1 col2 col3 col4 col5 a b c
0 a None None a b 1 1 0
1 b None a None c 1 1 1
2 a a None b c 1 1 1
3 c None b None None 0 1 1

How to summarize rows on column in pandas dataframe

I have a DataFrame that looks like this:
Col2 Col3
0 5 8
1 1 0
2 3 5
3 4 1
4 0 7
How can I sum values and get rid of index. To make it looks like this?
Col2 Col3
13 21
Sample code:
import pandas as pd
df = pd.DataFrame()
df["Col1"] = [0,2,4,6,2]
df["Col2"] = [5,1,3,4,0]
df["Col3"] = [8,0,5,1,7]
df["Col4"] = [1,4,6,0,8]
df_new = df.iloc[:, 1:3]
print(df_new)
Use .sum() to get the sums for each column. It produces a Series where each row contains the sum of a column. Transpose this to turn each row into a column, and then use .to_string(index=False) to print out the DataFrame without the index:
pd.DataFrame(df.sum()).T.to_string(index=False)
This outputs:
Col2 Col3
13 21
You can try:
df_new.sum(axis = 0)
df_new.reset_index(drop=True, inplace=True)

How to return column&row index of cell of certain value

index col1 col2 col3
0 0 1 0
1 1 0 1
2 1 1 0
I am just stuck at a task: to find locations(indices) of all cells that equals to 1.
I was trying to use such a statement
column_result=[]
row_result=[]
for column in df:
column_result=column_result.append(df.index[df[i] != 0])
for row in df:
row_result=row_result.append(df.index[df[i]!=0)
my logic is using loops to traverse the colomns and rows separately and concatenate them later
however it returns'NoneType' object has no attribute 'append'
would you please help me to debug and complete this task
Use numpy.where for indices for index and columns and then select them for cols, idx lists:
i, c = np.where(df.ne(0))
cols = df.columns[c].tolist()
idx = df.index[i].tolist()
print (idx)
[0, 1, 1, 2, 2]
print (cols)
['col2', 'col1', 'col3', 'col1', 'col2']
Or use DataFrame.stack with filtering for final DataFrame:
s = df.stack()
df1 = s[s.ne(0)].rename_axis(['idx','cols']).index.to_frame(index=False)
print (df1)
idx cols
0 0 col2
1 1 col1
2 1 col3
3 2 col1
4 2 col2

Find next higher value in a python dataframe column

we all now how to find the maximum value of a dataframe column.
But how can i find the next higher value in a column? So for example I have the following dataframe:
d = {'col1': [1, 4, 2], 'col2': [3, 4, 3]}
df = pd.DataFrame(data=d)
col1 col2
0 3 3
1 5 4
2 2 3
Basic-Questions:
When I want to find the next higher value in col1 to 0, outcome would be:2. Is there something similar to: df.loc[df['col1'].idxmax()], which would lead to:
col1 col2
5 4
And my outcome should be:
col1 col2
2 3
Background: And I am using a if-condition to filter this dataframe, as I need to prepare it for further filtering, and not all values are exsting which I will put in:
v= 0
if len(df[(df['col1'] == v)]) == 0:
df2 = df[(df['col1'] == v+1)]
else:
df2 = df[(df['col1'] == v)]
This would lead to an empty dataframe.
But I would like to go the the next entry not v+1=1 , in this case I want to insert 2because it is the next higher value, which has entry after 0. So the condition would be:
v= 0
if len(df[(df['col1'] == v)]) == 0:
df2 = df[(df['col1'] == 2)] #the 2 has to be find automatic, as the next value does not have a fixed distance
else:
df2 = df[(df['col1'] == v)]
How can I achieve that automatically?
So my desired outcome is:
when I put in v=0:
df2
col1 col2
2 3
when I put in v=2, it jumps to v=3:
df2
col1 col2
3 3
If I put v=3, it stays (else-condition):
df2
col1 col2
3 3
Check the searchsorted from numpy
df=df.sort_values('col1')
df.iloc[np.searchsorted(df.col1.values,[0])]
col1 col2
2 2 3
df.iloc[np.searchsorted(df.col1.values,[3,5])]
col1 col2
0 3 3
1 5 4
Add-on(from the questioneer): This also skips the if-condition

How to change values in certain columns according to certain rule in pandas dataframe

Suppose I have a pandas dataframe looks like this:
col1 col2
0 A A60
1 B B23
2 C NaN
The data from is read from a csv file. Suppose I want to change each non-missing value of 'col2' to its prefix (i.e. 'A' or 'B'). How could I do this without writing a for loop?
The expected output is
col1 col2
0 A A
1 B B
2 C NaN
.str[:1] just returns the first character
d = {'col1': ['A', 'B','C'], 'col2': ['A32', 'B60',np.nan]}
df = pd.DataFrame(data=d)
df['col2'] = df['col2'].str[:1]
df
out:
col1 col2
0 A A
1 B B
2 C NaN
You can also your pandas replace function directly on the column.
# sample data
df = pd.DataFrame({'col1':['A','B','C'], 'col2':['A60','B23',np.nan]})
# remove numbers from col2
df['col2'] = df['col2'].str.replace('\d+','')
print(df)
col1 col2
0 A A
1 B B
2 C NaN
You may need to use isnull():
df['col2'] = df['col2'].apply(lambda x: str(x)[0] if not pd.isnull(x) else x)

Categories

Resources