I need to add new data to the last column of a data-frame, if this has any empty cells, or create a new column otherwise. I wonder if there is any pythonic way to achieve this through pandas functionalities (e.g. concact, join, merge, etc.). The example is as follows:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'0':[8, 9, 3, 5, 0], '1':[9, 6, 6, np.nan, np.nan]})
df2 = pd.DataFrame({'2':[2, 9, 4]}, index = [3,4,0])
desired_output = pd.DataFrame({'0':[8, 9, 3, 5, 0],
'1':[9, 6, 6, 2, 9],
'2':[4, np.nan, np.nan, np.nan, np.nan]})
# df1
0 1
0 8 9
1 9 6
2 3 6
3 5 NaN
4 0 NaN
# df 2
2
3 2
4 9
0 4
# desired_output
0 1 2
0 8 9 4
1 9 6 NaN
2 3 6 NaN
3 5 2 NaN
4 0 9 NaN
Your problem can be broken down into 2 steps:
Contenate df1 and df2 based on their indexes.
For each row of the concatenated dataframe, move the nan to the end.
Try this:
# Step 1: concatenate the two dataframes
result = pd.concat([df1, df2], axis=1)
# Step 2a: for each row, sort the elements based on their nan status
# For example: sort [1, 2, nan, 3] based on [False, False, True, False]
# np.argsort will return [0, 1, 3, 2]
# Stable sort is critical here since we don't want to swap elements whose
# sort keys are equal.
arr = result.to_numpy()
idx = np.argsort(np.isnan(arr), kind="stable")
# Step 2b: reconstruct the result dataframe based on the sort order
result = pd.DataFrame(np.take_along_axis(arr, idx, axis=1), columns=result.columns)
Related
I have a dataframe with two columns and a series, I want to attach this series to the end of a column and have the values of the other one set to NaN. These appended values should not be merged with what is already in the df. I know there is append, but that is deprecated so I wondered what the alternatives were for this task nowadays.
This is what I did and how I want it to look in the end (without the conversion to float for some reason):
a = pd.DataFrame({'one': [1, 2, 3], 'two': [4, 5, 6]})
b = pd.Series([7, 8, 9], name='two')
pd.concat([a, pd.DataFrame({'one': np.nan, 'two': b})], ignore_index=True)
one two
0 1.0 4
1 2.0 5
2 3.0 6
3 NaN 7
4 NaN 8
5 NaN 9
This would be more problematic with more columns and I just wanted a cleaner way for the sake of it.
concat works fine without needing to manually pad, just convert your Series to_frame:
a = pd.DataFrame({'one': [1, 2, 3], 'two': [4, 5, 6]})
b = pd.Series([7, 8, 9], name='two')
out = pd.concat([a.convert_dtypes(), b.to_frame()], ignore_index=True)
NB. the convert_dtypes enables to use a NA that avoids conversion to float. Also, the name of the Series is important to assign the column, if the Series doesn't already have a name use to_frame(name='two').
Output:
one two
0 1 4
1 2 5
2 3 6
3 <NA> 7
4 <NA> 8
5 <NA> 9
a = pd.DataFrame({'one': [1, 2, 3], 'two': [4, 5, 6]})
b = pd.Series([7, 8, 9], name='two')
a = a.to_dict('list')
a['two'].extend(b)
pd.DataFrame.from_dict(a, orient='index').T
If you have more columns, use a for loop for the keys.
import pandas as pd
a = pd.DataFrame({'one': [1, 2, 3], 'two': [4, 5, 6]})
b = pd.Series([7, 8, 9], name='two')
c = a.merge(a['two'].append(b).reset_index(drop=True), how='outer')
print(c)
Result
one two
0 1.0 4
1 2.0 5
2 3.0 6
3 NaN 7
4 NaN 8
5 NaN 9
there are several answers around rolling count in pandas
Rolling unique value count in pandas
How to efficiently compute a rolling unique count in a pandas time series?
How do I count unique values across multiple columns?
For one column, I can do:
df[my_col]=df[my_col].rolling(300).apply(lambda x: len(np.unique(x)))
How to extend to multipe columns, counting unique values overall across all values in the rolling window?
Inside a list comprehension iterate over the rolling windows and for each window flatten the values in required columns then use set to get the distinct elements
cols = [...] # define your cols here
df['count'] = [len(set(w[cols].values.ravel())) for w in df.rolling(300)]
I took a dataframe as a example (3-rows rolling window taking into account all the columns at the same time)
Dataframe for visualization
col1 col2 col3
0 1 1 1
1 1 1 4
2 2 5 2
3 3 3 3
4 3 7 3
5 5 3 9
6 8 8 2
Proposed script for checkings
import pandas as pd
df = pd.DataFrame({'col1':[1, 1, 2, 3, 3, 5, 8],
'col2':[1, 1, 5, 3, 7, 3, 8],
'col3':[1, 4, 2, 3, 3, 9, 2],})
df['count'] = df.rolling(3).apply(lambda w: len(set(df.iloc[w.index].to_numpy().flatten())))['col1']
print(df)
Output
col1 col2 col3 count
0 1 1 1 NaN
1 1 1 4 NaN
2 2 5 2 4.0
3 3 3 3 5.0
4 3 7 3 4.0
5 5 3 9 4.0
6 8 8 2 6.0
Another method
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':[1, 1, 2, 3, 3, 5, 8],
'col2':[1, 1, 5, 3, 7, 3, 8],
'col3':[1, 4, 2, 3, 3, 9, 2],})
df = (df.assign( count=df.rolling(3, method='table')
.apply(lambda d:len(set(d.flatten()) ), raw=True, engine="numba")
.iloc[:,-1:] )
)
I have a DF like this one:
col1 col2
1 4
NaN 5
NaN 3
7 2
8 10
9 11
How can I get the first column from df as a list without NaN values:
col1_list = [1, 7, 8, 9]
Use Series.dropna with converting values to integers (if necessary) and then convert to list:
col1_list = df['col1'].dropna().astype(int).tolist()
print (col1_list)
[1, 7, 8, 9]
The numbers in ltlist refer to ID numbers that can change, is it possible to literate through multiple columns for the items in ltlist assume the elements in ltlist in this example aren't constant. Hope to use loop instead of vectorized if/else too but couldn't get it to work.
import pandas as pd, numpy as np
ltlist = [1, 2]
org = {'ID': [1, 3, 4, 5, 6, 7], 'ID2': [3, 4, 5, 6, 7, 2]}
ltlist_set = set(ltlist)
org['LT'] = np.where(org['ID'].isin(ltlist_set), org['ID'], 0)
I'll need to check the ID2 column and write the ID in, unless it already has an ID.
output
ID ID2 LT
1 3 1
3 4 0
4 5 0
5 6 0
6 7 0
7 2 2
Thanks!
Since you are using 0 as the default value, you can pass it as an or with against the data frame.
import pandas as pd
import numpy as np
ltset = set([1, 2])
org = pd.DataFrame({'ID': [1, 3, 4, 5, 6, 7], 'ID2': [3, 4, 5, 6, 7, 2]})
org['LT'] = 0
for col in org.columns.drop('LT'):
org['LT'] = np.where(org[col].isin(ltset), org[col], org['LT']|0)
org
# returns:
ID ID2 LT
0 1 3 1
1 3 4 0
2 4 5 0
3 5 6 0
4 6 7 0
5 7 2 2
This will always keep the value of the right-most column that has a value in ltlist. If you want to keep the left-most column that has a value, you can just iterate over the columns in reverse.
for col in org.columns.drop('LT')[::-1]:
...
How can I merge duplicate DataFrame columns and also keep all original column names?
e.g. If I have the DataFrame
df = pd.DataFrame({"col1" : [0, 0, 1, 2, 5, 3, 7],
"col2" : [0, 1, 2, 3, 3, 3, 4],
"col3" : [0, 1, 2, 3, 3, 3, 4]})
I can remove the duplicate columns (yes the transpose is slow for large DataFrames) with
df.T.drop_duplicates().T
but this only preserves one column name per unique column
col1 col2
0 0 0
1 0 1
2 1 2
3 2 3
4 5 3
5 3 3
6 7 4
How can I keep the information on which columns were merged? e.g. something like
[col1] [col2, col3]
0 0 0
1 0 1
2 1 2
3 2 3
4 5 3
5 3 3
6 7 4
Thanks!
# group columns by their values
grouped_columns = df.groupby(list(df.values), axis=1).apply(lambda g: g.columns.tolist())
# pick one column from each group of the columns
unique_df = df.loc[:, grouped_columns.str[0]]
# make a new column name for each group, don't think the list can work as a column name, you need to join them
unique_df.columns = grouped_columns.apply("-".join)
unique_df
I also used T and tuple to groupby
def f(x):
d = x.iloc[[0]]
d.index = ['-'.join(x.index.tolist())]
return d
df.T.groupby(df.apply(tuple), group_keys=False).apply(f).T