Get values from DataFrame column without NaN pandas - python

I have a DF like this one:
col1 col2
1 4
NaN 5
NaN 3
7 2
8 10
9 11
How can I get the first column from df as a list without NaN values:
col1_list = [1, 7, 8, 9]

Use Series.dropna with converting values to integers (if necessary) and then convert to list:
col1_list = df['col1'].dropna().astype(int).tolist()
print (col1_list)
[1, 7, 8, 9]

Related

Is there a non-deprecated clean way to append a series or an array to just one of a dataframe's column?

I have a dataframe with two columns and a series, I want to attach this series to the end of a column and have the values of the other one set to NaN. These appended values should not be merged with what is already in the df. I know there is append, but that is deprecated so I wondered what the alternatives were for this task nowadays.
This is what I did and how I want it to look in the end (without the conversion to float for some reason):
a = pd.DataFrame({'one': [1, 2, 3], 'two': [4, 5, 6]})
b = pd.Series([7, 8, 9], name='two')
pd.concat([a, pd.DataFrame({'one': np.nan, 'two': b})], ignore_index=True)
one two
0 1.0 4
1 2.0 5
2 3.0 6
3 NaN 7
4 NaN 8
5 NaN 9
This would be more problematic with more columns and I just wanted a cleaner way for the sake of it.
concat works fine without needing to manually pad, just convert your Series to_frame:
a = pd.DataFrame({'one': [1, 2, 3], 'two': [4, 5, 6]})
b = pd.Series([7, 8, 9], name='two')
out = pd.concat([a.convert_dtypes(), b.to_frame()], ignore_index=True)
NB. the convert_dtypes enables to use a NA that avoids conversion to float. Also, the name of the Series is important to assign the column, if the Series doesn't already have a name use to_frame(name='two').
Output:
one two
0 1 4
1 2 5
2 3 6
3 <NA> 7
4 <NA> 8
5 <NA> 9
a = pd.DataFrame({'one': [1, 2, 3], 'two': [4, 5, 6]})
b = pd.Series([7, 8, 9], name='two')
a = a.to_dict('list')
a['two'].extend(b)
pd.DataFrame.from_dict(a, orient='index').T
If you have more columns, use a for loop for the keys.
import pandas as pd
a = pd.DataFrame({'one': [1, 2, 3], 'two': [4, 5, 6]})
b = pd.Series([7, 8, 9], name='two')
c = a.merge(a['two'].append(b).reset_index(drop=True), how='outer')
print(c)
Result
one two
0 1.0 4
1 2.0 5
2 3.0 6
3 NaN 7
4 NaN 8
5 NaN 9

rolling unique value count in pandas across multiple columns

there are several answers around rolling count in pandas
Rolling unique value count in pandas
How to efficiently compute a rolling unique count in a pandas time series?
How do I count unique values across multiple columns?
For one column, I can do:
df[my_col]=df[my_col].rolling(300).apply(lambda x: len(np.unique(x)))
How to extend to multipe columns, counting unique values overall across all values in the rolling window?
Inside a list comprehension iterate over the rolling windows and for each window flatten the values in required columns then use set to get the distinct elements
cols = [...] # define your cols here
df['count'] = [len(set(w[cols].values.ravel())) for w in df.rolling(300)]
I took a dataframe as a example (3-rows rolling window taking into account all the columns at the same time)
Dataframe for visualization
col1 col2 col3
0 1 1 1
1 1 1 4
2 2 5 2
3 3 3 3
4 3 7 3
5 5 3 9
6 8 8 2
Proposed script for checkings
import pandas as pd
df = pd.DataFrame({'col1':[1, 1, 2, 3, 3, 5, 8],
'col2':[1, 1, 5, 3, 7, 3, 8],
'col3':[1, 4, 2, 3, 3, 9, 2],})
df['count'] = df.rolling(3).apply(lambda w: len(set(df.iloc[w.index].to_numpy().flatten())))['col1']
print(df)
Output
col1 col2 col3 count
0 1 1 1 NaN
1 1 1 4 NaN
2 2 5 2 4.0
3 3 3 3 5.0
4 3 7 3 4.0
5 5 3 9 4.0
6 8 8 2 6.0
Another method
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':[1, 1, 2, 3, 3, 5, 8],
'col2':[1, 1, 5, 3, 7, 3, 8],
'col3':[1, 4, 2, 3, 3, 9, 2],})
df = (df.assign( count=df.rolling(3, method='table')
.apply(lambda d:len(set(d.flatten()) ), raw=True, engine="numba")
.iloc[:,-1:] )
)

How to find the common elements of among the columns of a data-frame when dimensions are not same

a = [1, 2, 3, 4, 7, 9, 12]
df_1 = pd.DataFrame(a, columns=['X'])
b=[4, 8, 9, 11]
df_2 = pd.DataFrame(b, columns=['Y'])
df_f= pd.concat([df_1, df_2], axis=1)
Final data frame is looks like
X Y
0 1 4.0
1 2 8.0
2 3 9.0
3 4 11.0
4 7 NaN
5 9 NaN
6 12 NaN
I need to find the common elements. For example, in the above case, the answer is 4 and 9.
You can use .loc[] and .isin() to look for the values in X that are also in Y:
df_f.loc[df_f['X'].isin(df_f['Y']), 'X']
3 4
5 9
Name: X, dtype: int64
This will output a series of those values from X. You can use .to_list() on the end to output these as a list:
df_f.loc[df_f['X'].isin(df_f['Y']), 'X'].to_list()
[4, 9]
You can use intersection function to get the common values between two columns.
set(df_f['X']).intersection(df_f['Y']) #Output something like{4,9}

Concatenate data frames over finite index otherwise start a new column - pandas

I need to add new data to the last column of a data-frame, if this has any empty cells, or create a new column otherwise. I wonder if there is any pythonic way to achieve this through pandas functionalities (e.g. concact, join, merge, etc.). The example is as follows:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'0':[8, 9, 3, 5, 0], '1':[9, 6, 6, np.nan, np.nan]})
df2 = pd.DataFrame({'2':[2, 9, 4]}, index = [3,4,0])
desired_output = pd.DataFrame({'0':[8, 9, 3, 5, 0],
'1':[9, 6, 6, 2, 9],
'2':[4, np.nan, np.nan, np.nan, np.nan]})
# df1
0 1
0 8 9
1 9 6
2 3 6
3 5 NaN
4 0 NaN
# df 2
2
3 2
4 9
0 4
# desired_output
0 1 2
0 8 9 4
1 9 6 NaN
2 3 6 NaN
3 5 2 NaN
4 0 9 NaN
Your problem can be broken down into 2 steps:
Contenate df1 and df2 based on their indexes.
For each row of the concatenated dataframe, move the nan to the end.
Try this:
# Step 1: concatenate the two dataframes
result = pd.concat([df1, df2], axis=1)
# Step 2a: for each row, sort the elements based on their nan status
# For example: sort [1, 2, nan, 3] based on [False, False, True, False]
# np.argsort will return [0, 1, 3, 2]
# Stable sort is critical here since we don't want to swap elements whose
# sort keys are equal.
arr = result.to_numpy()
idx = np.argsort(np.isnan(arr), kind="stable")
# Step 2b: reconstruct the result dataframe based on the sort order
result = pd.DataFrame(np.take_along_axis(arr, idx, axis=1), columns=result.columns)

Successively filling in a new column of a pandas DataFrame

I would like to extend an existing pandas DataFrame and fill the new column successively:
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6], 'col2': [7, 8, 9, 10, 11, 12]})
df['col3'] = pd.Series(['a' for x in df[:3]])
df['col3'] = pd.Series(['b' for x in df[3:4]])
df['col3'] = pd.Series(['c' for x in df[4:]])
I would expect a result as follows:
col1 col2 col3
0 1 7 a
1 2 8 a
2 3 9 a
3 4 10 b
4 5 11 c
5 6 12 c
However, my code fails and I get:
col1 col2 col3
0 1 7 a
1 2 8 a
2 3 9 NaN
3 4 10 NaN
4 5 11 NaN
5 6 12 NaN
What is wrong?
Use the loc accessor:
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6], 'col2': [7, 8, 9, 10, 11, 12]})
df.loc[:2,'col3'] = 'a'
df.loc[3,'col3'] = 'b'
df.loc[4:,'col3'] = 'c'
df
col1
col2
col3
0
1
7
a
1
2
8
a
2
3
9
a
3
4
10
b
4
5
11
c
5
6
12
c
As #Amirhossein Kiani and #Emma notes in the comments, you're never using df itself to assign values, so there is no need to slice it. Since you can assign a list to a DataFrame column, the following suffices:
df['col3'] = ['a'] * 3 + ['b'] + ['c'] * (len(df) - 4)
You can also use numpy.select to assign values. The idea is to create a list of boolean Serieses for certain index ranges and select values accordingly. For example, if index is less than 3, select 'a', if index is between 3 and 4, select 'b', etc.
import numpy as np
df['col3'] = np.select([df.index<3, df.index.to_series().between(3, 4, inclusive='left')], ['a','b'], 'c')
Output:
col1 col2 col3
0 1 7 a
1 2 8 a
2 3 9 a
3 4 10 b
4 5 11 c
5 6 12 c
Every time you do something like df['col3'] = pd.Series(['a' for x in df[:3]]), you're assigning a new pd.Series to the column col3. One alternative way to do this is to create your new column separately, then assign it to the df.
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6], 'col2': [7, 8, 9, 10, 11, 12]})
new_col = ['a' for _ in range(3)] + ['b'] + ['c' for _ in range(4, len(df))]
df['col3'] = pd.Series(new_col)

Categories

Resources