Selecting multiple columns in pandas that start with similar letter

Selecting multiple columns in pandas that start with similar letter - python

Low S0.0 S1.0 S2.0 S3.0 S4.0 S5.0 S6.0 S7.0 S8.0 S9.0 S10.0 S11.0
0 55 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 60 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 78 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 77 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I have the following code to check if any of the "S" columns are near to "close":
level=0.035
cond = np.isclose(df.Low, df['S0.0'], rtol=level) | np.isclose(df.Low, df['S1.0'], rtol=level) | ...
df['ST'] = np.where(cond, 100, 0)
But this looks too manual, is there some way to attribute all the S columns without specifically naming all of them? Also considering that these columns keep on changing so specifically calling every column sometimes gives an error. THANKS!

I think a solution can be as follows:
from itertools import repeat
from operator import or_
selected_columns = [c for c in df.columns if c.startswith('s')]
cond = None
for low_serie, sel_serie in zip(repeat(df.Low), [df[selected_column] for selected_column in selected_columns]):
if cond is None:
cond = np.isclose(low_serie, sel_serie, rtol=level)
continue
cond = or_(cond, np.isclose(low_serie, sel_serie, rtol=level))
You have to pay attention to the condition to select the columns names. I put as an example if c.startswith('s').

Related

How to remove rows that include partially Nan values without taking specific part of the row into account?

I am working with multiple big data frames. I want to remove their NaN parts automatically to ease the data cleansing process. Data is collected from a camera or radar feed, but the part of the data I need is when a specific object comes into the view horizon of the camera/ radar. So, the data file (frame) looks like below, and has lots of NaN values:
total in seconds datetime(utc) channels AlviraPotentialDronePlots_timestamp AlviraPotentialDronPlot_id ...
0 1601381457 2020-09-29 12:10:57 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1601381459 2020-09-29 12:10:59 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1601381460 2020-09-29 12:11:00 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1601381461 2020-09-29 12:11:01 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1601381463 2020-09-29 12:11:03 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... Useful data is here ... ... ... ... ... ... ... ... ...
623 1601382249 2020-09-29 12:24:09 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
624 1601382250 2020-09-29 12:24:10 NaN NaN NaN NaN NaN NaN NaN NaN ... 51.521264 5.858627 5.0 NaN NaN SearchRadar 0.0 0.0 NaN NaN
625 1601382251 2020-09-29 12:24:11 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I have removed the columns with all NaN values using:
df = df.dropna(axis=1, how='all')
Now, I want to remove rows that contain all NaN. However, since total in seconds and datetime(utc) are always present in the file, I cannot use the following command:
df = df.dropna(axis=0, how='all')
Also, I cannot use how='any', because that would remove parts of the useful data too (the useful data contains some NaN values which I will fill later). I have to use the dropna() in a way that it does not take the total in seconds and datetime(utc) into account, but if all other fields are NaNs, then removes the whole row.
The closest I came to solving this problem was the command mentioned in this link, but I guess I am not enough familiar with Python to be able to formulate the following logic:
if in one row field != [is not] 'total in seconds' | [or] 'datetime(utc)' & [and] other fields == [is] 'NaN' then remove the row
I tried writing this with for loop too, but I was not successful. Can someone help me with this?
Thanks in advance.

You can check all columns without total in seconds, datetime(utc) by subset parameter with Index.difference:
cols = ['total in seconds','datetime(utc)']
checked = df.columns.difference(cols)
df = df.dropna(subset=checked, how='all')

If your number of columns is constant, you can use the parameter thresh.
Lets say you have 50 columns, you could put the thresh at 48 if you have 2 columns that are never empty.
For more, check https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

pivot_table requires more memory if dtype is category (MemoryError)

I have the following strange error with pandas(pandas==0.23.1) :
import pandas as pd
df = pd.DataFrame({'t1': ["a","b","c"]*10000, 't2': ["x","y","z"]*10000, 'i1': list(range(5000))*6, 'i2': list(range(5000))*6, 'dummy':0})
# works fast with less memory
piv = df.pivot_table(values='dummy', index=['i1','i2'], columns=['t1','t2'])
d2 = df.copy()
d2.t1 = d2.t1.astype('category')
d2.t2 = d2.t2.astype('category')
# needs > 20GB of memory and takes for ever
piv2 = d2.pivot_table(values='dummy', index=['i1','i2'], columns=['t1','t2'])
I am wondering if this is expected and I am doing something wrong, or if this is a bug in pandas. Should dtype category for str not be very transparent (for this use case)?

This is not a bug. What's happening is pandas.pivot_table is calculating the Cartesian product of grouper categories.
This is a known intended behaviour. In Pandas v0.23.0, we saw the introduction of the observed argument for pandas.groupby. Setting observed=True only includes observed combinations; it is False by default. This argument has not yet now been rolled out to related methods such as pandas.pivot_table. In my opinion, it should be.
But now let's see what this means. We can use an example dataframe and see what happens when we print the result.
Setup
We make the dataframe substantially smaller:
import pandas as pd
n = 10
df = pd.DataFrame({'t1': ["a","b","c"]*n, 't2': ["x","y","z"]*n,
'i1': list(range(int(n/2)))*6, 'i2': list(range(int(n/2)))*6,
'dummy':0})
Without categories
This is likely what you are looking for. Unobserved combinations of categories are not represented in your pivot table.
piv = df.pivot_table(values='dummy', index=['i1','i2'], columns=['t1','t2'])
print(piv)
t1 a b c
t2 x y z
i1 i2
0 0 0 0 0
1 1 0 0 0
2 2 0 0 0
3 3 0 0 0
4 4 0 0 0
With categories
With categories, all combinations of categories, even unobserved combinations, are accounted for in the result. This is expensive computationally and memory-hungry. Moreover, the dataframe is dominated by NaN from unobserved combinations. It's probably not what you want.
Update: you can now set the observed parameter to True to only show observed values for categorical groupers.
d2 = df.copy()
d2.t1 = d2.t1.astype('category')
d2.t2 = d2.t2.astype('category')
piv2 = d2.pivot_table(values='dummy', index=['i1','i2'], columns=['t1','t2'])
print(piv2)
t1 a b c
t2 x y z x y z x y z
i1 i2
0 0 0.0 NaN NaN NaN 0.0 NaN NaN NaN 0.0
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 0.0 NaN NaN NaN 0.0 NaN NaN NaN 0.0
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0.0 NaN NaN NaN 0.0 NaN NaN NaN 0.0
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 0.0 NaN NaN NaN 0.0 NaN NaN NaN 0.0
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0.0 NaN NaN NaN 0.0 NaN NaN NaN 0.0

How to return df with non-nan values of unique column Pandas DataFrame Pythonically

I have got the following dataframe, in which each column contains a set of values, and each index is only used once. However, I would like to get a completely filled dataframe. In order to do that I need to select, from each column, an X amount of values, in which X is the length of the column with the least non-nan values (in this case column '1.0').
>>> stat_df_iws
iws_w -2.0 -1.0 0.0 1.0
0 0.363567 NaN NaN NaN
1 0.183698 NaN NaN NaN
2 NaN -0.337931 NaN NaN
3 -0.231770 NaN NaN NaN
4 NaN 0.544836 NaN NaN
5 NaN -0.377620 NaN NaN
6 NaN NaN -0.428396 NaN
7 NaN NaN -0.443317 NaN
8 NaN -0.268033 NaN NaN
9 NaN 0.246714 NaN NaN
10 NaN NaN -0.503887 NaN
11 NaN NaN NaN -0.298935
12 NaN -0.252775 NaN NaN
13 NaN -0.447757 NaN NaN
14 -0.650598 NaN NaN NaN
15 -0.660542 NaN NaN NaN
16 NaN -0.952041 NaN NaN
17 -0.667356 NaN NaN NaN
18 -0.920873 NaN NaN NaN
19 NaN -0.537657 NaN NaN
20 NaN NaN -0.525121 NaN
21 NaN NaN NaN -0.619755
22 NaN -0.652138 NaN NaN
23 NaN -0.924181 NaN NaN
24 NaN -0.665720 NaN NaN
25 NaN NaN -0.336841 NaN
26 -0.428931 NaN NaN NaN
27 NaN -0.348248 NaN NaN
28 NaN 0.781024 NaN NaN
29 0.110727 NaN NaN NaN
... ... ... ... ...
I've achieved this with the following code, but it is not a very pythonic way of solving this.
def get_non_null_from_pivot(df):
lngth = min(list(len(col.dropna()) for ind, col in df.iteritems()))
df = pd.concat([df.loc[:,-2.0].dropna().head(lngth).reset_index(drop=True),\
df.loc[:,-1.0].dropna().head(lngth).reset_index(drop=True),\
df.loc[:,0.0].dropna().head(lngth).reset_index(drop=True),\
df.loc[:,1.0].dropna().head(lngth).reset_index(drop=True)], \
axis=1)
Is there a simpler way to achieve the same goal, so that I can more automatically repeat this step for other dataframes? Preferably without for-loops, for efficiency reasons.

I've made the function a little shorter by looping through the columns, and it seems to work perfectly.
def get_non_null_from_pivot_short(df):
lngth = min(list(len(col.dropna()) for ind, col in df.iteritems()))
df = pd.concat(list(df.loc[:,col].dropna().head(lngth).reset_index(drop=True) for col in df), \
axis=1)
return df

Broadcasting Error Pandas

I have a dataframe with 4 columns. I want to do an element-wise division of the first 3 columns by the value in 4th column
I tried:
df2 = pd.DataFrame(df.ix[:,['col1', 'col2', 'col3']].values / df.col4.values)
And I got this error:
ValueError: operands could not be broadcast together with shapes (19,3) (19,)
My solution was:
df2 = pd.DataFrame(df.ix[:,['col1', 'col2', 'col3']].values / df.col4.values.reshape(19,1))
This worked as I wanted, but to be robust for different numbers of rows I would need to do:
.reshape(len(df),1)
It just seems an ugly way to have to do something - is there a better way around the array shape being (19,) it seems odd that it has no second dimension.
Best Regards,
Ben

You can just do div and pass axis=0 to force the division to be performed column-wise:
df2 = pd.DataFrame(df.ix[:,['col1', 'col2', 'col3']].div(df.col4, axis=0))
Your error is because the division using / is being performed on the minor axis which in this case is the row axis and there is no direct alignment, see this example:
In [220]:
df = pd.DataFrame(columns=list('abcd'), data = np.random.randn(8,4))
df
Out[220]:
a b c d
0 1.074803 0.173520 0.211027 1.357138
1 1.418757 -1.879024 0.536826 1.006160
2 -0.029716 -1.146178 0.100900 -1.035018
3 0.314665 -0.773723 -1.170653 0.648740
4 -0.179666 1.291836 -0.009614 0.392149
5 0.264599 -0.057409 -1.425638 1.024098
6 -0.106062 1.824375 0.595974 1.167115
7 0.601544 -1.237881 0.106854 -1.276829
In [221]:
df.ix[:,['a', 'b', 'c']]/df['d']
Out[221]:
a b c 0 1 2 3 4 5 6 7
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
This isn't obvious until you understand how broadcasting works.

How can I refer to a column with a number as its name in a pandas dataframe?

I created a square dataframe in which the columns' names are its indices. See below for an example:
matrix
Out[75]:
24787 24798 24799 24789 24790 24791 24793 24797 24794 24796 24795 24788
24787 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
24798 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
24799 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
24789 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
24790 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
24791 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
...
I want to refer to each column, but matrix['24787'] returns KeyError: '24787' and matrix.24787 returns SyntaxError: invalid syntax. How do I refer to my column?

If the column names are integers (not strings), you can select a specific column with the specific integer value:
matrix[24787]
or, using the loc label selector,
matrix.loc[:, 24787]
If you want to select by index number, you can use iloc. For example, matrix.iloc[:, 0] selects the first column.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selecting multiple columns in pandas that start with similar letter - python

Related

How to remove rows that include partially Nan values without taking specific part of the row into account?

pivot_table requires more memory if dtype is category (MemoryError)

How to return df with non-nan values of unique column Pandas DataFrame Pythonically

Broadcasting Error Pandas

How can I refer to a column with a number as its name in a pandas dataframe?

Categories

Resources