Pairwise calculation on elements in a DataFrame

Pairwise calculation on elements in a DataFrame - python

I have a data frame that is structured similar to the following (but in the real case with many more rows and columns).
In [2]: Ex # The example DataFrame
Out[2]:
NameBef v1B v2B v3B v4B NameAft v1A v2A v3A v4A
Id
422 firstBef 133 145 534 745 FirstAft 212 543 2342 4563
862 secondBef 234 434 345 3453 SecondAft 643 493 3433 234
935 thirdBef 232 343 6454 463 thirdAft 423 753 754 743
For each row I want to calculate the quotient each vXB and vXA value from above (the Xs are variables) to end up with a DataFrame like this one
v1Q v2Q v3Q v4Q
Id
422 1.593985 3.744828 4.385768 6.124832
862 2.747863 1.135945 9.950725 0.067767
935 1.823276 2.195335 0.116827 1.604752
Where each element is the quotient of the corresponding elements of the original data frame.
I haven't been able to figure out how to do this conveniently.
To be convenient it would be good if it will not be required to provide only the names of the first and last columns of the "before" and "after" values i.e. 'v1B', 'v4B' and 'v1A', 'v4A' (i.e. not each of the columns).
The following is what I have come up with.
In [3]: C=Ex.columns
In [4]: C1B=C.get_loc('v1B')
In [5]: C2B=C.get_loc('v4B')
In [6]: C1A=C.get_loc('v1A')
In [7]: C2A=C.get_loc('v4A')
In [8]: FB=Ex.ix[:,C1B:C2B+1]
In [9]: FA=Ex.ix[:,C1A:C2A+1]
In [10]: FB # The FB and FA frames have this structure
Out[10]:
v1B v2B v3B v4B
Id
422 133 145 534 745
862 234 434 345 3453
935 232 343 6454 463
[3 rows x 4 columns]
Then finally produce the required DataFrame. This is done by doing the calculation on numpy arrays produced by DataFrame.values.
This method I got from this question/answer:
In [12]: DataFrame((FA.values*1.0) / FB.values,columns=['v1Q','v2Q','v3Q','v4Q'],index=Ex.index)
Out[12]:
v1Q v2Q v3Q v4Q
Id
422 1.593985 3.744828 4.385768 6.124832
862 2.747863 1.135945 9.950725 0.067767
935 1.823276 2.195335 0.116827 1.604752
[3 rows x 4 columns]
Am I missing something? I was hoping that I could achieve this in some much more direct way by doing some operation on the original DataFrame.
Is there no operation to do elementwise calculation directly on DataFrames instead of going via numpy arrays?

You could always use df.filter to select the relevant column names. It can accept a regular expression so you could specify the after/before columns with something like this:
>>> df.filter(regex=r'^v.A$').values / df.filter(regex=r'^v.B$').values
array([[ 1.59398496, 3.74482759, 4.38576779, 6.12483221],
[ 2.74786325, 1.1359447 , 9.95072464, 0.06776716],
[ 1.82327586, 2.19533528, 0.11682677, 1.60475162]])
Regarding the arithmetic operation, you're not missing anything. It's necessary to use Numpy arrays (.values) here as otherwise Pandas computes values from the common index labels in both DataFrames. If an index is missing the calculation results in NaN. Numpy arrays don't have labeled indexes so the element-wise operation succeeds.

Related

pandas apply with assignment on large dataframe

I have a very long dataframe dfWeather with the column TEMP. Because of its size, I want to keep only relevant information. Concretely, keep only entries, where the temperature changed by more than 1 since the last entry I kept. I want to use dfWeather.apply, since it seems to iterate much faster (10x) over the rows than a for-loop over dfWeather.iloc. I tried the following.
dfTempReduced = pd.DataFrame(columns = dfWeather.columns)
dfTempReduced.append(dfWeather.iloc[0])
dfWeather.apply(lambda x: dfTempReduced = dfTempReduced.append(x) if np.abs(TempReduced[-1].TEMP - x.TEMP) >= 1 else None, axis = 1)
unfortunately I get the error
SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
Is there a fast way to get that desired result? Thanks!
EDIT:
Here is some example data
dfWeather[200:220].TEMP
Out[208]:
200 12.28
201 12.31
202 12.28
203 12.28
204 12.24
205 12.21
206 12.17
207 11.93
208 11.83
209 11.76
210 11.66
211 11.55
212 11.48
213 11.43
214 11.37
215 11.33
216 11.36
217 11.33
218 11.29
219 11.27
The desired result would yield only the first and the last entry, since the absolute difference is larger than 1. The first entry is always included.

If you don't want to call this recursive (so you have [1, 2, 3] and you want to keep [1, 3] because 2 is only 1 degree larger than 1 but 3 is more than 1 degree larger than 1, but not than 2) than you can simply use diff.
However, this doesn't work if the values stay longer below the 1°C threshold. To overcome this limitation, you could round the values (to whatever precision but 1°C suggests that to zero-precision would be a good idea;) )
Let us create an example:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['TEMP'] = np.random.rand(100) * 2
so now if you are OK with using diff it can be done very efficiently just by:
# either slice
lg = df['TEMP'].apply(round).diff().abs() > 1
df = df[lg]
# or drop
lg = df['TEMP'].apply(round).diff().abs() < 1
df.drop(index=lg.index, inplace=True)
You even have two options to to the reduction. I guess that drop take a minimal twink longer but is more memory efficient than the slicing way.

Aggregations over specific columns of a large dataframe, with named output

I am looking for a way to aggregate over a large dataframe, possibly using groupby. Each group would be based on either pre-specified columns or regex, and the aggregation should produce a named output.
This produces a sample dataframe:
import pandas as pd
import itertools
import numpy as np
col = "A,B,C".split(',')
col1 = "1,2,3,4,5,6,7,8,9".split(',')
col2 = "E,F,G".split(',')
all_dims = [col, col1, col2]
all_keys = ['.'.join(i) for i in itertools.product(*all_dims)]
rng = pd.date_range(end=pd.Timestamp.today().date(), periods=12, freq='M')
df = pd.DataFrame(np.random.randint(0, 1000, size=(len(rng), len(all_keys))), columns=all_keys, index=rng)
Above produces a dataframe with one year's worth of monthly data, with 36 columns with following names:
['A.1.E', 'A.1.F', 'A.1.G', 'A.2.E', 'A.2.F', 'A.2.G', 'A.3.E', 'A.3.F',
'A.3.G', 'A.4.E', 'A.4.F', 'A.4.G', 'A.5.E', 'A.5.F', 'A.5.G', 'A.6.E',
'A.6.F', 'A.6.G', 'A.7.E', 'A.7.F', 'A.7.G', 'A.8.E', 'A.8.F', 'A.8.G',
'A.9.E', 'A.9.F', 'A.9.G', 'B.1.E', 'B.1.F', 'B.1.G', 'B.2.E', 'B.2.F',
'B.2.G', 'B.3.E', 'B.3.F', 'B.3.G', 'B.4.E', 'B.4.F', 'B.4.G', 'B.5.E',
'B.5.F', 'B.5.G', 'B.6.E', 'B.6.F', 'B.6.G', 'B.7.E', 'B.7.F', 'B.7.G',
'B.8.E', 'B.8.F', 'B.8.G', 'B.9.E', 'B.9.F', 'B.9.G', 'C.1.E', 'C.1.F',
'C.1.G', 'C.2.E', 'C.2.F', 'C.2.G', 'C.3.E', 'C.3.F', 'C.3.G', 'C.4.E',
'C.4.F', 'C.4.G', 'C.5.E', 'C.5.F', 'C.5.G', 'C.6.E', 'C.6.F', 'C.6.G',
'C.7.E', 'C.7.F', 'C.7.G', 'C.8.E', 'C.8.F', 'C.8.G', 'C.9.E', 'C.9.F',
'C.9.G']
What I would like now is to be able aggregate over the dataframe and take certain column combinations and produce named outputs. For example, one rules might be that I will take all 'A.*.E' columns (that have any number in the middle), sum them and produce a named output column called 'A.SUM.E'. And then do the same for 'A.*.F', 'A.*.G' and so on.
I have looked into pandas 25 named aggregation which allows me to name my outputs but I couldn't see how to simultaneously capture the right column combinations and produce the right output names.
If you need to reshape the dataframe to make a workable solution, that is fine as well.
Note, I am aware I could do something like this in a Python loop but I am looking for a pandas way to do it.

Not a groupby solution and it uses a loop but I think it's nontheless rather elegant: first get a list of unique column from - to combinations using a set and then do the sums using filter:
cols = sorted([(x[0],x[1]) for x in set([(x.split('.')[0], x.split('.')[-1]) for x in df.columns])])
for c0, c1 in cols:
df[f'{c0}.SUM.{c1}'] = df.filter(regex = f'{c0}\.\d+\.{c1}').sum(axis=1)
Result:
A.1.E A.1.F A.1.G A.2.E ... B.SUM.G C.SUM.E C.SUM.F C.SUM.G
2018-08-31 978 746 408 109 ... 4061 5413 4102 4908
2018-09-30 923 649 488 447 ... 5585 3634 3857 4228
2018-10-31 911 359 897 425 ... 5039 2961 5246 4126
2018-11-30 77 479 536 509 ... 4634 4325 2975 4249
2018-12-31 608 995 114 603 ... 5377 5277 4509 3499
2019-01-31 138 612 363 218 ... 4514 5088 4599 4835
2019-02-28 994 148 933 990 ... 3907 4310 3906 3552
2019-03-31 950 931 209 915 ... 4354 5877 4677 5557
2019-04-30 255 168 357 800 ... 5267 5200 3689 5001
2019-05-31 593 594 824 986 ... 4221 2108 4636 3606
2019-06-30 975 396 919 242 ... 3841 4787 4556 3141
2019-07-31 350 312 104 113 ... 4071 5073 4829 3717
If you want to have the result in a new DataFrame, just create an empty one and add the columns to it:
result = pd.DataFrame()
for c0, c1 in cols:
result[f'{c0}.SUM.{c1}'] = df.filter(regex = f'{c0}\.\d+\.{c1}').sum(axis=1)
Update: using simple groupby (which is even more simple in this particular case):
def grouper(col):
c = col.split('.')
return f'{c[0]}.SUM.{c[-1]}'
df.groupby(grouper, axis=1).sum()

find value in column and based on it create a new dataframe in pandas

I have a variable in the following format fg = 2017-20. It's a string. And also I have a dataframe:
flag №
2017-18 389
2017-19 390
2017-20 391
2017-21 392
2017-22 393
2017-23 394
...
I need to find this value (fg) in the column "flag" and select the appropriate value (in the example it will be 391) in the column "№". Then create new dataframe, in which there will also be a column "№". Add this value to this dataframe and iterate 53 times. The result should look like this:
№_new
391
392
393
394
395
...
442
443
444
It does not look difficult, but I can not find anything suitable based on other issues. Can someone advise anything, please?

You need boolean indexing with loc for filtering, then convert one item Series to scalar by convert to numpy array by values and select first value by [0].
Last create new DataFrame with numpy.arange.
fg = '2017-20'
val = df.loc[df['flag'] == fg, '№'].values[0]
print (val)
391
df1 = pd.DataFrame({'№_new':np.arange(val, val+53)})
print (df1)
№_new
0 391
1 392
2 393
3 394
4 395
5 396
6 397
7 398
8 399
9 400
10 401
11 402
..
..

is in comparision with pandas series

I have a string variable (sample_id) and am trying to see this element exists in a pandas series.
For example:
sample_id = "HERUSAF000043287899"
and
>>> failed.ID
5 HERUSAF000043287899
175 HERUSAM000043667608
195 HERUSAM000043667594
212 HERUSAF000043733959
213 HERUSAF000043733954
214 HERUSAM000043600074
215 HERUSAF000043733999
216 HERUSAF000043733982
217 HERUSAF000043733983
220 HERUSAM000043733661
221 HERUSAM000043734015
222 HERUSAM000043631768
223 HERUSAM000043733650
224 HERUSAM000043733649
225 HERUSAM000043733665
227 HERUSAM000043734019
Name: ID, dtype: object
Yet, when I do a comparison:
>>> sample_id in failed.ID
False
But, if I compare the values individually, the comparison works:
>>> sample_id == failed.ID.iloc[0]
True
How can I check for the individual value in the series without making individual checks?

The in operator checks the pandas index. Check the values explicitely:
sample_id in failed.ID.values

You can also use the handy series method isin.
failed.ID.isin([ sample_id])

How to shift a column in Pandas DataFrame without losing value

I would like to shift a column in a Pandas DataFrame, but I haven't been able to find a method to do it without losing values.
(This post is quite similar to How to shift a column in Pandas DataFrame but the validated answer doesn't give the desired output and I can't comment it).
Does anyone know how to do it?
## x1 x2
##0 206 214
##1 226 234
##2 245 253
##3 265 272
##4 283 291
Desired output:
## x1 x2
##0 206 nan
##1 226 214
##2 245 234
##3 265 253
##4 283 271
##5 nan 291

Use loc to add a new blank row to the DataFrame, then perform the shift.
df.loc[max(df.index)+1, :] = None
df.x2 = df.x2.shift(1)
The code above assumes that your index is integer based, which is the pandas default. If you're using a non-integer based index, replace max(df.index)+1 with whatever you want the new last index to be.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pairwise calculation on elements in a DataFrame - python

Related

pandas apply with assignment on large dataframe

Aggregations over specific columns of a large dataframe, with named output

find value in column and based on it create a new dataframe in pandas

is in comparision with pandas series

How to shift a column in Pandas DataFrame without losing value

Categories

Resources