pandas apply with assignment on large dataframe - python

I have a very long dataframe dfWeather with the column TEMP. Because of its size, I want to keep only relevant information. Concretely, keep only entries, where the temperature changed by more than 1 since the last entry I kept. I want to use dfWeather.apply, since it seems to iterate much faster (10x) over the rows than a for-loop over dfWeather.iloc. I tried the following.
dfTempReduced = pd.DataFrame(columns = dfWeather.columns)
dfTempReduced.append(dfWeather.iloc[0])
dfWeather.apply(lambda x: dfTempReduced = dfTempReduced.append(x) if np.abs(TempReduced[-1].TEMP - x.TEMP) >= 1 else None, axis = 1)
unfortunately I get the error
SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
Is there a fast way to get that desired result? Thanks!
EDIT:
Here is some example data
dfWeather[200:220].TEMP
Out[208]:
200 12.28
201 12.31
202 12.28
203 12.28
204 12.24
205 12.21
206 12.17
207 11.93
208 11.83
209 11.76
210 11.66
211 11.55
212 11.48
213 11.43
214 11.37
215 11.33
216 11.36
217 11.33
218 11.29
219 11.27
The desired result would yield only the first and the last entry, since the absolute difference is larger than 1. The first entry is always included.

If you don't want to call this recursive (so you have [1, 2, 3] and you want to keep [1, 3] because 2 is only 1 degree larger than 1 but 3 is more than 1 degree larger than 1, but not than 2) than you can simply use diff.
However, this doesn't work if the values stay longer below the 1°C threshold. To overcome this limitation, you could round the values (to whatever precision but 1°C suggests that to zero-precision would be a good idea;) )
Let us create an example:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['TEMP'] = np.random.rand(100) * 2
so now if you are OK with using diff it can be done very efficiently just by:
# either slice
lg = df['TEMP'].apply(round).diff().abs() > 1
df = df[lg]
# or drop
lg = df['TEMP'].apply(round).diff().abs() < 1
df.drop(index=lg.index, inplace=True)
You even have two options to to the reduction. I guess that drop take a minimal twink longer but is more memory efficient than the slicing way.

Related

Maximal Subset of Pandas Column based on a CutoFF

I am having an algoritmic problem which I am trying to solve in python. I have a pandas dataframe ( say) of two columns as: ( I have it kept it sorted in descending here to make it easier to explain the problem)
df:
ACOL BCOL
LA1 234
LA2 230
LA3 220
LA4 218
LA5 210
LA6 200
LA7 185
LA8 180
LA9 150
LA10 100
I have a threshold value of BCOL, say 215. So what I want is to get the maximal subset from the above pandas dataframe, which when I take the average of BCOL will give me greater than or equal to 215.
So in this case, if I keep the BCOL values upto 200 then the mean of (234, 230,... 200) is 218.67, whereas if I keep up to 185 ( 234, 230, ..., 200, 185), the mean is 213.86. So my maximal subset to get the BCOL mean greater than 215 should be from ( 234,... 200). So I will drop the rest of the rows. So my final output pandas dataframe should be :
dfnew:
ACOL BCOL
LA1 234
LA2 230
LA3 220
LA4 218
LA5 210
LA6 200
I was trying to put the BCOL into a list and trying a for/while loop, but it is not pythonic and also a bit time consuming for very large data table. Is there a way in pandas to achieve this more pythonic way.
Will appreciate any help. Thanks.
IIUC, you could do:
# guarantee that the DF is sorted by non ascending
df = df.sort_values(by=['BCOL'], ascending=False)
# cumulative mean, then find where is gt 215
mask = (df['BCOL'].cumsum() / np.arange(1, len(df) + 1)) > 215.0
print(df[mask])
Output
ACOL BCOL
0 LA1 234
1 LA2 230
2 LA3 220
3 LA4 218
4 LA5 210
5 LA6 200

Join dataframe with matrix output using pandas

I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)​
Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47

Pairwise calculation on elements in a DataFrame

I have a data frame that is structured similar to the following (but in the real case with many more rows and columns).
In [2]: Ex # The example DataFrame
Out[2]:
NameBef v1B v2B v3B v4B NameAft v1A v2A v3A v4A
Id
422 firstBef 133 145 534 745 FirstAft 212 543 2342 4563
862 secondBef 234 434 345 3453 SecondAft 643 493 3433 234
935 thirdBef 232 343 6454 463 thirdAft 423 753 754 743
For each row I want to calculate the quotient each vXB and vXA value from above (the Xs are variables) to end up with a DataFrame like this one
v1Q v2Q v3Q v4Q
Id
422 1.593985 3.744828 4.385768 6.124832
862 2.747863 1.135945 9.950725 0.067767
935 1.823276 2.195335 0.116827 1.604752
Where each element is the quotient of the corresponding elements of the original data frame.
I haven't been able to figure out how to do this conveniently.
To be convenient it would be good if it will not be required to provide only the names of the first and last columns of the "before" and "after" values i.e. 'v1B', 'v4B' and 'v1A', 'v4A' (i.e. not each of the columns).
The following is what I have come up with.
In [3]: C=Ex.columns
In [4]: C1B=C.get_loc('v1B')
In [5]: C2B=C.get_loc('v4B')
In [6]: C1A=C.get_loc('v1A')
In [7]: C2A=C.get_loc('v4A')
In [8]: FB=Ex.ix[:,C1B:C2B+1]
In [9]: FA=Ex.ix[:,C1A:C2A+1]
In [10]: FB # The FB and FA frames have this structure
Out[10]:
v1B v2B v3B v4B
Id
422 133 145 534 745
862 234 434 345 3453
935 232 343 6454 463
[3 rows x 4 columns]
Then finally produce the required DataFrame. This is done by doing the calculation on numpy arrays produced by DataFrame.values.
This method I got from this question/answer:
In [12]: DataFrame((FA.values*1.0) / FB.values,columns=['v1Q','v2Q','v3Q','v4Q'],index=Ex.index)
Out[12]:
v1Q v2Q v3Q v4Q
Id
422 1.593985 3.744828 4.385768 6.124832
862 2.747863 1.135945 9.950725 0.067767
935 1.823276 2.195335 0.116827 1.604752
[3 rows x 4 columns]
Am I missing something? I was hoping that I could achieve this in some much more direct way by doing some operation on the original DataFrame.
Is there no operation to do elementwise calculation directly on DataFrames instead of going via numpy arrays?
You could always use df.filter to select the relevant column names. It can accept a regular expression so you could specify the after/before columns with something like this:
>>> df.filter(regex=r'^v.A$').values / df.filter(regex=r'^v.B$').values
array([[ 1.59398496, 3.74482759, 4.38576779, 6.12483221],
[ 2.74786325, 1.1359447 , 9.95072464, 0.06776716],
[ 1.82327586, 2.19533528, 0.11682677, 1.60475162]])
Regarding the arithmetic operation, you're not missing anything. It's necessary to use Numpy arrays (.values) here as otherwise Pandas computes values from the common index labels in both DataFrames. If an index is missing the calculation results in NaN. Numpy arrays don't have labeled indexes so the element-wise operation succeeds.

python pandas add a lower level column to multi_index dataframe

Could someone help me to achieve this task?
I have data in multi-level data frame through the unstack() operation:
Original df:
Density Length Range Count
15k 0.60 small 555
15k 0.60 big 17
15k 1.80 small 141
15k 1.80 big 21
15k 3.60 small 150
15k 3.60 big 26
20k 0.60 small 5543
20k 0.60 big 22
20k 1.80 small 553
20k 1.80 big 25
20k 3.60 small 422
20k 3.60 big 35
df = df.set_index(['Density','Length','Range']).unstack('Range')
# After unstack:
Count
Range big small
Density Length
15k 0.60 17 555
1.80 21 141
3.60 26 150
20k 0.60 22 5543
1.80 25 553
3.60 35 422
Now I try to add an extra column in level 1. it is the ratio of the small/big. I have tried the following syntax, no error but with different outcomes
#df[:]['ratio']=df['Count']['small']/df['Count']['big'] ## case 1. no error, no ratio
#df['Count']['ratio']=df['Count']['small']/df['Count']['big'] ## case 2. no error, no ratio
#df['ratio']=df['Count']['small']/df['Count']['big'] ## case 3. no error, ratio on column level 0
df['ratio']=df.ix[:,1]/df.ix[:,0] ## case 4. no error, ratio on column level 0
#After execution above code, df:
Count ratio
Range big small
Density Length
15k 0.60 17 555 32.65
1.80 21 141 6.71
3.60 26 150 5.77
20k 0.60 22 5543 251.95
1.80 25 553 22.12
3.60 35 422 12.06
I don't understand why case 1 & 2 show no error neither adding new ratio column. and why in case 3 & 4 the ratio column is on level 0, not the expected level 1. Also like to know if there is a better/concise way to achieve this. Case 4 is the best I can do but I don't like the implicit indexing way (instead of using the name) to refer to a column.
Thanks
Case 1:
df[:]['ratio']=df['Count']['small']/df['Count']['big']
df[:] is a copy of df. They are different objects, each with its own copy of the underlying data:
In [69]: df[:] is df
Out[69]: False
So modifying the copy has no effect on the original df. Since no reference is
maintained for df[:], the object is garbage collected after the assignment,
making the assignment useless.
Case 2:
df['Count']['ratio']=df['Count']['small']/df['Count']['big']
uses chain-indexing. Avoid chain indexing when making assignments. The link explains why assignments using chain-indexing on the left-hand side may not affect df.
If you set
pd.options.mode.chained_assignment = 'warn'
then Pandas will warn you not to use chain-indexing in assignments:
SettingWithCopyError:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Case 3:
df['ratio']=df['Count']['small']/df['Count']['big']
and Case 4
df['ratio']=df.ix[:,1]/df.ix[:,0]
both work, but it could done more efficently using
df['ratio'] = df['Count','small']/df['Count','big']
Here is a microbenchmark showing that using df[tuple_index] is faster than
chain-indexing:
In [99]: %timeit df['Count']['small']
1000 loops, best of 3: 501 µs per loop
In [128]: %timeit df['Count','small']
100000 loops, best of 3: 8.91 µs per loop
If you want ratio to be the level 1 label, then you must tell Pandas that the level 0 label is Count. You can do that by assigning to df['Count','ratio']:
In [96]: df['Count','ratio'] = df['Count']['small']/df['Count','big']
# In [97]: df
# Out[97]:
# Count
# Range big small ratio
# Density Length
# 15k 0.6 17 555 32.647059
# 1.8 21 141 6.714286
# 3.6 26 150 5.769231
# 20k 0.6 22 5543 251.954545
# 1.8 25 553 22.120000
# 3.6 35 422 12.057143

Calculating difference between two rows in Python / Pandas

In python, how can I reference previous row and calculate something against it? Specifically, I am working with dataframes in pandas - I have a data frame full of stock price information that looks like this:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
Here is how I created this dataframe:
import pandas
url = 'http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
data = data = pandas.read_csv(url)
## now I sorted the data frame ascending by date
data = data.sort(columns='Date')
Starting with row number 2, or in this case, I guess it's 250 (PS - is that the index?), I want to calculate the difference between 2011-01-03 and 2011-01-04, for every entry in this dataframe. I believe the appropriate way is to write a function that takes the current row, then figures out the previous row, and calculates the difference between them, the use the pandas apply function to update the dataframe with the value.
Is that the right approach? If so, should I be using the index to determine the difference? (note - I'm still in python beginner mode, so index may not be the right term, nor even the correct way to implement this)
I think you want to do something like this:
In [26]: data
Out[26]:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
In [27]: data.set_index('Date').diff()
Out[27]:
Close Adj Close
Date
2011-01-03 NaN NaN
2011-01-04 0.16 0.16
2011-01-05 -0.59 -0.58
2011-01-06 1.61 1.57
2011-01-07 -0.73 -0.71
To calculate difference of one column. Here is what you can do.
df=
A B
0 10 56
1 45 48
2 26 48
3 32 65
We want to compute row difference in A only and want to consider the rows which are less than 15.
df['A_dif'] = df['A'].diff()
df=
A B A_dif
0 10 56 Nan
1 45 48 35
2 26 48 19
3 32 65 6
df = df[df['A_dif']<15]
df=
A B A_dif
0 10 56 Nan
3 32 65 6
I don't know pandas, and I'm pretty sure it has something specific for this; however, I'll give you the pure-Python solution, that might be of some help even if you need to use pandas:
import csv
import urllib
# This basically retrieves the CSV files and loads it in a list, converting
# All numeric values to floats
url='http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
reader = csv.reader(urllib.urlopen(url), delimiter=',')
# We sort the output list so the records are ordered by date
cleaned = sorted([[r[0]] + map(float, r[1:]) for r in list(reader)[1:]])
for i, row in enumerate(cleaned): # enumerate() yields two-tuples: (<id>, <item>)
# The try..except here is to skip the IndexError for line 0
try:
# This will calculate difference of each numeric field with the same field
# in the row before this one
print row[0], [(row[j] - cleaned[i-1][j]) for j in range(1, 7)]
except IndexError:
pass

Categories

Resources