I'm trying to select rows out of groups by max value using df.loc[df.groupby(keys)['column'].idxmax()].
I'm finding, however, that df.groupby(keys)['column'].idxmax() takes a really long time on my dataset of about 27M rows. Interestingly, running df.groupby(keys)['column'].max() on my dataset takes only 13 seconds while running df.groupby(keys)['column'].idxmax() takes 55 minutes. I don't understand why returning the indexes of the rows takes 250 times longer than returning a value from the row. Maybe there is something I can do to speed up idxmax?
If not, is there an alternative way of selecting rows out of groups by max value that might be faster than using idxmax?
For additional info, I'm using two keys and sorted the dataframe on those keys prior to the groupby and idxmax operations. Here's what it looks like in Jupyter Notebook:
import pandas as pd
df = pd.read_csv('/data/Broadband Data/fbd_us_without_satellite_jun2019_v1.csv', encoding='ANSI', \
usecols=['BlockCode', 'HocoNum', 'HocoFinal', 'TechCode', 'Consumer', 'MaxAdDown', 'MaxAdUp'])
%%time
df = df[df.Consumer == 1]
df.sort_values(['BlockCode', 'HocoNum'], inplace=True)
print(df)
HocoNum HocoFinal BlockCode TechCode
4631064 130077 AT&T Inc. 10010201001000 10
4679561 130077 AT&T Inc. 10010201001000 11
28163032 130235 Charter Communications 10010201001000 43
11134756 131480 WideOpenWest Finance, LLC 10010201001000 42
11174634 131480 WideOpenWest Finance, LLC 10010201001000 50
... ... ... ... ...
15389917 190062 Broadband VI, LLC 780309900000014 70
10930322 130081 ATN International, Inc. 780309900000015 70
15389918 190062 Broadband VI, LLC 780309900000015 70
10930323 130081 ATN International, Inc. 780309900000016 70
15389919 190062 Broadband VI, LLC 780309900000016 70
Consumer MaxAdDown MaxAdUp
4631064 1 6.0 0.512
4679561 1 18.0 0.768
28163032 1 940.0 35.000
11134756 1 1000.0 50.000
11174634 1 1000.0 50.000
... ... ... ...
15389917 1 25.0 5.000
10930322 1 25.0 5.000
15389918 1 25.0 5.000
10930323 1 25.0 5.000
15389919 1 25.0 5.000
[26991941 rows x 7 columns]
Wall time: 21.6 s
%time df.groupby(['BlockCode', 'HocoNum'])['MaxAdDown'].max()
Wall time: 13 s
BlockCode HocoNum
10010201001000 130077 18.0
130235 940.0
131480 1000.0
10010201001001 130235 940.0
10010201001002 130077 6.0
...
780309900000014 190062 25.0
780309900000015 130081 25.0
190062 25.0
780309900000016 130081 25.0
190062 25.0
Name: MaxAdDown, Length: 20613795, dtype: float64
%time df.groupby(['BlockCode', 'HocoNum'])['MaxAdDown'].idxmax()
Wall time: 55min 24s
BlockCode HocoNum
10010201001000 130077 4679561
130235 28163032
131480 11134756
10010201001001 130235 28163033
10010201001002 130077 4637222
...
780309900000014 190062 15389917
780309900000015 130081 10930322
190062 15389918
780309900000016 130081 10930323
190062 15389919
Name: MaxAdDown, Length: 20613795, dtype: int64
You'll see in the very first rows of data there are two entries for AT&T in the same BlockCode, one for MaxAdDown of 6Mbps and one for 18Mbps. I want to keep the 18Mbps row and drop the 6Mbps row, so that there is one row per company per BlockCode that has the the maximum MaxAdDown value. I need the entire row, not just the MaxAdDown value.
sort and drop duplicates:
df.sort('MaxAdDown').drop_duplicates(['BlockCode', 'HocoNum'], keep='last')
Related
I need to understand the slicing in multiIndexing, for example:
health_data
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 31.0 38.7 32.0 36.7 35.0 37.2
2 44.0 37.7 50.0 35.0 29.0 36.7
2014 1 30.0 37.4 39.0 37.8 61.0 36.9
2 47.0 37.8 48.0 37.3 51.0 36.5
And by doing the following command:
health_data.iloc[:2, :2]
I get back:
subject Bob
type HR Temp
year visit
2013 1 31.0 38.7
2 44.0 37.7
Can anybody please tell me why the result is like this? From where do we start the index in multi indexed matrix?
If I interpret your data correctly, we can rebuild your df as follows, using pd.MultiIndex.from_tuples and the regular df constructor pd.DataFrame, to gain some clarity about its structure:
import pandas as pd
import numpy as np
tuples_columns = [('Bob', 'HR'), ('Bob', 'Temp'), ('Guido', 'HR'),
('Guido', 'Temp'), ('Sue', 'HR'), ('Sue', 'Temp')]
columns = pd.MultiIndex.from_tuples(tuples_columns, names=['subject', 'type'])
tuples_index = [(2013, 1), (2013, 2), (2014, 1), (2014, 2)]
index = pd.MultiIndex.from_tuples(tuples_index, names=['year', 'visit'])
data = np.array([[31. , 38.7, 32. , 36.7, 35. , 37.2],
[44. , 37.7, 50. , 35. , 29. , 36.7],
[30. , 37.4, 39. , 37.8, 61. , 36.9],
[47. , 37.8, 48. , 37.3, 51. , 36.5]])
health_data = pd.DataFrame(data=data, columns=columns, index=index)
print(health_data)
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 31.0 38.7 32.0 36.7 35.0 37.2
2 44.0 37.7 50.0 35.0 29.0 36.7
2014 1 30.0 37.4 39.0 37.8 61.0 36.9
2 47.0 37.8 48.0 37.3 51.0 36.5
As you can see from this snippet, both your columns and index are MultiIndices with names for each level (2 levels for both: 0 and 1), which we find in the top left corner of the print. N.B. The names are not part of the columns/index in the sense that you cannot use them directly to select from the df. E.g. your columns start with (Bob, HR), not with subject and/or type. You can of course select the names if you want to:
print(health_data.columns.names)
['subject', 'type']
Or indeed, you can also reset them to None values, in which case they will disappear, without otherwise affecting the structure of your df:
health_data.columns.names = [None, None]
health_data.index.names = [None, None]
print(health_data)
Bob Guido Sue
HR Temp HR Temp HR Temp
2013 1 31.0 38.7 32.0 36.7 35.0 37.2
2 44.0 37.7 50.0 35.0 29.0 36.7
2014 1 30.0 37.4 39.0 37.8 61.0 36.9
2 47.0 37.8 48.0 37.3 51.0 36.5
The other confusing thing is probably that the values from the first level (0) are not repeated: they become blanks when they appear as duplicates. Not to worry, they are still there. This is just done to provide a better sense of the relation between the different levels. E.g. your actual index values look like this:
print(health_data.index)
MultiIndex([(2013, 1),
(2013, 2),
(2014, 1),
(2014, 2)],
names=['year', 'visit'])
But since 2013 occurs in both (2013, 1), (2013, 2), this is displayed as if they are (2013, 1), ('', 2). When you get used to this notation, it is actually much easier to see, e.g. that you just have two years (2013, 2014) with two sub levels (i.e. visit) for each: 1, 2.
Lastly, let's review your df.iloc example:
health_data.iloc[:2, :2]
subject Bob
type HR Temp
year visit
2013 1 31.0 38.7
2 44.0 37.7
We can see now how this works: we are selecting :2 from the index (so: 0, 1) and same for the columns. subject and type are just the names for the columns, year and visit just the names for the index, while Bob and 2013 are not repeated in the respective levels 0 of both MultiIndices since they are duplicates.
Suppose we want to select the same data using df.loc, we could do this as follows:
health_data.loc[[(2013,1),(2013,2)], [('Bob','HR'),('Bob','Temp')]]
# same result
Or, perhaps more conveniently, we make use of index.get_level_values, and do something like this:
health_data.loc[health_data.index.get_level_values(0) == 2013,
health_data.columns.get_level_values(0) == 'Bob']
# same result
Story that pertain to a new design solution
Goal is to use weather data to run ARIMA model fit on each group of like named 'stations' with their associated precipitation data, then finally execute a 30 day forward forecast. Looking to process specific same named stations and then next process the next unique same named stations, etc.
The algorithm to add question
How to write algorithm to run ARIMA model for each UNIQUE 'station' and, perhaps grouping stations to be unique groups to run ARIMA model on the group, and then fit for a 30 day forward forecast? The ARIMA(2,1,1) is a working order terms from auto.arima().
How to write a group algorithm for same named 'stations' before running the ARIMA model, fit, forecast? Or what other approach would achieve a set of like named stations to process specific same named stations and then move unto the next unique same named stations.
Working code executes but needs broader algorithm
Code was working, but on last run, predict(start=start_date, end=end_date) issued a key error. Removed NA, so this may fix the predict(start, end)
wd.weather_data = wd.weather_data[wd.weather_data['date'].notna()]
forecast_models = [50000]
n = 1
df_all_stations = data_prcp.drop(['level_0', 'index', 'prcp'], axis=1)
wd.weather_data.sort_values("date", axis = 0, ascending = True, inplace = True)
for station_name in wd.weather_data['station']:
start_date = pd.to_datetime(wd.weather_data['date'])
number_of_days = 31
end_date = pd.to_datetime(start_date) + pd.DateOffset(days=30)
model = statsmodels.tsa.arima_model.ARIMA(wd.weather_data['prcp'], order=(2,1,1))
model_fit = model.fit()
forecast = model_fit.predict(start=start_date, end=end_date)
forecast_models.append(forecast)
Data Source
<bound method NDFrame.head of station date tavg tmin tmax prcp snow
0 Anchorage, AK 2018-01-01 -4.166667 -8.033333 -0.30 0.3 80.0
35328 Grand Forks, ND 2018-01-01 -14.900000 -23.300000 -6.70 0.0 0.0
86016 Key West, FL 2018-01-01 20.700000 16.100000 25.60 0.0 0.0
59904 Wilmington, NC 2018-01-01 -2.500000 -7.100000 0.00 0.0 0.0
66048 State College, PA 2018-01-01 -13.500000 -17.000000 -10.00 4.5 0.0
... ... ... ... ... ... ... ...
151850 Kansas City, MO 2022-03-30 9.550000 3.700000 16.55 21.1 0.0
151889 Springfield, MO 2022-03-30 12.400000 4.500000 17.10 48.9 0.0
151890 St. Louis, MO 2022-03-30 14.800000 8.000000 17.60 24.9 0.0
151891 State College, PA 2022-03-30 0.400000 -5.200000 6.20 0.2 0.0
151899 Wilmington, NC 2022-03-30 14.400000 6.200000 20.20 0.0 0.0
wdir wspd pres
0 143.0 5.766667 995.133333
35328 172.0 33.800000 1019.200000
86016 4.0 13.000000 1019.900000
59904 200.0 21.600000 1017.000000
66048 243.0 12.700000 1015.200000
... ... ... ...
151850 294.5 24.400000 998.000000
151889 227.0 19.700000 997.000000
151890 204.0 20.300000 996.400000
151891 129.0 10.800000 1020.400000
151899 154.0 16.400000 1021.900000
Error
KeyError: 'The `start` argument could not be matched to a location related to the index of the data.'
I have two dataframes
1st
dt SRNE CRSR GME ... ASO TH DTE ATH
0 2021-04-12 00:00:00 6.940 33.67 141.09 ... 32.29 3.42 135.63 50.80
1 2021-04-13 00:00:00 6.930 33.71 140.99 ... 31.68 3.39 137.63 50.88
2 2021-04-14 00:00:00 7.385 33.93 166.53 ... 30.82 3.23 138.72 53.35
3 2021-04-15 00:00:00 7.440 34.16 156.44 ... 30.54 3.26 139.48 54.14
4 2021-04-16 00:00:00 7.490 32.60 154.69 ... 30.77 2.79 140.68 55.45
2nd
dt text compare
0 2021-03-19 14:59:49+00:00 i only need uxy to hit 20 eod to make up for a... 1
1 2021-03-19 14:59:51+00:00 oh this isn’t good 0
2 2021-03-19 14:59:51+00:00 lads why is my account covered in more red ink... 0
3 2021-03-19 14:59:51+00:00 i'm tempted to drop my last 800 into some stup... 0
4 2021-03-19 14:59:52+00:00 the sell offs will continue until moral improves. 0
I want to remove rows that don't match with both dataframes by looking at the data column.
I tried
discussion = discussion[discussion['dt'] == price['dt']]
It gives an error ValueError: Can only compare identically-labeled Series objects
I assume it is because the column names don't match
Appreciate your help
import pandas as pd
discussion = pd.DataFrame([['2021-04-12 00:00:00',6.940,33.67,141.09,32.29, 3.42, 135.63, 50.80],
['2021-04-13 00:00:00',6.930,33.71,140.99,31.68, 3.39, 137.63, 50.88],
['2021-04-14 00:00:00',7.385,33.93,166.53,30.82, 3.23, 138.72, 53.35],
['2021-04-15 00:00:00',7.440,34.16,156.44,30.54, 3.26, 139.48, 54.14],
['2021-04-16 00:00:00',7.490,32.60,154.69,30.77, 2.79, 140.68, 55.45]],
columns=['dt', 'SRNE', 'CRSR', 'GME', 'ASO', 'TH', 'DTE', 'ATH'])
discussion['dt'] = pd.to_datetime(discussion['dt'])
price = pd.DataFrame([['2021-04-12 23:30:00','i only need uxy to hit 20 eod to make up for a...', 1],
['2021-03-19 14:59:51+00:00','oh this isn’t good ',0],
['2021-03-19 14:59:51+00:00','lads why is my account covered in more red ink... ', 0],
['2021-03-19 14:59:51+00:00','im tempted to drop my last 800 into some stup... ', 0],
['2021-04-16 12:45:00','the sell offs will continue until moral improves. ', 0]],
columns=['dt', 'text', 'compare'])
price['dt'] = pd.to_datetime(price['dt'], utc=True)
discussion = discussion[discussion['dt'].dt.date.isin(price['dt'].dt.date)]
discussion
Output
dt SRNE CRSR GME ASO TH DTE ATH
0 2021-04-12 6.94 33.67 141.09 32.29 3.42 135.63 50.80
4 2021-04-16 7.49 32.60 154.69 30.77 2.79 140.68 55.45
My raw data is:
def f_ST(ST,F,T):
a=ST/F-1-np.log(ST/F)
return 2*a/T
df=pd.DataFrame(range(50,140,5),columns=['K'])
df['f(K0)']=df.apply(lambda x: f_ST(x.K,100,0.25),axis=1)
df['f(K1)']=df['f(K0)'].shift(-1)
df['dK']=df['K'].diff(1)
The thing I want to do is: I have a funtion f(k)
f(k)= (k-100)/100 - ln(k/100)
I want to calculate w, which goes following the steps
get 1-period foward value of f(k), then calculate
tmp(k)=f1_f(k)-f(k)/dk
w is calculated as
w[0]=tmpw[0]
w[n]-tmpw[n]-(w[0]+w[1]+...w[n-1])
And the result look like
nbr date k f(k) f1_f(k) d_k tmpw w
10 2019-02-19 100 0.000000 0.009679 5.0 0.001936 0.001936
11 2019-02-19 105 0.009679 0.037519 5.0 0.005568 0.003632
12 2019-02-19 110 0.037519 0.081904 5.0 0.008877 0.003309
13 2019-02-19 115 0.081904 0.141428 5.0 ...
14 2019-02-19 120 0.141428 0.214852 5.0 ...
15 2019-02-19 125 0.214852 0.301086 5.0
16 2019-02-19 130 0.301086 0.399163 5.0
Question: could anyone help to derive a quick code (not mathematically) without using loop?
Thanks a lot!
I don't fully understand your question, for me all those notation were a bit confusing.
If I got what you want right, for every row you want to have an accumulated value of all previous rows. than the value of another column of this row would be calculated based on this accumulated value.
In this case I would prefer something, calculate an accumulated column first, use it later.
for example:
note you need to call list(range()) instead of list, so your example is throwing an error
import pandas as pd
import numpy as np
def f_ST(ST,F,T):
a=ST/F-1-np.log(ST/F)
return 2*a/T
df=pd.DataFrame(list(range(50,140,5)),columns=['K'])
df['f(K0)']=df.apply(lambda x: f_ST(x.K,100,0.25),axis=1)
df['f(K1)']=df['f(K0)'].shift(-1)
df['dK']=df['K'].diff(1)
df['accumulate'] = df['K'].shift(1).cumsum()
df['currentVal-accumulated'] = df['K'] - df['accumulate']
print(df)
prints:
K f(K0) ... accumulate currentVal-accumulated
0 50 1.545177 ... NaN NaN
1 55 1.182696 ... 50.0 5.0
2 60 0.886605 ... 105.0 -45.0
3 65 0.646263 ... 165.0 -100.0
4 70 0.453400 ... 230.0 -160.0
5 75 0.301457 ... 300.0 -225.0
6 80 0.185148 ... 375.0 -295.0
7 85 0.100151 ... 455.0 -370.0
8 90 0.042884 ... 540.0 -450.0
9 95 0.010346 ... 630.0 -535.0
10 100 0.000000 ... 725.0 -625.0
11 105 0.009679 ... 825.0 -720.0
12 110 0.037519 ... 930.0 -820.0
13 115 0.081904 ... 1040.0 -925.0
14 120 0.141428 ... 1155.0 -1035.0
15 125 0.214852 ... 1275.0 -1150.0
16 130 0.301086 ... 1400.0 -1270.0
17 135 0.399163 ... 1530.0 -1395.0
[18 rows x 6 columns]
I have 2 Python Dataframes:
The first Dataframe contains all data imported to the DataFrame, which consists of "prodcode", "sentiment", "summaryText", "reviewText",etc. of all initial Review Data.
DFF = DFF[['prodcode', 'summaryText', 'reviewText', 'overall', 'reviewerID', 'reviewerName', 'helpful','reviewTime', 'unixReviewTime', 'sentiment','textLength']]
which produces:
prodcode summaryText reviewText overall reviewerID ... helpful reviewTime unixReviewTime sentiment textLength
0 B00002243X Work Well - Should Have Bought Longer Ones I needed a set of jumper cables for my new car... 5.0 A3F73SC1LY51OO ... [4, 4] 08 17, 2011 1313539200 2 516
1 B00002243X Okay long cables These long cables work fine for my truck, but ... 4.0 A20S66SKYXULG2 ... [1, 1] 09 4, 2011 1315094400 2 265
2 B00002243X Looks and feels heavy Duty Can't comment much on these since they have no... 5.0 A2I8LFSN2IS5EO ... [0, 0] 07 25, 2013 1374710400 2 1142
3 B00002243X Excellent choice for Jumper Cables!!! I absolutley love Amazon!!! For the price of ... 5.0 A3GT2EWQSO45ZG ... [19, 19] 12 21, 2010 1292889600 2 4739
4 B00002243X Excellent, High Quality Starter Cables I purchased the 12' feet long cable set and th... 5.0 A3ESWJPAVRPWB4 ... [0, 0] 07 4, 2012 1341360000 2 415
The second Dataframe is a grouping of all prodcodes and the ratio of sentiment score / all reviews made for that product. It is the ratio for that review score over all reviews scores made, for that particular product.
df1 = (
DFF.groupby(["prodcode", "sentiment"]).count()
.join(DFF.groupby("prodcode").count(), "prodcode", rsuffix="_r"))[['reviewText', 'reviewText_r']]
df1['result'] = df1['reviewText']/df1['reviewText_r']
df1 = df1.reset_index()
df1 = df1.pivot("prodcode", 'sentiment', 'result').fillna(0)
df1 = round(df1 * 100)
df1.astype('int')
sorted_df2 = df1.sort_values(['0', '1', '2'], ascending=False)
which produces the following DF:
sentiment 0 1 2
prodcode
B0024E6QOO 80.0 0.0 20.0
B000GPV2QA 67.0 17.0 17.0
B0067DNSUI 67.0 0.0 33.0
B00192JH4S 62.0 12.0 25.0
B0087FSA0C 60.0 20.0 20.0
B0002KM5L0 60.0 0.0 40.0
B000DZBP60 60.0 0.0 40.0
B000PJCBOE 60.0 0.0 40.0
B0033A5PPO 57.0 29.0 14.0
B003POL69C 57.0 14.0 29.0
B0002Z9L8K 56.0 31.0 12.0
What I am now trying to do filter my first dataframe in two ways. The first, by the results of the second dataframe. By that, I mean I want the first dataframe to be filtered by the prodcode's from the second dataframe where df1.sentiment['0'] > 40. From that list, I want to filter the first dataframe by those rows where 'sentiment' from the first dataframe = 0.
At a high level, I am trying to obtain the prodcode, summaryText and reviewText in the first dataframe for Products that had high ratios in lower sentiment scores, and whose sentiment is 0.
Something like this :
assuming all the data you need is in df1 and no merges are needed.
m = list(DFF['prodcode'].loc[DFF['sentiment'] == 0] # create a list matching your criteria
df.loc[(df['0'] > 40) & (df['sentiment'].isin(m)] # filter according to your conditions
I figured it out:
DF3 = pd.merge(DFF, df1, left_on='prodcode', right_on='prodcode')
print(DF3.loc[(DF3['0'] > 50.0) & (DF3['2'] < 50.0) & (DF3['sentiment'].isin(['0']))].sort_values('0', ascending=False))