How to replace a classic for loop with df.iterrows()? - python

I have a huge data frame.
I am using a for loop in the below sample code:
for i in range(1, len(df_A2C), 1):
A2C_TT= df_A2C.loc[(df_A2C['TO_ID'] == i)].sort_values('DURATION_H').head(1)
if A2C_TT.size > 0:
print (A2C_TT)
This is working fine but I want to use df.iterrows() since it will help me to automaticall avoid empty frame issues.
I want to iterate through TO_ID and looking for minimum values accordingly.
How should I replace my classical i loop counter with df.iterrows()?
Sample Data:
FROM_ID TO_ID DURATION_H DIST_KM
1 7 0.528555556 38.4398
2 26 0.512511111 37.38515
3 71 0.432452778 32.57571
4 83 0.599486111 39.26188
5 98 0.590516667 35.53107
6 108 1.077794444 76.79874
7 139 0.838972222 58.86963
8 146 1.185088889 76.39174
9 158 0.625872222 45.6373
10 208 0.500122222 31.85239
11 209 0.530916667 29.50249
12 221 0.945444444 62.69099
13 224 1.080883333 66.06291
14 240 0.734269444 48.1778
15 272 0.822875 57.5008
16 349 1.171163889 76.43536
17 350 1.080097222 71.16137
18 412 0.503583333 38.19685
19 416 1.144961111 74.35502

As far as I understand your question, you want to group your data by To_ID and select the row where Duration_H is the smallest? Is that right?
df.loc[df.groupby('TO_ID').DURATION_H.idxmin()]

here is one way about it
# run the loop for as many unique TO_ID you have
# instead of iterrows, which runs for all the DF or running to the size of DF
for idx in np.unique(df['TO_ID']):
A2C_TT= df.loc[(df['TO_ID'] == idx)].sort_values('DURATION_H').head(1)
print (A2C_TT)
ROM_ID TO_ID DURATION_H DIST_KM
498660 39 7 0.434833 25.53808
here is another way about it
df.loc[df['DURATION_H'].eq(df.groupby('TO_ID')['DURATION_H'].transform(min))]
ROM_ID TO_ID DURATION_H DIST_KM
498660 39 7 0.434833 25.53808

Related

Normalizing rows of pandas DF when there's string columns?

I'm trying to normalize a Pandas DF by row and there's a column which has string values which is causing me a lot of trouble. Anyone have a neat way to make this work?
For example:
system Fluency Terminology No-error Accuracy Locale convention Other
19 hyp.metricsystem2 111 28 219 98 0 133
18 hyp.metricsystem1 97 22 242 84 0 137
22 hyp.metricsystem5 107 11 246 85 0 127
17 hyp.eTranslation 49 30 262 80 0 143
20 hyp.metricsystem3 86 23 263 89 0 118
21 hyp.metricsystem4 74 17 274 70 0 111
I am trying to normalize each row from Fluency, Terminology, etc. Other over the total. In other words, divide each integer column entry over the total of each row (Fluency[0]/total_row[0], Terminology[0]/total_row[0], ...)
I tried using this command, but it's giving me an error because I have a column of strings
bad_models.div(bad_models.sum(axis=1), axis = 0)
Any help would be greatly appreciated...
Use select_dtypes to select numeric only columns:
subset = bad_models.select_dtypes('number')
bad_models[subset.columns] = subset.div(subset.sum(axis=1), axis=0)
print(bad_models)
# Output
system Fluency Terminology No-error Accuracy Locale convention Other
19 hyp.metricsystem2 0.211832 0.21374 0.145418 0.193676 0 0.172952
18 hyp.metricsystem1 0.185115 0.167939 0.160691 0.166008 0 0.178153
22 hyp.metricsystem5 0.204198 0.083969 0.163347 0.167984 0 0.16515
17 hyp.eTranslation 0.093511 0.229008 0.173971 0.158103 0 0.185956
20 hyp.metricsystem3 0.164122 0.175573 0.174635 0.175889 0 0.153446
21 hyp.metricsystem4 0.141221 0.129771 0.181939 0.13834 0 0.144343

Facebook Prophet: Providing different data sets to build a better model

My data frame looks like that. My goal is to predict event_id 3 based on data of event_id 1 & event_id 2
ds tickets_sold y event_id
3/12/19 90 90 1
3/13/19 40 130 1
3/14/19 13 143 1
3/15/19 8 151 1
3/16/19 13 164 1
3/17/19 14 178 1
3/20/19 10 188 1
3/20/19 15 203 1
3/20/19 13 216 1
3/21/19 6 222 1
3/22/19 11 233 1
3/23/19 12 245 1
3/12/19 30 30 2
3/13/19 23 53 2
3/14/19 43 96 2
3/15/19 24 120 2
3/16/19 3 123 2
3/17/19 5 128 2
3/20/19 3 131 2
3/20/19 25 156 2
3/20/19 64 220 2
3/21/19 6 226 2
3/22/19 4 230 2
3/23/19 63 293 2
I want to predict sales for the next 10 days of that data:
ds tickets_sold y event_id
3/24/19 20 20 3
3/25/19 30 50 3
3/26/19 20 70 3
3/27/19 12 82 3
3/28/19 12 94 3
3/29/19 12 106 3
3/30/19 12 118 3
So far my model is that one. However, I am not telling the model that these are two separate events. However, it would be useful to consider all data from different events as they belong to the same organizer and therefore provide more information than just one event. Is that kind of fitting possible for Prophet?
# Load data
df = pd.read_csv('event_data_prophet.csv')
df.drop(columns=['tickets_sold'], inplace=True, axis=0)
df.head()
# The important things to note are that cap must be specified for every row in the dataframe,
# and that it does not have to be constant. If the market size is growing, then cap can be an increasing sequence.
df['cap'] = 500
# growth: String 'linear' or 'logistic' to specify a linear or logistic trend.
m = Prophet(growth='linear')
m.fit(df)
# periods is the amount of days that I look in the future
future = m.make_future_dataframe(periods=20)
future['cap'] = 500
future.tail()
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
fig1 = m.plot(forecast)
Start dates of events seem to cause peaks. You can use holidays for this by setting the starting date of each event as a holiday. This informs prophet about the events (and their peaks). I noticed event 1 and 2 are overlapping. I think you have multiple options here to deal with this. You need to ask yourself what the predictive value of each event is related to event3. You don't have too much data, that will be the main issue. If they have equal value, you could change the date of one event. For example 11 days earlier. The unequal value scenario could mean you drop 1 event.
events = pd.DataFrame({
'holiday': 'events',
'ds': pd.to_datetime(['2019-03-24', '2019-03-12', '2019-03-01']),
'lower_window': 0,
'upper_window': 1,
})
m = Prophet(growth='linear', holidays=events)
m.fit(df)
Also I noticed you forecast on the cumsum. I think your events are stationary therefor prophet probably benefits from forecasting on the daily ticket sales rather than the cumsum.

How can I Extract only numbers from this columns?

Suppose, you have a column in excel, with values like this... there are only 5500 numbers present but it show length 5602 means that 102 strings are present
4 SELECTIO
6 N NO
14 37001
26 37002
38 37003
47 37004
60 37005
73 37006
82 37007
92 37008
105 37009
119 37010
132 37011
143 37012
157 37013
168 37014
184 37015
196 37016
207 37017
220 37018
236 37019
253 37020
267 37021
280 37022
287 Krishan
290 37023
300 37024
316 37025
337 37026
365 37027
...
74141 42471
74154 42472
74169 42473
74184 42474
74200 42475
74216 42476
74233 42477
74242 42478
74256 42479
74271 42480
74290 42481
74309 42482
74323 42483
74336 42484
74350 42485
74365 42486
74378 42487
74389 42488
74398 42489
74413 42490
74430 42491
74446 42492
74459 42493
74474 42494
74491 42495
74504 42496
74516 42497
74530 42498
74544 42499
74558 42500
Name: Selection No., Length: 5602, dtype: object
and I want to get only numeric values like this in python using pandas
37001
37002
37003
37004
37005
how can I do this? I have attached my code in python using pandas..............................................
def selection(sle):
if sle in re.match('[3-4][0-9]{4}',sle):
return 1
else:
return 0
select['status'] = select['Selection No.'].apply(selection)
and now I am geting an "argument of type 'NoneType' is not iterable" error.
Try using Numpy with np.isreal and only select numbers..
import pandas as pd
import numpy as np
df = pd.DataFrame({'SELECTIO':['N NO',37002,37003,'Krishan',37004,'singh',37005], 'some_col':[4,6,14,26,38,47,60]})
df
SELECTIO some_col
0 N NO 4
1 37002 6
2 37003 14
3 Krishan 26
4 37004 38
5 singh 47
6 37005 60
>>> df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
result:
Specific to column SELECTIO ..
df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
OR just another approach importing numbers + lambda :
import numbers
df[df[['SELECTIO']].applymap(lambda x: isinstance(x, numbers.Number)).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
Note: there is problem when you are extracting a column you are using ['Selection No.'] but indeed you have a Space in the name it will be like ['Selection No. '] that's the reason you are getting KeyError while executing it, try and see!
Your function contains wrong expression: if sle in re.match('[3-4][0-9]{4}',sle): - it tries to find a column value sle IN match object which "always have a boolean value of True" (re.match returns None when there's no match)
I would suggest to proceed with pd.Series.str.isnumeric function:
In [544]: df
Out[544]:
Selection No.
0 37001
1 37002
2 37003
3 asnsh
4 37004
5 singh
6 37005
In [545]: df['Status'] = df['Selection No.'].str.isnumeric().astype(int)
In [546]: df
Out[546]:
Selection No. Status
0 37001 1
1 37002 1
2 37003 1
3 asnsh 0
4 37004 1
5 singh 0
6 37005 1
If a strict regex pattern is required - use pd.Series.str.contains function:
df['Status'] = df['Selection No.'].str.contains('^[3-4][0-9]{4}$', regex=True).astype(int)

Python assign some part of the list to another list

I have a dataset like below:
In this dataset first column represents the id of a person, the last column is label of this person and rest of the columns are features of the person.
101 166 633.0999756 557.5 71.80000305 60.40000153 2.799999952 1 1 -1
101 133 636.2000122 504.3999939 71 56.5 2.799999952 1 2 -1
105 465 663.5 493.7000122 82.80000305 66.40000153 3.299999952 10 3 -1
105 133 635.5999756 495.6000061 89 72 3.599999905 9 6 -1
105 266 633.9000244 582.2000122 93.59999847 81 3.700000048 2 2 -1
105 299 618.4000244 552.4000244 80.19999695 66.59999847 3.200000048 3 64 -1
105 99 615.5999756 575.7000122 80 67 3.200000048 0 0 -1
120 399 617.7000122 583.5 95.80000305 82.40000153 3.799999952 8 10 1
120 266 633.9000244 582.2000122 93.59999847 81 3.700000048 2 2 1
120 299 618.4000244 552.4000244 80.19999695 66.59999847 3.200000048 3 64 1
120 99 615.5999756 575.7000122 80 67 3.200000048 0 0 1
My aim is to classify these people, and I want to use leave one out person method as a split method. So I need to choose one person and his all data as a test data and the rest of the data for training. But when try to choose the test data I implemented list assignment operation but it gave an error. This is my code:
`import numpy as np
datasets=["raw_fixationData.txt"]
file_name_array=[101,105,120]
for data in datasets:
data = np.genfromtxt(data,delimiter="\t")
data=data[1:,:]
num_line=len(data[:,1])-1
num_feat=len(data[1,:])-2
label=num_feat+1
X = data[0:num_line+1,1:label]
y = data[0:num_line+1,label]
test_prtcpnt=[]; test_prtcpnt_label=[]; train_prtcpnt=[]; train_prtcpnt_label=[];
for i in range(len(file_name_array)):
m=0; # test index
n=0 #train index
for j in range(num_line):
if X[j,0]==file_name_array[i]:
test_prtcpnt[m,0:10]=X[j,0:10];
test_prtcpnt_label[m]=y[j];
m=m+1;
else:
train_prtcpnt[n,0:10]=X[j,0:10];
train_prtcpnt_label[n]=y[j];
n=n+1; `
This code give me this error test_prtcpnt[m,0:10]=X[j,0:10]; TypeError: list indices must be integers or slices, not tuple
How could I solve this problem?
I think that you are misusing Python's slice notation. Please refer to the following stack overflow post on slicing:
Explain Python's slice notation
In this case, the Python interpreter seems to be interpreting test_prtcpnt[m,0:10] as a tuple. Is it possible that you meant to say the following:
test_prtcpnt[0:10]=X[0:10]

python statistic top 10

using python 2.6
I have large text file.
Below are the first 3 entries, but there are over 50 users I need to check.
html_log:jeff 1153.3 1.84 625:54 1 2 71 3 2 10 7:58 499 3 5 616:36 241 36 html_log:fred 28.7 1.04 27:34 -10 18 13 0:48 37 18 8 -3.63 html_log:bob 1217.1 1.75 696:48 1 5 38 6 109 61 14:42 633 223 25 435:36 182 34 ... continues
I need to beable to find the username in this case the text after the "html_log:" tags
I also need the rating (first set of values next to the username.)
Output would check the entire txt file and output the top 10 highest rated players.
Please note that there are not always 16 sets of values, some contain far less.
producing:
bob 1217.1
jeff 1153
fred 28.7
In this case I would actually use a regular expression.
Just consider html_log: as a record start marker, the next part up until a whitespace is the name. The next part after it is the score, which you can convert to float for comparison:
s = "html_log:jeff 1153.3 1.84 625:54 1 2 71 3 2 10 7:58 499 3 5 616:36 241 36 html_log:fred 28.7 1.04 27:34 -10 18 13 0:48 37 18 8 -3.63 html_log:bob 1217.1 1.75 696:48 1 538 6 109 61 14:42 633 223 25 435:36 182 34"
pattern = re.compile("html_log:(?P<name>[^ ]*) (?P<score>[^ ]*)")
print sorted(pattern.findall(s), key=lambda x: float(x[1]), reverse=True)
# [('bob', '1217.1'), ('jeff', '1153.3'), ('fred', '28.7')]
If you are wondering how to read this file the straight forward algorithm would be, first, read the whole file in a string. then use string.split(' ') to split everything with space, then through a for loop on every pieces of these check whether an element contains html_log: if yes here is the username, and the next element is the highest rate! and store all these stuffs in a dictionary for further sorting or other operations.

Categories

Resources