python statistic top 10 - python

using python 2.6
I have large text file.
Below are the first 3 entries, but there are over 50 users I need to check.
html_log:jeff 1153.3 1.84 625:54 1 2 71 3 2 10 7:58 499 3 5 616:36 241 36 html_log:fred 28.7 1.04 27:34 -10 18 13 0:48 37 18 8 -3.63 html_log:bob 1217.1 1.75 696:48 1 5 38 6 109 61 14:42 633 223 25 435:36 182 34 ... continues
I need to beable to find the username in this case the text after the "html_log:" tags
I also need the rating (first set of values next to the username.)
Output would check the entire txt file and output the top 10 highest rated players.
Please note that there are not always 16 sets of values, some contain far less.
producing:
bob 1217.1
jeff 1153
fred 28.7

In this case I would actually use a regular expression.
Just consider html_log: as a record start marker, the next part up until a whitespace is the name. The next part after it is the score, which you can convert to float for comparison:
s = "html_log:jeff 1153.3 1.84 625:54 1 2 71 3 2 10 7:58 499 3 5 616:36 241 36 html_log:fred 28.7 1.04 27:34 -10 18 13 0:48 37 18 8 -3.63 html_log:bob 1217.1 1.75 696:48 1 538 6 109 61 14:42 633 223 25 435:36 182 34"
pattern = re.compile("html_log:(?P<name>[^ ]*) (?P<score>[^ ]*)")
print sorted(pattern.findall(s), key=lambda x: float(x[1]), reverse=True)
# [('bob', '1217.1'), ('jeff', '1153.3'), ('fred', '28.7')]

If you are wondering how to read this file the straight forward algorithm would be, first, read the whole file in a string. then use string.split(' ') to split everything with space, then through a for loop on every pieces of these check whether an element contains html_log: if yes here is the username, and the next element is the highest rate! and store all these stuffs in a dictionary for further sorting or other operations.

Related

How to replace a classic for loop with df.iterrows()?

I have a huge data frame.
I am using a for loop in the below sample code:
for i in range(1, len(df_A2C), 1):
A2C_TT= df_A2C.loc[(df_A2C['TO_ID'] == i)].sort_values('DURATION_H').head(1)
if A2C_TT.size > 0:
print (A2C_TT)
This is working fine but I want to use df.iterrows() since it will help me to automaticall avoid empty frame issues.
I want to iterate through TO_ID and looking for minimum values accordingly.
How should I replace my classical i loop counter with df.iterrows()?
Sample Data:
FROM_ID TO_ID DURATION_H DIST_KM
1 7 0.528555556 38.4398
2 26 0.512511111 37.38515
3 71 0.432452778 32.57571
4 83 0.599486111 39.26188
5 98 0.590516667 35.53107
6 108 1.077794444 76.79874
7 139 0.838972222 58.86963
8 146 1.185088889 76.39174
9 158 0.625872222 45.6373
10 208 0.500122222 31.85239
11 209 0.530916667 29.50249
12 221 0.945444444 62.69099
13 224 1.080883333 66.06291
14 240 0.734269444 48.1778
15 272 0.822875 57.5008
16 349 1.171163889 76.43536
17 350 1.080097222 71.16137
18 412 0.503583333 38.19685
19 416 1.144961111 74.35502
As far as I understand your question, you want to group your data by To_ID and select the row where Duration_H is the smallest? Is that right?
df.loc[df.groupby('TO_ID').DURATION_H.idxmin()]
here is one way about it
# run the loop for as many unique TO_ID you have
# instead of iterrows, which runs for all the DF or running to the size of DF
for idx in np.unique(df['TO_ID']):
A2C_TT= df.loc[(df['TO_ID'] == idx)].sort_values('DURATION_H').head(1)
print (A2C_TT)
ROM_ID TO_ID DURATION_H DIST_KM
498660 39 7 0.434833 25.53808
here is another way about it
df.loc[df['DURATION_H'].eq(df.groupby('TO_ID')['DURATION_H'].transform(min))]
ROM_ID TO_ID DURATION_H DIST_KM
498660 39 7 0.434833 25.53808

Pandas DataFrame iterate over a window of rows quickly

I've got a time-series dataframe that looks something like:
datetime gesture left-5-x ...30 columns omitted
2022-09-27 19:54:54.396680 gesture0255 533
2022-09-27 19:54:54.403298 gesture0255 534
2022-09-27 19:54:54.408938 gesture0255 535
2022-09-27 19:54:54.413995 gesture0255 523
2022-09-27 19:54:54.418666 gesture0255 522
... 95 000 rows omitted
And I want to create a new column df['cross_correlation'] which is the function of multiple sequential rows. So the cross_correlation of row i depends on the data from rows i-10 to i+10.
I could do this with df.iterrow(), but that seems like the non-idiomatic version. Is there a function like
df.window(-10, +10).apply(lambda rows: calculate_cross_correlation(rows)
or similar?
EDIT:
Thanks #chris, who pointed me towards df.rolling(), although I now have this example which better reflect the problem I'm having:
Here's a simplified version of the function I want to apply over the moving window. Note that the actual version requires that the input be the full 2D window of shape (window_size, num_columns) but the toy function below doesn't actually need the input to be 2D. I've added an assertion to make sure this is true:
def sum_over_2d(x):
assert len(x.shape) == 2, f'shape of input is {x.shape} and not of length 2'
return x.sum()
And now if I use .rolling with .apply
df.rolling(window=10, center=True).apply(
sum_over_2d
)
, I get an assertion error:
AssertionError: shape of input is (10,) and not of length 2
and if I print the input x before the assertion, I get:
0 533.0
1 534.0
2 535.0
3 523.0
4 522.0
5 526.0
6 510.0
7 509.0
8 502.0
9 496.0
dtype: float64
which is one column from my many-columned dataset. What I'm wanting is for the input x to be a dataframe or 2d numpy array.
IIUC, one way using pandas.Series.rolling.apply.
Example with sum:
df["new"] = df["left-5-x"].rolling(3, center=True, min_periods=1).sum()
Output:
datetime gesture left-5-x new explain
0 2022-09-27 19:54:54.396680 gesture0255 533 1067.0 533+534
1 2022-09-27 19:54:54.403298 gesture0255 534 1602.0 533+534+535
2 2022-09-27 19:54:54.408938 gesture0255 535 1592.0 534+535+523
3 2022-09-27 19:54:54.413995 gesture0255 523 1580.0 535+523+522
4 2022-09-27 19:54:54.418666 gesture0255 522 1045.0 523+522
You can see left-5-x are summed with +1 to -1 neighbors.
Edit:
If you want to use roll-ed dataframe, one way would be iterate over the rolling:
new_df = pd.concat([sum_over_2d(d) for d in df.rolling(window=10)],axis=1).T
Output:
0 1 2 3
0 0 1 2 3
1 4 6 8 10
2 12 15 18 21
3 24 28 32 36
4 40 45 50 55
5 60 66 72 78
6 84 91 98 105
7 112 120 128 136
8 144 153 162 171
9 180 190 200 210
Or as per #Sandwichnick's comment, you can use method=="table", but only if pass engine=="numba". In other words, your sum_over_2d must be numba compilable (which is beyond the scope of this question and my knowledge)
df.rolling(window=10, center=True, method="table").sum(engine="numba")

Add percentage column to panda dataframe with percent sign

For the below excel which will be used as panda dataframe
df['% '] = (df['Code Lines'] / df['Code Lines'].sum())*100
Output comes
Language # of Files Blank Lines Comment Lines Code Lines %
C++ 15 66 35 354 3.064935065
C/C++ Header 1 3 7 4 0.034632035
Markdown 6 73 0 142 1.229437229
Python 110 1998 2086 4982 43.13419913
Tcl/Tk 1 14 18 273 2.363636364
YAML 1 0 6 20 0.173160173
Total 134 2154 2152 5775 50
I am trying to get % column should be of only 2 decimal places with a percent sign
Something like this
Language # of Files Blank Lines Comment Lines Code Lines %
C++ 15 66 35 354 3.06%
C/C++ Header 1 3 7 4 0.03%
Markdown 6 73 0 142 1.22%
Use round, convert to strings and add %:
df['%'] = ((df['Code Lines'] / df['Code Lines'].sum())*100).round(2).astype(str) + '%'

Pandas groupby two columns and only keep records satisfying condition based on count

Trying to filter out a number of actions a user has done if the number of actions reaches a threshold.
Here is the data set: (Only Few records)
user_id,session_id,item_id,rating,length,time
123,36,28,3.5,6243.0,2015-03-07 22:44:40
123,36,29,2.5,4884.0,2015-03-07 22:44:14
123,36,30,3.5,6846.0,2015-03-07 22:44:28
123,36,54,6.5,10281.0,2015-03-07 22:43:56
123,36,61,3.5,7639.0,2015-03-07 22:43:44
123,36,62,7.5,18640.0,2015-03-07 22:43:34
123,36,63,8.5,7189.0,2015-03-07 22:44:06
123,36,97,2.5,7627.0,2015-03-07 22:42:53
123,36,98,4.5,9000.0,2015-03-07 22:43:04
123,36,99,7.5,7514.0,2015-03-07 22:43:13
223,63,30,8.0,5412.0,2015-03-22 01:42:10
123,36,30,5.5,8046.0,2015-03-07 22:42:05
223,63,32,8.5,4872.0,2015-03-22 01:42:03
123,36,32,7.5,11914.0,2015-03-07 22:41:54
225,63,35,7.5,6491.0,2015-03-22 01:42:19
123,36,35,5.5,7202.0,2015-03-07 22:42:15
123,36,36,6.5,6806.0,2015-03-07 22:42:43
123,36,37,2.5,6810.0,2015-03-07 22:42:34
225,63,41,5.0,15026.0,2015-03-22 01:42:37
225,63,45,6.5,8532.0,2015-03-07 22:42:25
I can groupby the data using user_id and session_id and get a count of items a user has rated in a session:
df.groupby(['user_id', 'session_id']).agg({'item_id':'count'}).rename(columns={'item_id': 'count'})
List of items that user has rated in a session can be obtained:
df.groupby(['user_id','session_id'])['item_id'].apply(list)
The goal is to get following if a user has rated more than 3 items in session, I want to pick only the first three items (keep only first three per user per session) from the original data frame. Maybe use the time to sort the items?
First tried to obtain which sessions contain more than 3, somewhat struggling to go beyond.
df.groupby(['user_id', 'session_id'])['item_id'].apply(
lambda x: (x > 3).count())
Example: from original df, user 123 should have first three records belong to session 36
It seems like you want to use groupby with head:
In [8]: df.groupby([df.user_id, df.session_id]).head(3)
Out[8]:
user_id session_id item_id rating length time
0 123 36 28 3.5 6243.0 2015-03-07 22:44:40
1 123 36 29 2.5 4884.0 2015-03-07 22:44:14
2 123 36 30 3.5 6846.0 2015-03-07 22:44:28
10 223 63 30 8.0 5412.0 2015-03-22 01:42:10
12 223 63 32 8.5 4872.0 2015-03-22 01:42:03
14 225 63 35 7.5 6491.0 2015-03-22 01:42:19
18 225 63 41 5.0 15026.0 2015-03-22 01:42:37
19 225 63 45 6.5 8532.0 2015-03-07 22:42:25
One way is to use sort_values followed by groupby.cumcount. A method I find useful is to extract any series or MultiIndex data before applying any filtering.
The below example filters for minimum user_id / session_id combination of 3 items and only takes the first 3 in each group.
sizes = df.groupby(['user_id', 'session_id']).size()
counter = df.groupby(['user_id', 'session_id']).cumcount() + 1 # counting begins at 0
indices = df.set_index(['user_id', 'session_id']).index
df = df.sort_values('time')
res = df[(indices.map(sizes.get) >= 3) & (counter <=3)]
print(res)
user_id session_id item_id rating length time
0 123 36 28 3.5 6243.0 2015-03-07 22:44:40
1 123 36 29 2.5 4884.0 2015-03-07 22:44:14
2 123 36 30 3.5 6846.0 2015-03-07 22:44:28
14 225 63 35 7.5 6491.0 2015-03-22 01:42:19
18 225 63 41 5.0 15026.0 2015-03-22 01:42:37
19 225 63 45 6.5 8532.0 2015-03-07 22:42:25

Replacing value in text file column with string

I'm having a pretty simple issue. I have a dataset (small sample shown below)
22 85 203 174 9 0 362 40 0
21 87 186 165 5 0 379 32 0
30 107 405 306 25 0 756 99 0
6 5 19 6 2 0 160 9 0
21 47 168 148 7 0 352 29 0
28 38 161 114 10 3 375 40 0
27 218 1522 1328 114 0 1026 310 0
21 78 156 135 5 0 300 27 0
The first issue I needed to cover was replacing each space with a comma I did that with the following code
import fileinput
with open('Data_Sorted.txt', 'w') as f:
for line in fileinput.input('DATA.dat'):
line = line.split(None,8)
f.write(','.join(line))
The result was the following
22,85,203,174,9,0,362,40,0
21,87,186,165,5,0,379,32,0
30,107,405,306,25,0,756,99,0
6,5,19,6,2,0,160,9,0
21,47,168,148,7,0,352,29,0
28,38,161,114,10,3,375,40,0
27,218,1522,1328,114,0,1026,310,0
21,78,156,135,5,0,300,27,0
My next step is to grab the values from the last column, check if they are less than 2 and replace it with the string 'nfp'.
I'm able to seperate the last column with the following
for line in open("Data_Sorted.txt"):
columns = line.split(',')
print columns[8]
My issue is implementing the conditional to replace the value with the string and then I'm not sure how to put the modified column back into the original dataset.
There's no need to do this in two loops through the file. Also, you can use -1 to index the last element in the line.
import fileinput
with open('Data_Sorted.txt', 'w') as f:
for line in fileinput.input('DATA.dat'):
# strip newline character and split on whitespace
line = line.strip().split()
# check condition for last element (assuming you're using ints)
if int(line[-1]) < 2:
line[-1] = 'nfp'
# write out the line, but you have to add the newline back in
f.write(','.join(line) + "\n")
Further Reading:
Negative list index?
Understanding Python's slice notation
You need to convert columns[8] to an int and compare if it is less than 2.
for line in open("Data_Sorted.txt"):
columns = line.split(',')
if (int(columns[8]) < 2):
columns[8] = "nfp"
print columns

Categories

Resources