line_profiler only showing a small number of lines - python

Following the docs on line_profiler, I am able to profile my code just fine, but when I view the output with python -m line_profiler script_to_profile.py.lprof, I only see 27 lines of code. I expect to see about 250, because that's the length of the function that I added the #profile decorator to. My output looks like this:
Timer unit: 1e-06 s
Total time: 831.023 s
File: /Users/will/Repos/Beyond12/src/student_table.py
Function: aggregate_student_data at line 191
Line # Hits Time Per Hit % Time Line Contents
==============================================================
191 # load files
192 files = os.listdir(datadir)
193
194 tabs = ['Contact', 'Costs_Contributions', 'Course', 'EducationalHistory', 'Events', 'FinancialAidScholarship',
195 1 764 764.0 0.0 'PreCollegiateExam', 'QualitativeInformation', 'Student_Performance', 'Task', 'Term']
196 special_contact_id = {'Contact': 'Id',
197 1 7 7.0 0.0 'Events': 'WhoId',
198 1 6 6.0 0.0 'QualitativeInformation': 'Alumni',
199 1 6 6.0 0.0 'Student_Performance': 'Contact',
200 1 6 6.0 0.0 'PreCollegiateExam': 'Contact ID'}
201 1 6 6.0 0.0
202 1 5 5.0 0.0 # todo: make dictionary of the columns that we'll be using, along with their types?
203
204 df = {}
205 for tab in tabs:
206 1 6 6.0 0.0 # match tab titles to files by matching first 5 non-underscore characters
207 12 115 9.6 0.0 filename = filter(lambda x: tab.replace('_', '')[:5] in x.replace('_', '')[:5], files)[0]
It's cut off in the middle of a for loop.

you might have modified source file after creating profile result file, try a re-run.
because the source code line_profiler printed is read from the file on disk, ref:
https://github.com/rkern/line_profiler/blob/master/line_profiler.py#L190

Related

Create new column for Dataset within a foor loop - Pandas Python

I have a dataframe which contains students attendance data over the previous year. It looks like this, with lots of columns showing different dates, and numbers showing whether they attended or not on that dat.
date students 2019-09-03 2019-09-04 ... ThisYearPossible ThisYearAttended
0 5bf3e06e9a892068705d8415 2.0 2.0 ... 240 224
1 5bf3e06e9a892068705d8416 2.0 1.0 ... 244 240
2 5bf3e06e9a892068705d8417 2.0 1.0 ... 240 228
3 5bf3e06e9a892068705d8418 2.0 2.0 ... 244 238
4 5bf3e06e9a892068705d8419 2.0 2.0 ... 244 238
.. ... ... ... ... ... ...
207 5d718580a974320c3ddcbb2f NaN 2.0 ... 240 234
208 5d718580a974320c3ddcbb30 NaN 2.0 ... 240 240
209 5d718580a974320c3ddcbb31 NaN NaN ... 230 230
210 5d718580a974320c3ddcbb32 NaN NaN ... 240 236
211 5e13ae04b9b219f0b15bf0c9 NaN 0.0 ... 98 88
However, some of the columns are NaN, as those students hadnt started school yet. So I am trying to create another column in the dataset called 'StartDate' which shows the date that the child first attended - so they either received a 0, 1 or 2 for attendance.
This is what I have so far:
for i in ThisYeardf.index:
ThisStudent = ThisYeardf.iloc[i].dropna(axis=0, how='any', inplace=False)
ThisStudent = ThisStudent.to_frame()
StartDate = ThisStudent.index[1]
#ThisYeardf['StartDate'].iloc[i] = StartDate
print(StartDate)
This receives the start date correctly and prints it out fine. But I cannot seem to make a column and add the start date into it for each pupil. The line commented out above gives me the following error - KeyError: 'StartDate'
Does anyone know how to do this? Thanks in advance
Found the solution - as follows:
Weirddf = pd.DataFrame(columns = ['students', 'StartDate'])
result = []
names = []
for i in ThisYeardf.index:
ThisStudent = ThisYeardf.iloc[i].dropna(axis=0, how='any', inplace=False)
ThisStudent = ThisStudent.to_frame()
StartDate = ThisStudent.index[1]
Student = ThisStudent.iloc[0]
result.append(StartDate)
names.append(Student)
print(StartDate)
Weirddf["StartDate"] = result
Weirddf["students"] = names

Reading a multiline record into Pandas dataframe

I have an earthquake data I want to read into a Pandas dataframe. Data for each earthquake is spread over 5 fixed-format lines, and the format for each of the 5 lines is different. Some fields include variable whitespaces, so I can't just do a delimited read.
Is there an elegant way to parse that with read_fwf (or something else)? I think nesting loops with chunksize=1 might work but it's not very clean. Or I could reformat the file to cat each 5 line block out to a single line; but I'd rather use the original file.
Here's he first earthquake as an example:
MLI 1976/01/01 01:29:39.6 -28.61 -177.64 59.0 6.2 0.0 KERMADEC ISLANDS REGION
M010176A B: 0 0 0 S: 0 0 0 M: 12 30 135 CMT: 1 BOXHD: 9.4
CENTROID: 13.8 0.2 -29.25 0.02 -176.96 0.01 47.8 0.6 FREE O-00000000000000
26 7.680 0.090 0.090 0.060 -7.770 0.070 1.390 0.160 4.520 0.160 -3.260 0.060
V10 8.940 75 283 1.260 2 19 -10.190 15 110 9.560 202 30 93 18 60 88

Python: Imported csv not being split into proper columns

I am importing a csv file into python using pandas but the data frame is only in one column. I copied and pasted data from the comma-separated format from The Player Standing Field table at this link (second one) into an excel file and saved it as a csv (originally as ms-dos, then both as normal and utf-8 per recommendation by AllthingsGo42). But it only returned a single column data frame.
Examples of what I tried:
dataset=pd.read('MLB2016PlayerStats2.csv')
dataset=pd.read('MLB2016PlayerStats2.csv', delimiter=',')
dataset=pd.read_csv('MLB2016PlayerStats2.csv',encoding='ISO-8859-9',
delimiter=',')
The each line of code above all returned:
Rk,Name,Age,Tm,Lg,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr,Rdrs,Rdrs/yr,RF/9,RF/G,Pos Summary
1,Fernando Abad\abadfe01,30,TOT,AL,57,0,0,46.2...
2,Jose Abreu\abreujo02,29,CHW,AL,152,152,150,1...
3,A.J. Achter\achteaj01,27,LAA,AL,27,0,0,37.2,...
4,Dustin Ackley\ackledu01,28,NYY,AL,23,16,10,1...
5,Cristhian Adames\adamecr01,24,COL,NL,69,43,3...
Also tried:
dataset=pd.read_csv('MLB2016PlayerStats2.csv',encoding='ISO-8859-9',
delimiter=',',quoting=3)
Which returned:
"Rk Name Age Tm Lg G GS CG Inn Ch
\
0 "1 Fernando Abad\abadfe01 30 TOT AL 57 0 0 46.2 4
1 "2 Jose Abreu\abreujo02 29 CHW AL 152 152 150 1355.2 1337
2 "3 A.J. Achter\achteaj01 27 LAA AL 27 0 0 37.2 6
3 "4 Dustin Ackley\ackledu01 28 NYY AL 23 16 10 140.1 97
4 "5 Cristhian Adames\adamecr01 24 COL NL 69 43 38 415.0 212
E DP Fld% Rtot Rtot/yr Rdrs Rdrs/yr RF/9 RF/G \
0 ... 0 1 1.000 NaN NaN NaN NaN 0.77 0.07
1 ... 10 131 0.993 -2.0 -2.0 -5.0 -4.0 8.81 8.73
2 ... 0 0 1.000 NaN NaN 0.0 0.0 1.43 0.22
3 ... 0 8 1.000 1.0 9.0 3.0 27.0 6.22 4.22
4 ... 6 24 0.972 -4.0 -12.0 1.0 3.0 4.47 2.99
Pos Summary"
0 P"
1 1B"
2 P"
3 1B-OF-2B"
4 SS-2B-3B"
Below is what the data looks like in notepad++
"Rk,Name,Age,Tm,Lg,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr,Rdrs,Rdrs/yr,RF/9,RF/G,Pos Summary"
"1,Fernando Abad\abadfe01,30,TOT,AL,57,0,0,46.2,4,0,4,0,1,1.000,,,,,0.77,0.07,P"
"2,Jose Abreu\abreujo02,29,CHW,AL,152,152,150,1355.2,1337,1243,84,10,131,.993,-2,-2,-5,-4,8.81,8.73,1B"
"3,A.J. Achter\achteaj01,27,LAA,AL,27,0,0,37.2,6,2,4,0,0,1.000,,,0,0,1.43,0.22,P"
"4,Dustin Ackley\ackledu01,28,NYY,AL,23,16,10,140.1,97,89,8,0,8,1.000,1,9,3,27,6.22,4.22,1B-OF-2B"
"5,Cristhian Adames\adamecr01,24,COL,NL,69,43,38,415.0,212,68,138,6,24,.972,-4,-12,1,3,4.47,2.99,SS-2B-3B"
"6,Austin Adams\adamsau01,29,CLE,AL,19,0,0,18.1,1,0,0,1,0,.000,,,0,0,0.00,0.00,P"
Sorry for the confusion with my question before. I hope this edit will clear things up. Thank you to those that answered thus far.
Running it quickly myself, I was able to get what I am understanding is the desired output.
My only thought is that there i s no need to call out a delimiter for a csv, because a csv is a comma separated variable file, but that should not matter. I am thinking that there is something incorrect with your actual data file and I would go and make sure it is saved correctly. I would echo previous comments and make sure that the csv is a UTF-8, and not an MS-DOS or Macintosh (both options when saving in excel)
Best of luck!
There is no need to call for a delimiter for a csv. You only have to change the separator from ";" to ",". For this you can open your csv file with notepad and change them with the replace tool.

Dictionary comprehension on a filter object

What's going on with my dictionary comprehension here?
I am parsing a BLAST file and want to create objects for each line in the file. Ideally each object will be stored in a dictionary for use later in the program.
Parsing works fine but I end up with a blank transSwiss dictionary.
Here are a few lines of output as an example:
c0_g1_i1|m.1 gi|74665200|sp|Q9HGP0.1|PVG4_SCHPO 100.00 372 0 0 1 372 1 372 0.0 754
c1000_g1_i1|m.799 gi|48474761|sp|O94288.1|NOC3_SCHPO 100.00 747 0 0 5 751 1 747 0.0 1506
c1001_g1_i1|m.800 gi|259016383|sp|O42919.3|RT26A_SCHPO 100.00 268 0 0 1 268 1 268 0.0 557
c1002_g1_i1|m.801 gi|1723464|sp|Q10302.1|YD49_SCHPO 100.00 646 0 0 1 646 1 646 0.0 1310
I'm trying to make each BLAST line a parse_blast object.
class parse_blast(object):
def __init__(self, line):
#Strip end-of-line and split on tabs
self.fields = line.strip("\n").split("\t")
self.transcriptId, self.isoform = self.fields[0].split("|")
self.swissStuff = self.fields[1].split("|")
self.swissProtId = self.swissStuff[3]
self.percentId = self.fields[2]
def filterblast(self):
return float(self.percentId) > 95
blastmap = map(parse_blast, blast_output.readlines())
filtered = filter(parse_blast.filterblast, blastmap)
transSwiss = {blastmap.transcriptId:blastmap for blastmap.transcriptId in filtered}
When you do this:
for blastmap.transcriptId in filtered
you're trying to assign each element of filtered to blastmap.transcriptId in sequence. blastmap is either a list or an instance of the map type, depending on your Python version, so it has no transcriptId attribute, and your code fails with an AttributeError.
Use a variable. A new variable:
transSwiss = {pb.transcriptId: pb for pb in filtered}

python statistic top 10

using python 2.6
I have large text file.
Below are the first 3 entries, but there are over 50 users I need to check.
html_log:jeff 1153.3 1.84 625:54 1 2 71 3 2 10 7:58 499 3 5 616:36 241 36 html_log:fred 28.7 1.04 27:34 -10 18 13 0:48 37 18 8 -3.63 html_log:bob 1217.1 1.75 696:48 1 5 38 6 109 61 14:42 633 223 25 435:36 182 34 ... continues
I need to beable to find the username in this case the text after the "html_log:" tags
I also need the rating (first set of values next to the username.)
Output would check the entire txt file and output the top 10 highest rated players.
Please note that there are not always 16 sets of values, some contain far less.
producing:
bob 1217.1
jeff 1153
fred 28.7
In this case I would actually use a regular expression.
Just consider html_log: as a record start marker, the next part up until a whitespace is the name. The next part after it is the score, which you can convert to float for comparison:
s = "html_log:jeff 1153.3 1.84 625:54 1 2 71 3 2 10 7:58 499 3 5 616:36 241 36 html_log:fred 28.7 1.04 27:34 -10 18 13 0:48 37 18 8 -3.63 html_log:bob 1217.1 1.75 696:48 1 538 6 109 61 14:42 633 223 25 435:36 182 34"
pattern = re.compile("html_log:(?P<name>[^ ]*) (?P<score>[^ ]*)")
print sorted(pattern.findall(s), key=lambda x: float(x[1]), reverse=True)
# [('bob', '1217.1'), ('jeff', '1153.3'), ('fred', '28.7')]
If you are wondering how to read this file the straight forward algorithm would be, first, read the whole file in a string. then use string.split(' ') to split everything with space, then through a for loop on every pieces of these check whether an element contains html_log: if yes here is the username, and the next element is the highest rate! and store all these stuffs in a dictionary for further sorting or other operations.

Categories

Resources