I'm reading datas from a file. content: time-id-data,
when I run on MAC, it works well, but on linux, sometimes it works sometimes fails.
the error is "IndexError: list index out of range"
data like this:
'
1554196690 0.0 178 180 180 178 178 178 180
1554196690 0.1 178 180 178 180 180 178 178
1554196690 0.2 175 171 178 173 173 178 172
1554196690 0.3 171 175 175 17b 179 177 17e
1554196691 0.4 0 d3
1554196691 0.50 28:10:4:92:a:0:0:d6 395
1554196691 0.58 28:a2:23:93:a:0:0:99 385
'
data = []
boardID=100 #how many lines at most in datafile
for i in range(8):
data.append([[] for x in range(8)])#5 boards,every boards have 7 sensors add 1-boardID
time_stamp = []
time_xlabel=[]
time_second=[]
for i in range(8):
time_stamp.append([]) #5th-lines data is the input voltage and pressure
time_xlabel.append([])#for x label
time_second.append([])#time from timestamp to time(start-time is 0)
with open("Potting_20190402-111807.txt","r") as eboardfile:
for line in eboardfile:
values = line.strip().split("\t")
boardID=int(round(float(values[1])%1*10)) #define board, 0-3 is the electronBoards, board4-pressure sensor, board5-temperature sensor located inside house not on eboard.
time_stamp[boardID].append(int(values[0]))
if boardID >= 0 and boardID < 4:
for i in range(2,9):
data[boardID][i-2].append(int(values[i],16) * 0.0625)
if boardID==4:#pressure
data[boardID][0].append( int(values[2],16) * 5./1024. *14.2/2.2) #voltage divider: 12k + 2.2k
data[boardID][1].append( (int(values[3],16) * 5./1024. - 0.5) / 4.*6.9*1000.) #adc to volt: value * 5V/1024, volt to hpa: (Vout - 0.5V)/4V *6.9bar * 1000i
elif boardID > 4 and boardID < 7: #temperature sensor located inside house not no electronBoards
data[boardID][0].append(int(values[4],10) * 0.0625)#values[2] is the address,[3]-empty;[4]is the valueself.
eboardfile.close()
Traceback(most recent call last):
boardID=int(round(float(values[1])%1*10)) #define board, 0-3 is the electronBoards, board4-pressure sensor, board5-temperature sensor located inside house not on eboard.
IndexError: list index out of range
Traceback(most recent call last):
boardID=int(round(float(values[1])%1*10)) #define board, 0-3 is the electronBoards, board4-pressure sensor, board5-temperature sensor located inside house not on eboard.
IndexError: list index out of range
this error occurs when your values has element less than one, which means values = line.strip().split("\t"), the line has no \t at all.
maybe a empty line? or linux format problem.
you can check the len of values before use:
if len(values) < 9:
continue
or try this:
import string
values = line.strip().split(string.whitespace)
can not reproduce your condition, so just have a try.
Related
I'm trying to write a python script where a line must be executed a defined number of times per second (corresponding to FPS). I'm having a precision problem that I guess comes from the time.time_ns() as it increases when I go into high tick per second values.
If I put for example 240 ticks per second, my ticks per second counter will display 200 - 250 -250 -250 - 200. Impossible to have values between 200 and 250. Same thing, at lower values between 112 and 125 (I would have liked 120).
What is wrong with my logic, how can I improve it?
fps = 120
timeFrame_ns = 1000000000/fps
timeTickCounter = 0
count = 0
elapsedTime= time.time_ns()
while True:
if (time.time_ns() > timeTickCounter + 1000000000):
print("Ticks per second %s " % count)
if count != fps and count !=0:
diff = fps-count
ratio = diff/count
timeFrame_ns -= timeFrame_ns*ratio
timeTickCounter = time.time_ns()
count = 0
count += 1
while True:
if timeFrame_ns <= (time.time_ns()-elapsedTime):
break
elapsedTime= time.time_ns()
This script will print :
Ticks per second 0
Ticks per second 112
Ticks per second 125
Ticks per second 112
Ticks per second 125
Ticks per second 125
Ticks per second 112
Ticks per second 125
Ticks per second 125
Ticks per second 112
Ticks per second 125
What I want is to see how I should group my CD's so that I have a similar group count for each 'bin' eg A+B C+D and E+F+G+H for example. It's more of an exercise rather than a need, but I don't have enough space to have a pile for each letter of the alphabet, so I'd rather have say 10 piles, but how to split them up.
So I have the following obtained from my DataFrame, showing the cumulative sum of entries through numbers (#) and the alphabet;
In[135]:csum
Out[135]:
key
# 9
A 25
B 43
C 63
D 76
E 82
F 98
G 105
H 116
I 120
J 125
K 130
L 139
M 154
N 160
O 164
P 186
R 221
S 234
T 298
U 302
V 319
W 325
Y 326
Name: count, dtype: int64
I've written a function 'distribution' to get the result I wanted... i.e. 10 separate groups, showing which alphabetic clusters to use.
dist = distribution(byvar, various=True)
dist
Out[138]:
quants
(8.999, 49.0] #AB
(49.0, 79.6] CD
(79.6, 104.3] EF
(104.3, 121.0] GHI
(121.0, 134.5] JK
(134.5, 158.8] LM
(158.8, 189.5] NOP
(189.5, 259.6] RS
(259.6, 313.9] TU
(313.9, 326.0] VWY
dtype: object
The code is here;
import pandas as pd
import numpy as np
def distribution(df, various=False):
'''
Parameters
----------
df : dataframe
various : boolean, optional
Select if Various df
Returns
-------
df
Shows how to distribute groupings to get similar size bunches.
'''
global gar, csum
if various:
df['AZ'] = df['album'].apply(lambda x: '#' if x[0] in map(str,range(10)) else x[0].upper())
else:
df['AZ'] = df['artist'].apply(lambda x: '#' if x[0] in map(str,range(10)) else x[0].upper())
gar = df.groupby('AZ')
csum = gar.size().cumsum() ### => csum becomes a Series obj
sdf = pd.DataFrame(csum.iteritems(), columns=['key','count'])
sdf['quants'] = pd.qcut(sdf['count'], q=np.array(range(11))*0.1)
gsdf = sdf.groupby('quants')
return gsdf.apply(lambda x: x['key'].sum())
So my question arises from the fact that I couldn't see how to achieve this without converting my Series object (csum) back into a DataFrame before using pd.qcut to split it up.
Can anyone see a more concise approach that bypasses the creating of the intermediate 'sdf' DataFrame ?
Hi I'm new to Python and I've no clue how to fix the following error:
I've a data frame with around 2 million records & 20 columns of stores data, I am grouping the stores by State and trying to run dedupe_dataframe on each state after training it on one.
Here is how my code looks (np is numpy, dp is pandas pandas_dedupe):
#Read Store Data
stores = pd.read_csv("storefile.csv",sep = ",", encoding= 'latin1',dtype=str)
#There was /t in the first column so removing that
stores= stores.replace('\t','', regex=True)
stores= stores.replace(np.nan, '', regex=True)
#Getting a lowercase state list
states=list(stores.State.str.lower().unique())
#Grouping Data by States
state_data= {state: stores[stores.State.str.lower()==state] for state in states}
#Running De-Dupe for state Ohio ('oh')
dp.dedupe_dataframe(state_data['oh'], ['StoreBannerName','Address','City','State'])
I'm getting the following error:
importing data ...
---------------------------------------------------------------------------
KeyError
Traceback (most recent call last) <ipython-input-37-e2ed10256338> in <module>
----> 1 dp.dedupe_dataframe(state_data['oh'], ['StoreBannerName','Address','City','State'])
~\anaconda3\lib\site-packages\pandas_dedupe\dedupe_dataframe.py in
dedupe_dataframe(df, field_properties, canonicalize, config_name,
recall_weight, sample_size)
211 # train or load the model
212 deduper = _train(settings_file, training_file, data_d, field_properties,
--> 213 sample_size)
214
215 # ## Set threshold
~\anaconda3\lib\site-packages\pandas_dedupe\dedupe_dataframe.py in
_train(settings_file, training_file, data, field_properties, sample_size)
58 # To train dedupe, we feed it a sample of records.
59 sample_num = math.floor(len(data) * sample_size)
---> 60 deduper.sample(data, sample_num)
61
62 # If we have training data saved from a previous run of dedupe,
~\anaconda3\lib\site-packages\dedupe\api.py in sample(self, data,
sample_size, blocked_proportion, original_length)
836 sample_size,
837 original_length,
--> 838 index_include=examples)
839
840 self.active_learner.mark(examples, y)
~\anaconda3\lib\site-packages\dedupe\labeler.py in __init__(self,
data_model, data, blocked_proportion, sample_size, original_length,
index_include)
401 data = core.index(data)
402
--> 403 self.candidates = super().sample(data, blocked_proportion, sample_size)
404
405 random_pair = random.choice(self.candidates)
~\anaconda3\lib\site-packages\dedupe\labeler.py in sample(self, data,
blocked_proportion, sample_size)
50 return [(data[k1], data[k2])
51 for k1, k2
---> 52 in blocked_sample_keys | random_sample_keys]
53
54
~\anaconda3\lib\site-packages\dedupe\labeler.py in <listcomp>(.0)
49
50 return [(data[k1], data[k2])
---> 51 for k1, k2
52 in blocked_sample_keys | random_sample_keys]
53
KeyError: 2147550487
Solution
Swap the following line:
#Running De-Dupe for state Ohio ('oh')
dp.dedupe_dataframe(state_data['oh'], ['StoreBannerName','Address','City','State'])
WITH
#Running De-Dupe for state Ohio ('oh')
state_data['oh'].dedupe_dataframe(subset = ['StoreBannerName','Address','City','State'], keep='first')
Reference
pandas.DataFrame.drop_duplicates()
I got a KeyError as well.
I used this code example:
https://github.com/dedupeio/dedupe-examples/tree/master/record_linkage_example
I had changed some of the code and I had forgotten to remove data_matching_learned_settings file before running it again.
Another reason you might get this error, especially with the first column, is that your file could contain a BOM (Byte Order Mark) in the first character.
See this for how to remove a BOM:
VS Code: How to save UTF-8 without BOM in Mac?
I am having issues trying to generate a colinearity analysis on a simple DF (see below). My problem is that everytime I try to run the function, I retrieve the following error message:
KeyError: "None of [Int64Index([0, 1, 2, 3], dtype='int64')] are in the [columns]"
Below is the code I am using
read_training_set = pd.read_csv('C:\\Users\\rapha\\Desktop\\New test\\Classeur1.csv', sep=";")
training_set = pd.DataFrame(read_training_set)
print(training_set)
def calculate_vif_(X):
thresh = 5.0
variables = range(X.shape[1])
for i in np.arange(0, len(variables)):
vif = [variance_inflation_factor(X[variables].values, ix) for ix in range(X[variables].shape[1])]
print(vif)
maxloc = vif.index(max(vif))
if max(vif) > thresh:
print('dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
del variables[maxloc]
print('Remaining variables:')
print(X.columns[variables])
return X
X = training_set
X2 = calculate_vif_(X)
The DF on which I am trying to run my function looks like this.
Year Age Weight Size
0 2020 10 100 170
1 2021 11 101 171
2 2022 12 102 172
3 2023 13 103 173
4 2024 14 104 174
5 2025 15 105 175
6 2026 16 106 176
7 2027 17 107 177
8 2028 18 108 178
I have two guesses here; but not sure how to fix that anyway:
-Guess 1: the np.arrange is causing some sort of conflict with the header & columns which prevents the rest of the function of iterating through each column
-Guess 2: The problem comes from blankseperators, which prevents the function from jumping from one column to another properly. The problem is that my CSV file already has ";" seperators (I do not know exactly why to be honnest as I manually created the file and saved it as a regular CSV with "," separators").
Not sure how to fix the problem at this point, does anyone has insights here?
Best
The error is caused by this snippet X[variables].values. Convert variables, which is a range, to a list.
As an aside, the code is very confusing. Why are you calling np.arange when variables is already a range? Why are you using a range of the number of columns to index rows?
It looks like from the comments above that you think you are indexing columns by column number, but you are actually indexing rows. Some of this confusion would be cleared up if you use loc`` oriloc``` to be explicit about what you are trying to index.
Got it, I revised the whole thing and seems to be working. See below how it looks.
Thanks a lot for the help
variables = list(range(X.shape[1]))
for i in variables:
vif = [variance_inflation_factor(X.iloc[:, variables].values, ix)
for ix in range(X.iloc[:, variables].shape[1])]
maxloc = vif.index(max(vif))
if max(vif) > thresh:
print('dropping \'' + X.iloc[:, variables].columns[maxloc] +
'\' at index: ' + str(maxloc))
del variables[maxloc]
print('Remaining variables:')
print(X.columns[variables])
return X.iloc[:, variables]
X = training_set
X2 = calculate_vif_(X)```
Just discovered that the structure of my file could be different and my regex only works sometimes because of this change. My regex is
v6 = re.findall(r'(?s)----------\s*LOW VOLTAGE SUMMARY BY AREA.*?\rACTIVITY.+?',wholefile)
It currently matches the following section of the file.
---------- LOW VOLTAGE SUMMARY BY AREA ----------
BUS NAME BASKV VOLT TIME AREA ZONE
12006 [AMISTAD 69.0] 0.971 1.8700 10 NEW MEXICO 121
11223 [WHITESA213.8] 0.918 1.9900 11 EL PASO 110
70044 [B.HYDROB4.16] 0.955 2.3233 70 PSCOLORADO 703
70044 [B.HYDROB4.16] 0.955 2.3233 70 PSCOLORADO 703
79086 [PAGOSA 115] 0.937 2.0333 73 WAPA R.M. 791
ACTIVITY?
PDEV
ENTER OUTPUT DEVICE CODE:
0 FOR NO OUTPUT
1 FOR PROGRESS WINDOW
However that section of the file sometimes is as below
---------- LOW VOLTAGE SUMMARY BY AREA ----------
BUS NAME BASKV VOLT TIME AREA ZONE
12006 [AMISTAD 69.0] 0.742 13.2060 10 NEW MEXICO 121
11223 [WHITESA213.8] 0.916 1.8367 11 EL PASO 110
70187 [FTGARLND69.0] 0.936 19.6099 70 PSCOLORADO 710
73216 [WINDRIVR 115] 0.858 3.6100 73 WAPA R.M. 750
(VFSCAN) AT TIME = 20.0000 UP TO 100 BUSES WITH LOW FREQUENCY BELOW 59.600:
X ----- BUS ------ X FREQ X ----- BUS ------ X FREQ
12063 [ROSEBUD 13.8] 59.506
On both occasions I would like to capture just the section below:
---------- LOW VOLTAGE SUMMARY BY AREA ----------
BUS NAME BASKV VOLT TIME AREA ZONE
12006 [AMISTAD 69.0] 0.971 1.8700 10 NEW MEXICO 121
11223 [WHITESA213.8] 0.918 1.9900 11 EL PASO 110
70044 [B.HYDROB4.16] 0.955 2.3233 70 PSCOLORADO 703
70044 [B.HYDROB4.16] 0.955 2.3233 70 PSCOLORADO 703
79086 [PAGOSA 115] 0.937 2.0333 73 WAPA R.M. 791
How can my regex return the section above regardless of which version of the file I am lookin at?
This should work
v6 = re.findall(r'(?s)----------\s*LOW VOLTAGE SUMMARY BY AREA.*?\r(ACTIVITY|\(VFSCAN\)).+?',wholefile)
I would not suggest to use a regex, but do some parsing instead. Let's assume your data is in a string called data:
lines = [line for line in data.split("\n")]
# find start of header
for index, line in enumerate(lines):
if "LOW VOLTAGE SUMMARY BY AREA" in line:
start_index = index
break
# first first data entry (line starting with whitespace and then a number)
for index, line in enumerate(lines[start_index:]):
if line.strip() and line.split()[0].isdigit():
first_entry_index = start_index + index
break
# find last data entry (line starting with whitespace and then a number)
for index, line in enumerate(lines[first_entry_index:]):
# we don't do this inside the if because it's possible
# to end the data with only entries and whitespace
end_entry_index = first_entry_index + index
if line.strip() and not line.split()[0].isdigit():
break
# print all lines between header and last data entry
print("\n".join(lines[start_index:end_entry_index]))