I am trying to manipulate a dataframe. The value of in a list which I use to append a column to the dataframe is 161137531201111100. However, I created a dictionary whose keys are the unique values of this column, and I use this dictionary in further operations. This could used to run perfectly before.
However, after trying this code on another data I had the following error:
KeyError: 1.611375312011111e+17
which means that this value is not the of the dictionary; I tried to trace the code, everything seemed to be okay. However, when I opened the csv file of the dataframe I built I found out that the value that is causing the problem is: 161137531201111000 which is not in the list(and ofc not a key in the dictionary) I used to create this column of dataframe. This seems weird. However, I don't know what is the reason? Is there any reason that a number is saved in another way?
And how can I save it as it is in all phases? Also, why did it change in the csv?
No unfortunately, they are not equal
print(1.611375312011111e+17 == 161137531201111000)` # False.
The problem lies in the way floating numbers are handled by computers, in general, and most programming languages, including Python.
Always use integers (and not "too large") when doing computations if you want exact results.
See Is floating point math broken? for generic explanation that you definitely must know as a programmer, even if it's not specific to Python.
(and be aware that Python tries to do a rather good job at keeping precision on integers, that unfortunately won't work on floating-point numbers).
And just for the sake of "fun" with floating point numbers, 1.611375312011111e+17 is actually equal to the integer 161137531201111104!
print(format (1.611375312011111e+17, ".60g")) # shows 161137531201111104
print(1.611375312011111e+17 == 161137531201111104) # True
a = dict()
a[1.611375312011111e+17] = "hello"
#print(a[161137531201111100]) # Key error, as in question
print(a[161137531201111104]) # This one shows "hello" properly!
Related
I am using execute_values to insert a list of lists of values into a postgres database using psycopg2. Sometimes I get "not all arguments converted during string formatting", indicating that one of the values in one of the lists is not the expected data type (and also not NoneType). When it is a long list, it can be a pain to figure out which value in which list was causing the problem.
Is there a way to get postgres/psycopg2 to tell me the specific 'argument which could not be converted'?
If not, what is the most efficient way to look through the list of lists and find any incongruent data types per place in the list, excluding NoneTypes (which obviously are not equal to a value but also are not the cause of the error)?
Please note that I am not asking for help with the specific set of values I am executing it, but trying to find a general method to more quickly inspect the problem query so I can debug it.
I am trying to read sav files using pyreadstat in python but for some rare scenarios I am getting error of UnicodeDecodeError since the string variable has special characters.
To handle this I think instead of loading the entire variable set I will load only variables which do not have this error.
Below is the pseudo-code that I have with me. This is not a very efficient code since I check for error in each item of list using try and except.
# Reads only the medata to get information about the variables
df, meta = pyreadstat.read_sav('Test.sav', metadataonly=True)
list = meta.column_names # All variables are stored in list
result = []
for var in list:
print(var)
try:
df, meta = pyreadstat.read_sav('Test.sav', usecols=[str(var)])
# If no error that means we can store this variable in result
result.append(var)
except:
pass
# This will finally load the sav for non error variables
df, meta = pyreadstat.read_sav('Test.sav', usecols=result)
For a sav file with 1000+ variables it takes a long amount of time to process this.
I was thinking if there is a way to use divide and conquer approach and do it faster. Below is my suggested approach but I am not very good in implementing recursion algorithm. Can someone please help me with pseudo code it would be very helpful.
Take the list and try to read sav file
In case of no error then output can be stored in result and then we read the sav file
In case of error then split the list into 2 parts and run these again ....
Step 3 needs to run again until we have a list where it does not give any error
Using the second approach 90% of my sav files will get loaded on the first pass itself hence I think recursion is a good method
You can try to reproduce the issue for sav file here
For this specific case I would suggest a different approach: you can give an argument "encoding" to pyreadstat.read_sav to manually set the encoding. If you don't know which one it is, what you can do is iterate over the list of encodings here: https://gist.github.com/hakre/4188459 to find out which one makes sense. For example:
# here codes is a list with all the encodings in the link mentioned before
for c in codes:
try:
df, meta = p.read_sav("Test.sav", encoding=c)
print(encoding)
print(df.head())
except:
pass
I did and there were a few that may potentially make sense, assuming that the string is in a non-latin alphabet. However the most promising one is not in the list: encoding="UTF8" (the list contains UTF-8, with dash and that fails). Using UTF8 (no dash) I get this:
నేను గతంలో వాడిన బ
which according to google translate means "I used to come b" in Telugu. Not sure if that fully makes sense, but it's a way.
The advantage of this approach is that if you find the right encoding, you will not be loosing data, and reading the data will be fast. The disadvantage is that you may not find the right encoding.
In case you would not find the right encoding, you anyway would be reading the problematic columns very fast, and you can discard them later in pandas by inspecting which character columns do not contain latin characters. This will be much faster than the algorithm you were suggesting.
I have a CSV file with 100,000 rows.
Each row in column A is a sentence comprised of both chars and integers.
I want column B to contain only integers.
I want the new columns to be in the same CSV file.
How can I accomplish this?
If I'm understanding your question correctly, I would use .isdigit() to parse the data in column A. I'm frankly not sure what the format of column A is, so I don't know exactly what you would do with this (if you gave more information I could give a more specific answer). Your solution will likely come in a similar form to this:
def find(lines):
B = []
for line in lines:
numbers = [c for c in line if c.isdigit()]
current = int(''.join(numbers))
# current is the concatenation of all
# integers found in column A from left to right
B.append(current)
return B
Let me know if this makes sense or is even in the right track for your solution. Once again, without knowing what you're trying to do, and what A looks like, I'm not sure what your actual goals are.
EDIT
I'm not going to explain the csv stuff for you, mainly because there is a fantastic resource and library for it included in python here. If you have specific questions related to writing csv, definitely post them.
It sounds like you essentially want to pull int values out of column A then add them to a new column B. There are definitely many ways to solve this, but the general form of the problem is for each row you'll filter out the int, then you'll add the filtered int into the new column. I'll list a couple:
Regex: You could use a pattern such as [0-9]+ to pull the string out of A, then use int(whatever that output is) to cast to int, then store those values in B. I'm a sucker for a good regular expression and this one is fairly straight forward. Regexr is a great resource to learn about this and test your pattern.
Use an algorithm similar to above: The above algorithm worked before, but I've updated it slightly. Now that it's been updated it'll return an array of numbers correspondent to numbers in A from left to right. This is relatively sound, but it doesn't necessarily guarantee you have the right integer, given that if the title has an int in it, it'll mess some things up. It is likely one of the more clear ways of doing this, though.
I wonder if you can help because I've been looking at this for a good half hour and I'm completely baffled, I think I must be missing something so I hope you can shed some light on this.
In this area of my program I am coding a query which will search a list of tuples for the salary of the person. Each tuple in the list is a separate record of a persons details, hence I have used two indexes; one for the record which is looped over, and one for the salary of the employee. What I am aiming for is for the program to ask you a minimum and maximum salary and for the program to print the names of the employees who are in that salary range.
It all seemed to work fine, until I realised that when entering in the value '100000' as a maximum value the query would output nothing. Completely baffled I tried entering in '999999' which then worked and all records were print. The only thing that I can think of is that the program is ignoring the extra digit, which I could not figure out why this would be?!
Below is my code for that specific section and output for a maximum value of 999999 (I would prefer not to paste the whole program as this is for a coursework project and I want to prevent anyone on the same course potentially copying my work, sorry if this makes my question unclear!):
The maximum salary out of all the records is 55000, hence why it doesnt make sense that a minimum of 0 and maximum of 100000 does not work, but a maximum of 999999 does!
If any more information is need to help, please ask! This probably seems unclear but like I said above, I dont want anyone from the class to plagiarise and my work to be void because of that! So I have tried to ask this without posting all my code on here!
Given your use of the print function (instead of the Python 2 print statement), it looks like you're writing Python 3 code. In Python 3, input returns a str. I'm guessing your data is also storing the salaries as str (otherwise the comparison would raise a TypeError). You need to convert both stored values and the result of input to int so it performs numerical comparisons, not ASCIIbetical comparisons.
When you read in from standard input in Python, no matter what input you get, you receive the input as a string. That means that your comparison function is resulting to:
if tuplist[x][2] > "0" and tuplist[x][2] < "999999" :
Can you see what the problem is now? Because it's a homework assignment, I don't want to give you the answer straight away.
I was working on a dictionary, but I came up with an issue. And the issue is, I want to be able to use more than 1 number in order to reference a key in a dictionary.
For example, I want the range of numbers between 1 and 5 to be all assigned to, let's say, "apple". So I came up with this:
my_dict['apple'] = range(1,5)
At the program's current state, its far from being able to run, so testing is an issue, but I do not receive any issues from my editor. I was just wondering, is this the correct way to go about this? Or is there a better way?
Thanks.
EDIT:
A little more info: I want to make a random integer with the randint function. Then, after Python has generated that number, I want to use it to call for the key assigned to the value of the random integer. Thing is, I want to make some things more common than others, so I want to make the range of numbers I can call it with larger so the chance of the key coming up becomes likelier. Sorry if it doesn't make much sense, but at the current state, I really don't even have code to show what I'm trying to accomplish.
You have the dictionary backwards. If you want to be able to recall, e.g., 'apple' with any of the numbers 1-5, you'd need the numbers to be the keys, not the values.
for i in range(1,6): # range(a,b) gives [a,b)
my_dict[i] = 'apple'
etc. Then, my_dict[4] == 'apple' and the same is true for the other values in the range.
This can create very large dictionaries with many copies of the same value.
Alternately, you can use range objects as dictionary keys, but the testing will be a bit more cumbersome unless you create your own class.
my_dict[range(1,6)] = 'apple'
n = random.randint(1, 5)
for key in my_dict:
if n in key:
print(my_dict[key])
...prints apple.
The value in a dictionary can be any arbitrary object. Whether it makes sense to use a given type or structure as a value only makes sense in the context of the complete script, so it is impossible to tell you whether it is the correct solution with the given information.