i have a file with over 15k lines each line having 1 key and 1 value. I can modify file content if any formatting is required for faster reading. currently i have made entire file like a dict and doing an eval on that is this the best way to read the file or any better approach can we follow, please suggest.
File mymapfile.txt:
{
'a':'this',
'b':'that',
.
.
.
.
'xyz':'message can have "special" char %s etc '
}
and on this file i am doing eval
f_read = eval(open('mymapfile.txt', 'r').read())
my concern is my file keeps growing and values can have quotes,special char etc where we need to wrap value ''' or """. with dictionary format even if there is small syntax error eval will fail. So is it better to use readlines() without making file as dict and then create dict or eval is faster if we make dict in file? for readlines i can simply write text in each line split with : and need not worry about any special characters
File for readlines:
a:this
b:that
.
.
.
.
xyz:message can have "special" char %s etc
#Mahesh24 answer returns a set with values that look like dict but are not. Also his variable overwrites the builtin dict. Rather use the two lines:
s={ (i.strip()) for i in open('ss.txt','r').readlines() }
d = {i.split(':')[0]:i.split(':')[1] for i in s}
d will then be dict with read in values. Bit of thinking could probably get this into a one liner. Pretty sure there are read csv in python standard library that will give you some more options and robustness. if your data is in any other standard format using the appropriate standard libraries will be preferential. The above two liner will however give you a quick and dirty way of doing it. can change the ":" for commas or whatever separator your data has.
Assuming you'll stick to json you might want to take a look at ultrajson. It seems to be very fast (even if with a memory penalty) at dumping and loading data.
Here are two articles that have some benchmarks and might help you make a decision:
https://medium.com/#jyotiska/json-vs-simplejson-vs-ujson-a115a63a9e26
http://jmoiron.net/blog/python-serialization/
Please avoid eval if you only want to load data.
What you only need is to read lines, recognize the key and the value so your proposed file format:
a:this
b:that
...
is fully suitable.
Related
I am trying to read sav files using pyreadstat in python but for some rare scenarios I am getting error of UnicodeDecodeError since the string variable has special characters.
To handle this I think instead of loading the entire variable set I will load only variables which do not have this error.
Below is the pseudo-code that I have with me. This is not a very efficient code since I check for error in each item of list using try and except.
# Reads only the medata to get information about the variables
df, meta = pyreadstat.read_sav('Test.sav', metadataonly=True)
list = meta.column_names # All variables are stored in list
result = []
for var in list:
print(var)
try:
df, meta = pyreadstat.read_sav('Test.sav', usecols=[str(var)])
# If no error that means we can store this variable in result
result.append(var)
except:
pass
# This will finally load the sav for non error variables
df, meta = pyreadstat.read_sav('Test.sav', usecols=result)
For a sav file with 1000+ variables it takes a long amount of time to process this.
I was thinking if there is a way to use divide and conquer approach and do it faster. Below is my suggested approach but I am not very good in implementing recursion algorithm. Can someone please help me with pseudo code it would be very helpful.
Take the list and try to read sav file
In case of no error then output can be stored in result and then we read the sav file
In case of error then split the list into 2 parts and run these again ....
Step 3 needs to run again until we have a list where it does not give any error
Using the second approach 90% of my sav files will get loaded on the first pass itself hence I think recursion is a good method
You can try to reproduce the issue for sav file here
For this specific case I would suggest a different approach: you can give an argument "encoding" to pyreadstat.read_sav to manually set the encoding. If you don't know which one it is, what you can do is iterate over the list of encodings here: https://gist.github.com/hakre/4188459 to find out which one makes sense. For example:
# here codes is a list with all the encodings in the link mentioned before
for c in codes:
try:
df, meta = p.read_sav("Test.sav", encoding=c)
print(encoding)
print(df.head())
except:
pass
I did and there were a few that may potentially make sense, assuming that the string is in a non-latin alphabet. However the most promising one is not in the list: encoding="UTF8" (the list contains UTF-8, with dash and that fails). Using UTF8 (no dash) I get this:
నేను గతంలో వాడిన బ
which according to google translate means "I used to come b" in Telugu. Not sure if that fully makes sense, but it's a way.
The advantage of this approach is that if you find the right encoding, you will not be loosing data, and reading the data will be fast. The disadvantage is that you may not find the right encoding.
In case you would not find the right encoding, you anyway would be reading the problematic columns very fast, and you can discard them later in pandas by inspecting which character columns do not contain latin characters. This will be much faster than the algorithm you were suggesting.
This question already has answers here:
Is there a memory efficient and fast way to load big JSON files?
(11 answers)
Closed 3 years ago.
I would like to load one by one the items of my json file. The file could be up to 3gb so loading it in advance and looping over it is not an option.
My json file is basically a dictionary of key and value pairs (hundreds of pairs), and there is nothing I want to discard (ijson).
I just want to load one pair at a time to work with it. Is there anyway to do that?
So basically I found out in this answer how to do it in a much simple way:
https://stackoverflow.com/a/17326199/2933485
Using ijson, it looks like you can loop over the file without loadin it but opening the file and using ijson parse function over it, this is the example I found:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
Why dont you populate a sqlite table with the data once and query the data using the record PK? See https://docs.python.org/3.7/library/sqlite3.html
OK, so json is a nested format, which means each repeating block (dict or list object) is surrounded by start and end characters. Normally, you read the entire file, and in doing so, can confirm the well-formed, structure and "closedness" of each object - in other words, it's verifiable that all objects are legally structured. When you load a json file into memory using the json library, part of that process is the validation.
If you want to do that for an extra large file - you have to forgo the normal library and roll your own, loading in a line (or chunk) at a time, and processing that under the assumption that validation will retrospectively succeed.
That's achievable (assuming you're able to put your faith in such an assumption) but it's probably something you'll have to write yourself.
One strategy might be to read a line at a time, splitting on the colon : character, with commas as record delimiters, which is a crude approximation of how key-value pairs are coded within json. Following this method, you're going to be able to process all but the first and final key-value pairs cleanly in sequence.
That just leaves you to write some special conditions for properly parsing the first and final records, which will come through garbled using this strategy.
Crudely then, call something like this (referencing the csv library) and treat the json like a massive, unusually formatted csv file.
import csv
with open('big.json', newline=',') as csv_json_franken_file:
jsonreader = csv.reader(csv_json_franken_file, delimiter=':', quotechar='"')
for row in jsonreader: # This bit reads in a "row" at a time, until finished
print(', '.join(row))
Then do some edge-case treatment of the first and last rows (more or less depending on the structure of your json) to repair the garbling caused by what is a fairly blatant hack. It's not clean, and it's not robust to changes in the content - but sometimes, you just have to play the hand you've been dealt.
To be honest, generating json files of 3GB in size is a little irresponsible, so if anyone comes asking, you've got that in your corner.
Currently expensively parsing a file, which generates a dictionary of ~400 key, value pairs, which is seldomly updated. Previously had a function which parsed the file, wrote it to a text file in dictionary syntax (ie. dict = {'Adam': 'Room 430', 'Bob': 'Room 404'}) etc, and copied and pasted it into another function whose sole purpose was to return that parsed dictionary.
Hence, in every file where I would use that dictionary, I would import that function, and assign it to a variable, which is now that dictionary. Wondering if there's a more elegant way to do this, which does not involve explicitly copying and pasting code around? Using a database kind of seems unnecessary, and the text file gave me the benefit of seeing whether the parsing was done correctly before adding it to the function. But I'm open to suggestions.
Why not dump it to a JSON file, and then load it from there where you need it?
import json
with open('my_dict.json', 'w') as f:
json.dump(my_dict, f)
# elsewhere...
with open('my_dict.json') as f:
my_dict = json.load(f)
Loading from JSON is fairly efficient.
Another option would be to use pickle, but unlike JSON, the files it generates aren't human-readable so you lose out on the visual verification you liked from your old method.
Why mess with all these serialization methods? It's already written to a file as a Python dict (although with the unfortunate name 'dict'). Change your program to write out the data with a better variable name - maybe 'data', or 'catalog', and save the file as a Python file, say data.py. Then you can just import the data directly at runtime without any clumsy copy/pasting or JSON/shelve/etc. parsing:
from data import catalog
JSON is probably the right way to go in many cases; but there might be an alternative. It looks like your keys and your values are always strings, is that right? You might consider using dbm/anydbm. These are "databases" but they act almost exactly like dictionaries. They're great for cheap data persistence.
>>> import anydbm
>>> dict_of_strings = anydbm.open('data', 'c')
>>> dict_of_strings['foo'] = 'bar'
>>> dict_of_strings.close()
>>> dict_of_strings = anydbm.open('data')
>>> dict_of_strings['foo']
'bar'
If the keys are all strings, you can use the shelve module
A shelf is a persistent, dictionary-like object. The difference with
“dbm” databases is that the values (not the keys!) in a shelf can be
essentially arbitrary Python objects — anything that the pickle module
can handle. This includes most class instances, recursive data types,
and objects containing lots of shared sub-objects. The keys are
ordinary strings.
json would be a good choice if you need to use the data from other languages
If storage efficiency matters, use Pickle or CPickle(for execution performance gain). As Amber pointed out, you can also dump/load via Json. It will be human-readable, but takes more disk.
I suggest you consider using the shelve module since your data-structure is a mapping.
That was my answer to a similar question titled If I want to build a custom database, how could I? There's also a bit of sample code in another answer of mine promoting its use for the question How to get a object database?
ActiveState has a highly rated PersistentDict recipe which supports csv, json, and pickle output file formats. It's pretty fast since all three of those formats are implement in C (although the recipe itself is pure Python), so the fact that it reads the whole file into memory when it's opened might be acceptable.
JSON (or YAML, or whatever) serialisation is probably better, but if you're already writing the dictionary to a text file in python syntax, complete with a variable name binding, you could just write that to a .py file instead. Then that python file would be importable and usable as is. There's no need for the "function which returns a dictionary" approach, since you can directly use it as a global in that file. e.g.
# generated.py
please_dont_use_dict_as_a_variable_name = {'Adam': 'Room 430', 'Bob': 'Room 404'}
rather than:
# manually_copied.py
def get_dict():
return {'Adam': 'Room 430', 'Bob': 'Room 404'}
The only difference is that manually_copied.get_dict gives you a fresh copy of the dictionary every time, whereas generated.please_dont_use_dict_as_a_variable_name[1] is a single shared object. This may matter if you're modifying the dictionary in your program after retrieving it, but you can always use copy.copy or copy.deepcopy to create a new copy if you need to modify one independently of the others.
[1] dict, list, str, int, map, etc are generally viewed as bad variable names. The reason is that these are already defined as built-ins, and are used very commonly. So if you give something a name like that, at the least it's going to cause cognitive-dissonance for people reading your code (including you after you've been away for a while) as they have to keep in mind that "dict doesn't mean what it normally does here". It's also quite likely that at some point you'll get an infuriating-to-solve bug reporting that dict objects aren't callable (or something), because some piece of code is trying to use the type dict, but is getting the dictionary object you bound to the name dict instead.
on the JSON direction there is also something called simpleJSON. My first time using json in python the json library didnt work for me/ i couldnt figure it out. simpleJSON was...easier to use
How do i load a text file full of 10 digit codes separated by a return into a dictionary in python?
Then how do i cross check the variables in the dictionary with my own variables?
Ok, it is simple really. I have a TXT file containing 1000 or so 10 digit sequences looks like this:
121001000
000000000
121212121
I need to input these files into a dictionary then be able to take a number that i receive and cross check it with this database so it does NOT match.
IE 0000000001 =/= any previous entry.
It sounds like you want to store the numbers in a way that makes it easy to look up "Is this other value already there?", but you don't actually have "values" to associate with these "keys" - so you don't really want a dict (associative array), but rather a set.
Python file objects are iterable, and iterating over them gives you each line of the file in turn. Meanwhile, Python's container types (including set) can be constructed from iterables. So making a set of the lines in the file is as simple as set(the_file_object). And since this is Python, checking if some other value is in the set is as simple as some_other_value in the_set.
On reading text from files, try looking over the python document for input/output. Additionally look through data structures tutorial.
Dictionary usually has a key and a value, that corresponds to the key:
name: "John"
age: 13
If you are just looking for the structure to read the values from the file, list seems to be more appropriate, since you did not specify anything about the designation of those values.
If you need the file's contents as numbers and not as strings:
file_data = set()
for line in open('/some/file/with/sequences.txt'):
file_data.add(int(line))
then later:
if some_num not in file_data:
do_something_with(some_num)
If you have blank lines or garbage in the file, you'll want to add some error checking.
So lets say I'm using Python's ftplib to retrieve a list of log files from an FTP server. How would I parse that list of files to get just the file names (the last column) inside a list? See the link above for example output.
Using retrlines() probably isn't the best idea there, since it just prints to the console and so you'd have to do tricky things to even get at that output. A likely better bet would be to use the nlst() method, which returns exactly what you want: a list of the file names.
This best answer
You may want to use ftp.nlst() instead of ftp.retrlines(). It will give you exactly what you want.
If you can't, read the following :
Generators for sysadmin processes
In his now famous review, Generator Tricks For Systems Programmers An Introduction, David M. Beazley gives a lot of receipes to answer to this kind of data problem with wuick and reusable code.
E.G :
# empty list that will receive all the log entry
log = []
# we pass a callback function bypass the print_line that would be called by retrlines
# we do that only because we cannot use something better than retrlines
ftp.retrlines('LIST', callback=log.append)
# we use rsplit because it more efficient in our case if we have a big file
files = (line.rsplit(None, 1)[1] for line in log)
# get you file list
files_list = list(files)
Why don't we generate immediately the list ?
Well, it's because doing it this way offer you much flexibility : you can apply any intermediate generator to filter files before turning it into files_list : it's just like pipe, add a line, you add a process without overheat (since it's generators). And if you get rid off retrlines, it still work be it's even better because you don't store the list even one time.
EDIT : well, I read the comment to the other answer and it says that this won't work if there is any space in the name.
Cool, this will illustrate why this method is handy. If you want to change something in the process, you just change a line. Swap :
files = (line.rsplit(None, 1)[1] for line in log)
and
# join split the line, get all the item from the field 8 then join them
files = (' '.join(line.split()[8:]) for line in log)
Ok, this may no be obvious here, but for huge batch process scripts, it's nice :-)
And a slightly less-optimal method, by the way, if you're stuck using retrlines() for some reason, is to pass a function as the second argument to retrlines(); it'll be called for each item in the list. So something like this (assuming you have an FTP object named 'ftp') would work as well:
filenames = []
ftp.retrlines('LIST', lambda line: filenames.append(line.split()[-1]))
The list 'filenames' will then be a list of the file names.
Is there any reason why ftplib.FTP.nlst() won't work for you? I just checked and it returns only names of the files in a given directory.
Since every filename in the output starts at the same column, all you have to do is get the position of the dot on the first line:
drwxrwsr-x 5 ftp-usr pdmaint 1536 Mar 20 09:48 .
Then slice the filename out of the other lines using the position of that dot as the starting index.
Since the dot is the last character on the line, you can use the length of the line minus 1 as the index. So the final code is something like this:
lines = ftp.retrlines('LIST')
lines = lines.split("\n") # This should split the string into an array of lines
filename_index = len(lines[0]) - 1
files = []
for line in lines:
files.append(line[filename_index:])
If the FTP server supports the MLSD command, then please see section “single directory case” from that answer.
Use an instance (say ftpd) of the FTPDirectory class, call its .getdata method with connected ftplib.FTP instance in the correct folder, then you can:
directory_filenames= [ftpfile.name for ftpfile in ftpd.files]
I believe it should work for you.
file_name_list = [' '.join(each_file.split()).split()[-1] for each_file_detail in file_list_from_log]
NOTES -
Here I am making a assumption that you want the data in the program (as list), not on console.
each_file_detail is each line that is being produced by the program.
' '.join(each_file.split())
To replace multiple spaces by 1 space.