keep track of a string in a binary file at python - python

I have multiple strings and I want to keep track of them when save them on a binary file. In fact I like to know each string occupy how many bytes of a binary file. I don't know how to do it in python.Please help me.

Related

how to read a csv and write it back exactly the same with pandas overcoming float imprecision

I would like to read in a csv and write it back exactly the same as it was using pandas or similar
example csv
019-12-12 23:45:00,95480,12.41,-10.19,11.31851,2.1882
and when I go to write it back, due to floating point properties i might get something like
019-12-12 23:45:00,95480,12.410000009,-10.19,11.31851.000000002,2.1822
I've seen suggestions to use float_format but the format is different for each column and different across files I'm looping through.
I'm not sure what you're doing, but if you need pandas and want to re-save the file, you likely will want to change data somewhere. If so, I'd recommend using
df.set_option('precision', num_decimals)
as long as your decimals are reasonably close in precision, in the example given, 4 would allow for enough precision and remove any straggling floating point inaccuracy. Otherwise, you'll have to look for several zeros in a row and delete all decimal places after that.
If you don't need to change any data, I would go with an alternative solution: shutil
import shutil
shutil.copyfile(path_to_file, path_to_target_file)
This way, there's no mutation that can occur as it's just copying the raw contents.

Converting the endianness type of an already existing binary file

I have a binary file on my PC that contains data in big-endian. The file contains around 121 MB.
The problem is I would like to convert the data into little-endian with a python script.
What is currently giving me headaches is the fact that I don't know how to convert an entire file. If I would have a short hex string I could simply use struct.pack to convert it into little-endian but if I see this correctly I can't give struct.pack a binary file as input.
Is there an other function/utility that I can use to do that or how should my approach look like?
We need a document or knowledge of the file's exact structure.
Suppose that there is a 4 byte file. If this file has just a int, we need to flip that. But if it is a combination of 4 char, we should leave it as it be.
Above all, you should find the structure. Then we can talk about the translation. I think there is no translation tools to support general data, but you need to parse that binary file following the structure.

Pandas: efficiently write thousands of small files

here is my problem.
I have a single big CSV file containing a bit more than 100M rows which I need to divide in much smaller files (if needed I can add more details). At the moment I'm reading in chunks the big CSV, doing some computations to determine how to subdivide the chunk and finally writing (appending) to the files with
df.to_csv(outfile, float_format='%.8f', index=False, mode='a', header=header)
(the header variable is True if it is the first time that I write to 'outfile', otherwise it is False).
While running the code I noticed that the amount of on-disk memory taken by the smaller files on the whole was likely to become larger than three times the size of the single big csv.
So here are my questions:
is this behavior normal? (probably it is, but I'm asking just in case)
is it possible to reduce the size of the files? (different file formats?) [SOLVED through compression, see update below and comments]
are there better file types for this situation with respect to CSV?
Please note that I don't have an extensive knowledge of programming, I'm just using Python for my thesis.
Thanks in advance to whoever will help.
UPDATE: thanks to #AshishAcharya and #PatrickArtner I learned how to use the compression while writing and reading the CSV. Still, I'd like to know if there are any file types that may be better than CSV for this task.
NEW QUESTION: (maybe stupid question) does appending work on compressed files?
UPDATE 2: using the compression option I noticed something that I don't understand. To determine the size of folders I was taught to use the du -hs <folder> command, but using it on the folder containing compressed files or the one containing the uncompressed files results in the same value of '3.8G' (both are created using the same first 5M rows from the big CSV). From the file explorer (Nautilus) instead, I get about 590MB for the one containing uncompressed CSV and 230MB for the other. What am I missing?

Keep formatting of Excel file when loading into DataFrame in Python

I'm trying to load a Excel file into Python and keep the formatting of the columns/ data. I have numbers which are stored as text, but Python changes the formatting to numbers which causes an issues as it only shows 15 significant digits in a number, and changes digits after the 15th place to 0 (as it would do in Excel). I would like to keep the numbers as Text to have all digits.
I'm using:
myContractData = pd.read_excel(Path)
Thanks a lot a lot for your help.
Alex
If you case is based on loading data to a excel file, you need to ensure the number in the excel files has 15+ digits.
what I know in old office versions, excel only handle 15 significant digits of precision https://support.microsoft.com/en-us/kb/78113
if you use Excel load the number, the number is rounded to 15 digits automatically. That’s why in Excel files for numbers, we can only see the 15 digits of the value.
if possible, I would suggest you connecting to the data source directly in Python, or convert it to string in data source.

Implementing python list and Binary search tree

I am working on making a index for a text file. index will be an list of every word and symbols like(~!##$%^&*()_-{}:"<>?/.,';[]1234567890|) and counting the number of times each token occurred in the text file. printing all this in an ascending ASCII value order.
I am going to read a .txt file and split the words and special characters and store it in a list. Can any one throw me idea on how to use binary search in this case.
If your lookup is small (say, up to 1000 records), then you can use a dict; you can either pickle() it, or write it out to a text file. Overhead (for this size) is fairly small anyway.
If your look-up table is bigger, or there are a small number of lookups per run, I would suggest using a key/value database (e.g. dbm).
If it is too complex to use a dbm, use a SQL (e.g. sqlite3, MySQL, Postgres) or NoSQL database. Depending on your application, you can get a huge benefit from the extra features these provide.
In either case, all the hard work is done for you, much better than you can expect to do it yourself. These formats are all standard, so you will get simple-to-use tools to read the data.

Categories

Resources