Reformatting a txt file with characters at certain positions using python - python

Very newbie programmer asking a question here. I have searched all over the forums but can't find something to solve this issue I thought there would be a simple function for. Is there a way to do this?
I am trying to reformat a txt file so I can use it with the pandas function but this requires my data to be in a specific format.
Currently my data is in the following format of a txt file:
01/09/21,00:28,7.1,75,3.0,3.7,3.7,292,0.0,0.0,1025.8,81.9,17.1,44,3.7,4.6,7.1,0,0,0.00,0.00,3.0,0,0.0,292,0.0,0.0
01/09/21,00:58,7.0,75,2.9,5.1,5.1,248,0.0,0.0,1025.9,81.9,17.0,44,5.1,3.8,7.0,0,0,0.00,0.00,1.9,0,0.0,248,0.0,0.0
it is required to be formatted like this for processing using pandas:
["06/09/21","19:58",11.4,69,5.9,0.0,0.0,0,0.0,0.3,1006.6,82.2,21.8,52,0.0,11.4,11.4,0,0,0.00,0.00,10.5,0,1.5,0,0.0,0.3],
["06/09/21","20:28",10.6,73,6.0,0.0,0.0,0,0.0,0.3,1006.3,82.2,22.4,49,0.0,10.6,10.6,0,0,0.00,0.00,9.7,0,1.5,0,0.0,0.3],
This requires adding a [" at the start and adding a " at the end of the date before the comma, then adding another " after the comma and another " at the end of the time section. At the end of the line, I also need to add a ],
I thought something like this would work but i get an error when trying to run it.
info =
06/09/21,19:58,11.4,69,5.9,0.0,0.0,0,0.0,0.3,1006.6,82.2,21.8,52,0.0,11.4,11.4,0,0,0.00,0.00,10.5,0,1.5,0,0.0,0.3
info=info[:1] +"['" +info[1:]
print (info)
I have over 1000 lines of data so doing it manually is out of the question. I've seen other questions like this, but they didn't get helpful answers. Can it be done, preferably with either a method or a loop?

You are confusing the CONTENTS of your data with the REPRESENTATION of your data. You don't really need brackets and quotes at all. What you need is a list that contains strings and integers. What you've shown there is how Python would PRINT a list containing strings and integers. The list doesn't actually contain brackets or quotes.
You can use pandas.read_csv directly on that data file with no extra processing. You just need to provide the column names.

Related

How to reduce repeated code when reusing string format?

For example, if I have lots of lines of coding doing something like:
print('{:=+5d}'.format(my_value))
or perhaps something more involved like:
print('{:04d}-{:04d}|{:03d}'.format(val1, val2, val3))
Is there a good way (and is it good practice) to replace the string format conversion specifier with something so that I:
Reduce the number of times that needs to be typed out
Make it more human readable
Make the format string parametric so it can be changed in one place
Edit for more clarity:
These prints occur throughout the code and aren't just a single list of items in one spot I can loop through. The formats are also used for strings in log messages and other non-print places.
I might want to even specify the string format programmatically
This is running on a legacy python 2 script
Try this:
['{:=+5d}'.format(val) for val in all_vals]
According to python 8.6, you can write this code like this to be more pythonic using something called f-string¹
Code Syntax
print(f'{val1:04d}-{val2:04d}|{val3:03d}')

cleaning the format of the printed data in python

I am trying to compare two lists in python and produce two arrays that contain matching rows and non-matching rows, but the program prints the data in an ugly format. How can I clean I go about cleaning it up?
If you want to read the file without the \n character, you might consider doing the following
lines = list1.readlines()
lines2 = list2.readlines()
would read your file without the "\n" characters
Alternatively, for each line, you can do .strip("\n")
The "ugly format" might be because you are using print(match) (which is actually translated by Python to print ( repr(match) ), printing something that is more useful for debugging or as input back to Python - but not 'nice'.
If you want it printed 'nicely', you'd have to decide what format that would be and write the code for it. In the simplest case, you might do:
for i in match:
print(i)
(note your original list contains \n characters, that's what enumerating an open text file does. They will get printed, as well (together with the `\n' added by print() itself). I don't know if you want them removed or not. See the other answer for possible ways of getting rid of them.

Django import-export leading zeros for numerical values in excel

I am faced with the following problem: when I generate .csv files in python using django-import-export even though the field is a string, when I open it in Excel the leading zeros are omitted. E.g. 000123 > 123.
This is a problem, because if I'd like to display a zipcode I need the zeros the way they are. I can cover it in quotes, but that's not desirable since it will grab unnecessary attention and it just looks bad. I'm also aware that you can do it in Excel files manually by changing the data type, but I don't want to explain that to people who are using my software.
Any suggestions?
Thanks in advance.
I've tried this solution. It's the solution suggested by #jquijano but it hasn't worked.
After generating the CSV, I opened it with 'open office' and 'excel' and in both cases I could see the (') character at the beginning of each string. However, if I added a new value to the CSV in the editor, for example '0895, the (') disappeared and the leading 0 wasn't removed.
Luckily, I found a workaround. I just added an empty character at the beginning.
value = chr(24) + unidecode('00123')
An easy fix would be adding an apostrophe (') at the beginning of each number when doing using import-export. This way Excel will recognize those numbers as a text.

Fastest way to extract part of a long string in Python

I have a large set of strings, and am looking to extract a certain part of each of the strings. Each string contains a sub string like this:
my_token:[
"key_of_interest"
],
This is the only part in each string it says my_token. I was thinking about getting the end index position of ' my_token:[" ' and after that getting the beginning index position of ' "], ' and getting all the text between those two index positions.
Is there a better or more efficient way of doing this? I'll be doing this for string of length ~10,000 and sets of size 100,000.
Edit: The file is a .ion file. From my understanding it can be treated as a flat file - as it is text based and used for describing metadata.
How can this can possibly be done the "dumbest and simplest way"?
find the starting position
look on for the ending position
grab everything indiscriminately between the two
This is indeed what you're doing. Thus any further inprovement can only come from the optimization of each step. Possible ways include:
narrow down the search region (requires additional constraints/assumptions as per comment56995056)
speed up the search operation bits, which include:
extracting raw data from the format
you already did this by disregarding the format altogether - so you have to make sure there'll never be any incorrect parsing (e.g. your search terms embedded in strings elsewhere or matching a part of a token) as per comment56995034
elementary pattern comparison operation
unlikely to attain in pure Python since str.index is implemented in C already and the implementation is probably already as simple as can possibly be
The underlying requirement shows through when you clarify:
I was thinking about getting the end index position of ' my_token:[" ' and after that getting the beginning index position of ' "], ' and getting all the text between those two index positions.
That sounds like you're trying to avoid the correct approach: use a parser for whatever language is in the string.
There is no good reason to build directly on top of string primitives for parsing, unless you are interested in writing yet another parsing framework.
So, use libraries written by people who have dealt with the issues before you.
If it's JSON, use the standard library json module; ditto if it's some other language with a parser already in the Python standard library.
If it's some other widely-implemented standard: get whichever already-existing third-party Python library knows how to parse that properly.
If it's not already implemented: write a custom parser using pyparsing or some other well-known solid library.
So to make a good choice you need to know what is the data format (this is not answered by “what are the file names”; rather, you need to know what is the data format of the content of those files). Then you'll be able to search for a parser library that knows about that data format.
Well, as already mentioned - a parser seems the best option.
But to answer your question without all this extra advice ... if you're just looking at speed, a parser isn't really the best method of doing this. The faster method is you already have a string like this would be to use regex.
matches = re.match(r"my_token:\[\s*"(.*)"\s*\]\.",str)
key_of_interest = matches.groups()[0]
There are other issues that come up. For example what if your key has a " inside it ? strinified JSON will automatically use an escape character there and that will be captures by the regex too. And therefore this gets a bit too complicated.
And JSON is not regex parsable in itself (is-json-a-regular-language). So, use at your own risk. But with the appropriate restrictions and assumptions regex would be faster than a json parser.

wxPython/TextCtrl replacing a character within the first x lines of a string

I've scanned the questions here as well as the web and haven't found my answer, this is my first question and I'm a noobie to (wx)Python so go easy on me.
Using TextCtrl I'm trying to remove a single character within a string, this string will always start with the same set of characters but the rest of the string is freely editable by the user.
e.g
self.text=wx.TextCtrl(panel,-1"hello world,, today we're asking a question on stackoverflow, what would you ask?")
poor example but how would I find and remove the 11th(',') character so the sentence is more formatted without affecting the rest of the string?
I've tried standard python indexing but I get an error for that, I can successfully remove chunks of the string from the start outwards of the end inwards but I need only a single character removed.
Again, sorry for the poor terminology, as I said I'm fairly new to python so some of my terms may be a bit iffy.
self.text.SetValue(self.text.GetValue()[:10] + self.text.GetValue()[11:] )
maybe??
self.text.SetValue(self.text.GetValue().replace(",,",",")
maybe?
its not really clear what you are trying to accomplish here ...

Categories

Resources