Very newbie programmer asking a question here. I have searched all over the forums but can't find something to solve this issue I thought there would be a simple function for. Is there a way to do this?
I am trying to reformat a txt file so I can use it with the pandas function but this requires my data to be in a specific format.
Currently my data is in the following format of a txt file:
01/09/21,00:28,7.1,75,3.0,3.7,3.7,292,0.0,0.0,1025.8,81.9,17.1,44,3.7,4.6,7.1,0,0,0.00,0.00,3.0,0,0.0,292,0.0,0.0
01/09/21,00:58,7.0,75,2.9,5.1,5.1,248,0.0,0.0,1025.9,81.9,17.0,44,5.1,3.8,7.0,0,0,0.00,0.00,1.9,0,0.0,248,0.0,0.0
it is required to be formatted like this for processing using pandas:
["06/09/21","19:58",11.4,69,5.9,0.0,0.0,0,0.0,0.3,1006.6,82.2,21.8,52,0.0,11.4,11.4,0,0,0.00,0.00,10.5,0,1.5,0,0.0,0.3],
["06/09/21","20:28",10.6,73,6.0,0.0,0.0,0,0.0,0.3,1006.3,82.2,22.4,49,0.0,10.6,10.6,0,0,0.00,0.00,9.7,0,1.5,0,0.0,0.3],
This requires adding a [" at the start and adding a " at the end of the date before the comma, then adding another " after the comma and another " at the end of the time section. At the end of the line, I also need to add a ],
I thought something like this would work but i get an error when trying to run it.
info =
06/09/21,19:58,11.4,69,5.9,0.0,0.0,0,0.0,0.3,1006.6,82.2,21.8,52,0.0,11.4,11.4,0,0,0.00,0.00,10.5,0,1.5,0,0.0,0.3
info=info[:1] +"['" +info[1:]
print (info)
I have over 1000 lines of data so doing it manually is out of the question. I've seen other questions like this, but they didn't get helpful answers. Can it be done, preferably with either a method or a loop?
You are confusing the CONTENTS of your data with the REPRESENTATION of your data. You don't really need brackets and quotes at all. What you need is a list that contains strings and integers. What you've shown there is how Python would PRINT a list containing strings and integers. The list doesn't actually contain brackets or quotes.
You can use pandas.read_csv directly on that data file with no extra processing. You just need to provide the column names.
I have a list of 77 items. I have placed all 77 items in a text file (one per line).
I am trying to read this into my python script (where I will then compare each item in a list, to another list pulled via API).
Problem: for some reason, 2/77 of the items on the list have encoding, giving me characters of "u00c2" and "u00a2" which means they are not comparing correctly and being missed. I have no idea why these 2/77 have this encoding, but the other 75 are fine, and I don't know how to get rid of the encoding, in python.
Question:
In Python, How can I get rid of the encoding to ensure none of them have any special/weird characters and are just plain text?
Is there a method I can use to do this upon reading the file in?
Here is how I am reading the text file into python:
with open("name_list2.txt", "r") as myfile:
policy_match_list = myfile.readlines()
policy_match_list = [x.strip() for x in policy_match_list]
Note - "policy_match_list" is the list of 77 policies read in from the text file.
Here is how I am comparing my two lists:
for policy_name in policy_match_list:
for us_policy in us_policies:
if policy_name == us_policy["name"]:
print(f"Match #{match} | {policy_name}")
match += 1
Note - "us_policies" is another list of thousands of policies, pulled via API that I am comparing to
Which is resulting in 75/77 expected matches, due to the other 2 policies comparing e.g. "text12 - text" to "text12u00c2-u00a2text" rather than "text12 - text" to "text12 - text"
I hope this makes sense, let me know if I can add any further info
Cheers!
Did you try to open the file while decoding from utf8? because I can't see the file I can't tell this is the problem, but the file might have characters that the default decoding option (which I think is Latin) can't process.
Try doing:
with open("name_list2.txt", "r", encoding="utf-8") as myfile:
Also, you can watch this question about how to treat control characters: Python - how to delete hidden signs from string?
Sorry about not posting it as a comment (as I really don't know if this is the solution), I don't have enough reputation for that.
Certain Unicode characters aren't properly decoded in some cases. In your case, the characters \u00c2 and \u00a2 caused the issue. As of now, I see two fixes:
Try to resolve the encoding by replacing the characters (refer to https://stackoverflow.com/a/56967370)
Copy the text to a new plain text file (if possible) and save it. These extra characters tend to get ignored in that case and consequently removed.
I have my script doing the following for the headers of the text file:
file.write("{:50} {:50} {:50}\n".format("Name", "Count", "Price"))
Then a function is called through multiple threads that write out a list:
file.write("{:50}{:50}{:50}".format(*map(str, input)).strip()+ "\n")
This gets the formatting close to the header but they all seem to be off by a bit, i'm assuming it has to do with the variable length of the contents of the list. But I'm not sure how to change the 50's in the list to make it align correctly.
My output looks like this currently:
Any help would be appreciated.
EDIT
Problem has been fixed.
The variance is caused by your inputs containing extra whitespace after, pushing the line lengths past the fifty characters.
You need to strip the inputs, not the str.format() output:
file.write("{:50}{:50}{:50}\n".format(*(str(i).strip() for i in input)))
Next, open your file in a fixed-width font. Your text is not aligning because different letters take up different widths; m is wider than i.
I'm outputting pretty huge XML structure to file and I want user to be able to enable/disable pretty print.
I'm working with approximately 150MB of data,when I tried xml.etree.ElementTree and build tree structure from it's element objects, it used awfully lot of memory, so I do this manually by storing raw strings and outputing by .write(). My output sequence looks like this:
ofile.write(pretty_print(u'\
\t\t<LexicalEntry id="%s">\n\
\t\t\t<feat att="languageCode" val="cz"/>\n\
\t\t\t<Lemma>\n\
\t\t\t\t<FormRepresentation>\n\
\t\t\t\t\t<feat att="writtenForm" val="%s"/>\n\
\t\t\t\t</FormRepresentation>\n\
\t\t\t</Lemma>\n\
\t\t\t<Sense>%s\n' % (str(lex_id), word['word'], '' if word['pos']=='' else '\n\t\t\t\t<feat att="partOfSpeech" val="%s"/>' % word['pos'])))
inside the .write() I call my function pretty_print which, depending on command line option, SHOULD strip all tab and newline characters
o_parser = OptionParser()
# ....
o_parser.add_option("-p", "--prettyprint", action="store_true", dest="pprint", default=False)
# ....
def pretty_print(string):
if not options.pprint:
return string.strip('\n\t')
return string
I wrote 'should', because it does not, in this particular case it does not strip any of the characters.
BUT in this case, it works fine:
for ss in word['synsets']:
ofile.write(pretty_print(u'\t\t\t\t<Sense synset="%s-synset"/>\n' % ss))
First thing that came on my mind was that there might be some issues with the substitution, but when i print passed string inside the pretty_print function it looks perfectly fine.
Any suggestiones what might cause that .strip() does not work?
Or if there is any better way to do this, I'll accept any advice
Your issue is that str.strip() only removes from the beginning and end of a string.
You either want str.replace() to remove all instances, or to split it into lines and strip each line, if you want to remove them from the beginning and end of lines.
Also note that for your massive string, Python supports multi-line strings with triple quotes that will make it a lot easier to type out, and the old style string formatting with % has been superseded by str.format() - which you probably want to use instead in new code.
I'm using python 2.6 on linux.
I have two text files
first.txt has a single string of text on each line. So it looks like
lorem
ipus
asfd
The second file doesn't quite have the same format.
it would look more like this
1231 lorem
1311 assss 31 1
etc
I want to take each line of text from first.txt and determine if there's a match in the second text. If there isn't a match then I would like to save the missing text to a third file. I would like to ignore case but not completely necessary. This is why I was looking at regex but didn't have much luck.
So I'm opening the files, using readlines() to create a list.
Iterating through the lists and printing out the matches.
Here's my code
first_file=open('first.txt', "r")
first=first_file.readlines()
first_file.close()
second_file=open('second.txt',"r")
second=second_file.readlines()
second_file.close()
while i < len(first):
j=search[i]
while k < len(second):
m=compare[k]
if not j.find(m):
print m
i=i+1
k=k+1
exit()
It's definitely not elegant. Anyone have suggestions how to fix this or a better solution?
My approach is this: Read the second file, convert it into lowercase and then create a list of the words it contains. Then convert this list into a set, for better performance with large files.
Then go through each line in the first file, and if it (also converted to lowercase, and with extra whitespace removed) is not in the set we created, write it to the third file.
with open("second.txt") as second_file:
second_values = set(second_file.read().lower().split())
with open("first.txt") as first_file:
with open("third.txt", "wt") as third_file:
for line in first_file:
if line.lower().strip() not in second_values:
third_file.write(line + "\n")
set objects are a simple container type that is unordered and cannot contain duplicate value. It is designed to allow you to quickly add or remove items, or tell if an item is already in the set.
with statements are a convenient way to ensure that a file is closed, even if an exception occurs. They are enabled by default from Python 2.6 onwards, in Python 2.5 they require that you put the line from __future__ import with_statements at the top of your file.
The in operator does what it sounds like: tell you if a value can be found in a collection. When used with a list it just iterates through, like your code does, but when used with a set object it uses hashes to perform much faster. not in does the opposite. (Possible point of confusion: in is also used when defining a for loop (for x in [1, 2, 3]), but this is unrelated.)
Assuming that you're looking for the entire line in the second file:
second_file=open('second.txt',"r")
second=second_file.readlines()
second_file.close()
first_file=open('first.txt', "r")
for line in first_file:
if line not in second:
print line
first_file.close()