How to remove brackets and the contents inside from a file - python

I have a file named sample.txt which looks like below
ServiceProfile.SharediFCList[1].DefaultHandling=1
ServiceProfile.SharediFCList[1].ServiceInformation=
ServiceProfile.SharediFCList[1].IncludeRegisterRequest=n
ServiceProfile.SharediFCList[1].IncludeRegisterResponse=n
Here my requirement is to remove the brackets and the integer and enter os commands with that
ServiceProfile.SharediFCList.DefaultHandling=1
ServiceProfile.SharediFCList.ServiceInformation=
ServiceProfile.SharediFCList.IncludeRegisterRequest=n
ServiceProfile.SharediFCList.IncludeRegisterResponse=n
I am quite a newbie in Python. This is my first attempt. I have used these codes to remove the brackets:
#!/usr/bin/python
import re
import os
import sys
f = os.open("sample.txt", os.O_RDWR)
ret = os.read(f, 10000)
os.close(f)
print ret
var1 = re.sub("[\(\[].*?[\)\]]", "", ret)
print var1f = open("removed.cfg", "w+")
f.write(var1)
f.close()
After this using the file as input I want to form application specific commands which looks like this:
cmcli INS "DefaultHandling=1 ServiceInformation="
and the next set as
cmcli INS "IncludeRegisterRequest=n IncludeRegisterRequest=y"
so basically now I want the all the output to be bunched to a set of two for me to execute the commands on the operating system.
Is there any way that I could bunch them up as set of two?

Reading 10,000 bytes of text into a string is really not necessary when your file is line-oriented text, and isn't scalable either. And you need a very good reason to be using os.open() instead of open().
So, treat your data as the lines of text that it is, and every two lines, compose a single line of output.
from __future__ import print_function
import re
command = [None,None]
cmd_id = 1
bracket_re = re.compile(r".+\[\d\]\.(.+)")
# This doesn't just remove the brackets: what you actually seem to want is
# to pick out everything after [1]. and ignore the rest.
with open("removed_cfg","w") as outfile:
with open("sample.txt") as infile:
for line in infile:
m = bracket_re.match(line)
cmd_id = 1 - cmd_id # gives 0, 1, 0, 1
command[cmd_id] = m.group(1)
if cmd_id == 1: # we have a pair
output_line = """cmcli INS "{0} {1}" """.format(*command)
print (output_line, file=outfile)
This gives the output
cmcli INS "DefaultHandling=1 ServiceInformation="
cmcli INS "IncludeRegisterRequest=n IncludeRegisterResponse=n"
The second line doesn't correspond to your sample output. I don't know how the input IncludeRegisterResponse=n is supposed to become the output IncludeRegisterRequest=y. I assume that's a mistake.
Note that this code depends on your input data being precisely as you describe it and has no error checking whatsoever. So if the format of the input is in reality more variable than that, then you will need to add some validation.

Related

How to solve problem decoding from wrong json format

everyone. Need help opening and reading the file.
Got this txt file - https://yadi.sk/i/1TH7_SYfLss0JQ
It is a dictionary
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}
But it was written using json into txt file.
#This is how I dump the data into a txt
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
So, the file structure is
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}{"id2":"url2", "id3":"url3", ..., "id4":"url4"}{"id5":"url5", "id6":"url6", ..., "id7":"url7"}
And it is all a string....
I need to open it and check repeated ID, delete and save it again.
But getting - json.loads shows ValueError: Extra data
Tried these:
How to read line-delimited JSON from large file (line by line)
Python json.loads shows ValueError: Extra data
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 190)
But still getting that error, just in different place.
Right now I got as far as:
with open('111111111.txt', 'r') as log:
before_log = log.read()
before_log = before_log.replace('}{',', ').split(', ')
mu_dic = []
for i in before_log:
mu_dic.append(i)
This eliminate the problem of several {}{}{} dictionaries/jsons in a row.
Maybe there is a better way to do this?
P.S. This is how the file is made:
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
Your file size is 9,5M, so it'll took you a while to open it and debug it manually.
So, using head and tail tools (found normally in any Gnu/Linux distribution) you'll see that:
# You can use Python as well to read chunks from your file
# and see the nature of it and what it's causing a decode problem
# but i prefer head & tail because they're ready to be used :-D
$> head -c 217 111111111.txt
{"1933252590737725178": "https://instagram.fiev2-1.fna.fbcdn.net/vp/094927bbfd432db6101521c180221485/5CC0EBDD/t51.2885-15/e35/46950935_320097112159700_7380137222718265154_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net",
$> tail -c 219 111111111.txt
, "1752899319051523723": "https://instagram.fiev2-1.fna.fbcdn.net/vp/a3f28e0a82a8772c6c64d4b0f264496a/5CCB7236/t51.2885-15/e35/30084016_2051123655168027_7324093741436764160_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net"}
$> head -c 294879 111111111.txt | tail -c 12
net"}{"19332
So the first guess is that your file is a malformed series ofJSON data, and the best guess is to seperate }{ by a \n for further manipulations.
So, here is an example of how you can solve your problem using Python:
import json
input_file = '111111111.txt'
output_file = 'new_file.txt'
data = ''
with open(input_file, mode='r', encoding='utf8') as f_file:
# this with statement part can be replaced by
# using sed under your OS like this example:
# sed -i 's/}{/}\n{/g' 111111111.txt
data = f_file.read()
data = data.replace('}{', '}\n{')
seen, total_keys, to_write = set(), 0, {}
# split the lines of the in memory data
for elm in data.split('\n'):
# convert the line to a valid Python dict
converted = json.loads(elm)
# loop over the keys
for key, value in converted.items():
total_keys += 1
# if the key is not seen then add it for further manipulations
# else ignore it
if key not in seen:
seen.add(key)
to_write.update({key: value})
# write the dict's keys & values into a new file as a JSON format
with open(output_file, mode='a+', encoding='utf8') as out_file:
out_file.write(json.dumps(to_write) + '\n')
print(
'found duplicated key(s): {seen} from {total}'.format(
seen=total_keys - len(seen),
total=total_keys
)
)
Output:
found duplicated key(s): 43836 from 45367
And finally, the output file will be a valid JSON file and the duplicated keys will be removed with their values.
The basic difference between the file structure and actual json format is the missing commas and the lines are not enclosed within [. So the same can be achieved with the below code snippet
with open('json_file.txt') as f:
# Read complete file
a = (f.read())
# Convert into single line string
b = ''.join(a.splitlines())
# Add , after each object
b = b.replace("}", "},")
# Add opening and closing parentheses and ignore last comma added in prev step
b = '[' + b[:-1] + ']'
x = json.loads(b)

regular expressions in python using quotes

I am attempting to create a regular expression pattern for strings similar to the below which are stored in a file. The aim is to get any column for any row, the rows need not be on a single line. So for example, consider the following file:
"column1a","column2a","column
3a,", #entity 1
"column\"this is, a test\"4a"
"column1b","colu
mn2b,","column3b", #entity 2
"column\"this is, a test\"4b"
"column1c,","column2c","column3c", #entity 3
"column\"this is, a test\"4c"
Each entity consists of four columns, column 4 for entity 2 would be "column\"this is, a test\"4b", column 2 for entity 3 would be "column2c". Each column begins with a quote and closes with a quote, however you must be careful because some columns have escaped quotes. Thanks in advance!
You could do like this, ie
Read the whole file.
Split the input according to the newline character which was not preceded by a comma.
Iterate over the spitted elements and again do splitting on the comma (and also the following optional newline character) which was preceded and followed by double quotes.
Code:
import re
with open(file) as f:
fil = f.read()
m = re.split(r'(?<!,)\n', fil.strip())
for i in m:
print(re.split('(?<="),\n?(?=")', i))
Output:
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
Here is the check..
$ cat f
"column1a","column2a","column3a,",
"column\"this is, a test\"4a"
"column1b","column2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"
$ python3 f.py
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
f is the input file name and f.py is the file-name which contains the python script.
Your problem is terribly familiar to what I have to deal thrice every month :) Except I'm not using python to solve it, but I can 'translate' what I usually do:
text = r'''"column1a","column2a","column
3a,",
"column\"this is, a test\"4a"
"column1a2","column2a2","column3a2","column4a2"
"column1b","colu
mn2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"'''
import re
# Number of columns one line is supposed to have
columns = 4
# Temporary variable to hold partial lines
buffer = ""
# Our regex to check for each column
check = re.compile(r'"(?:[^"\\]*|\\.)*"')
# Read the file line by line
for line in text.split("\n"):
# If there's no stored partial line, this is a new line
if buffer == "":
# Check if we get 4 columns and print, if not, put the line
# into buffer so we store a partial line for later
if len(check.findall(line)) == columns:
print matches
else:
# use line.strip() if you need to trim whitespaces
buffer = line
else:
# Update the variable (containing a partial line) with the
# next line and recheck if we get 4 columns
# use line.strip() if you need to trim whitespaces
buffer = buffer + line
# If we indeed get 4, our line is complete and print
# We must not forget to empty buffer now that we got a whole line
if len(check.findall(buffer)) == columns:
print matches
buffer = ""
# Optional; always good to have a safety backdoor though
# If there is a problem with the csv itself like a weird unescaped
# quote, you send it somewhere else
elif len(check.findall(buffer)) > columns:
print "Error: cannot parse line:\n" + buffer
buffer = ""
ideone demo

python reading lines in file and joining them

I have this code
with open ('ip.txt') as ip :
ips = ip.readlines()
with open ('user.txt') as user :
usrs = user.readlines()
with open ('pass.txt') as passwd :
passwds = passwd.readlines()
with open ('prefix.txt') as pfx :
pfxes = pfx.readlines()
with open ('time.txt') as timer :
timeout = timer.readline()
with open ('phone.txt') as num :
number = num.readline()
which open all those files and join them in this shape
result = ('Server:{0} # U:{1} # P:{2} # Pre:{3} # Tel:{4}\n{5}\n'.format(b,c,d,a,number,ctime))
print (result)
cmd = ("{0}{1}#{2}".format(a,number,b))
print (cmd)
I supposed it will print like this
Server:x.x.x.x # U:882 # P:882 # Pre:900 # Tel:456123456789
900456123456789#x.x.x.x
but the output was like this
Server:x.x.x.x
# U:882 # P:882 # Pre:900
# Tel:456123456789
900
456123456789#187.191.45.228
New output :-
Server:x.x.x.x # U:882 # P:882 # Pre:900 # Tel:['456123456789']
900['456123456789']#x.x.x.x
how i can solve this ?
may be you should remove newline using strip()
Example
with open ('ip.txt') as ip :
ips = ip.readline().strip()
readline() will read one line at a time, where readlines() will read entire files as a list of lines
I am guessing from your limited example is that b has a newline embedded. That's because of readlines(). The python idiom to use here is: ip.read().splitlines() where ip is one of your file handles.
See more splitlines options at python docs
Apart from other great answers, for completeness sake here I am going to post an alternative answer using string.translate, which will cover in case of any \n or newline has been accidentally inserted into middle of your string, like '123\n456\n78', which will cover the corner cases from using rstrip or strip.
Server:x.x.x.x # U:882 # P:882 # Pre:900 # Tel:['456123456789']
900['456123456789']#x.x.x.x
You have this is because you're printing a list, to resolve this, you need to join the string in your list number
Altogether, solution will be something Like this:
import string
# prepare for string translation to get rid of new lines
tbl = string.maketrans("","")
result = ('Server:{0} # U:{1} # P:{2} # Pre:{3} # Tel:{4}\n{5}\n'.format(b,c,d,a,''.join(number),ctime))
# this will translate all new lines to ""
print (result.translate(tbl, "\n"))
cmd = ("{0}{1}#{2}".format(a,''.join(number),b))
print (cmd.translate(tbl, "\n"))

Copy pieces of data from a .txt into another file for a spreadsheet

I have a bunch of data in .txt file and I need it in a format that I can use in fusion tables/spreadsheet. I assume that that format would be a csv that I can write into another file that I can then import into a spreadsheet to work with.
The data is in this format with multiple entries separated by a blank line.
Start Time
8/18/14, 11:59 AM
Duration
15 min
Start Side
Left
Fed on Both Sides
No
Start Time
8/18/14, 8:59 AM
Duration
13 min
Start Side
Right
Fed on Both Sides
No
(etc.)
but I need it ultimately in this format (or whatever i can use to get it into a spreadsheet)
StartDate, StartTime, Duration, StartSide, FedOnBothSides
8/18/14, 11:59 AM, 15, Left, No
- , -, -, -, -
The problems I have come across are:
-I don't need all the info or every line but i'm not sure how to automatically separate them. I don't even know if the way I am going about sorting each line is smart
-I have been getting an error that says that "argument 1 must be string or read-only character buffer, not list" when I use .read() or .readlines() sometimes (although it did work at first). also both of my arguments are .txt files.
-the dates and times are not in set formats with regular lengths (it has 8/4/14, 5:14 AM instead of 08/04/14, 05:14 AM) which I'm not sure how to deal with
this is what I have tried so far
from sys import argv
from os.path import exists
def filework():
script, from_file, to_file = argv
print "copying from %s to %s" % (from_file, to_file)
in_file = open(from_file)
indata = in_file.readlines() #.read() .readline .readlines .read().splitline .xreadlines
print "the input file is %d bytes long" % len(indata)
print "does the output file exist? %r" % exists(to_file)
print "ready, hit RETURN to continue, CTRL-C to abort."
raw_input()
#do stuff section----------------BEGIN
for i in indata:
if i == "Start Time":
pass #do something
elif i== '{date format}':
pass #do something
else:
pass #do something
#do stuff section----------------END
out_file = open(to_file, 'w')
out_file.write(indata)
print "alright, all done."
out_file.close()
in_file.close()
filework()
So I'm relatively unversed in scripts like this that have multiple complex parts. Any help and suggestions would be greatly appreciated. Sorry if this is a jumble.
Thanks
This code should work, although its not exactly optimal, but I'm sure you'll figure out how to make it better!
What this code basically does is:
Get all the lines from the input data
Loop through all the lines, and try to recognize different keys (the start time etc)
If a keys is recognize, get the line beneath it, and apply a appropriate function to it
If a new line is found, add the current entry to a list, so that other entries can be read
Write the data to a file
Incase you haven't seen string formatting being done this way before:
"{0:} {1:}".format(arg0, arg1), the {0:} is just a way of defining a placeholder for a variable(here: arg0), and the 0 just defines which arguments to use.
Find out more here:
Python .format docs
Python OrderedDict docs
If you are using a version of python < 2.7, you might have to install a other version of ordereddicts by using pip install ordereddict. If that doesn't work, just change data = OrderedDict() to data = {}, and it should work. But then the output will look somewhat different each time it is generated, but it will still be correct.
from sys import argv
from os.path import exists
# since we want to have a somewhat standardized format
# and dicts are unordered by default
try:
from collections import OrderedDict
except ImportError:
# python 2.6 or earlier, use backport
from ordereddict import OrderedDict
def get_time_and_date(time):
date, time = time.split(",")
time, time_indic = time.split()
date = pad_time(date)
time = "{0:} {1:}".format(pad_time(time), time_indic)
return time, date
"""
Make all the time values look the same, ex turn 5:30 AM into 05:30 AM
"""
def pad_time(time):
# if its time
if ":" in time:
separator = ":"
# if its a date
else:
separator = "/"
time = time.split(separator)
for index, num in enumerate(time):
if len(num) < 2:
time[index] = "0" + time[index]
return separator.join(time)
def filework():
from_file, to_file = argv[1:]
data = OrderedDict()
print "copying from %s to %s" % (from_file, to_file)
# by using open(...) the file closes automatically
with open(from_file, "r") as inputfile:
indata = inputfile.readlines()
entries = []
print "the input file is %d bytes long" % len(indata)
print "does the output file exist? %r" % exists(to_file)
print "ready, hit RETURN to continue, CTRL-C to abort."
raw_input()
for line_num in xrange(len(indata)):
# make the entire string lowercase to be more flexible,
# and then remove whitespace
line_lowered = indata[line_num].lower().strip()
if "start time" == line_lowered:
time, date = get_time_and_date(indata[line_num+1].strip())
data["StartTime"] = time
data["StartDate"] = date
elif "duration" == line_lowered:
duration = indata[line_num+1].strip().split()
# only keep the amount of minutes
data["Duration"] = duration[0]
elif "start side" == line_lowered:
data["StartSide"] = indata[line_num+1].strip()
elif "fed on both sides" == line_lowered:
data["FedOnBothSides"] = indata[line_num+1].strip()
elif line_lowered == "":
# if a blank line is found, prepare for reading a new entry
entries.append(data)
data = OrderedDict()
entries.append(data)
# create the outfile if it does not exist
with open(to_file, "w+") as outfile:
headers = entries[0].keys()
outfile.write(", ".join(headers) + "\n")
for entry in entries:
outfile.write(", ".join(entry.values()) + "\n")
filework()

grep in python properly

I am used to do scripting in bash, but I am also learning python.
So, as a way of learning, I am trying to modify my few old bash in python. As, say,I have a file, with line like:
TOTDOS= 0.38384E+02n_Ef= 0.81961E+02 Ebnd 0.86883E+01
to get the value of TOTDOS in bash, I just do:
grep "TOTDOS=" 630/out-Dy-eos2|head -c 19|tail -c 11
but by python, I am doing:
#!/usr/bin/python3
import re
import os.path
import sys
f1 = open("630/out-Dy-eos2", "r")
re1 = r'TOTDOS=\s*(.*)n_Ef=\s*(.*)\sEbnd'
for line in f1:
match1 = re.search(re1, line)
if match1:
TD = (match1.group(1))
f1.close()
print(TD)
Which is surely giving correct result, but seems to be much more then bash(not to mention problem with regex).
Question is, am I overworking in python, or missing something of it?
A python script that matches your bash line would be more like this:
with open('630/out-Dy-eos2', 'r') as f1:
for line in f1:
if "TOTDOS=" in line:
print line[8:19]
Looks a little bit better now.
[...] but seems to be much more than bash
Maybe (?) generators are the closest Python concept to the "pipe filtering" used in shell.
import itertools
#
# Simple generator to iterate through a file
# equivalent of line by line reading from an input file
def source(fname):
with open(fname,"r") as f:
for l in f:
yield l
src = source("630/out-Dy-eos2")
# First filter to keep only lines containing the required word
# equivalent to `grep -F`
filter1 = (l for l in src if "TOTDOS=" in l)
# Second filter to keep only line in the required range
# equivalent of `head -n ... | tail -n ...`
filter2 = itertools.islice(filter1, 10, 20,1)
# Finally output
output = "".join(filter2)
print(output)
Concerning your specific example, if you need it, you could use regexp in a generator:
re1 = r'TOTDOS=\s*(.*)n_Ef=\s*(.*)\sEbnd'
filter1 = (m.group(1) for m in (re.match(re1, l) for l in src) if m)
Those are only (some of the) basic building blocs available to you.

Categories

Resources