Python - formatting output with varying numbers of spaces - python

I'm new to python and am struggling with finding a way to format output similar to the below into csv.
The following code runs an expect script that yields columns separated by varying numbers of spaces.
out = subprocess.check_output([get_script, "|", "grep Up"], shell=True
print out
1501 4122:1501 Mesh 1.2.3.4 Up 262075 261927
1502 4121:1502 Mesh 1.2.3.5 Up 262089 261552
1502 4122:1502 Spok 1.2.3.6 Up 262074 261784
701000703 4121:701000703 Mesh 1.2.3.7 Up 262081 261356
What I want is to remove all whitespace and add a "," separator.
1501,4122:1501,Mesh,1.2.3.4,Up,262075,261927
1502,4121:1502,Mesh,1.2.3.5,Up,262089,261552
1502,4122:1502,Spok,1.2.3.6,Up,262074,261784
701000703,4121:701000703,Mesh,1.2.3.7,Up,262081,261356
I can achieve this via awk with awk -v OFS=',' '{$1=$1};1' but am struggling to find a python equivalent.
Any guidance is most appreciated!

Split each line on space and then join the result with a comma
# The commented out step is needed if out is not a list of lines already
# out=out.strip().split('\n')
for line in out:
print ','.join(line.split())

You can convert out as follows:
import csv
import StringIO
out = """1501 4122:1501 Mesh 1.2.3.4 Up 262075 261927
1502 4121:1502 Mesh 1.2.3.5 Up 262089 261552
1502 4122:1502 Spok 1.2.3.6 Up 262074 261784
701000703 4121:701000703 Mesh 1.2.3.7 Up 262081 261356"""
csv_input = csv.reader(StringIO.StringIO(out), delimiter=' ', skipinitialspace=True)
with open('output.csv', 'wb') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerows(csv_input)
Giving you output.csv file containing:
1501,4122:1501,Mesh,1.2.3.4,Up,262075,261927
1502,4121:1502,Mesh,1.2.3.5,Up,262089,261552
1502,4122:1502,Spok,1.2.3.6,Up,262074,261784
701000703,4121:701000703,Mesh,1.2.3.7,Up,262081,261356
StringIO is used to make your out string appear as a file object for the csv module to use.
Tested using Python 2.7.9

This should be possible by using the sub function in the re (regex) library.
import re
table_str = """
1501 4122:1501 Mesh 1.2.3.4 Up 262075 261927
1502 4121:1502 Mesh 1.2.3.5 Up 262089 261552
1502 4122:1502 Spok 1.2.3.6 Up 262074 261784
701000703 4121:701000703 Mesh 1.2.3.7 Up 262081 261356
"""
re.sub(" +", ",", table_str)
The " +" string literal in the sub function does a greedy match on all whitespace.

Related

Convert string to multidimensional array in Python

I'm having a problem managing some data that are saved in a really awful format.
I have data for points that correspond to the edges of a polygon. The data for each polygon is separated by the string >, while the x and y values for the points are separated with non-unified criteria, sometimes with a number of spaces, sometimes with some spaces and a tabulation. I've tried to load such data to an array of arrays with the following code:
f = open('/Path/Data.lb','r')
data = f.read()
splat = data.split('>')
region = []
for number, polygon in enumerate(splat[1:len(splat)], 1):
region.append(float(polygon))
But I keep getting an error trying to run the float() function (I've cut it as it's much longer):
ValueError: could not convert string to float: '\n -73.311 -48.328\n -73.311 -48.326\n -73.318 -48.321\n ...
... -73.324\t -48.353\n -73.315\t -48.344\n -73.313\t -48.337\n'
Is there a way to convert the data to float without modifying the source file? If not, is there a way to easily modify the source file so that all columns are separated the same way? I guess that way the same code should run smoothly.
Thanks!
Try:
with open("PataIce.lb", "r") as file:
polygons = file.read().strip(">").strip().split(">")
region =[]
for polygon in polygons:
sides = polygon.strip().split("\n")
points = [[float(num) for num in side.split()[:2]] for side in sides]
region.append(points)
Some of the points contain more than 2 values and I've restricted the script to only read the first two numbers in these cases.
You can use regex to match decimal numbers.
import re
PATH = <path_to_file>
coords = []
with open(PATH) as f:
for line in f:
nums = re.findall('-?\d+\.\d+', line)
if len(nums) >0:
coords.append(nums)
print(coords)
Note: this solution ignores the trailing 0 at the end of some lines.
Be aware that the results in coords are still strings. You'll need to convert them to float using float().
In [79]: astr = '\n -73.311 -48.328\n -73.311 -48.326\n -73.318 -48.321\n -73.324\
...: t -48.353\n -73.315\t -48.344\n -73.313\t -48.337\n'
In [80]: lines =astr.splitlines()
In [81]: lines
Out[81]:
['',
' -73.311 -48.328',
' -73.311 -48.326',
' -73.318 -48.321',
' -73.324\t -48.353',
' -73.315\t -48.344',
' -73.313\t -48.337']
splitlines deals with the \n separator; split() handles the tab and spaces.
In [82]: [line.split() for line in lines]
Out[82]:
[[],
['-73.311', '-48.328'],
['-73.311', '-48.326'],
['-73.318', '-48.321'],
['-73.324', '-48.353'],
['-73.315', '-48.344'],
['-73.313', '-48.337']]
The initial [] needs to be removed one way or other:
In [84]: np.array(Out[82][1:], dtype=float)
Out[84]:
array([[-73.311, -48.328],
[-73.311, -48.326],
[-73.318, -48.321],
[-73.324, -48.353],
[-73.315, -48.344],
[-73.313, -48.337]])
This works only if each line has the same number of elements, where 2. As long as the lists of strings in Out[82] is clean enough you can let np.array do the conversion from string to float.
Your actually file may require some further handling, but this should give you an idea of the basics.

How to reconstruct and change structure of a dataset using python?

I have a dataset and I need to reconstruct some data from this dataset to a new style
My dataset is something like below (Stored in a file named train1.txt):
2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468
I need to convert to below style (I need to store in a new file as train.txt):
2342728
2414939
2397722
2386848
2398737
2367906
2384003
2399896
2359702
2414293
And other numbers ….
My python version is 2.7.13
My operating system is Ubuntu 14.04 LTS
I will appreciate you for any help.
Thank you so much.
I would suggest using regex (regular expressions). This might be a little overkill, but in the long run, knowing regex is super powerful.
import re
def return_no_commas(string):
regex = r'\d*'
matches = re.findall(regex, string)
for match in matches:
print(match)
numbers = """
2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468
"""
return_no_commas(numbers)
Let me explain what everything does.
import re
just imports regular expressions. The regular expression I wrote is
regex = r'\d*'
the "r" at the beginning says it's a regex and it just looks for any number (which is the "\d" part) and says it can repeat any number of times (which is the "*" part). Then we print out all the matches.
I saved your numbers in a string called numbers, but you could just as easily read in a file and worked with those contents.
You'll get something like:
2342728
2414939
2397722
2386848
2398737
2367906
2384003
2399896
2359702
2414293
2411228
2416802
2322710
2387437
2397274
2344681
2396522
2386676
2413824
2328225
2413833
2335374
2328594
497966
2384001
2372746
2386538
2348518
2380037
2374364
2352054
2377990
2367915
2412520
2348070
2356469
2353541
2413446
2391930
2366968
2364762
2347618
2396550
2370538
2393212
It sounds to me like your original data is separated by commas. However, you want the data separated by new-line characters (\n) instead. This is very easy to do.
def covert_comma_to_newline(rfilename, wfilename):
"""
rfilename -- name of file to read-from
wfilename -- name of file to write-to
"""
assert(rfilename != wfilename)
# open two files, one in read-mode
# the other in write-mode
rfile = open(rfilename, "r")
wfile = open(wfilename, "w")
# read the file into a string
rstryng = rfile.read()
lyst = rstryng.split(",")
# EXAMPLE:
# rstryng == "1,2,3,4"
# lyst == ["1", "2", "3", "4"]
# remove leading and trailing whitespace
lyst = [s.strip() for s in lyst]
wstryng = "\n".join(lyst)
wfile.writelines(wstryng)
rfile.close()
wfile.close()
return
covert_comma_to_newline("train1.txt", "train.txt")
# open and check the contents of `train.txt`
Since others have added answers, I will include one using numpy.
If you are ok using numpy, it is as simple as:
data = np.genfromtxt('train1.txt', dtype=int, delimiter=',')
If you want a list instead of numpy array,
data.tolist()
[2342728,
2414939,
2397722,
2386848,
2398737,
2367906,
2384003,
2399896,
....
]

How to solve problem decoding from wrong json format

everyone. Need help opening and reading the file.
Got this txt file - https://yadi.sk/i/1TH7_SYfLss0JQ
It is a dictionary
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}
But it was written using json into txt file.
#This is how I dump the data into a txt
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
So, the file structure is
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}{"id2":"url2", "id3":"url3", ..., "id4":"url4"}{"id5":"url5", "id6":"url6", ..., "id7":"url7"}
And it is all a string....
I need to open it and check repeated ID, delete and save it again.
But getting - json.loads shows ValueError: Extra data
Tried these:
How to read line-delimited JSON from large file (line by line)
Python json.loads shows ValueError: Extra data
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 190)
But still getting that error, just in different place.
Right now I got as far as:
with open('111111111.txt', 'r') as log:
before_log = log.read()
before_log = before_log.replace('}{',', ').split(', ')
mu_dic = []
for i in before_log:
mu_dic.append(i)
This eliminate the problem of several {}{}{} dictionaries/jsons in a row.
Maybe there is a better way to do this?
P.S. This is how the file is made:
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
Your file size is 9,5M, so it'll took you a while to open it and debug it manually.
So, using head and tail tools (found normally in any Gnu/Linux distribution) you'll see that:
# You can use Python as well to read chunks from your file
# and see the nature of it and what it's causing a decode problem
# but i prefer head & tail because they're ready to be used :-D
$> head -c 217 111111111.txt
{"1933252590737725178": "https://instagram.fiev2-1.fna.fbcdn.net/vp/094927bbfd432db6101521c180221485/5CC0EBDD/t51.2885-15/e35/46950935_320097112159700_7380137222718265154_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net",
$> tail -c 219 111111111.txt
, "1752899319051523723": "https://instagram.fiev2-1.fna.fbcdn.net/vp/a3f28e0a82a8772c6c64d4b0f264496a/5CCB7236/t51.2885-15/e35/30084016_2051123655168027_7324093741436764160_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net"}
$> head -c 294879 111111111.txt | tail -c 12
net"}{"19332
So the first guess is that your file is a malformed series ofJSON data, and the best guess is to seperate }{ by a \n for further manipulations.
So, here is an example of how you can solve your problem using Python:
import json
input_file = '111111111.txt'
output_file = 'new_file.txt'
data = ''
with open(input_file, mode='r', encoding='utf8') as f_file:
# this with statement part can be replaced by
# using sed under your OS like this example:
# sed -i 's/}{/}\n{/g' 111111111.txt
data = f_file.read()
data = data.replace('}{', '}\n{')
seen, total_keys, to_write = set(), 0, {}
# split the lines of the in memory data
for elm in data.split('\n'):
# convert the line to a valid Python dict
converted = json.loads(elm)
# loop over the keys
for key, value in converted.items():
total_keys += 1
# if the key is not seen then add it for further manipulations
# else ignore it
if key not in seen:
seen.add(key)
to_write.update({key: value})
# write the dict's keys & values into a new file as a JSON format
with open(output_file, mode='a+', encoding='utf8') as out_file:
out_file.write(json.dumps(to_write) + '\n')
print(
'found duplicated key(s): {seen} from {total}'.format(
seen=total_keys - len(seen),
total=total_keys
)
)
Output:
found duplicated key(s): 43836 from 45367
And finally, the output file will be a valid JSON file and the duplicated keys will be removed with their values.
The basic difference between the file structure and actual json format is the missing commas and the lines are not enclosed within [. So the same can be achieved with the below code snippet
with open('json_file.txt') as f:
# Read complete file
a = (f.read())
# Convert into single line string
b = ''.join(a.splitlines())
# Add , after each object
b = b.replace("}", "},")
# Add opening and closing parentheses and ignore last comma added in prev step
b = '[' + b[:-1] + ']'
x = json.loads(b)

Import .dat file into 2d array

I need to import a .dat file into a 2d array, there are 15 lines of text in the file before the data begins, separated by a space. Once imported i have to plot an eliminate noise from the data, however, i can't seem to get the data to import properly.
This is my attempt
import numpy as np
data = np.array
def ProcessData(data):
data = np.loadtxt("myfile.dat", delimiter = " ", skiprows = 15)
print data
ProcessData(data)
>>>ValueError: invalid literal for float(): 10.0 0.0
10.0 and 0.0 are the first two values for each column of data.
Can anyone point out what might be wrong?
Don't use delimiter = " ". Use the default, delimiter=None. Then any block of whitespace is treat as a separator of the data fields.
When you use delimiter=" ", then each space is significant, so a line such as "1 2" contains the fields '1', '', '', '2'. When you use delimiter=None, those three spaces are treated as a single separator.
In your case, I suspect the fields in your file are separated by tabs, not spaces. A tab is also whitespace, so it will be treated as a separator when you use delimiter=None.

Process each line of text file using Python

I am fairly new to Python. I have a text file containing many blocks of data in following format along with other unnecessary blocks.
NOT REQUIRED :: 123
Connected Part-1:: A ~$
Connected Part-3:: B ~$
Connector Location:: 100 200 300 ~$
NOT REQUIRED :: 456
Connected Part-2:: C ~$
i wish to extract the info (A,B,C, 100 200 300) corresponding to each property ( connected part-1, Connector location) and store it as list to use it later. I have prepared following code which reads file, cleans the line and store it as list.
import fileinput
with open('C:/Users/file.txt') as f:
content = f.readlines()
for line in content:
if 'Connected Part-1' in line or 'Connected Part-3' in line:
if 'Connected Part-1' in line:
connected_part_1 = [s.strip(' \n ~ $ Connected Part -1 ::') for s in content]
print ('PART_1:',connected_part_1)
if 'Connected Part-3' in line:
connected_part_3 = [s.strip(' \n ~ $ Connected Part -3 ::') for s in content]
print ('PART_3:',connected_part_3)
if 'Connector Location' in line:
# removing unwanted characters and converting into the list
content_clean_1 = [s.strip('\n ~ $ Connector Location::') for s in content]
#converting a single string item in list to a string
s = " ".join(content_clean_1)
# splitting the string and converting into a list
weld_location= s.split(" ")
print ('POSITION',weld_location)
here is the output
PART_1: ['A', '\t\tConnector Location:: 100.00 200.00 300.00', '\t\tConnected Part-3:: C~\t']
POSITION ['d', 'Part-1::', 'A', '\t\tConnector', 'Location::', '100.00', '200.00', '300.00', '\t\tConnected', 'Part-3::', 'C~\t']
PART_3: ['1:: A', '\t\tConnector Location:: 100.00 200.00 300.00', '\t\tConnected Part-3:: C~\t']
From the output of this program, i may conclude that, since 'content' is the string consisting all the characters in the file, the program is not reading an individual line. Instead it is considering all text as single string. Could anyone please help in this case?
I am expecting following output:
PART_1: ['A']
PART_3: ['C']
POSITION: ['100.00', '200.00','300.00']
(Note) When i am using individual files containing single line of data, it works fine. Sorry for such a long question
I will try to make it clear, and show how I would do it without regex. First of all, the biggest issue with the code presented is that when using the string.strip function the entire content list is being read:
connected_part_1 = [s.strip(' \n ~ $ Connected Part -1 ::') for s in content]
Content is the entire file lines, I think you want simply something like:
connected_part_1 = [line.strip(' \n ~ $ Connected Part -1 ::')]
How to parse the file is a bit subjective, but given the file format posted as input, I would do it like this:
templatestr = "{}: {}"
with open('inputreadlines.txt') as f:
content = f.readlines()
for line in content:
label, value = line.split('::')
ltokens = label.split()
if ltokens[0] == 'Connected':
print(templatestr.format(
ltokens[-1], #The last word on the label
value.split()[:-1])) #the split value without the last word '~$'
elif ltokens[0] == 'Connector':
print(value.split()[:-1]) #the split value without the last word '~$'
else: #NOT REQUIRED
pass
You can use the string.strip function to remove the funny characters '~$' instead of removing the last token as in the example.

Categories

Resources