I'm having a problem managing some data that are saved in a really awful format.
I have data for points that correspond to the edges of a polygon. The data for each polygon is separated by the string >, while the x and y values for the points are separated with non-unified criteria, sometimes with a number of spaces, sometimes with some spaces and a tabulation. I've tried to load such data to an array of arrays with the following code:
f = open('/Path/Data.lb','r')
data = f.read()
splat = data.split('>')
region = []
for number, polygon in enumerate(splat[1:len(splat)], 1):
region.append(float(polygon))
But I keep getting an error trying to run the float() function (I've cut it as it's much longer):
ValueError: could not convert string to float: '\n -73.311 -48.328\n -73.311 -48.326\n -73.318 -48.321\n ...
... -73.324\t -48.353\n -73.315\t -48.344\n -73.313\t -48.337\n'
Is there a way to convert the data to float without modifying the source file? If not, is there a way to easily modify the source file so that all columns are separated the same way? I guess that way the same code should run smoothly.
Thanks!
Try:
with open("PataIce.lb", "r") as file:
polygons = file.read().strip(">").strip().split(">")
region =[]
for polygon in polygons:
sides = polygon.strip().split("\n")
points = [[float(num) for num in side.split()[:2]] for side in sides]
region.append(points)
Some of the points contain more than 2 values and I've restricted the script to only read the first two numbers in these cases.
You can use regex to match decimal numbers.
import re
PATH = <path_to_file>
coords = []
with open(PATH) as f:
for line in f:
nums = re.findall('-?\d+\.\d+', line)
if len(nums) >0:
coords.append(nums)
print(coords)
Note: this solution ignores the trailing 0 at the end of some lines.
Be aware that the results in coords are still strings. You'll need to convert them to float using float().
In [79]: astr = '\n -73.311 -48.328\n -73.311 -48.326\n -73.318 -48.321\n -73.324\
...: t -48.353\n -73.315\t -48.344\n -73.313\t -48.337\n'
In [80]: lines =astr.splitlines()
In [81]: lines
Out[81]:
['',
' -73.311 -48.328',
' -73.311 -48.326',
' -73.318 -48.321',
' -73.324\t -48.353',
' -73.315\t -48.344',
' -73.313\t -48.337']
splitlines deals with the \n separator; split() handles the tab and spaces.
In [82]: [line.split() for line in lines]
Out[82]:
[[],
['-73.311', '-48.328'],
['-73.311', '-48.326'],
['-73.318', '-48.321'],
['-73.324', '-48.353'],
['-73.315', '-48.344'],
['-73.313', '-48.337']]
The initial [] needs to be removed one way or other:
In [84]: np.array(Out[82][1:], dtype=float)
Out[84]:
array([[-73.311, -48.328],
[-73.311, -48.326],
[-73.318, -48.321],
[-73.324, -48.353],
[-73.315, -48.344],
[-73.313, -48.337]])
This works only if each line has the same number of elements, where 2. As long as the lists of strings in Out[82] is clean enough you can let np.array do the conversion from string to float.
Your actually file may require some further handling, but this should give you an idea of the basics.
I have a dataset and I need to reconstruct some data from this dataset to a new style
My dataset is something like below (Stored in a file named train1.txt):
2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468
I need to convert to below style (I need to store in a new file as train.txt):
2342728
2414939
2397722
2386848
2398737
2367906
2384003
2399896
2359702
2414293
And other numbers ….
My python version is 2.7.13
My operating system is Ubuntu 14.04 LTS
I will appreciate you for any help.
Thank you so much.
I would suggest using regex (regular expressions). This might be a little overkill, but in the long run, knowing regex is super powerful.
import re
def return_no_commas(string):
regex = r'\d*'
matches = re.findall(regex, string)
for match in matches:
print(match)
numbers = """
2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468
"""
return_no_commas(numbers)
Let me explain what everything does.
import re
just imports regular expressions. The regular expression I wrote is
regex = r'\d*'
the "r" at the beginning says it's a regex and it just looks for any number (which is the "\d" part) and says it can repeat any number of times (which is the "*" part). Then we print out all the matches.
I saved your numbers in a string called numbers, but you could just as easily read in a file and worked with those contents.
You'll get something like:
2342728
2414939
2397722
2386848
2398737
2367906
2384003
2399896
2359702
2414293
2411228
2416802
2322710
2387437
2397274
2344681
2396522
2386676
2413824
2328225
2413833
2335374
2328594
497966
2384001
2372746
2386538
2348518
2380037
2374364
2352054
2377990
2367915
2412520
2348070
2356469
2353541
2413446
2391930
2366968
2364762
2347618
2396550
2370538
2393212
It sounds to me like your original data is separated by commas. However, you want the data separated by new-line characters (\n) instead. This is very easy to do.
def covert_comma_to_newline(rfilename, wfilename):
"""
rfilename -- name of file to read-from
wfilename -- name of file to write-to
"""
assert(rfilename != wfilename)
# open two files, one in read-mode
# the other in write-mode
rfile = open(rfilename, "r")
wfile = open(wfilename, "w")
# read the file into a string
rstryng = rfile.read()
lyst = rstryng.split(",")
# EXAMPLE:
# rstryng == "1,2,3,4"
# lyst == ["1", "2", "3", "4"]
# remove leading and trailing whitespace
lyst = [s.strip() for s in lyst]
wstryng = "\n".join(lyst)
wfile.writelines(wstryng)
rfile.close()
wfile.close()
return
covert_comma_to_newline("train1.txt", "train.txt")
# open and check the contents of `train.txt`
Since others have added answers, I will include one using numpy.
If you are ok using numpy, it is as simple as:
data = np.genfromtxt('train1.txt', dtype=int, delimiter=',')
If you want a list instead of numpy array,
data.tolist()
[2342728,
2414939,
2397722,
2386848,
2398737,
2367906,
2384003,
2399896,
....
]
everyone. Need help opening and reading the file.
Got this txt file - https://yadi.sk/i/1TH7_SYfLss0JQ
It is a dictionary
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}
But it was written using json into txt file.
#This is how I dump the data into a txt
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
So, the file structure is
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}{"id2":"url2", "id3":"url3", ..., "id4":"url4"}{"id5":"url5", "id6":"url6", ..., "id7":"url7"}
And it is all a string....
I need to open it and check repeated ID, delete and save it again.
But getting - json.loads shows ValueError: Extra data
Tried these:
How to read line-delimited JSON from large file (line by line)
Python json.loads shows ValueError: Extra data
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 190)
But still getting that error, just in different place.
Right now I got as far as:
with open('111111111.txt', 'r') as log:
before_log = log.read()
before_log = before_log.replace('}{',', ').split(', ')
mu_dic = []
for i in before_log:
mu_dic.append(i)
This eliminate the problem of several {}{}{} dictionaries/jsons in a row.
Maybe there is a better way to do this?
P.S. This is how the file is made:
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
Your file size is 9,5M, so it'll took you a while to open it and debug it manually.
So, using head and tail tools (found normally in any Gnu/Linux distribution) you'll see that:
# You can use Python as well to read chunks from your file
# and see the nature of it and what it's causing a decode problem
# but i prefer head & tail because they're ready to be used :-D
$> head -c 217 111111111.txt
{"1933252590737725178": "https://instagram.fiev2-1.fna.fbcdn.net/vp/094927bbfd432db6101521c180221485/5CC0EBDD/t51.2885-15/e35/46950935_320097112159700_7380137222718265154_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net",
$> tail -c 219 111111111.txt
, "1752899319051523723": "https://instagram.fiev2-1.fna.fbcdn.net/vp/a3f28e0a82a8772c6c64d4b0f264496a/5CCB7236/t51.2885-15/e35/30084016_2051123655168027_7324093741436764160_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net"}
$> head -c 294879 111111111.txt | tail -c 12
net"}{"19332
So the first guess is that your file is a malformed series ofJSON data, and the best guess is to seperate }{ by a \n for further manipulations.
So, here is an example of how you can solve your problem using Python:
import json
input_file = '111111111.txt'
output_file = 'new_file.txt'
data = ''
with open(input_file, mode='r', encoding='utf8') as f_file:
# this with statement part can be replaced by
# using sed under your OS like this example:
# sed -i 's/}{/}\n{/g' 111111111.txt
data = f_file.read()
data = data.replace('}{', '}\n{')
seen, total_keys, to_write = set(), 0, {}
# split the lines of the in memory data
for elm in data.split('\n'):
# convert the line to a valid Python dict
converted = json.loads(elm)
# loop over the keys
for key, value in converted.items():
total_keys += 1
# if the key is not seen then add it for further manipulations
# else ignore it
if key not in seen:
seen.add(key)
to_write.update({key: value})
# write the dict's keys & values into a new file as a JSON format
with open(output_file, mode='a+', encoding='utf8') as out_file:
out_file.write(json.dumps(to_write) + '\n')
print(
'found duplicated key(s): {seen} from {total}'.format(
seen=total_keys - len(seen),
total=total_keys
)
)
Output:
found duplicated key(s): 43836 from 45367
And finally, the output file will be a valid JSON file and the duplicated keys will be removed with their values.
The basic difference between the file structure and actual json format is the missing commas and the lines are not enclosed within [. So the same can be achieved with the below code snippet
with open('json_file.txt') as f:
# Read complete file
a = (f.read())
# Convert into single line string
b = ''.join(a.splitlines())
# Add , after each object
b = b.replace("}", "},")
# Add opening and closing parentheses and ignore last comma added in prev step
b = '[' + b[:-1] + ']'
x = json.loads(b)