Python splitting data record

Python splitting data record - python

I have a record as below:
29 16
A 1.2595034 0.82587254 0.7375044 1.1270138 -0.35065323 0.55985355
0.7200067 -0.889543 0.2300735 0.56767654 0.2789483 0.32296127 -0.6423197 0.26456305 -0.07363393 -1.0788593
B 1.2467299 0.78651106 0.4702038 1.204216 -0.5282698 0.13987103
0.5911153 -0.6729466 0.377103 0.34090135 0.3052503 0.028784657 -0.39129165 0.079238065 -0.29310825 -0.99383247
I want to split the data into key-value pairs neglecting the first top row i.e 29 16. It should be neglected.
The output should be something like this:
x = A , B
y = 1.2595034 0.82587254 0.7375044 1.1270138 -0.35065323 0.55985355 0.7200067 -0.889543 0.2300735 0.56767654 0.2789483 0.32296127 -0.6423197 0.26456305 -0.07363393 -1.0788593
1.2467299 0.78651106 0.4702038 1.204216 -0.5282698 0.13987103 0.5911153 -0.6729466 0.377103 0.34090135 0.3052503 0.028784657 -0.39129165 0.079238065 -0.29310825 -0.99383247
I am able to neglect the first line using the below code:
f = open(fileName, 'r')
lines = f.readlines()[1:]
Now how do I separate rest record in Python?

So here's my take :D I expect you'd want to have the numbers parsed as well?
def generate_kv(fileName):
with open(fileName, 'r') as file:
# ignore first line
file.readline()
for line in file:
if '' == line.strip():
# empty line
continue
values = line.split(' ')
try:
yield values[0], [float(x) for x in values[1:]]
except ValueError:
print(f'one of the elements was not a float: {line}')
if __name__ == '__main__':
x = []
y = []
for key, value in generate_kv('sample.txt'):
x.append(key)
y.append(value)
print(x)
print(y)
assumes that the values in sample.txt look like this:
% cat sample.txt
29 16
A 1.2595034 0.82587254 0.7375044 1.1270138 -0.35065323 0.55985355 0.7200067 -0.889543 0.2300735 0.56767654 0.2789483 0.32296127 -0.6423197 0.26456305 -0.07363393 -1.0788593
B 1.2467299 0.78651106 0.4702038 1.204216 -0.5282698 0.13987103 0.5911153 -0.6729466 0.377103 0.34090135 0.3052503 0.028784657 -0.39129165 0.079238065 -0.29310825 -0.99383247
and the output:
% python sample.py
['A', 'B']
[[1.2595034, 0.82587254, 0.7375044, 1.1270138, -0.35065323, 0.55985355, 0.7200067, -0.889543, 0.2300735, 0.56767654, 0.2789483, 0.32296127, -0.6423197, 0.26456305, -0.07363393, -1.0788593], [1.2467299, 0.78651106, 0.4702038, 1.204216, -0.5282698, 0.13987103, 0.5911153, -0.6729466, 0.377103, 0.34090135, 0.3052503, 0.028784657, -0.39129165, 0.079238065, -0.29310825, -0.99383247]]
Alternatively, if you'd wanted to have a dictionary, do:
if __name__ == '__main__':
print(dict(generate_kv('sample.txt')))
That will convert the list into a dictionary and output:
{'A': [1.2595034, 0.82587254, 0.7375044, 1.1270138, -0.35065323, 0.55985355, 0.7200067, -0.889543, 0.2300735, 0.56767654, 0.2789483, 0.32296127, -0.6423197, 0.26456305, -0.07363393, -1.0788593], 'B': [1.2467299, 0.78651106, 0.4702038, 1.204216, -0.5282698, 0.13987103, 0.5911153, -0.6729466, 0.377103, 0.34090135, 0.3052503, 0.028784657, -0.39129165, 0.079238065, -0.29310825, -0.99383247]}

you can use this script if your file is a text
filename='file.text'
with open(filename) as f:
data = f.readlines()
x=[data[0][0],data[1][0]]
y=[data[0][1:],data[1][1:]]

If you're happy to store the data in a dictionary here is what you can do:
records = dict()
with open(filename, 'r') as f:
f.readline() # skip the first line
for line in file:
key, value = line.split(maxsplit=1)
records[key] = value.split()
The structure of records would be:
{
'A': ['1.2595034', '0.82587254', '0.7375044', ... ]
'B': ['1.2467299', '0.78651106', '0.4702038', ... ]
}
What's happening
with ... as f we're opening the file within a context manager (more info here). This allows us to automatically close the file when the block finishes.
Because the open file keeps track of where it is in the file we can use f.readline() to move the pointer down a line. (docs)
line.split() allows you to turn a string into a list of strings. With the maxsplits=1 arg it means that it will only split on the first space.
e.g. x, y = 'foo bar baz'.split(maxsplit=1), x = 'foo' and y = 'bar baz'

If I understood correctly, you want the numbers to be collected in a list. One way of doing this is:
import string
text = '''
29 16
A 1.2595034 0.82587254 0.7375044 1.1270138 -0.35065323 0.55985355 0.7200067 -0.889543 0.2300735 0.56767654 0.2789483 0.32296127 -0.6423197 0.26456305 -0.07363393 -1.0788593
B 1.2467299 0.78651106 0.4702038 1.204216 -0.5282698 0.13987103 0.5911153 -0.6729466 0.377103 0.34090135 0.3052503 0.028784657 -0.39129165 0.079238065 -0.29310825 -0.99383247
'''
lines = text.split('\n')
x = [
line[1:].strip().split()
for i, line in enumerate(lines)
if line and line[0].lower() in string.ascii_letters]
This will produce a list of lists when the outer list contains A, B, etc. and the inner lists contain the numbers associated to A, B, etc.
This code assumes that you are interested in lines starting with any single letter (case-insensitive).
For more elaborated conditions you may want to look into regular expressions.
Obviously, if your text is in a file, you could substitute lines = ... with:
with open(filepath, 'r') as lines:
x = ...
Also, if the items in x should not be separated, but rather in a string, you may want to change line[1:].strip().split() with line[1:].strip().
Instead, if you want the numbers as float and not string, you should replace line[1:].strip().split() with [float(value) for value in line[1:].strip().split()].
EDIT:
Alternatively to line[1:].strip().split() you may want to do:
line.split(maxsplit=1)[1].split()
as suggested in some other answer. This would generalize better if the first token is not a single character.

Related

Extract data between two lines from text file

Say I have hundreds of text files like this example :
NAME
John Doe
DATE OF BIRTH
1992-02-16
BIO
THIS is
a PRETTY
long sentence
without ANY structure
HOBBIES
//..etc..
NAME, DATE OF BIRTH, BIO, and HOBBIES (and others) are always there, but text content and the number of lines between them can sometimes change.
I want to iterate through the file and store the string between each of these keys. For example, a variable called Name should contain the value stored between 'NAME' and 'DATE OF BIRTH'.
This is what I turned up with :
lines = f.readlines()
for line_number, line in enumerate(lines):
if "NAME" in line:
name = lines[line_number + 1] # In all files, Name is one line long.
elif "DATE OF BIRTH" in line:
date = lines[line_number + 2] # Date is also always two lines after
elif "BIO" in line:
for x in range(line_number + 1, line_number + 20): # Length of other data can be randomly bigger
if "HOBBIES" not in lines[x]:
bio += lines[x]
else:
break
elif "HOBBIES" in line:
#...
This works well enough, but I feel like instead of using many double loops, there must be a smarter and less hacky way to do it.
I'm looking for a general solution where NAME would store everything until DATE OF BIRTH, and BIO would store everything until HOBBIES, etc. With the intention of cleaning up and removing extra white lintes later.
Is it possible?
Edit : While I was reading through the answers, I realized I forgot a really significant detail, the keys will sometimes be repeated (in the same order).
That is, a single text file can contain more than one person. A list of persons should be created. The key Name signals the start of a new person.

I did it storing everything in a dictionary, see code below.
f = open("test.txt")
lines = f.readlines()
dict_text = {"NAME":[], "DATEOFBIRTH":[], "BIO":[]}
for line_number, line in enumerate(lines):
if not ("NAME" in line or "DATE OF BIRTH" in line or "BIO" in line):
text = line.replace("\n","")
dict_text[location].append(text)
else:
location = "".join((line.split()))

You could use a regular expression:
import re
keys = """
NAME
DATE OF BIRTH
BIO
HOBBIES
""".strip().splitlines()
key_pattern = '|'.join(f'{key.strip()}' for key in keys)
pattern = re.compile(fr'^({key_pattern})', re.M)
# uncomment to see the pattern
# print(pattern)
with open(filename) as f:
text = f.read()
parts = pattern.split(text)
... process parts ...
parts will be a list strings. The odd indexed positions (parts[1], parts[3], ...) will be the keys ('NAME', etc) and the even indexed positions (parts[2], parts[4], ...) will be the text in between the keys. parts[0] will be whatever was before the first key.

Instead of reading lines you could cast the file as one long string. Use string.index() to find the start index of your trigger words, then set everything from that index to the next trigger word index to a variable.
Something like:
string = str(f)
important_words = ['NAME', 'DATE OF BIRTH']
last_phrase = None
for phrase in important_words:
phrase_start = string.index(phrase)
phrase_end = phrase_start + len(phrase)
if last_phrase is not None:
get_data(string, last_phrase, phrase_start)
last_phrase = phrase_end
def get_data(string, previous_end_index, current_start_index):
usable_data = string[previous_end_index: current_start_index]
return usable_data
Better/shorter variable names should probably be used

You can just read the text in as 1 long string. And then make use of .split()
This will only work if the categories are in order and don't repeat.
Like so;
Categories = ["NAME", "DOB", "BIO"] // in the order they appear in text
Output = {}
Text = str(f)
for i in range(1,len(Categories)):
SplitText = Text.split(Categories[i])
Output.update({Categories[i-1] : SplitText[0] })
Text = SplitText[1]
Output.update({Categories[-1] : Text})

You can try the following.
keys = ["NAME","DATE OF BIRTH","BIO","HOBBIES"]
f = open("data.txt", "r")
result = {}
for line in f:
line = line.strip('\n')
if any(v in line for v in keys):
last_key = line
else:
result[last_key] = result.get(last_key, "") + line
print(result)
Output
{'NAME': 'John Doe', 'DATE OF BIRTH': '1992-02-16', 'BIO ': 'THIS is a PRETTY long sentence without ANY structure ', 'HOBBIES ': '//..etc..'}

Adding words between lines to an array

This is the content of my file:
david C001 C002 C004 C005 C006 C007
* C008 C009 C010 C011 C016 C017 C018
* C019 C020 C021 C022 C023 C024 C025
anna C500 C521 C523 C547 C555 C556
* C557 C559 C562 C563 C566 C567 C568
* C569 C571 C572 C573 C574 C575 C576
* C578
charlie C701 C702 C704 C706 C707 C708
* C709 C712 C715 C716 C717 C718
I want my output to be:
david=[C001,C002,C004,C005,C006,C007,C008,C009,C010,C011,C016,C017,C018,C019,C020,C021,C022,C023,C024,C025]
anna=[C500,C521,C523,C547,C555,C556,C557,C559,C562,C563,C566,C567,C568,C569,C571,C572,C573,C574,C575,C576,C578]
charlie=[C701,C702,C704,C706,C707,C708,C709,C712,C715,C716,C717,C718]
I am able to create:
david=[C001,C002,C004,C005,C006,C007]
anna=[C500,C521,C523,C547,C555,C556]
charlie=[C701,C702,C704,C706,C707,C708]
counting the number of words in a line and using line[0] as the array name and adding the remaining words to the array.
However, I don't know how to take the continuation of words in the next lines starting with "*" to the array.
Can anyone help?

NOTE: This solution relies on defaultdict being ordered, which is something that was introduced on Python 3.6
Somewhat naive approach:
from collections import defaultdict
# Create a dictionary of people
people = defaultdict(list)
# Open up your file in read-only mode
with open('your_file.txt', 'r') as f:
# Iterate over all lines, stripping them and splitting them into words
for line in filter(bool, map(str.split, map(str.strip, f))):
# Retrieve the name of the person
# either from the current line or use the name of the last person processed
name, words = list(people)[-1] if line[0] == '*' else line[0], line[1:]
# Add all remaining words to that person's record
people[name].extend(words)
print(people['anna'])
# ['C500', 'C521', 'C523', 'C547', 'C555', 'C556', 'C557', 'C559', 'C562', 'C563', 'C566', 'C567', 'C568', 'C569', 'C571', 'C572', 'C573', 'C574', 'C575', 'C576', 'C578']
It also has the additional benefit of returning an empty list for unknown names:
print(people['matt'])
# []

You could read the lists into a dictionary using regular expressions:
import re
with open('file_name') as file:
contents = file.read()
res_list = re.findall(r"[a-z]+\s+[^a-z]+",contents)
res_dict = {}
for p in res_list:
elt = p.split()
res_dict[elt[0]] = [e for e in elt[1:] if e != '*']
print(res_dict)

I figured out a way myself. Thanks to the ones who gave their own solution. It gave me new perspective.
Below is my code:
persons_library={}
persons=['david','anna','charlie']
for i,person in enumerate(persons,start=0):
persons_library[person]=[]
with open('data.txt','r') as f:
for line in f:
line=line.replace('*',"")
line=line.split()
for i,val in enumerate(line,start=0):
if val in persons_library:
key=val
else:
persons_library[key].append(val)
print(persons_library)

a loop that is suppose to write lines to a file isnt working

I have a very large file that looks like this:
[original file][1]
field number 7 (info) contains ~100 pairs of X=Y separated by ';'.
I first want to split all X=Y pairs.
Next I want to scan one pair at a time, and if X is one of 4 titles and Y is an int- I want to put them them in a dictionary.
After finishing going through the pairs I want to check if the dictionary contains all 4 of my titles, and if so, I want to calculate something and write it into a new file.
This is the part of my code which suppose to do that:
for row in reader:
m = re.split(';',row[7]) # split the info field by ';'
d = {}
nl = []
for c in m: # for each info field, split by '=', if it is one of the 4 fields wanted and the value is int- add it to a dict
t = re.split('=',c)
if (t[0]=='AC_MALE' or t[0]=='AC_FEMALE' or t[0]=='AN_MALE' or t[0]=='AN_FEMALE') and type(t[1])==int:
d[t[0]] = t[1]
if 'AC_MALE' in d and 'AC_FEMALE' in d and 'AN_MALE' in d and 'AN_FEMALE' in d: # if the dict contain all 4 wanted fields- make a new line for the final file
total_ac = int(d['AC_MALE'])+ int(d['AC_FEMALE'])
total_an = int(d['AN_MALE'])+ int(d['AN_FEMALE'])
ac_an = total_ac/total_an
nl.extend([row[0],row[1],row[3],row[4],total_ac,total_an, ac_an])
writer.writerow(nl)
The code is running with no errors but isnt writing anything to the file.
Can someone figure out why?
Thanks!

type(t[1])==int is never true. t[1] is a string, always, because you just split that object from another string. It doesn't matter here if the string contains only digits and could be converted to a int.
Test if you can convert your string to an integer, and if that fails, just move on to the next. If it succeeds, add the value to your dictionary:
for c in m:
t = re.split('=',c)
if (t[0]=='AC_MALE' or t[0]=='AC_FEMALE' or t[0]=='AN_MALE' or t[0]=='AN_FEMALE'):
try:
d[t[0]] = int(t[1])
except ValueError:
# string could not be converted, so move on
pass
Note that you don't need to use re.split(); use the standard str.split() method instead. You don't need to test if all keys are present in your dictionary afterwards, just test if the dictionary contains 4 elements, so has a length of 4. You can also simplify the code to test the key name:
for row in reader:
d = {}
for key_value in row[7].split(','):
key, value = key_value.split('=')
if key in {'AC_MALE', 'AC_FEMALE', 'AN_MALE', 'AN_FEMALE'}:
try:
d[key] = int(value)
except ValueError:
pass
if len(d) == 4:
total_ac = d['AC_MALE'] + d['AC_FEMALE']
total_an = d['AN_MALE'] + d['AN_FEMALE']
ac_an = total_ac / total_an
writer.writerow([
row[0], row[1], row[3], row[4],
total_ac, total_an, ac_an])

Creating a Dictionary from txt

I have a couple of lines inside a text that i am looking to turn the first word to a key (space is between each) with a function, and the rest to follow as values.
This is what the text contains:
FFFB10 11290 Charlie
1A9345 37659 Delta
221002 93323 Omega
The idea is to turn the first word into a key, but also arrange it (row underneath a row) visualy, so the first word(FFFB10) is the key, and the rest are values, meaning:
Entered: FFFB10
Location: 11290
Name: Charlie
I tried with this as a beginning:
def code(codeenter, file):
for line in file.splitlines():
if name in line:
parts = line.split(' ')
But i dont know how to continue (i erased most of the code), any suggestions?

Assuming you managed to extract a list of lines without the newline character at the end.
def MakeDict(lines):
return {key: (location, name) for key, location, name in (line.split() for line in lines)}
This is an ordinary dictionary comprehension with a generator expression. The former is all the stuff in brackets and the latter is inside the last pair of brackets. line.split splits a line with whitespace being the delimiter.
Example run:
>>> data = '''FFFB10 11290 Charlie
... 1A9345 37659 Delta
... 221002 93323 Omega'''
>>> lines = data.split('\n')
>>> lines
['FFFB10 11290 Charlie', '1A9345 37659 Delta', '221002 93323 Omega']
>>> def MakeDict(lines):
... return {key: (location, name) for key, location, name in (line.split() for line in lines)}
...
>>>
>>> MakeDict(lines)
{'FFFB10': ('11290', 'Charlie'), '1A9345': ('37659', 'Delta'), '221002': ('93323', 'Omega')}
How to format the output:
for key, values in MakeDict(lines).items():
print("Key: {}\nLocation: {}\nName: {}".format(key, *values))

See ForceBru's answer on how to construct the dictionary. Here's the printing part:
for k, (v1, v2) in your_dict.items():
print("Entered: {}\nLocation: {}\nName: {}\n".format(k, v1, v2))

You can try this:
f = [i.strip('\n').split() for i in open('filename.txt')]
final_dict = {i[0]:i[1:] for i in f}
Assuming the data is structured like this:
FFFB10 11290 Charlie
1A9345 37659 Delta
221002 93323 Omega
Your output will be:
{'FFFB10': ['11290', 'Charlie'], '221002': ['93323', 'Omega'], '1A9345': ['37659', 'Delta']}

You may want to consider using namedtuple.
from collections import namedtuple
code = {}
Code = namedtuple('Code', 'Entered Location Name')
filename = '/Users/ca_mini/Downloads/junk.txt'
with open(filename, 'r') as f:
for row in f:
row = row.split()
code[row[0]] = Code(*row)
>>> code
{'1A9345': Code(Entered='1A9345', Location='37659', Name='Delta'),
'221002': Code(Entered='221002', Location='93323', Name='Omega'),
'FFFB10': Code(Entered='FFFB10', Location='11290', Name='Charlie')}

Importing data from a text file using python

I have a text file containing data in rows and columns (~17000 rows in total). Each column is a uniform number of characters long, with the 'unused' characters filled in by spaces. For example, the first column is 11 characters long, but the last four characters in that column are always spaces (so that it appears to be a nice column when viewed with a text editor). Sometimes it's more than four if the entry is less than 7 characters.
The columns are not otherwise separated by commas, tabs, or spaces. They are also not all the same number of characters (the first two are 11, the next two are 8 and the last one is 5 - but again, some are spaces).
What I want to do is import the entires (which are numbers) in the last two columns if the second column contains the string 'OW' somewhere in it. Any help would be greatly appreciated.

Python's struct.unpack is probably the quickest way to split fixed-length fields. Here's a function that will lazily read your file and return tuples of numbers that match your criteria:
import struct
def parsefile(filename):
with open(filename) as myfile:
for line in myfile:
line = line.rstrip('\n')
fields = struct.unpack('11s11s8s8s5s', line)
if 'OW' in fields[1]:
yield (int(fields[3]), int(fields[4]))
Usage:
if __name__ == '__main__':
for field in parsefile('file.txt'):
print field
Test data:
1234567890a1234567890a123456781234567812345
something maybe OW d 111111118888888855555
aaaaa bbbbb 1234 1212121233333
other thinganother OW 121212 6666666644444
Output:
(88888888, 55555)
(66666666, 44444)

In Python you can extract a substring at known positions using a slice - this is normally done with the list[start:end] syntax. However you can also create slice objects that you can use later to do the indexing.
So you can do something like this:
columns = [slice(11,22), slice(30,38), slice(38,44)]
myfile = open('some/file/path')
for line in myfile:
fields = [line[column].strip() for column in columns]
if "OW" in fields[0]:
value1 = int(fields[1])
value12 = int(fields[2])
....
Separating out the slices into a list makes it easy to change the code if the data format changes, or you need to do stuff with the other fields.

Here's a function which might help you:
def rows(f, columnSizes):
while True:
row = {}
for (key, size) in columnSizes:
value = f.read(size)
if len(value) < size: # EOF
return
row[key] = value
yield row
for an example of how it's used:
from StringIO import StringIO
sample = StringIO("""aaabbbccc
d e f
g h i
""")
for row in rows(sample, [('first', 3),
('second', 3),
('third', 4)]):
print repr(row)
Note that unlike the other answers, this example is not line-delimited (it uses the file purely as a provider of bytes, not an iterator of lines), since you specifically mentioned that the fields were not separated, I assumed that the rows might not be either; the newline is taken into account specifically.
You can test if one string is a substring of another with the 'in' operator. For example,
>>> 'OW' in 'hello'
False
>>> 'OW' in 'helOWlo'
True
So in this case, you might do
if 'OW' in row['third']:
stuff()
but you can obviously test any field for any value as you see fit.

entries = ((float(line[30:38]), float(line[38:43])) for line in myfile if "OW" in line[11:22])
for num1, num2 in entries:
# whatever

entries = []
with open('my_file.txt', 'r') as f:
for line in f.read().splitlines()
line = line.split()
if line[1].find('OW') >= 0
entries.append( ( int(line[-2]) , int(line[-1]) ) )
entries is an array containing tuples of the last two entries
edit: oops

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.