Python: parsing texts in a .txt file

Python: parsing texts in a .txt file - python

I have a text file like this.
1 firm A Manhattan (company name) 25,000
SK Ventures 25,000
AEA investors 10,000
2 firm B Tencent collaboration 16,000
id TechVentures 4,000
3 firm C xxx 625
(and so on)
I want to make a matrix form and put each item into the matrix.
For example, the first row of matrix would be like:
[[1,Firm A,Manhattan,25,000],['','',SK Ventures,25,000],['','',AEA investors,10,000]]
or,
[[1,'',''],[Firm A,'',''],[Manhattan,SK Ventures,AEA Investors],[25,000,25,000,10,000]]
For doing so, I wanna parse texts from each line of the text file. For example, from the first line, I can create [1,firm A, Manhattan, 25,000]. However, I can't figure out how exactly to do it. Every text starts at the same position, but ends at different positions. Is there any good way to do this?
Thank you.

Well if you know all of the start positions:
# 0123456789012345678901234567890123456789012345678901234567890
# 1 firm A Manhattan (company name) 25,000
# SK Ventures 25,000
# AEA investors 10,000
# 2 firm B Tencent collaboration 16,000
# id TechVentures 4,000
# 3 firm C xxx 625
# Field #1 is 8 wide (0 -> 7)
# Field #2 is 15 wide (8 -> 22)
# Field #3 is 19 wide (23 -> 41)
# Field #4 is arbitrarily wide (42 -> end of line)
field_lengths = [ 8, 15, 19, ]
data = []
with open('/path/to/file', 'r') as f:
row = f.readline()
row = row.strip()
pieces = []
for x in field_lengths:
piece = row[:x].strip()
pieces.append(piece)
row = row[x:]
pieces.append(row)
data.append(pieces)

From what you've given as data*, the input changes if the lines starts with a number or a space, and the data can be separated as
(numbers)(spaces)(letters with 1 space)(spaces)(letters with 1 space)(spaces)(numbers+commas)
or
(spaces)(letters with 1 space)(spaces)(numbers+commas)
That's what the two regexes below look for, and they build a dictionary with indexes from the leading numbers, each having a firm name and a list of company and value pairs.
I can't really tell what your matrix arrangement is.
import re
data = {}
f = open('data.txt')
for line in f:
if re.match('^\d', line):
matches = re.findall('^(\d+)\s+((\S\s|\s\S|\S)+)\s\s+((\S\s|\s\S|\S)+)\s\s+([0-9,]+)', line)
idx, firm, x, company, y, value = matches[0]
data[idx] = {}
data[idx]['firm'] = firm.strip()
data[idx]['company'] = [(company.strip(), value)]
else:
matches = re.findall('\s+((\S\s|\s\S|\S)+)\s\s+([0-9,]+)', line)
company, x, value = matches[0]
data[idx]['company'].append((company.strip(), value))
import pprint
pprint.pprint(data)
->
{'1': {'company': [('Manhattan (company name)', '25,000'),
('SK Ventures', '25,000'),
('AEA investors', '10,000')],
'firm': 'firm A'},
'2': {'company': [('Tencent collaboration', '16,000'),
('id TechVentures', '4,000')],
'firm': 'firm B'},
'3': {'company': [('xxx', '625')],
'firm': 'firm C'}
}
* This works on your example, but it may not work on your real data very well. YMMV.

If I understand you correctly (although I'm not totally sure I do), this will produce the output I think your looking for.
import re
with open('data.txt', 'r') as f:
f_txt = f.read() # Change file object to text
f_lines = re.split(r'\n(?=\d)', f_txt)
matrix = []
for line in f_lines:
inner1 = line.split('\n')
inner2 = [re.split(r'\s{2,}', l) for l in inner1]
matrix.append(inner2)
print(matrix)
print('')
for row in matrix:
print(row)
Output of the program:
[[['1', 'firm A', 'Manhattan (company name)', '25,000'], ['', 'SK Ventures', '25,000'], ['', 'AEA investors', '10,000']], [['2', 'firm B', 'Tencent collaboration', '16,000'], ['', 'id TechVentures', '4,000']], [['3', 'firm C', 'xxx', '625']]]
[['1', 'firm A', 'Manhattan (company name)', '25,000'], ['', 'SK Ventures', '25,000'], ['', 'AEA investors', '10,000']]
[['2', 'firm B', 'Tencent collaboration', '16,000'], ['', 'id TechVentures', '4,000']]
[['3', 'firm C', 'xxx', '625']]
I am basing this on the fact that you wanted the first row of your matrix to be:
[[1,Firm A,Manhattan,25,000],['',SK Ventures,25,000],['',AEA investors,10,000]]
However, to achieve this with more rows, we then get a list that is nested 3 levels deep. Such is the output of print(matrix). This can be a little unwieldy to use, which is why TessellatingHeckler's answer uses a dictionary to store the data, which I think is a much better way to access what you need. But if a list of list of "matrices' is what your after, then the code I wrote above does that.

Related

Python: Separate text file data into tuples?

I'm currently working on trying to separate values inside of a .txt file into tuples. This is so that, later on, I want to create a simple database using these tuples to look up the data. Here is my current code:
with open("data.txt") as load_file:
data = [tuple(line.split()) for line in load_file]
c = 0
pts = []
while c < len(data):
pts.append(data[c][0])
c += 1
print(pts)
pts = []
Here is the text file:
John|43|123 Apple street|514 428-3452
Katya|26|49 Queen Mary Road|514 234-7654
Ahmad|91|1888 Pepper Lane|
I want to store each value that is separated with a "|" and store these into my tuple in order for this database to work. Here is my current output:
['John|43|123']
['Katya|26|49']
['Ahmad|91|1888']
So it is storing some of the data as a single string, and I can't figure out how to make this work. My desired end result is something like this:
['John', 43, '123 Apple street', 514 428-3452]
['Katya', 26, '49 Queen Mary Road', 514 234-7654]
['Ahmad', 91, '1888 Pepper Lane', ]

try with this:
with open("data.txt") as load_file:
data = [line.strip('\n').split('|') for line in load_file]
for elem in data:
print(elem)

Try to use csv module with custom delimiter=:
import csv
with open("your_file.txt", "r") as f_in:
reader = csv.reader(f_in, delimiter="|")
for a, b, c, d in reader:
print([a, int(b), c, d])
Prints:
['John', 43, '123 Apple street', '514 428-3452']
['Katya', 26, '49 Queen Mary Road', '514 234-7654']
['Ahmad', 91, '1888 Pepper Lane', '']

Using regex to break string into a dictionary with key and values

I am attempting to turn this list into a list full of dictionaries. Below I included a snippet of the full list Im using.
With row 4.1 as an example, I want:
the key to be the row number, ('4.1')
the values to include the title ('Properties occupied by the company
(less $ 43,332,898 \nencumbrances')
and the four numbers after it as a list ['68,122,291', '0',
'68,122,291', '64,237,046'].
I got the general loop for how I'd put together each separate dictionary. Where I am struggling is coming up with regex patterns to get the row title and row values. Its difficult since some of the row titles also include numbers. Another issue is that not all of the rows have four numbers at the end. For these instances, I just want the available numbers. Any help figure out the regex to grab these would be appreciated.
clean = ['4.1 Properties occupied by the company (less $ 43,332,898 \nencumbrances) 68,122,291 0 68,122,291 64,237,046 \n',
'4.2 Properties held for the production of income (less \n $ encumbrances) 0 0 0 0 \n',
'4.3 Properties held for sale (less $ \nencumbrances) 0 0 \n',]
clean_list = []
for n in clean:
row_num = re.findall(r'\d+\.',n)
row_title =
row_values =
new_dict = {row_num: row_title, row_values}
clean_list.append(new_dict)

Not sure why you want a separate dictionary for each line, each with just one key. I would think it more useful to end up with one dictionary with several keys.
d = {}
for line in clean:
parts = re.match(r"^([\d.]+)\s+(.*?)\s+(\d[\d,.]*)\s*(?:(\d[\d,.]*)\s*)?(?:(\d[\d,.]*)\s*)?(?:(\d[\d,.]*)\s*)?$",
line, re.DOTALL)
code, title, *values = parts.group(1,2,3,4,5,6)
d[code] = (title, list(filter(None, values)))
For the sample data, the value of d would be:
{
'4.1': (
'Properties occupied by the company (less $ 43,332,898 \nencumbrances)',
['68,122,291', '0', '68,122,291', '64,237,046']
),
'4.2': (
'Properties held for the production of income (less \n $ encumbrances)',
['0', '0', '0', '0']
),
'4.3': (
'Properties held for sale (less $ \nencumbrances)',
['0', '0']
)
}

Following the examples shared by you
"4.1 Properties occupied by the company (less $ 43,332,898 \nencumbrances) 68,122,291 0 68,122,291 64,237,046"
"4.2 Properties held for the production of income (less \n $ encumbrances) 0 0 0 0"
"4.3 Properties held for sale (less $ \nencumbrances) 0 0 "
for n in clean:
row_num = re.search(r'$\d+\.\d+',n).group()
row_title = re.search(r'(?<=\d+\.\d+).*?(?=((\d+,?)+ ?)+$)', n}.group()
row_values = re.search(r'((\d+,?)+ ?)+$', n).group()
new_dict = {row_num: row_title, row_values}
clean_list.append(new_dict)
I would seriously recommend going through https://regexr.com/ and trying these patterns and messing with them to learn. The solution is not important. It's how you end up at it.

you can use split() to accommodate variable number of the last values into a list. if your data always has the parenthesis you can use those as part of the pattern:
for n in clean:
row_num, row_title, row_values = re.findall(
r'^(\d+\.\d+)\s+(.*\))\s+(.*)$', n, re.DOTALL)[0]
new_dict = {row_num: (row_title, row_values.split())}
clean_list.append(new_dict)
keeping your use of re.findall(). output looks like:
[{'4.1': ('Properties occupied by the company (less $ 43,332,898 \n'
'encumbrances)',
['68,122,291', '0', '68,122,291', '64,237,046'])},
{'4.2': ('Properties held for the production of income (less \n'
' $ encumbrances)',
['0', '0', '0', '0'])},
{'4.3': ('Properties held for sale (less $ \nencumbrances)', ['0', '0'])}]
this retains the newlines, if you want those.

How to check how close a number is to another number?

I have a TEXT FILE that looks like:
John: 27
Micheal8483: 160
Mary Smith: 57
Adam 22: 68
Patty: 55
etc etc. They are usernames that is why their names contain numbers occasionally. What I want to do is check each of their numbers (the ones after the ":") and get the 3 names that have the numbers that are closest in value to a integer (specifically named targetNum). It will always be positive.
I have tried multiple things but I am new to Python and I am not really sure how to go about this problem. Any help is appreciated!

You can parse the file into a list of name/number pairs. Then sort the list by difference between a number and targetNum. The first three items of the list will then contain the desired names:
users = []
with open("file.txt") as f:
for line in f:
name, num = line.split(":")
users.append((name, int(num)))
targetNum = 50
users.sort(key=lambda pair: abs(pair[1] - targetNum))
print([pair[0] for pair in users[:3]]) # ['Patty', 'Mary Smith', 'Adam 22']

You could use some regex recipe here :
import re
pattern=r'(\w.+)?:\s(\d+)'
data_1=[]
targetNum = 50
with open('new_file.txt','r') as f:
for line in f:
data=re.findall(pattern,line)
for i in data:
data_1.append((int(i[1])-targetNum,i[0]))
print(list(map(lambda x:x[1],data_1[-3:])))
output:
['Mary Smith', 'Adam 22', 'Patty']

How to get rid of \n and split the values surrounding it in a Python list

So I'm new to Python and I'm having difficulty understanding how to manipulate files and such. Currently I've been trying to assign the lines in my file into a list by splitting it at commas. I'm using this code:
with open('grades.txt','r') as f:
data=f.read()
data=data.split(',')
print(data)
The problem I have now is the output is this:
['22223333', ' Michael Gill', ' 49\n23232323', ' Nicholas Smith', ' 62\n18493214', ' Kerri Morgan', ' 75\n00015542', ' Donald Knuth', ' 90\n00000001', ' Alan Turing', ' 100']
my question is, how do I remove the \n from my output and also how would i go about splitting the values separated by the \n (for example, 49\n23232323, i would like it to be split like '49','23232323').It is my understanding(which is not a lot) that you can't split a list nor can you assign 2 variables for splitting a file, so how would I split the file by commas and '\n'?
The ideal output would be:
['22223333', 'Michael Gill', '49', '23232323', 'Nicholas Smith', '62', '18493214', 'Kerri Morgan', '75', '00015542', 'Donald Knuth', '90', '00000001', 'Alan Turing', '100']
The grades.txt file consists of:
22223333, Michael Gill, 49
23232323, Nicholas Smith, 62
18493214, Kerri Morgan, 75
00015542, Donald Knuth, 90
00000001, Alan Turing, 100
Also, is it possible to split only certain lines/words in a file into a list? (i.e. a file containing (1,2,3,4,a,b,c,d,5,4,3,d,r) and splitting the numbers into one list and the letters into another?)

i'd do something like this:
with open('grades.txt','r') as f:
data=f.read()
data=data.replace("\n", ",").split(',')
print(data)
thus replacing every \n with commas
if you want to have numbers in one list and words in another just create two list and sort them using the function .isdigit() like this:
words = []
numbers = []
for element in data:
if element.replace(" ", "").isdigit(): numbers.append(element)
else: words.append(element)
another way to do it is using try and except:
for element in data:
try:
int(element.replace(" ", ""))
numbers.append(element)
except:
words.append(element)

As someone mentioned in the comments, perhaps the better approach would be to use the csv module. But that requires you to learn/understand Python dictionaries - however dictionaries are a great data structure and very useful in many cases.
from csv import DictReader as dr
data_from_file = []
with open(my_file.csv,'rb') as fh:
my_reader = dr(fh)
column_headings = my_reader.fieldnames
for row in my_reader:
data_from_file.append(row)
The result is a list of dictionaries. Each line in the list corresponds to a row in the initial file. But instead of the data just being some object without specific identity - assuming you have column headings id, name and age in your original file the results would look like
[{'id:':'22223333', 'name': 'Michael Gill', 'age': '49'} . . .]
the column_headings object is a list of the original column headings from the file if you wanted to manipulate/explore those. Of course the next question is how to save your data as a CSV file. There are a number of Q&A here on how to use the DictWriter method.

You can do it in this way as well
list1 = ['22223333', ' Michael Gill', ' 49\n23232323', ' Nicholas Smith', ' 62\n18493214', ' Kerri Morgan', ' 75\n00015542', ' Donald Knuth', ' 90\n00000001', ' Alan Turing', ' 100']
list2=[]
for x in xrange(len(list1)):
list1[x] = list1[x].split('\n')
list2 = sum(list1, [])
print(list2)
output would be
['22223333', ' Michael Gill', ' 49', '23232323', ' Nicholas Smith', ' 62', '18493214', ' Kerri Morgan', ' 75', '00015542', ' Donald Knuth', ' 90', '00000001', ' Alan Turing', ' 100']

You could use Python's chain() function as follows:
from itertools import chain
with open('grades.txt','r') as f:
data = list(chain.from_iterable(line.split() for line in f.readlines()))
print(data)
This would display data as:
['22223333,', 'Michael', 'Gill,', '49', '23232323,', 'Nicholas', 'Smith,', '62', '18493214,', 'Kerri', 'Morgan,', '75', '00015542,', 'Donald', 'Knuth,', '90', '00000001,', 'Alan', 'Turing,', '100']
This uses readlines() to first read each of your lines in. This has the benefit of removing the newlines, giving you a list of lines. For each line, it use split() to create a list of entries, and then flattens all of the lists into a single list to give you the required results using the chain() function.

I suspect those newlines are there separating rows and you would be better off:
with open('grades.txt', 'r') as f:
for row in f.readlines():
data = row.split(',')
print(data)
If you want to have a single, long tuple, you can do that instead by concatenating the results of the operation

Handling file reading and multiple values to a key in dictionary

How can I code to read the first line from a file and put it as a key value of a dictionary and keep reading next values in a iterative manner and put them as the values to the particular key they fall into in the file.
Like example:
Item Quality Cost Place
Ball 1 $12 TX
Umbrella 5 $35 NY
sweater 89 $100 LA
So here, the representation is my file. When I read, I want the dictionary to be created as in the things in bold go as keys and when i keep reading lines below that, I would have them going as multiple values in the respective keys.
thanks

Looks like you are describing a csv file with a space delimiter. Something like this should work (from the Python help).
>>> import csv
>>> spamReader = csv.reader(open('eggs.csv', 'rb'), delimiter=' ', quotechar='|')
>>> for row in spamReader:
... print ', '.join(row)
Spam, Spam, Spam, Spam, Spam, Baked Beans
Spam, Lovely Spam, Wonderful Spam
In fact, the csv.DictReader would be better in order to have each row as a dictionary with keys defined by the first row.
I am assuming that there is a newline inserted after each group of values.
Edit: Using the example above, we get:
In [1]: import csv
In [2]: f = csv.DictReader(open('test.txt', 'r'), delimiter = ' ', skipinitialspace = True)
In [3]: for row in f: print row
{'Item': 'Ball', 'Cost': '$12', 'Quality': '1', 'Place': 'TX'}
{'Item': 'Umbrella', 'Cost': '$35', 'Quality': '5', 'Place': 'NY'}
{'Item': 'sweater', 'Cost': '$100', 'Quality': '89', 'Place': 'LA'}
Passing the parameter skipinitialspace = True to the DictReader is needed to be able to get rid of multiple spaces without creating spurious entries in each row.

You can't have "multiple values" for a given key, but you can of course have one value per key that's a list.
For example (Python 2.6 or better -- simply because I use the next function for generality rather than methods such as readline, but you can of course tweak that!):
def makedictwithlists(f):
keys = next(f).split()
d = {}
for k in keys: d[k] = []
for line in f:
for k, v in zip(keys, line.split()):
d[k].append(v)
return d

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: parsing texts in a .txt file - python

Related

Python: Separate text file data into tuples?

Using regex to break string into a dictionary with key and values

How to check how close a number is to another number?

How to get rid of \n and split the values surrounding it in a Python list

Handling file reading and multiple values to a key in dictionary

Categories

Resources