Loading text data in python and create a matrix

Loading text data in python and create a matrix - python

say I have a text file with a format similarly to this
Q: hello what is your name?
A: Hi my name is John Smith
and I want to create a matrix such that it is a 2xn in this case
[['hello','what','is',your','name','?', ' '],['hi','my','name','is','John','Smith']]
note that the first row has an empty entry because it has 6 strings while second row has 7 strings

You can use re.split:
import re
file_data = open('filename.txt').read()
results = filter(None, re.split('A:\s|Q:\s', file_data))
new_results = [re.findall('\w+|\W', i) for i in results]
Output:
[['hello', ' ', 'what', ' ', 'is', ' ', 'your', ' ', 'name', '?', ' '], ['Hi', ' ', 'my', ' ', 'name', ' ', 'is', ' ', 'John', ' ', 'Smith']]

just split the strings, using the split function:
with open('txt.txt') as my_file:
lines = my_file.readlines()
#lines[0] = "Q: hello what is your name?"
#lines[1] = "A: Hi my name is John Smith"
then just use
output = [lines[0].split,lines[1].split]

Related

Extract all element of a list that matched a string

I have a keyword list and an input list of lists. My task is to find those lists that contain the keyword (even partially). I am able to extract the lists that contain the keyword using the following code:
t_list = [['Subtotal: ', '1,292.80 '], ['VAT ', ' 64.64 '], ['RECEIPT TOTAL ', 'AED1,357.44 '],
['NOT_SELECTED, upto2,000 ', 'Sub total ', '60.58 '],
['NOT_SELECTED, upto500 ', 'amount 160.58 ', '', '3.03 '],
['Learn', 'Bectricity total ', '', '', '63.61 ']]
keyword = ['total ', 'amount ']
for lists in t_list:
for string_list in table:
string_list[:] = [item for item in string_list if item != '']
for element in string_list:
element = element.lower()
if any(s in element for s in keyword):
print(string_list)
The output is:
[['Subtotal: ', '1,292.80 '], ['RECEIPT TOTAL ', 'AED1,357.44 '], ['NOT_SELECTED, upto2,000 ', 'Sub total ', '60.58 '], ['NOT_SELECTED, upto500 ', 'amount 160.58 ', '3.03 '],
['Learn', 'Bectricity total ', '63.61 ']]
Required output is to have only the string that matched with the keyword and the number in the list.
Required output:
[['Subtotal: ', '1,292.80 '], ['RECEIPT TOTAL ', 'AED1,357.44 '], ['Sub total ', '60.58 '], ['amount 160.58 ', '3.03 '],['Bectricity total ', '63.61 ']]
If I can have the output as a dictionary with the string matched to the keyword as key and the number a value, it would be perfect.
Thanks a ton in advance!

Here is the answer from our chat, slightly modified with some comments as some explanation for the code. Feel free to ask me to clarify or change anything.
import re
t_list = [
['Subtotal: ', '1,292.80 '],
['VAT ', ' 64.64 '],
['RECEIPT TOTAL ', 'AED1,357.44 '],
['NOT_SELECTED, upto2,000 ', 'Sub total ', '60.58 '],
['NOT_SELECTED, upto500 ', 'amount 160.58 ', '', '3.03 '],
['Learn', 'Bectricity total ', '', '', '63.61 ']
]
keywords = ['total ', 'amount ']
output = {}
for sub_list in t_list:
# Becomes the string that matched the keyword if one is found
matched = None
for item in sub_list:
for keyword in keywords:
if keyword in item.lower():
matched = item
# If a match was found, then we start looking at the list again
# looking for the numbers
if matched:
for item in sub_list:
# split the string so for example 'amount 160.58 ' becomes ['amount', '160.58']
# This allows us to more easily extract just the number
split_items = item.split()
for split_item in split_items:
# Simple use of regex to match any '.' with digits either side
re_search = re.search(r'[0-9][.][0-9]', split_item)
if re_search:
# Try block because we are making a list. If the list exists,
# then just append a value, otherwise create the list with the item
# in it
try:
output[matched.strip()].append(split_item)
except KeyError:
output[matched.strip()] = [split_item]
print(output)
You mentioned wanting to match a string such as 'AED 63.61'. My solution is using .split() to separate strings and make it easier to grab just the number. For example, for a string like 'amount 160.58' it becomes much easier to just grab the 160.58. I'm not sure how to go about matching a string like the one you want to keep but not matching the one I just mentioned (unless, of course, it is just 'AED' in which case we could just add some more logic to match anything with 'aed').

Splitting a string if a word does NOT have any numbers

I want to split a string into words on white-spaces or any special character. But, if the word before AND after the split contains a number, and it is not a white-space character, then I DON'T want it to split.
"abc abc-def a2b-def a2b-d3f"
Should become - (notice the last word)
"abc", " ", "abc", "-", "def", " ", "a2b", "-", "def", " ", "a2b-d3f"
I tried
b = "abc abc-def a2b-def a2b-d3f ab2-3cd"
print(re.split(r"((?<=\D)[\W]|[\W](?=\D)|\s)",b))
print(re.split(r"((?<!\b\w*\d\w*\b)[\W]|[\W](?!\b\w*\d\w*\b)|\s)",b))
The first one sort of works, but it only considers the last and first character of the previous or next word respectively. It maintained "ab2-3cd" as a single word, but it wouldn't work for "a2b-c3d".
The second one gives me an error "look-behind requires fixed-width pattern" because it doesn't allow me to use * in look-back or look-ahead.
Please help me out!
EDIT: the words can be of arbitrary length, "abcdef".

You can grab all patterns matching ptrn r'\w+|\W+' from the words that match the pattern r'\d\w*\W+\w*\d'
>>> import re
>>> txt = "abc abc-def a2b-def a2b-d3f"
>>> [w for s in txt.split() for w in ([s] if re.search(r'\d\w*\W+\w*\d', s) else re.findall(r'\w+|\W+', s)) + [' ']]
['abc', ' ', 'abc', '-', 'def', ' ', 'a2b', '-', 'def', ' ', 'a2b-d3f', ' ']

import re
s = "abc abc-def a2b-def a2b-d3f"
s = re.split(r'(?:(?<=[\da-z]{3})(\s|-)(?=[a-z]{3})|(?:(?<=[a-z]{3})(\s|-)(?=[a-z\d]{3})))', s)
s = [i for i in s if i is not None]
print(s)
Prints:
['abc', ' ', 'abc', '-', 'def', ' ', 'a2b', '-', 'def', ' ', 'a2b-d3f']
EDIT:
import re
s = "a2dc abc axx2b-dss3f abc-def a2b-abc a2b-d3f"
s = re.split(r'(\s|-)(?=[a-z]+(?:-|\s))', s)
out = []
for w in s:
out.extend(re.split(r'(?<=[a-z\d])(\s)(?=[a-z\d])', w))
print(out)
Prints:
['a2dc', ' ', 'abc', ' ', 'axx2b-dss3f', ' ', 'abc', '-', 'def', ' ', 'a2b', '-', 'abc', ' ', 'a2b-d3f']

Split strings on whitespaces, but do not remove them [duplicate]

This question already has answers here:
Preserve whitespaces when using split() and join() in python
(3 answers)
Closed 7 years ago.
I want to split strings based on whitespace and punctuation, but the whitespace and punctuation should still be in the result.
For example:
Input: text = "This is a text; this is another text.,."
Output: ['This', ' ', 'is', ' ', 'a', ' ', 'text', '; ', 'this', ' ', 'is', ' ', 'another', ' ', 'text', '.,.']
Here is what I'm currently doing:
def classify(b):
"""
Classify a character.
"""
separators = string.whitespace + string.punctuation
if (b in separators):
return "separator"
else:
return "letter"
def tokenize(text):
"""
Split strings to words, but do not remove white space.
The input must be of type str, not bytes
"""
if (len(text) == 0):
return []
current_word = "" + text[0]
previous_mode = classify(text)
offset = 1
results = []
while offset < len(text):
current_mode = classify(text[offset])
if current_mode == previous_mode:
current_word += text[offset]
else:
results.append(current_word)
current_word = text[offset]
previous_mode = current_mode
offset += 1
results.append(current_word)
return results
It works, but it's so C-style. Is there a better way in Python?

You can use a regular expression:
import re
re.split('([\s.,;()]+)', text)
This splits on arbitrary-width whitespace (including tabs and newlines) plus a selection of punctuation characters, and by grouping the split text you tell re.sub() to include it in the output:
>>> import re
>>> text = "This is a text; this is another text.,."
>>> re.split('([\s.,;()]+)', text)
['This', ' ', 'is', ' ', 'a', ' ', 'text', '; ', 'this', ' ', 'is', ' ', 'another', ' ', 'text', '.,.', '']
If you only wanted to match spaces (and not other whitespace), replace \s with a space:
>>> re.split('([ .,;()]+)', text)
['This', ' ', 'is', ' ', 'a', ' ', 'text', '; ', 'this', ' ', 'is', ' ', 'another', ' ', 'text', '.,.', '']
Note the extra trailing empty string; a split always has a head and a tail, so text starting or ending in a split group will always have an extra empty string at the start or end. This is easily removed.

Use Python to match a letter in a string only when followed by a space, period, or nothing, without regex?

I am trying to write this code for readability but the last 'for x in measurements' clearly doesn't work.
The following prints ' t' but I don't want it to match on ' test'
I do want it to match on ' t' of 'this is a t' if it were a test case.
Possible without resorting to regex?
measurements = ['t', 'tsp', 'T', 'tbl', 'tbs', 'tbsp', 'c']
measurements = ([' ' + x + ' ' for x in measurements] + #space on either side
[' ' + x + '.' for x in measurements] + #space in front, period in back
[' ' + x + '' for x in measurements]) #space in front, nothing in back???
string_to_check = 'this is a test'
for measurement in measurements:
if measurement in string_to_check:
print(measurement)

Here you could use re.search
>>> measurements = ['t', 'tsp', 'T', 'tbl', 'tbs', 'tbsp', 'c']
>>> measurements = ([' ' + x + ' ' for x in measurements] + [' ' + x + '\.' for x in measurements] + [' ' + x + r'\b' for x in measurements])
>>> measurements
[' t ', ' tsp ', ' T ', ' tbl ', ' tbs ', ' tbsp ', ' c ', ' t\\.', ' tsp\\.', ' T\\.', ' tbl\\.', ' tbs\\.', ' tbsp\\.', ' c\\.', ' t\\b', ' tsp\\b', ' T\\b', ' tbl\\b', ' tbs\\b', ' tbsp\\b', ' c\\b']
>>> string_to_check = 'this is a test'
>>> for measurement in measurements:
if re.search(measurement, string_to_check):
print(measurement)
>>>
I had done two things here.
[' ' + x + '\.' for x in measurements], escape the dot in-order to match a literal dot, since dot is a special meta character in regex which matches any character.
[' ' + x + r'\b' for x in measurements] add word boundary \b, since \b matches between a word character and a non-word character, it won't pick spacet from <space>test

The problem is that you're coded for a different meaning of 'nothing behind it' than you're thinking of.
You've included the string ' t' in your array which is a substring of the string 'this is a test' [namely, it's sitting there at the front of the word test].
If you want 'nothing behind it' to mean 'at the end of the string' then you'll have to check what's at the end of the string instead of using substring search.

measurements
[' t ', ' tsp ', ' T ', ' tbl ', ' tbs ', ' tbsp ', ' c ', ' t.', ' tsp.', ' T.', ' tbl.', ' tbs.', ' tbsp.', ' c.', ' t', ' tsp', ' T', ' tbl', ' tbs', ' tbsp', ' c']
You can find ' t' in measurements.So ' t' in your check string "this is a[ t]est".
so, it's right to return ' t'.
if you want to exactly match ' t' not ' txxx', you need to
[' ' + x + r'\b' for x in measurements]

A possible non-regex approach is to split string_to_check into a list of words. Then in will look for a word that matches exactly.
measurements = ['t', 'tsp', 'T', 'tbl', 'tbs', 'tbsp', 'c']
string_to_check = 'this is a test'
words_to_check = string_to_check.replace('.', ' ').split()
for measurement in measurements:
if measurement in words_to_check:
print(measurement)

Efficiently split a string using multiple separators and retaining each separator?

I need to split strings of data using each character from string.punctuation and string.whitespace as a separator.
Furthermore, I need for the separators to remain in the output list, in between the items they separated in the string.
For example,
"Now is the winter of our discontent"
should output:
['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent']
I'm not sure how to do this without resorting to an orgy of nested loops, which is unacceptably slow. How can I do it?

A different non-regex approach from the others:
>>> import string
>>> from itertools import groupby
>>>
>>> special = set(string.punctuation + string.whitespace)
>>> s = "One two three tab\ttabandspace\t end"
>>>
>>> split_combined = [''.join(g) for k, g in groupby(s, lambda c: c in special)]
>>> split_combined
['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t ', 'end']
>>> split_separated = [''.join(g) for k, g in groupby(s, lambda c: c if c in special else False)]
>>> split_separated
['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t', ' ', 'end']
Could use dict.fromkeys and .get instead of the lambda, I guess.
[edit]
Some explanation:
groupby accepts two arguments, an iterable and an (optional) keyfunction. It loops through the iterable and groups them with the value of the keyfunction:
>>> groupby("sentence", lambda c: c in 'nt')
<itertools.groupby object at 0x9805af4>
>>> [(k, list(g)) for k,g in groupby("sentence", lambda c: c in 'nt')]
[(False, ['s', 'e']), (True, ['n', 't']), (False, ['e']), (True, ['n']), (False, ['c', 'e'])]
where terms with contiguous values of the keyfunction are grouped together. (This is a common source of bugs, actually -- people forget that they have to sort by the keyfunc first if they want to group terms which might not be sequential.)
As #JonClements guessed, what I had in mind was
>>> special = dict.fromkeys(string.punctuation + string.whitespace, True)
>>> s = "One two three tab\ttabandspace\t end"
>>> [''.join(g) for k,g in groupby(s, special.get)]
['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t ', 'end']
for the case where we were combining the separators. .get returns None if the value isn't in the dict.

import re
import string
p = re.compile("[^{0}]+|[{0}]+".format(re.escape(
string.punctuation + string.whitespace)))
print p.findall("Now is the winter of our discontent")
I'm no big fan of using regexps for all problems, but I don't think you have much choice in this if you want it fast and short.
I'll explain the regexp since you're not familiar with it:
[...] means any of the characters inside the square brackets
[^...] means any of the characters not inside the square brackets
+ behind means one or more of the previous thing
x|y means to match either x or y
So the regexp matches 1 or more characters where either all must be punctuation and whitespace, or none must be. The findall method finds all non-overlapping matches of the pattern.

Try this:
import re
re.split('(['+re.escape(string.punctuation + string.whitespace)+']+)',"Now is the winter of our discontent")
Explanation from the Python documentation:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

Solution in linear (O(n)) time:
Let's say you have a string:
original = "a, b...c d"
First convert all separators to space:
splitters = string.punctuation + string.whitespace
trans = string.maketrans(splitters, ' ' * len(splitters))
s = original.translate(trans)
Now s == 'a b c d'. Now you can use itertools.groupby to alternate between spaces and non-spaces:
result = []
position = 0
for _, letters in itertools.groupby(s, lambda c: c == ' '):
letter_count = len(list(letters))
result.append(original[position:position + letter_count])
position += letter_count
Now result == ['a', ', ', 'b', '...', 'c', ' ', 'd'], which is what you need.

My take:
from string import whitespace, punctuation
import re
pattern = re.escape(whitespace + punctuation)
print re.split('([' + pattern + '])', 'now is the winter of')

Depending on the text you are dealing with, you may be able to simplify your concept of delimiters to "anything other than letters and numbers". If this will work, you can use the following regex solution:
re.findall(r'[a-zA-Z\d]+|[^a-zA-Z\d]', text)
This assumes that you want to split on each individual delimiter character even if they occur consecutively, so 'foo..bar' would become ['foo', '.', '.', 'bar']. If instead you expect ['foo', '..', 'bar'], use [a-zA-Z\d]+|[^a-zA-Z\d]+ (only difference is adding + at the very end).

from string import punctuation, whitespace
s = "..test. and stuff"
f = lambda s, c: s + ' ' + c + ' ' if c in punctuation else s + c
l = sum([reduce(f, word).split() for word in s.split()], [])
print l

For any arbitrary collection of separators:
def separate(myStr, seps):
answer = []
temp = []
for char in myStr:
if char in seps:
answer.append(''.join(temp))
answer.append(char)
temp = []
else:
temp.append(char)
answer.append(''.join(temp))
return answer
In [4]: print separate("Now is the winter of our discontent", set(' '))
['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent']
In [5]: print separate("Now, really - it is the winter of our discontent", set(' ,-'))
['Now', ',', '', ' ', 'really', ' ', '', '-', '', ' ', 'it', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent']
Hope this helps

from itertools import chain, cycle, izip
s = "Now is the winter of our discontent"
words = s.split()
wordsWithWhitespace = list( chain.from_iterable( izip( words, cycle([" "]) ) ) )
# result : ['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent', ' ']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Loading text data in python and create a matrix - python

just split the strings, using the split function: with open('txt.txt') as my_file: lines = my_file.readlines() #lines[0] = "Q: hello what is your name?" #lines[1] = "A: Hi my name is John Smith" then just use output = [lines[0].split,lines[1].split]

Related

Extract all element of a list that matched a string

Splitting a string if a word does NOT have any numbers

Split strings on whitespaces, but do not remove them [duplicate]

Use Python to match a letter in a string only when followed by a space, period, or nothing, without regex?

Efficiently split a string using multiple separators and retaining each separator?

Categories

Resources