Extract Keys and Values from text using regular expressions - python

I have a big number of strings that I need to parse. These strings contain information that is put in key-value pairs.
Sample input text:
Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim: ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim: ad minima veniam, *31.12.2012, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur
Key information:
A key starts either from the beginning of the string or after \.
A key ends always with :
The key is immediately followed by a value
This value continues until the next key or until the last symbol in the string
There are a multiple of key-value pairs, which I don't know
Expected Output
{
"Nemo enim": "ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem",
"Ut enim": "ad minima veniam, *31.12.2012, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur. Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur"
}
The regex that I have so far is ([üöä\w\s]*)\: (.*?)\.. Suffice it to say it doesn't provide the expected output.

This regex ([^:.]+):\s*([^:]+)(?=\.\s+|$) does the job.
Demo & explanation

You can match the following regular expression, which saves the keys and values to capture groups 1 and 2.
r'(?<![^.]) *([^.]+?:) *((?:(?!\. ).)+)'
Start your engine! | Python code
Python's regex engine performs the following operations.
(?<![^.]) : negative lookbehind asserts current location is not
preceded by a character other than '.'
\ * : match 0+ spaces
( : begin capture group 1
[^.]+? : match 1+ characters other than '.', lazily
: : match ':'
) : end capture group 1
\ * : match 0+ spaces
( : begin capture group 2
(?: : begin non-capture group
(?!\. ) : negative lookahead asserts current position is not
followed by a period followed by a space
. : match any character other than a line terminator
)+ : end non-capture group and execute 1+ times
) : end capture group 2
This uses the tempered greedy token technique, which matches a series of individual characters that do not begin an unwanted string. For example, if the string were "concatenate", (?:(?:!cat).)+ would match the first three letters but not the second 'c', so the match would be 'con'.

Just for fun, here's a python, non-regex solution:
latin = """[the sample input text]"""
new_lat = latin.replace(":","xxx:").split('xxx')
for l in new_lat:
if ":" in l:
curr_ind = new_lat.index(l)
cur_brek = l.rfind('. ')
prev_brek = new_lat[curr_ind-1].rfind('. ')
stub = new_lat[curr_ind-1][prev_brek+2:]
new_l = stub+l[:cur_brek]
print(new_l)
Output is the two text blocks starting from the key.

Related

What's the proper way to exclude uppercase word/s in regex python

Let's say I've scrapped this from a website.
PARIS - Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua (2015). Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat 22/05/2015. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
I can just use .replace ('PARIS - ','') and then get the texts with regex, but what if the place is changing in different article?
How do I exclude the first "Paris" and " - " and get the other texts
Should I seperate between the location and the content with regex?
What should I think or do first when facing problem like this?
Here's my code to get the first string for my third question, assume that text is variable that contains these texts
location = re.findall('^\w+', text)
Use a regular expression that matches a sequence of uppercase letters and spaces followed by a hyphen at the beginning, and replaces it with an empty string.
text = re.sub(r'^[A-Z\s]+\s-\s*', '', text)

Python user paragraph input

How can I copy and paste this as user input in Python?
"Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet,
consectetur, adipisci velit, sed quia non numquam eius modi tempora
incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut
enim ad minima veniam, quis nostrum exercitationem ullam corporis
suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur?
Quis autem vel eum iure reprehenderit qui in ea voluptate velit
esse quam nihil molestiae consequatur, vel illum qui dolorem eum
fugiat quo voluptas nulla pariatur?"
Edit:
input() doesn't work, I am getting this in console:
https://pastebin.com/raw/Pc55u0KX
After inserting multiline text, click enter with empty input and it will insert and split all lines into list.
def multiline_input(sentinel=''):
for inp in iter(input, sentinel):
yield inp.split()
lis = list(multiline_input())
print(lis)
Source: Python: Multiline input converted into a list

Python two text side by side with precise width

I've got an exercise where I have two text, "left" and "right".
I need to make a function to make them side by side given a width as parameter and all of this using itertools and textwrap.
Here's my code :
import textwrap
import itertools
def sidebyside(left,right,width=79):
width = round((width+1)/2)
leftwrapped = textwrap.wrap(left,width = width-1)
for i in range(0,len(leftwrapped)):
leftwrapped[i] = leftwrapped[i].ljust(width)
rightwrapped = textwrap.wrap(right,width = width-1)
for i in range(0,len(rightwrapped)):
rightwrapped[i] = rightwrapped[i].ljust(width)
pipes = ["|"]*max(len(leftwrapped),len(rightwrapped))
paragraph = itertools.zip_longest(leftwrapped,pipes,rightwrapped, fillvalue="".ljust(width))
result = ""
for a in paragraph:
result = result + a[0] + a[1] + a[2] + "\n"
return(result)
Here's a sample of "left" & "right" :
left = (
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. "
"Sed non risus. "
"Suspendisse lectus tortor, dignissim sit amet, "
"adipiscing nec, utilisez sed sin dolor."
)
right = (
"Morbi venenatis, felis nec pretium euismod, "
"est mauris finibus risus, consectetur laoreet "
"sem enim sed arcu. Maecenas sit amet eleifend sem. "
"Nullam ac libero metus. Praesent ac finibus nulla, vitae molestie dolor."
" Aliquam vestibulum viverra nisl, id porta mi viverra hendrerit."
" Ut et porta augue, et convallis ante."
)
My problem is that I'm getting some spacing issues, i.e: for the first line, for a given length of 20, I have this output :
'Lorem |Morbi ven '
But I need this output :
'Lorem |Morbi ven'
Found it, my round function was not good, I had to make two width, the first one being the round of the division and a second one being the result of width - round(width/2).
Talk is cheap, code is better :
from itertools import zip_longest
import textwrap
def sidebyside(left, right, width=79):
mid_width = (width - (1 - width%2)) // 2
return "\n".join(
f"{l.ljust(mid_width)}|{r.ljust(mid_width)}"
for l, r in zip_longest(
*map(lambda t: textwrap.wrap("".join(t), mid_width), (left, right)),
fillvalue=""
)
)
The goal of the original post was to solve a programming puzzle that required the sidebyside() method be implemented using only itertools.zip_longest() and textwrap.wrap().
This method works but has some disadvantages. For instance, it does not support line breaks, within the left and right texts (all spacing is removed before the texts are reflowed to be side-by-side).
Since this is a really useful method, I wrote an improved version of it, see the Gist itself for more information.
For example:
# some random text
LOREM = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
# split into paragraphs
LOREM_PARA = LOREM.replace(". ", ".\n\n").split("\n")
# arbitrarily truncate the first two lines for text B
TEXT_A = LOREM_PARA[:]
TEXT_B = LOREM_PARA[2:]
# reflow as side-by-side
print(side_by_side(TEXT_A, TEXT_B, width=50, as_string=True))
will output:
Lorem ipsum dolor sit | Ut enim ad minim
amet, consectetur | veniam, quis nostrud
adipiscing elit, sed do | exercitation ullamco
eiusmod tempor | laboris nisi ut aliquip
incididunt ut labore et | ex ea commodo
dolore magna aliqua. | consequat.
|
Ut enim ad minim | Duis aute irure dolor
veniam, quis nostrud | in reprehenderit in
exercitation ullamco | voluptate velit esse
laboris nisi ut aliquip | cillum dolore eu fugiat
ex ea commodo | nulla pariatur.
consequat. |
| Excepteur sint occaecat
Duis aute irure dolor | cupidatat non proident,
in reprehenderit in | sunt in culpa qui
voluptate velit esse | officia deserunt mollit
cillum dolore eu fugiat | anim id est laborum.
nulla pariatur. |
|
Excepteur sint occaecat |
cupidatat non proident, |
sunt in culpa qui |
officia deserunt mollit |
anim id est laborum. |

Counting how many different words have exactly the letter 'o' twice

I'm new at ApacheSpark. Currently, I'm performing some statistical analysis for a text. I start reading the text and storing it in a variable as follows:
loremRDD = sc.textFile(fileName, 8).map(removePunctuation).filter(lambda x: len(x)>0)
#First 10 lines
loremRDD.take(10)
The result is a PythonRDD[66] at RDD at PythonRDD.scala:53 variable. Below there are the first 10 lines of the text:
['aut minima deleniti et autem minus illo esse dolores eligendi corrupti dolore minima nostrum eos nobis nam nihil aspernatur nam ut quae sint laborum ut dolores error possimus aperiam consequatur',
'pariatur sed quo non itaque qui pariatur saepe ad quis consequatur nihil iste molestias et eos ut expedita vel reiciendis dolorem enim doloribus quam architecto aperiam',
'sed repudiandae pariatur similique est aut sequi animi in aperiam enim ipsa enim dolorem inventore aut quo odio in consequatur et',
'aspernatur ad esse et aliquid itaque dolores rerum quia commodi explicabo non magnam nostrum consectetur non sint eum nulla et aut quis doloribus itaque nulla molestiae quis est est quo facilis incidunt a ipsa in itaque sed aut nobis facere dignissimos atque unde cum ea vero',
'tenetur vel quod voluptatum laudantium dolores neque aut est modi qui aperiam itaque aperiam quae ratione doloremque aut delectus quas qui',
'qui placeat vel ipsam praesentium sint recusandae dicta minus praesentium omnis sequi a sed veritatis porro ab et officia esse commodi pariatur sequi cumque',
'mollitia facilis amet deleniti quia laborum commodi et molestias maxime quia dignissimos inventore neque libero deleniti ad quo corrupti numquam quis accusantium',
'architecto harum sunt et enim nisi commodi et id reprehenderit illum molestias illo facilis fuga eum illum quasi fugit qui',
'modi voluptatem quia et saepe inventore sed quo ea debitis explicabo vel perferendis commodi exercitationem sequi eum dolor cupiditate ab molestiae nemo ullam neque hic ipsa cupiditate dolor molestiae neque nam nobis nihil mollitia unde',
'voluptates quod in ipsum dicta fuga voluptatibus sint consequatur quod optio molestias nostrum repellendus consequatur aliquam fugiat provident omnis minus est quisquam exercitationem eum voluptas fugit quae eveniet perspiciatis assumenda maxime']
I need to know how many different words have twice the letter 'o'. For example, the word dolorem has twice the letter 'o'.
At the moment, I´ve created distintWordsRDD which stores all differents words contained in the text as follows:
loremWordsRDD = loremRDD.flatMap(lambda x: x.split(' '))
distintWordsMapRDD = loremWordsRDD.map(lambda word: (word,1)).reduceByKey(lambda a,b:a+b)
distintWordsRDD=distintWordsMapRDD.keys().distinct()
# Showing 8 first words
print(distintWordsRDD.take(8))
The result of the 8 first words is:
['tempora', 'sapiente', 'vitae', 'nisi', 'quidem', 'consectetur', 'perferendis', 'debitis']
My problem is that I don´t know how to retrieve from distintWordsRDD a list with the words that have two 'o'.
The following should work:
your_text=''.join(your_original_list_of_texts)
result=[i for i in your_text.split() if i.count('o')==2]
print(result)
['dolores', 'dolore', 'dolores', 'dolorem', 'doloribus', 'dolorem', 'odio']
However the text that you provided is split to many subtexts ('sometext1', 'sometext2', 'sometext3, etc) and it needs some additional work so that it will come to a simple text format ('alltext')
If you provide exact details of your input text, i will adjust the code so that it will work properly with the input without additional manual work
If you only have one string sentence:
results = set(word for word in sentence.split() if word.count('o') == 2)
If you have a list sentences of strings (which is what you show in your question):
results = set(
word
for sentence in sentences
for word in sentence.split()
if word.count('o') == 2
)
I'm using set to unify the results.
Output for the list of sentences in your example:
{'odio', 'dolorem', 'dolore', 'doloremque', 'dolor', 'doloribus', 'optio', 'commodi', 'porro', 'dolores'}
If you need a list then just convert the set in a list: results = list(results)).
I managed to solve this problem doing the following:
results = distintWordsRDD.filter(lambda word: word.count('o')==2)
print (results.collect())
print(results.count())
Result:
['porro', 'odio', 'laboriosam', 'doloremque', 'doloribus', 'dolores', 'dolor', 'corporis', 'commodi', 'optio', 'dolorum', 'dolore', 'dolorem']
13

How to create a word wrapping program in Python 3.6

I am trying to create a program that simulates word wrapping text found in programs like Word or Notepad. If I have a long text, I would like to print out 64 characters (or less) per line, followed by a newline return, without truncating words. Using Windows 10, PyCharm 2018.2.4 and Python 3.6, I've tried the following code:
long_str = "Lorem ipsum dolor sit amet, consectetur adipiscing elit," \
"sed do eiusmod tempor incididunt ut labore et dolore magna aliqua." \
"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris" \
"nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in" \
"reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur." \
"Excepteur sint occaecat cupidatat non proident, sunt in culpa qui" \
"officia deserunt mollit anim id est laborum."
concat_str = long_str[:64] # The first 64 characters
rest_str = long_str[65:] # The rest of the string
rest_str_len = len(rest_str)
while rest_str_len > 64:
print(concat_str.lstrip() + " (" + str(len(concat_str)) + ")" + "\n")
concat_str = rest_str[:64]
rest_str = rest_str[65:]
rest_str_len = len(rest_str)
print(concat_str.lstrip() + " (" + str(len(concat_str)) + ")" + "\n")
print(rest_str.lstrip() + " (" + str(len(rest_str)) + ")")
This is so close, but there are two problems. First, the code truncates off letters at the end or beginning of lines, such as the following output:
# I've added the total len() at the end of each line just to check-sum.
'Lorem ipsum dolor sit amet, consectetur adipiscing elit,sed do e (64)'
'usmod tempor incididunt ut labore et dolore magna aliqua. Ut enim (64)'
'ad minim veniam, quis nostrud exercitation ullamco laborisnisi u (64)'
'aliquip ex ea commodo consequat. Duis aute irure dolor inrepreh (64)'
'nderit in voluptate velit esse cillum dolore eu fugiat nulla par (64)'
'atur. Excepteur sint occaecat cupidatat non proident, sunt in cul (64)'
'a quiofficia deserunt mollit anim id est laborum. (49)'
The second problem is that I need the code to print a newline only after a whole word (or punctuation), instead of chopping up the word at 64 characters.
Use textwrap.wrap:
import textwrap
long_str = "Lorem ipsum dolor sit amet, consectetur adipiscing elit," \
"sed do eiusmod tempor incididunt ut labore et dolore magna aliqua." \
"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris" \
"nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in" \
"reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur." \
"Excepteur sint occaecat cupidatat non proident, sunt in culpa qui" \
"officia deserunt mollit anim id est laborum."
lines = textwrap.wrap(long_str, 64, break_long_words=False)
print('\n'.join(lines))
This takes long string and splits it into lines of a particular width. Also, set break_long_words to False to prevent splitting of words.

Categories

Resources