Counting how many different words have exactly the letter 'o' twice - python

I'm new at ApacheSpark. Currently, I'm performing some statistical analysis for a text. I start reading the text and storing it in a variable as follows:
loremRDD = sc.textFile(fileName, 8).map(removePunctuation).filter(lambda x: len(x)>0)
#First 10 lines
loremRDD.take(10)
The result is a PythonRDD[66] at RDD at PythonRDD.scala:53 variable. Below there are the first 10 lines of the text:
['aut minima deleniti et autem minus illo esse dolores eligendi corrupti dolore minima nostrum eos nobis nam nihil aspernatur nam ut quae sint laborum ut dolores error possimus aperiam consequatur',
'pariatur sed quo non itaque qui pariatur saepe ad quis consequatur nihil iste molestias et eos ut expedita vel reiciendis dolorem enim doloribus quam architecto aperiam',
'sed repudiandae pariatur similique est aut sequi animi in aperiam enim ipsa enim dolorem inventore aut quo odio in consequatur et',
'aspernatur ad esse et aliquid itaque dolores rerum quia commodi explicabo non magnam nostrum consectetur non sint eum nulla et aut quis doloribus itaque nulla molestiae quis est est quo facilis incidunt a ipsa in itaque sed aut nobis facere dignissimos atque unde cum ea vero',
'tenetur vel quod voluptatum laudantium dolores neque aut est modi qui aperiam itaque aperiam quae ratione doloremque aut delectus quas qui',
'qui placeat vel ipsam praesentium sint recusandae dicta minus praesentium omnis sequi a sed veritatis porro ab et officia esse commodi pariatur sequi cumque',
'mollitia facilis amet deleniti quia laborum commodi et molestias maxime quia dignissimos inventore neque libero deleniti ad quo corrupti numquam quis accusantium',
'architecto harum sunt et enim nisi commodi et id reprehenderit illum molestias illo facilis fuga eum illum quasi fugit qui',
'modi voluptatem quia et saepe inventore sed quo ea debitis explicabo vel perferendis commodi exercitationem sequi eum dolor cupiditate ab molestiae nemo ullam neque hic ipsa cupiditate dolor molestiae neque nam nobis nihil mollitia unde',
'voluptates quod in ipsum dicta fuga voluptatibus sint consequatur quod optio molestias nostrum repellendus consequatur aliquam fugiat provident omnis minus est quisquam exercitationem eum voluptas fugit quae eveniet perspiciatis assumenda maxime']
I need to know how many different words have twice the letter 'o'. For example, the word dolorem has twice the letter 'o'.
At the moment, I´ve created distintWordsRDD which stores all differents words contained in the text as follows:
loremWordsRDD = loremRDD.flatMap(lambda x: x.split(' '))
distintWordsMapRDD = loremWordsRDD.map(lambda word: (word,1)).reduceByKey(lambda a,b:a+b)
distintWordsRDD=distintWordsMapRDD.keys().distinct()
# Showing 8 first words
print(distintWordsRDD.take(8))
The result of the 8 first words is:
['tempora', 'sapiente', 'vitae', 'nisi', 'quidem', 'consectetur', 'perferendis', 'debitis']
My problem is that I don´t know how to retrieve from distintWordsRDD a list with the words that have two 'o'.

The following should work:
your_text=''.join(your_original_list_of_texts)
result=[i for i in your_text.split() if i.count('o')==2]
print(result)
['dolores', 'dolore', 'dolores', 'dolorem', 'doloribus', 'dolorem', 'odio']
However the text that you provided is split to many subtexts ('sometext1', 'sometext2', 'sometext3, etc) and it needs some additional work so that it will come to a simple text format ('alltext')
If you provide exact details of your input text, i will adjust the code so that it will work properly with the input without additional manual work

If you only have one string sentence:
results = set(word for word in sentence.split() if word.count('o') == 2)
If you have a list sentences of strings (which is what you show in your question):
results = set(
word
for sentence in sentences
for word in sentence.split()
if word.count('o') == 2
)
I'm using set to unify the results.
Output for the list of sentences in your example:
{'odio', 'dolorem', 'dolore', 'doloremque', 'dolor', 'doloribus', 'optio', 'commodi', 'porro', 'dolores'}
If you need a list then just convert the set in a list: results = list(results)).

I managed to solve this problem doing the following:
results = distintWordsRDD.filter(lambda word: word.count('o')==2)
print (results.collect())
print(results.count())
Result:
['porro', 'odio', 'laboriosam', 'doloremque', 'doloribus', 'dolores', 'dolor', 'corporis', 'commodi', 'optio', 'dolorum', 'dolore', 'dolorem']
13

Related

How to split a text file into chunks?

I have tried many methods but it hasn't worked for me. I want to split a text files lines into multiple chunks. Specifically 50 lines per chunk.
Like this [['Line1', 'Line2' -- up to 50] and so on.
data.txt (example):
Line2
Line2
Line3
Line4
Line5
Line6
Line7
Line8
Python code:
with open('data.txt', 'r') as file:
sample = file.readlines()
chunks = []
for i in range(0, len(sample), 3): # replace 3 with 50 in your case
chunks.append(sample[i:i+3]) # replace 3 with 50 in your case
chunks (in my example, chunks of 3 lines):
[['Line1\n', 'Line2\n', 'Line3\n'], ['Line4\n', 'Line5\n', 'Line6\n'], ['Line7\n', 'Line8']]
You can apply the string.rstrip('\n') method on those lines to remove the \n at the end.
Alternative:
Without reading the whole file in memory (better):
chunks = []
with open('data.txt', 'r') as file:
while True:
chunk = []
for i in range(3): # replace 3 with 50 in your case
line = file.readline()
if not line:
break
chunk.append(line)
# or 'chunk.append(line.rstrip('\n')) to remove the '\n' at the ends
if not chunk:
break
chunks.append(chunk)
print(chunks)
Produces same result
A good way to do it would be to create a generic generator function that could break any sequence up into chunks of any size. Here's what I mean:
from itertools import zip_longest
def grouper(n, iterable): # Generator function.
"s -> (s0, s1, ...sn-1), (sn, sn+1, ...s2n-1), (s2n, s2n+1, ...s3n-1), ..."
FILLER = object() # Unique object
for group in zip_longest(*([iter(iterable)]*n), fillvalue=FILLER):
limit = group.index(FILLER) if group[-1] is FILLER else len(group)
yield group[:limit] # Sliced to remove any filler.
if __name__ == '__main__':
from pprint import pprint
with open('lorem ipsum.txt') as inf:
for chunk in grouper(3, inf):
pprint(chunk, width=90)
If the lorem ipsum.txt file contained these lines:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis ut volutpat sem.
In felis nibh, efficitur id orci tristique, sollicitudin rhoncus nibh. In
elementum suscipit est, et varius mi aliquam ac. Duis fringilla neque urna,
dapibus volutpat ex ullamcorper eget. Duis in mauris vitae neque porttitor
facilisis. Nulla ornare leo ac nibh facilisis, in feugiat eros accumsan.
Suspendisse elementum elementum libero, sed tempor ex sollicitudin ac. Cras
pharetra, neque eu porttitor mattis, odio quam interdum diam, quis aliquam ex
arcu non nisl. Duis consequat lorem metus. Mauris vitae ex ante. Duis vehicula.
The result will be the following chunks each composed of 3 lines or less:
('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis ut volutpat sem.\n',
'In felis nibh, efficitur id orci tristique, sollicitudin rhoncus nibh. In\n',
'elementum suscipit est, et varius mi aliquam ac. Duis fringilla neque urna,\n')
('dapibus volutpat ex ullamcorper eget. Duis in mauris vitae neque porttitor\n',
'facilisis. Nulla ornare leo ac nibh facilisis, in feugiat eros accumsan.\n',
'Suspendisse elementum elementum libero, sed tempor ex sollicitudin ac. Cras\n')
('pharetra, neque eu porttitor mattis, odio quam interdum diam, quis aliquam ex\n',
'arcu non nisl. Duis consequat lorem metus. Mauris vitae ex ante. Duis vehicula.\n')
Update
If you want to remove the newline characters from the end of the lines of the file, you could do it with the same generic grouper() function by passing it a generator expression to preprocess the lines being read without needing to read them all into memory first:
if __name__ == '__main__':
from pprint import pprint
with open('lorem ipsum.txt') as inf:
lines = (line.rstrip() for line in inf) # Generator expr - cuz outer parentheses.
for chunk in grouper(3, lines):
pprint(chunk, width=90)
Output using generator expression:
('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis ut volutpat sem.',
'In felis nibh, efficitur id orci tristique, sollicitudin rhoncus nibh. In',
'elementum suscipit est, et varius mi aliquam ac. Duis fringilla neque urna,')
('dapibus volutpat ex ullamcorper eget. Duis in mauris vitae neque porttitor',
'facilisis. Nulla ornare leo ac nibh facilisis, in feugiat eros accumsan.',
'Suspendisse elementum elementum libero, sed tempor ex sollicitudin ac. Cras')
('pharetra, neque eu porttitor mattis, odio quam interdum diam, quis aliquam ex',
'arcu non nisl. Duis consequat lorem metus. Mauris vitae ex ante. Duis vehicula.')
You can split the text by each newline using the str.splitlines() method. Then, using a list comprehension, you can use list slices to slice the list at increments of the chunk_size (50 in your case). Below, I used 3 as the chunk_size variable, but you can replace that with 50:
text = '''Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate
velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum.'''
lines = text.splitlines()
chunk_size = 3
chunks = [lines[i: i + chunk_size] for i in range(0, len(lines), chunk_size)]
print(chunks)
Output:
[['Lorem ipsum dolor sit amet, consectetur adipiscing elit,', 'sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.', 'Ut enim ad minim veniam,'],
['quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.', 'Duis aute irure dolor in reprehenderit in voluptate', 'velit esse cillum dolore eu fugiat nulla pariatur.'],
['Excepteur sint occaecat cupidatat non proident,', 'sunt in culpa qui officia deserunt mollit anim id est laborum.']]

Python user paragraph input

How can I copy and paste this as user input in Python?
"Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet,
consectetur, adipisci velit, sed quia non numquam eius modi tempora
incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut
enim ad minima veniam, quis nostrum exercitationem ullam corporis
suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur?
Quis autem vel eum iure reprehenderit qui in ea voluptate velit
esse quam nihil molestiae consequatur, vel illum qui dolorem eum
fugiat quo voluptas nulla pariatur?"
Edit:
input() doesn't work, I am getting this in console:
https://pastebin.com/raw/Pc55u0KX
After inserting multiline text, click enter with empty input and it will insert and split all lines into list.
def multiline_input(sentinel=''):
for inp in iter(input, sentinel):
yield inp.split()
lis = list(multiline_input())
print(lis)
Source: Python: Multiline input converted into a list

Extract Keys and Values from text using regular expressions

I have a big number of strings that I need to parse. These strings contain information that is put in key-value pairs.
Sample input text:
Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim: ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim: ad minima veniam, *31.12.2012, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur
Key information:
A key starts either from the beginning of the string or after \.
A key ends always with :
The key is immediately followed by a value
This value continues until the next key or until the last symbol in the string
There are a multiple of key-value pairs, which I don't know
Expected Output
{
"Nemo enim": "ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem",
"Ut enim": "ad minima veniam, *31.12.2012, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur. Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur"
}
The regex that I have so far is ([üöä\w\s]*)\: (.*?)\.. Suffice it to say it doesn't provide the expected output.
This regex ([^:.]+):\s*([^:]+)(?=\.\s+|$) does the job.
Demo & explanation
You can match the following regular expression, which saves the keys and values to capture groups 1 and 2.
r'(?<![^.]) *([^.]+?:) *((?:(?!\. ).)+)'
Start your engine! | Python code
Python's regex engine performs the following operations.
(?<![^.]) : negative lookbehind asserts current location is not
preceded by a character other than '.'
\ * : match 0+ spaces
( : begin capture group 1
[^.]+? : match 1+ characters other than '.', lazily
: : match ':'
) : end capture group 1
\ * : match 0+ spaces
( : begin capture group 2
(?: : begin non-capture group
(?!\. ) : negative lookahead asserts current position is not
followed by a period followed by a space
. : match any character other than a line terminator
)+ : end non-capture group and execute 1+ times
) : end capture group 2
This uses the tempered greedy token technique, which matches a series of individual characters that do not begin an unwanted string. For example, if the string were "concatenate", (?:(?:!cat).)+ would match the first three letters but not the second 'c', so the match would be 'con'.
Just for fun, here's a python, non-regex solution:
latin = """[the sample input text]"""
new_lat = latin.replace(":","xxx:").split('xxx')
for l in new_lat:
if ":" in l:
curr_ind = new_lat.index(l)
cur_brek = l.rfind('. ')
prev_brek = new_lat[curr_ind-1].rfind('. ')
stub = new_lat[curr_ind-1][prev_brek+2:]
new_l = stub+l[:cur_brek]
print(new_l)
Output is the two text blocks starting from the key.

How to print very long string without wrapping in Jupyter notebook?

Example:
print("Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo."
Outputs:
Sed ut perspiciatis unde omnis iste natus error sit voluptatem
accusantium doloremque laudantium, totam rem aperiam, eaque
ipsa quae ab illo inventore veritatis et quasi architecto
beatae vitae dicta sunt explicabo.
Desired output:
Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo.
Directly go through this page
https://github.com/jupyter/notebook/issues/106
Type in your cmd prompt or equivalent: jupyter --config-dir to get your jupyter settings location.
Create a folder nbconfig inside the settings location /.jupyter. Inside, create a file notebook.json with the following
{
"MarkdownCell": {
"cm_config": {
"lineWrapping": false
}
},
"CodeCell": {
"cm_config": {
"lineWrapping": false
}
}
}
Restart jupyter and reload, then try
You need HTML output that has some CSS applied. Here is the code that you may try:
import IPython.display as dp
long_txt = "Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo."
outp = dp.HTML("<style>.nowrap{white-space:nowrap;}</style><span class='nowrap'>" +long_txt+ "</span>")
Now you get outp as an HTML object. You can render it and get the long one-line text.
outp
The output will be:
Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo.
Hope it helps.

.dropna() doesn't remove all NaN from pandas Dataframe

I have a code in which i filter out some stopwords and special characters. The dropna() filters out most of the existing NaN but the cleaner = clean.str.replace('#|\|_|!|.|\^|:|(|)|-|\?|!|\,','') line creates some new NaN in the csv file (some lines are just special chars), these aren't filtered out. How can i filter these out as well?
import pandas as pd
from stop_words import get_stop_words
df = pd.read_csv("F:/textclustering/data/cleandata.csv", encoding="iso-8859-1")
usertext = df[df.Role.str.contains("End-user",na=False)][['Data','chatid']]
lowertext = usertext['Data'].map(lambda x: x if type(x)!=str else x.lower())
nl_stop_words = get_stop_words('dutch')
stop_words_pat = '|'.join(['\\b' + stop + '\\b' for stop in nl_stop_words])
clean = lowertext.str.replace(stop_words_pat, '')
cleaner = clean.str.replace('\#|\|\_|\!|\.|\^|\:|\(|\)|\-|\?|\!|\,','')
render = pd.concat([cleaner, usertext['chatid']], axis=1)
#print(render)
#print(type(render))
final= render.dropna(how='any')
final.to_csv("F:/textclustering/data/filteredtext.csv", sep=',',index=False, encoding="iso-8859-1")
df2 = pd.read_csv("F:/textclustering/data/filteredtext.csv", encoding="iso-8859-1")
print(df2)
UPDATE: Raw Data
"Agent","Chat.Event","Role","Data","chatid"
Chat ID: ^^^^^^,,,"",1
x,Agent Accepted,Lead,"Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur",1
x,Engagement Participant Entered,Lead,,1
No Value,End-user Post,End-user,"At vero eos et accusamus et iusto odio dignissimos ducimus",1
x,Agent Post,Lead,"Itaque earum rerum hic tenetur a sapiente delectus, ut aut reiciendis voluptatibus maiores alias consequatur aut perferendis doloribus asperiores repellat.",1
No Value,End-user Post,End-user,"Et harum quidem rerum!",1
x,Agent Post,Lead,"omnis voluptas assumenda est",1
No Value,End-user Post,End-user,"assumenda est",1
x,Agent Post,Lead,"Nam libero tempore?",1
x,Agent Post,Lead,"Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",1
x,Agent Post,Lead,"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed?",1
No Value,End-user Post,End-user,"^^########",1
(i have replaced the dutch text for lorum impsum for privacy reasons)
The last Line stays NaN

Categories

Resources