Reformat Beautiful Soup Output to include CSS - python

I am trying to parse through the text of emails to expedite my workflow using Python. I first save the email has a .htm on my local drive. Then, I want to try pulling certain pieces of information out of a table within the email using Jupyter Notebook. Whenever I create the soup, the result is a spaced out text field. I am unable to use this soup to make proper HTML calls to pull data. How may I reformat the soup?
The .htm file is already text, but I would still like to use Beautiful Soup to help me parse through the text field. Should I be trying a different parse method?
from bs4 import BeautifulSoup
raw_file = open(r"C:\Users\Desktop\Example.htm").read()
soup = BeautifulSoup(raw_file, 'lxml')
print(soup)
I expected a nicely formatted soup file, instead, this is what the print statement returns:
<html><body>
<p>ÿþh t m l x m l n s : v = " u r n : s c h e m a s - m i c r o s o f t - c o m : v m l "
x m l n s : o = " u r n : s c h e m a s - m i c r o s o f t - c o m : o f f i c e : o f f i c e "
x m l n s : w = " u r n : s c h e m a s - m i c r o s o f t - c o m : o f f i c e : w o r d "
x m l n s : m = " h t t p : / / s c h e m a s . m i c r o s o f t . c o m / o f f i c e / 2 0 0 4 / 1 2 / o m m l "
x m l n s = " h t t p : / / w w w . w 3 . o r g / T R / R E C - h t m l 4 0 " >
h e a d >
m e t a h t t p - e q u i v = C o n t e n t - T y p e c o n t e n t = " t e x t / h t m l ; c h a r s e t = u n i c o d e " >
m e t a n a m e = P r o g I d c o n t e n t = W o r d . D o c u m e n t >
m e t a n a m e = G e n e r a t o r c o n t e n t = " M i c r o s o f t W o r d 1 5 " >
m e t a n a m e = O r i g i n a t o r c o n t e n t = " M i c r o s o f t W o r d 1 5 " >
b a s e
When I call -
print(raw_file)
the following returns:
ÿþ<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=unicode">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 15">
<meta name=Originator content="Microsoft Word 15">
<base

Related

Different slices give different inequalities for same elements

import numpy as np
a = np.array([.4], dtype='float32')
b = np.array([.4, .6])
print(a > b)
print(a > b[0], a > b[1])
print(a[0] > b[0], a[0] > b[1])
[ True False]
[False] [False]
True False
What's the deal? Yes, b.dtype == 'float64', but so are its slices b[0] & b[1], and a remains 'float32'.
Note: I'm asking why this occurs, not how to circumvent it, which I know (e.g. cast both to 'float64').
As I've noted in another answer, type casting in numpy is pretty complicated, and this is the root cause of the behaviour you are seeing. The documents linked in that answer make it clear that scalars(/0d arrays) and 1d arrays differ in type conversions, since the latter aren't considered value by value.
The first half of the problem you already know: the problem is that type conversion happens differently for your two cases:
>>> (a + b).dtype
dtype('float64')
>>> (a + b[0]).dtype
dtype('float32')
>>> (a[0] + b[0]).dtype
dtype('float64')
There's also a helper called numpy.result_type() that can tell you the same information without having to perform the binary operation:
>>> np.result_type(a, b)
dtype('float64')
>>> np.result_type(a, b[0])
dtype('float32')
>>> np.result_type(a[0], b[0])
dtype('float64')
I believe we can understand what's happening in your example if we consider the type conversion tables:
>>> from numpy.testing import print_coercion_tables
can cast
[...]
In these tables, ValueError is '!', OverflowError is '#', TypeError is '#'
scalar + scalar
+ ? b h i l q p B H I L Q P e f d g F D G S U V O M m
? ? b h i l q l B H I L Q L e f d g F D G # # # O ! m
b b b h i l q l h i l d d d e f d g F D G # # # O ! m
h h h h i l q l h i l d d d f f d g F D G # # # O ! m
i i i i i l q l i i l d d d d d d g D D G # # # O ! m
l l l l l l q l l l l d d d d d d g D D G # # # O ! m
q q q q q q q q q q q d d d d d d g D D G # # # O ! m
p l l l l l q l l l l d d d d d d g D D G # # # O ! m
B B h h i l q l B H I L Q L e f d g F D G # # # O ! m
H H i i i l q l H H I L Q L f f d g F D G # # # O ! m
I I l l l l q l I I I L Q L d d d g D D G # # # O ! m
L L d d d d d d L L L L Q L d d d g D D G # # # O ! m
Q Q d d d d d d Q Q Q Q Q Q d d d g D D G # # # O ! m
P L d d d d d d L L L L Q L d d d g D D G # # # O ! m
e e e f d d d d e f d d d d e f d g F D G # # # O ! #
f f f f d d d d f f d d d d f f d g F D G # # # O ! #
d d d d d d d d d d d d d d d d d g D D G # # # O ! #
g g g g g g g g g g g g g g g g g g G G G # # # O ! #
F F F F D D D D F F D D D D F F D G F D G # # # O ! #
D D D D D D D D D D D D D D D D D G D D G # # # O ! #
G G G G G G G G G G G G G G G G G G G G G # # # O ! #
S # # # # # # # # # # # # # # # # # # # # # # # O ! #
U # # # # # # # # # # # # # # # # # # # # # # # O ! #
V # # # # # # # # # # # # # # # # # # # # # # # O ! #
O O O O O O O O O O O O O O O O O O O O O O O O O ! #
M ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
m m m m m m m m m m m m m m # # # # # # # # # # # ! m
scalar + neg scalar
[...]
array + scalar
+ ? b h i l q p B H I L Q P e f d g F D G S U V O M m
? ? b h i l q l B H I L Q L e f d g F D G # # # O ! m
b b b b b b b b b b b b b b e f d g F D G # # # O ! m
h h h h h h h h h h h h h h f f d g F D G # # # O ! m
i i i i i i i i i i i i i i d d d g D D G # # # O ! m
l l l l l l l l l l l l l l d d d g D D G # # # O ! m
q q q q q q q q q q q q q q d d d g D D G # # # O ! m
p l l l l l l l l l l l l l d d d g D D G # # # O ! m
B B B B B B B B B B B B B B e f d g F D G # # # O ! m
H H H H H H H H H H H H H H f f d g F D G # # # O ! m
I I I I I I I I I I I I I I d d d g D D G # # # O ! m
L L L L L L L L L L L L L L d d d g D D G # # # O ! m
Q Q Q Q Q Q Q Q Q Q Q Q Q Q d d d g D D G # # # O ! m
P L L L L L L L L L L L L L d d d g D D G # # # O ! m
e e e e e e e e e e e e e e e e e e F F F # # # O ! #
f f f f f f f f f f f f f f f f f f F F F # # # O ! #
d d d d d d d d d d d d d d d d d d D D D # # # O ! #
g g g g g g g g g g g g g g g g g g G G G # # # O ! #
F F F F F F F F F F F F F F F F F F F F F # # # O ! #
D D D D D D D D D D D D D D D D D D D D D # # # O ! #
G G G G G G G G G G G G G G G G G G G G G # # # O ! #
S # # # # # # # # # # # # # # # # # # # # # # # O ! #
U # # # # # # # # # # # # # # # # # # # # # # # O ! #
V # # # # # # # # # # # # # # # # # # # # # # # O ! #
O O O O O O O O O O O O O O O O O O O O O O O O O ! #
M ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
m m m m m m m m m m m m m m # # # # # # # # # # # ! m
[...]
The above is part of the current promotion tables for value-based promotion. It denotes how differing types contribute to a result type when pairing two numpy objects of a given kind (see the first column and first row for the specific types). The types are to be understood according to the single-character dtype specifications (below "One-character strings"), in particular np.dtype('f') corresponds to np.float32 (f for C-style float) and np.dtype('d') (d for C-style double) to np.float64 (see also np.typename('f') and the same for 'd').
I have noted two items in boldface in the above tables:
scalar f + scalar d --> d
array f + scalar d --> f
Now let's look at your cases. The premise is that you have an 'f' array a and a 'd' array b. The fact that a only has a single element doesn't matter: it's a 1d array with length 1 rather than a 0d array.
When you do a > b you are comparing two arrays, this is not denoted in the above tables. I'm not sure what the behaviour is here; my guess is that a gets broadcast to b's shape and then its type is cast to 'd'. The reason I think this is that np.can_cast(a, np.float64) is True and np.can_cast(b, np.float32) is False. But this is just a guess, a lot of this machinery in numpy is not intuitive to me.
When you do a > b[0] you are comparing a 'f' array to a 'd' scalar, so according to the above you get a 'f' array. That's what (a + b[0]).dtype told us. (When you use a > b[0] you don't see the conversion step, because the result is always a bool.)
When you do a[0] > b[0] you are comparing a 'f' scalar to a 'd' scalar, so according to the above you get a 'd' scalar. That's what (a[0] + b[0]).dtype told us.
So I believe this is all consistent with the quirks of type conversion in numpy. While it might seem like an unfortunate corner case with the value of 0.4 in double and single precision, this feature goes deeper and the problem serves as a big red warning that you should be very careful when mixing different dtypes.
The safest course of action is to convert your types yourself in order to control what happens in your code. Especially since there's discussion about reconsidering some aspects of type promotion.
As a side note (for now), there's a work-in-progress NEP 50 created in May 2021 that explains how confusing type promotion can be when scalars are involved, and plans to simplify some of the rules eventually. Since this also involves breaking changes, its implementation in NumPy proper won't happen overnight.

Printing the inverse pyramid using alphabets

I need to print a pattern like this:
C E G I K
D F H J
E G I
F H
G
Here is my code:
Someone please correct this code for me.
alpha=ord('C')
for i in range(5,0,-1):
for j in range(i):
print(chr(alpha+2),end="")
print('')
My current output is:
E E E E E
E E E E
E E E
E E
E
You can add an offset to the ordinal number of C based on the the line number and character number:
for i in range(5):
print(*(chr(ord('C') + i + j * 2) for j in range(5 - i)), sep=' ')

Why is my file seemingly being read incorrectly?

In Python I want to read from a large file:
def aggregate(file_input):
import fileinput
reviews = []
with open(file_input.replace(".txt", "_aggregated.txt"), "w") as outp:
currComp = ""
outp.write("Business;Stars_In_Sequence")
for line in fileinput.input(file_input):
reviews.append(MyReview(line))
if(currComp != reviews[-1].getCompany()):
currComp = reviews[-1].getCompany()
outp.write("\n" + currComp + ";" + reviews[-1].getStars())
outp.flush()
else:
outp.write(reviews[-1].getStars())
outp.flush()
The file looks like this:
Business;User;Review_Stars;Date;Length;Votes_Cool;Votes_Funny;Votes_Useful;
0DI8Dt2PJp07XkVvIElIcQ;jkrzTC5P5QGJRoKECzcleQ;5;2014-03-11;421;0;1;0
0DI8Dt2PJp07XkVvIElIcQ;cK78PTjb65kdmRL9BnEdoQ;5;2014-03-29;190;0;1;0
and works fine if I use only a small part of the file, returning the right output:
Business;Stars_In_Sequence
Business;R
0DI8Dt2PJp07XkVvIElIcQ;55555455555555515
LTlCaCGZE14GuaUXUGbamg;555555555
EDqCEAGXVGCH4FJXgqtjqg;3324133
However, if I use the original file it returns this, and I cant figure out why
Business;Stars_In_Sequence
ÿþB u s i n e s s ;
0 D I 8 D t 2 P J p 0 7 X k V v I E l I c Q ;
L T l C a C G Z E 1 4 G u a U X U G b a m g ;
E D q C E A G X V G C H 4 F J X g q t j q g ;

regex pattern won't return in python script

Why does the first snippet return digits, but the latter does not? I have tried more complicated expressions without success. The expressions I use are valid according to pythex.org, but do not work in the script.
(\d{6}-){7}\d{6}) is one such expression. I've tested it against this string: 123138-507716-007469-173316-534644-033330-675057-093280
import re
pattern = re.compile('(\d{1})')
load_file = open('demo.txt', 'r')
search_file = load_file.read()
result = pattern.findall(search_file)
print(result)
==============
import re
pattern = re.compile('(\d{6})')
load_file = open('demo.txt', 'r')
search_file = load_file.read()
result = pattern.findall(search_file)
print(result)
When I put the string into a variable and then search the variable it works just fine. This should work as is. But it doesn't help if I want to read a text file. I've tried to read each line of the file and that seems to be where the script breaks down.
import re
pattern = re.compile('((\d{6}-){7})')
#pattern = re.compile('(\d{6})')
#load_file = open('demo.txt', 'r')
#search_file = load_file.read()
test_string = '123138-507716-007469-173316-534644-033330-675057-093280'
result = pattern.findall(test_string)
print(result)
=========
printout,
Search File:
ÿþB i t L o c k e r D r i v e E n c r y p t i o n R e c o v e r y K e y
T h e r e c o v e r y k e y i s u s e d t o r e c o v e r t h e d a t a o n a B i t L o c k e r p r o t e c t e d d r i v e .
T o v e r i f y t h a t t h i s i s t h e c o r r e c t r e c o v e r y k e y c o m p a r e t h e i d e n t i f i c a t i o n w i t h w h a t i s p r e s e n t e d o n t h e r e c o v e r y s c r e e n .
R e c o v e r y k e y i d e n t i f i c a t i o n : f f s d f a - f s d f - s f
F u l l r e c o v e r y k e y i d e n t i f i c a t i o n : 8 8 8 8 8 8 8 8 - 8 8 8 8 - 8 8 8 8 - 8 8 8 8 - 8 8 8 8 8 8 8 8 8 8 8
B i t L o c k e r R e c o v e r y K e y :
1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1
6 6 6 6 6 6
Search Results:
[]
Process finished with exit code 0
================
This is where I ended up. It finds the string just fine and without the commas.
import re
pattern = re.compile('(\w{6}-\w{6}-\w{6}-\w{6}-\w{6}-\w{6}-\w{6}-\w{6})')
load_file = open('demo3.txt', 'r')
for line in load_file:
print(pattern.findall(line))

Python: Split a mixed String

I read some lines from a file in the following form:
line = a b c d,e,f g h i,j,k,l m n
What I want is lines without the ","-separated elements, e.g.,
a b c d g h i m n
a b c d g h j m n
a b c d g h k m n
a b c d g h l m n
a b c e g h i m n
a b c e g h j m n
a b c e g h k m n
a b c e g h l m n
. . . . . . . . .
. . . . . . . . .
First I would split line
sline = line.split()
Now I would iterate over sline and look for elements that can be splited with "," as separator. The Problem is I don't know always how much from those elements I have to expect.
Any ideas?
Using regex, itertools.product and some string formatting:
This solution preserves the initial spacing as well.
>>> import re
>>> from itertools import product
>>> line = 'a b c d,e,f g h i,j,k,l m n'
>>> items = [x[0].split(',') for x in re.findall(r'((\w+,)+\w)',line)]
>>> strs = re.sub(r'((\w+,)+\w+)','{}',line)
>>> for prod in product(*items):
... print (strs.format(*prod))
...
a b c d g h i m n
a b c d g h j m n
a b c d g h k m n
a b c d g h l m n
a b c e g h i m n
a b c e g h j m n
a b c e g h k m n
a b c e g h l m n
a b c f g h i m n
a b c f g h j m n
a b c f g h k m n
a b c f g h l m n
Another example:
>>> line = 'a b c d,e,f g h i,j,k,l m n q,w,e,r f o o'
>>> items = [x[0].split(',') for x in re.findall(r'((\w+,)+\w)',line)]
>>> strs = re.sub(r'((\w+,)+\w+)','{}',line)
for prod in product(*items):
print (strs.format(*prod))
...
a b c d g h i m n q f o o
a b c d g h i m n w f o o
a b c d g h i m n e f o o
a b c d g h i m n r f o o
a b c d g h j m n q f o o
a b c d g h j m n w f o o
a b c d g h j m n e f o o
a b c d g h j m n r f o o
a b c d g h k m n q f o o
a b c d g h k m n w f o o
a b c d g h k m n e f o o
a b c d g h k m n r f o o
a b c d g h l m n q f o o
a b c d g h l m n w f o o
a b c d g h l m n e f o o
a b c d g h l m n r f o o
a b c e g h i m n q f o o
a b c e g h i m n w f o o
a b c e g h i m n e f o o
a b c e g h i m n r f o o
a b c e g h j m n q f o o
a b c e g h j m n w f o o
a b c e g h j m n e f o o
a b c e g h j m n r f o o
a b c e g h k m n q f o o
a b c e g h k m n w f o o
a b c e g h k m n e f o o
a b c e g h k m n r f o o
a b c e g h l m n q f o o
a b c e g h l m n w f o o
a b c e g h l m n e f o o
a b c e g h l m n r f o o
a b c f g h i m n q f o o
a b c f g h i m n w f o o
a b c f g h i m n e f o o
a b c f g h i m n r f o o
a b c f g h j m n q f o o
a b c f g h j m n w f o o
a b c f g h j m n e f o o
a b c f g h j m n r f o o
a b c f g h k m n q f o o
a b c f g h k m n w f o o
a b c f g h k m n e f o o
a b c f g h k m n r f o o
a b c f g h l m n q f o o
a b c f g h l m n w f o o
a b c f g h l m n e f o o
a b c f g h l m n r f o o
Your question is not really clear. If you want to strip off any part after commas (as your text suggests), then a fairly readable one-liner should do:
cleaned_line = " ".join([field.split(",")[0] for field in line.split()])
If you want to expand lines containing comma-separated fields into multiple lines (as your example suggests), then you should use the itertools.product function:
import itertools
line = "a b c d,e,f g h i,j,k,l m n"
line_fields = [field.split(",") for field in line.split()]
for expanded_line_fields in itertools.product(*line_fields):
print " ".join(expanded_line_fields)
This is the output:
a b c d g h i m n
a b c d g h j m n
a b c d g h k m n
a b c d g h l m n
a b c e g h i m n
a b c e g h j m n
a b c e g h k m n
a b c e g h l m n
a b c f g h i m n
a b c f g h j m n
a b c f g h k m n
a b c f g h l m n
If it's important to keep the original spacing, for some reason, then you can replace line.split() by re.findall("([^ ]*| *)", line):
import re
import itertools
line = "a b c d,e,f g h i,j,k,l m n"
line_fields = [field.split(",") for field in re.findall("([^ ]+| +)", line)]
for expanded_line_fields in itertools.product(*line_fields):
print "".join(expanded_line_fields)
This is the output:
a b c d g h i m n
a b c d g h j m n
a b c d g h k m n
a b c d g h l m n
a b c e g h i m n
a b c e g h j m n
a b c e g h k m n
a b c e g h l m n
a b c f g h i m n
a b c f g h j m n
a b c f g h k m n
a b c f g h l m n
If I have understood your example correctly You need following
import itertools
sss = "a b c d,e,f g h i,j,k,l m n d,e,f "
coma_separated = [i for i in sss.split() if ',' in i]
spited_coma_separated = [i.split(',') for i in coma_separated]
symbols = (i for i in itertools.product(*spited_coma_separated))
#use generator statement to save memory
for s in symbols:
st = sss
for part, symb in zip(coma_separated, s):
st = st.replace(part, symb, 1) # To prevent replacement of the
# same coma separated group replace once
# for first occurance
print (st.split()) # for python3 compatibility
Most other answers only produce one line instead of the multiple lines you seem to want.
To achieve what you want, you can work in several ways.
The recursive solution seems the most intuitive to me:
def dothestuff(l):
for n, i in enumerate(l):
if ',' in i:
# found a "," entry
items = i.split(',')
for j in items:
for rest in dothestuff(l[n+1:]):
yield l[:n] + [j] + rest
return
yield l
line = "a b c d,e,f g h i,j,k,l m n"
for i in dothestuff(line.split()): print i
for i in range(len(line)-1):
if line[i] == ',':
line = line.replace(line[i]+line[i+1], '')
import itertools
line_data = 'a b c d,e,f g h i,j,k,l m n'
comma_fields_indices = [i for i,val in enumerate(line_data.split()) if "," in val]
comma_fields = [i.split(",") for i in line_data.split() if "," in i]
all_comb = []
for val in itertools.product(*comma_fields):
sline_data = line_data.split()
for index,word in enumerate(val):
sline_data[comma_fields_indices[index]] = word
all_comb.append(" ".join(sline_data))
print all_comb

Categories

Resources