Different slices give different inequalities for same elements - python

import numpy as np
a = np.array([.4], dtype='float32')
b = np.array([.4, .6])
print(a > b)
print(a > b[0], a > b[1])
print(a[0] > b[0], a[0] > b[1])
[ True False]
[False] [False]
True False
What's the deal? Yes, b.dtype == 'float64', but so are its slices b[0] & b[1], and a remains 'float32'.
Note: I'm asking why this occurs, not how to circumvent it, which I know (e.g. cast both to 'float64').

As I've noted in another answer, type casting in numpy is pretty complicated, and this is the root cause of the behaviour you are seeing. The documents linked in that answer make it clear that scalars(/0d arrays) and 1d arrays differ in type conversions, since the latter aren't considered value by value.
The first half of the problem you already know: the problem is that type conversion happens differently for your two cases:
>>> (a + b).dtype
dtype('float64')
>>> (a + b[0]).dtype
dtype('float32')
>>> (a[0] + b[0]).dtype
dtype('float64')
There's also a helper called numpy.result_type() that can tell you the same information without having to perform the binary operation:
>>> np.result_type(a, b)
dtype('float64')
>>> np.result_type(a, b[0])
dtype('float32')
>>> np.result_type(a[0], b[0])
dtype('float64')
I believe we can understand what's happening in your example if we consider the type conversion tables:
>>> from numpy.testing import print_coercion_tables
can cast
[...]
In these tables, ValueError is '!', OverflowError is '#', TypeError is '#'
scalar + scalar
+ ? b h i l q p B H I L Q P e f d g F D G S U V O M m
? ? b h i l q l B H I L Q L e f d g F D G # # # O ! m
b b b h i l q l h i l d d d e f d g F D G # # # O ! m
h h h h i l q l h i l d d d f f d g F D G # # # O ! m
i i i i i l q l i i l d d d d d d g D D G # # # O ! m
l l l l l l q l l l l d d d d d d g D D G # # # O ! m
q q q q q q q q q q q d d d d d d g D D G # # # O ! m
p l l l l l q l l l l d d d d d d g D D G # # # O ! m
B B h h i l q l B H I L Q L e f d g F D G # # # O ! m
H H i i i l q l H H I L Q L f f d g F D G # # # O ! m
I I l l l l q l I I I L Q L d d d g D D G # # # O ! m
L L d d d d d d L L L L Q L d d d g D D G # # # O ! m
Q Q d d d d d d Q Q Q Q Q Q d d d g D D G # # # O ! m
P L d d d d d d L L L L Q L d d d g D D G # # # O ! m
e e e f d d d d e f d d d d e f d g F D G # # # O ! #
f f f f d d d d f f d d d d f f d g F D G # # # O ! #
d d d d d d d d d d d d d d d d d g D D G # # # O ! #
g g g g g g g g g g g g g g g g g g G G G # # # O ! #
F F F F D D D D F F D D D D F F D G F D G # # # O ! #
D D D D D D D D D D D D D D D D D G D D G # # # O ! #
G G G G G G G G G G G G G G G G G G G G G # # # O ! #
S # # # # # # # # # # # # # # # # # # # # # # # O ! #
U # # # # # # # # # # # # # # # # # # # # # # # O ! #
V # # # # # # # # # # # # # # # # # # # # # # # O ! #
O O O O O O O O O O O O O O O O O O O O O O O O O ! #
M ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
m m m m m m m m m m m m m m # # # # # # # # # # # ! m
scalar + neg scalar
[...]
array + scalar
+ ? b h i l q p B H I L Q P e f d g F D G S U V O M m
? ? b h i l q l B H I L Q L e f d g F D G # # # O ! m
b b b b b b b b b b b b b b e f d g F D G # # # O ! m
h h h h h h h h h h h h h h f f d g F D G # # # O ! m
i i i i i i i i i i i i i i d d d g D D G # # # O ! m
l l l l l l l l l l l l l l d d d g D D G # # # O ! m
q q q q q q q q q q q q q q d d d g D D G # # # O ! m
p l l l l l l l l l l l l l d d d g D D G # # # O ! m
B B B B B B B B B B B B B B e f d g F D G # # # O ! m
H H H H H H H H H H H H H H f f d g F D G # # # O ! m
I I I I I I I I I I I I I I d d d g D D G # # # O ! m
L L L L L L L L L L L L L L d d d g D D G # # # O ! m
Q Q Q Q Q Q Q Q Q Q Q Q Q Q d d d g D D G # # # O ! m
P L L L L L L L L L L L L L d d d g D D G # # # O ! m
e e e e e e e e e e e e e e e e e e F F F # # # O ! #
f f f f f f f f f f f f f f f f f f F F F # # # O ! #
d d d d d d d d d d d d d d d d d d D D D # # # O ! #
g g g g g g g g g g g g g g g g g g G G G # # # O ! #
F F F F F F F F F F F F F F F F F F F F F # # # O ! #
D D D D D D D D D D D D D D D D D D D D D # # # O ! #
G G G G G G G G G G G G G G G G G G G G G # # # O ! #
S # # # # # # # # # # # # # # # # # # # # # # # O ! #
U # # # # # # # # # # # # # # # # # # # # # # # O ! #
V # # # # # # # # # # # # # # # # # # # # # # # O ! #
O O O O O O O O O O O O O O O O O O O O O O O O O ! #
M ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
m m m m m m m m m m m m m m # # # # # # # # # # # ! m
[...]
The above is part of the current promotion tables for value-based promotion. It denotes how differing types contribute to a result type when pairing two numpy objects of a given kind (see the first column and first row for the specific types). The types are to be understood according to the single-character dtype specifications (below "One-character strings"), in particular np.dtype('f') corresponds to np.float32 (f for C-style float) and np.dtype('d') (d for C-style double) to np.float64 (see also np.typename('f') and the same for 'd').
I have noted two items in boldface in the above tables:
scalar f + scalar d --> d
array f + scalar d --> f
Now let's look at your cases. The premise is that you have an 'f' array a and a 'd' array b. The fact that a only has a single element doesn't matter: it's a 1d array with length 1 rather than a 0d array.
When you do a > b you are comparing two arrays, this is not denoted in the above tables. I'm not sure what the behaviour is here; my guess is that a gets broadcast to b's shape and then its type is cast to 'd'. The reason I think this is that np.can_cast(a, np.float64) is True and np.can_cast(b, np.float32) is False. But this is just a guess, a lot of this machinery in numpy is not intuitive to me.
When you do a > b[0] you are comparing a 'f' array to a 'd' scalar, so according to the above you get a 'f' array. That's what (a + b[0]).dtype told us. (When you use a > b[0] you don't see the conversion step, because the result is always a bool.)
When you do a[0] > b[0] you are comparing a 'f' scalar to a 'd' scalar, so according to the above you get a 'd' scalar. That's what (a[0] + b[0]).dtype told us.
So I believe this is all consistent with the quirks of type conversion in numpy. While it might seem like an unfortunate corner case with the value of 0.4 in double and single precision, this feature goes deeper and the problem serves as a big red warning that you should be very careful when mixing different dtypes.
The safest course of action is to convert your types yourself in order to control what happens in your code. Especially since there's discussion about reconsidering some aspects of type promotion.
As a side note (for now), there's a work-in-progress NEP 50 created in May 2021 that explains how confusing type promotion can be when scalars are involved, and plans to simplify some of the rules eventually. Since this also involves breaking changes, its implementation in NumPy proper won't happen overnight.

Related

Slow NetworkX graph creation

I have to create a graph, starting from a documents-terms matrix, loaded into a pandas dataframe, where nodes are terms and where arches contain the number of documents in which the two nodes appear together.
The code works well but is really really slow.
edges = []
edges_attrs = {}
columns = list(dtm.columns)
for key in dtm.columns:
for key1 in columns:
# skip the same node
if key == key1:
continue
df = dtm.loc[(dtm[key] != 0) & (dtm[key1] != 0), [key, key1]]
docs = df.shape[0]
edges.append((key, key1))
edges_attrs[(key, key1)] = {'docs': docs}
# no double arches (u, v) == (v, u)
columns.remove(key)
graph.add_edges_from(edges)
nx.set_edge_attributes(graph, edges_attrs)
For a dtm with 2k terms (columns), it takes more than 3 hours, that it sounds to me quite too much for that size.
Some hints on how to speed up?
Don't use for loops. Learn about inner and outer joins in databases. An introductory course in SQL would cover these concepts. Applying them to a pandas dataframe is then pretty straightforward:
#!/usr/bin/env python
"""
https://stackoverflow.com/q/62406586/2912349
"""
import numpy as np
import pandas as pd
# simulate some data
x = pd.DataFrame(np.random.normal(0, 1, (4,4)), index=['a', 'b', 'c', 'd'], columns=['e', 'f', 'g', 'h'])
x[:] = x > 0
# e f g h
# a False False True False
# b False False False True
# c True True True True
# d False True True True
sparse = pd.DataFrame(x[x > 0].stack().index.tolist(), columns=['Documents', 'Terms'])
# Documents Terms
# 0 a g
# 1 b h
# 2 c e
# 3 c f
# 4 c g
# 5 c h
# 6 d f
# 7 d g
# 8 d h
cooccurrences = pd.merge(sparse, sparse, how='inner', on='Documents')
# Documents Terms_x Terms_y
# 0 a g g
# 1 b h h
# 2 c e e
# 3 c e f
# 4 c e g
# 5 c e h
# 6 c f e
# 7 c f f
# 8 c f g
# 9 c f h
# 10 c g e
# 11 c g f
# 12 c g g
# 13 c g h
# 14 c h e
# 15 c h f
# 16 c h g
# 17 c h h
# 18 d f f
# 19 d f g
# 20 d f h
# 21 d g f
# 22 d g g
# 23 d g h
# 24 d h f
# 25 d h g
# 26 d h h
# remove self loops and repeat pairings such as the second tuple in (u, v), (v, u)
valid = cooccurrences['Terms_x'] > cooccurrences['Terms_y']
valid_cooccurrences = cooccurrences[valid]
# Documents Terms_x Terms_y
# 6 c f e
# 10 c g e
# 11 c g f
# 14 c h e
# 15 c h f
# 16 c h g
# 21 d g f
# 24 d h f
# 25 d h g
counts = valid_cooccurrences.groupby(['Terms_x', 'Terms_y']).count()
# Documents
# Terms_x Terms_y
# f e 1
# g e 1
# f 2
# h e 1
# f 2
# g 2
documents = valid_cooccurrences.groupby(['Terms_x', 'Terms_y']).aggregate(lambda x : set(x))
# Documents
# Terms_x Terms_y
# f e {c}
# g e {c}
# f {d, c}
# h e {c}
# f {d, c}
# g {d, c}

How to leave only one defined sub-string in a string in Python

Say I have one of the strings:
"a b c d e f f g" || "a b c f d e f g"
And I want there to be only one occurrence of a substring (f in this instance) throughout the string so that it is somewhat sanitized.
The result of each string would be:
"a b c d e f g" || "a b c d e f g"
An example of the use would be:
str = "a b c d e f g g g g g h i j k l"
str.leaveOne("g")
#// a b c d e f g h i j k l
If it doesn't matter which instance you leave, you can use str.replace, which takes a parameter signifying the number of replacements you want to perform:
def leave_one_last(source, to_remove):
return source.replace(to_remove, '', source.count(to_remove) - 1)
This will leave the last occurrence.
We can modify it to leave the first occurrence by reversing the string twice:
def leave_one_first(source, to_remove):
return source[::-1].replace(to_remove, '', source.count(to_remove) - 1)[::-1]
However, that is ugly, not to mention inefficient. A more elegant way might be to take the substring that ends with the first occurrence of the character to find, replace occurrences of it in the rest, and finally concatenate them together:
def leave_one_first_v2(source, to_remove):
first_index = source.index(to_remove) + 1
return source[:first_index] + source[first_index:].replace(to_remove, '')
If we try this:
string = "a b c d e f g g g g g h i j k l g"
print(leave_one_last(string, 'g'))
print(leave_one_first(string, 'g'))
print(leave_one_first_v2(string, 'g'))
Output:
a b c d e f h i j k l g
a b c d e f g h i j k l
a b c d e f g h i j k l
If you don't want to keep spaces, then you should use a version based on split:
def leave_one_split(source, to_remove):
chars = source.split()
first_index = chars.index(to_remove) + 1
return ' '.join(chars[:first_index] + [char for char in chars[first_index:] if char != to_remove])
string = "a b c d e f g g g g g h i j k l g"
print(leave_one_split(string, 'g'))
Output:
'a b c d e f g h i j k l'
If I understand correctly, you can just use a regex and re.sub to look for groups of two or more of your letter with or without a space and replace it by a single instance:
import re
def leaveOne(s, char):
return re.sub(r'((%s\s?)){2,}' % char, r'\1' , s)
leaveOne("a b c d e f g g g h i j k l", 'g')
# 'a b c d e f g h i j k l'
leaveOne("a b c d e f ggg h i j k l", 'g')
# 'a b c d e f g h i j k l'
leaveOne("a b c d e f g h i j k l", 'g')
# 'a b c d e f g h i j k l'
EDIT
If the goal is to get rid of all occurrences of the letter except one, you can still use a regex with a lookahead to select all letters followed by the same:
import re
def leaveOne(s, char):
return re.sub(r'(%s)\s?(?=.*?\1)' % char, '' , s)
print(leaveOne("a b c d e f g g g h i j k l g", 'g'))
# 'a b c d e f h i j k l g'
print(leaveOne("a b c d e f ggg h i j k l gg g", 'g'))
# 'a b c d e f h i j k l g'
print(leaveOne("a b c d e f g h i j k l", 'g'))
# 'a b c d e f g h i j k l'
This should even work with more complicated patterns like:
leaveOne("a b c ffff d e ff g", 'ff')
# 'a b c d e ff g'
Given String
mystr = 'defghhabbbczasdvakfafj'
cache = {}
seq = 0
for i in mystr:
if i not in cache:
cache[i] = seq
print (cache[i])
seq+=1
mylist = []
Here I have ordered the dictionary with values
for key,value in sorted(cache.items(),key=lambda x : x[1]):
mylist.append(key)
print ("".join(mylist))

Alternative solution for printing pattern using python

I want to print pattern using python and i have done it but i want to
know other solutions possible for the same:-
A B C D E F G F E D C B A
A B C D E F F E D C B A
A B C D E E D C B A
......
....
A A
and here is my code:-
n=0
for i in range(71,64,-1):
for j in range(65,i+1):
a=chr(j)
print(a, end=" ")
if n>0:
for l in range(1,3+(n-1)*4):
print(end=" ")
if i<71:
j=j+1
for k in range(j-1,64,-1):
b=chr(k)
print(b, end=" ")
n=n+1
print()
Here's an alternative method using 3rd party library numpy. I use this library specifically because it allows vectorised assignment, which I use instead of an inner loop.
from string import ascii_uppercase
import numpy as np
n = 7
# extract first n letters from alphabet
letters = ascii_uppercase[:n]
res = np.array([list(letters + letters[-2::-1])] * (n-1))
# generate indices that are removed per line
idx = (range(n-i-1, n+i) for i in range(n-1))
# printing logic
print(' '.join(res[0]))
for i, j in enumerate(idx):
# vectorised assignment
res[i, j] = ' '
print(' '.join(res[i]))
Result:
A B C D E F G F E D C B A
A B C D E F F E D C B A
A B C D E E D C B A
A B C D D C B A
A B C C B A
A B B A
A A

Why is my file seemingly being read incorrectly?

In Python I want to read from a large file:
def aggregate(file_input):
import fileinput
reviews = []
with open(file_input.replace(".txt", "_aggregated.txt"), "w") as outp:
currComp = ""
outp.write("Business;Stars_In_Sequence")
for line in fileinput.input(file_input):
reviews.append(MyReview(line))
if(currComp != reviews[-1].getCompany()):
currComp = reviews[-1].getCompany()
outp.write("\n" + currComp + ";" + reviews[-1].getStars())
outp.flush()
else:
outp.write(reviews[-1].getStars())
outp.flush()
The file looks like this:
Business;User;Review_Stars;Date;Length;Votes_Cool;Votes_Funny;Votes_Useful;
0DI8Dt2PJp07XkVvIElIcQ;jkrzTC5P5QGJRoKECzcleQ;5;2014-03-11;421;0;1;0
0DI8Dt2PJp07XkVvIElIcQ;cK78PTjb65kdmRL9BnEdoQ;5;2014-03-29;190;0;1;0
and works fine if I use only a small part of the file, returning the right output:
Business;Stars_In_Sequence
Business;R
0DI8Dt2PJp07XkVvIElIcQ;55555455555555515
LTlCaCGZE14GuaUXUGbamg;555555555
EDqCEAGXVGCH4FJXgqtjqg;3324133
However, if I use the original file it returns this, and I cant figure out why
Business;Stars_In_Sequence
ÿþB u s i n e s s ;
0 D I 8 D t 2 P J p 0 7 X k V v I E l I c Q ;
L T l C a C G Z E 1 4 G u a U X U G b a m g ;
E D q C E A G X V G C H 4 F J X g q t j q g ;

Python: Split a mixed String

I read some lines from a file in the following form:
line = a b c d,e,f g h i,j,k,l m n
What I want is lines without the ","-separated elements, e.g.,
a b c d g h i m n
a b c d g h j m n
a b c d g h k m n
a b c d g h l m n
a b c e g h i m n
a b c e g h j m n
a b c e g h k m n
a b c e g h l m n
. . . . . . . . .
. . . . . . . . .
First I would split line
sline = line.split()
Now I would iterate over sline and look for elements that can be splited with "," as separator. The Problem is I don't know always how much from those elements I have to expect.
Any ideas?
Using regex, itertools.product and some string formatting:
This solution preserves the initial spacing as well.
>>> import re
>>> from itertools import product
>>> line = 'a b c d,e,f g h i,j,k,l m n'
>>> items = [x[0].split(',') for x in re.findall(r'((\w+,)+\w)',line)]
>>> strs = re.sub(r'((\w+,)+\w+)','{}',line)
>>> for prod in product(*items):
... print (strs.format(*prod))
...
a b c d g h i m n
a b c d g h j m n
a b c d g h k m n
a b c d g h l m n
a b c e g h i m n
a b c e g h j m n
a b c e g h k m n
a b c e g h l m n
a b c f g h i m n
a b c f g h j m n
a b c f g h k m n
a b c f g h l m n
Another example:
>>> line = 'a b c d,e,f g h i,j,k,l m n q,w,e,r f o o'
>>> items = [x[0].split(',') for x in re.findall(r'((\w+,)+\w)',line)]
>>> strs = re.sub(r'((\w+,)+\w+)','{}',line)
for prod in product(*items):
print (strs.format(*prod))
...
a b c d g h i m n q f o o
a b c d g h i m n w f o o
a b c d g h i m n e f o o
a b c d g h i m n r f o o
a b c d g h j m n q f o o
a b c d g h j m n w f o o
a b c d g h j m n e f o o
a b c d g h j m n r f o o
a b c d g h k m n q f o o
a b c d g h k m n w f o o
a b c d g h k m n e f o o
a b c d g h k m n r f o o
a b c d g h l m n q f o o
a b c d g h l m n w f o o
a b c d g h l m n e f o o
a b c d g h l m n r f o o
a b c e g h i m n q f o o
a b c e g h i m n w f o o
a b c e g h i m n e f o o
a b c e g h i m n r f o o
a b c e g h j m n q f o o
a b c e g h j m n w f o o
a b c e g h j m n e f o o
a b c e g h j m n r f o o
a b c e g h k m n q f o o
a b c e g h k m n w f o o
a b c e g h k m n e f o o
a b c e g h k m n r f o o
a b c e g h l m n q f o o
a b c e g h l m n w f o o
a b c e g h l m n e f o o
a b c e g h l m n r f o o
a b c f g h i m n q f o o
a b c f g h i m n w f o o
a b c f g h i m n e f o o
a b c f g h i m n r f o o
a b c f g h j m n q f o o
a b c f g h j m n w f o o
a b c f g h j m n e f o o
a b c f g h j m n r f o o
a b c f g h k m n q f o o
a b c f g h k m n w f o o
a b c f g h k m n e f o o
a b c f g h k m n r f o o
a b c f g h l m n q f o o
a b c f g h l m n w f o o
a b c f g h l m n e f o o
a b c f g h l m n r f o o
Your question is not really clear. If you want to strip off any part after commas (as your text suggests), then a fairly readable one-liner should do:
cleaned_line = " ".join([field.split(",")[0] for field in line.split()])
If you want to expand lines containing comma-separated fields into multiple lines (as your example suggests), then you should use the itertools.product function:
import itertools
line = "a b c d,e,f g h i,j,k,l m n"
line_fields = [field.split(",") for field in line.split()]
for expanded_line_fields in itertools.product(*line_fields):
print " ".join(expanded_line_fields)
This is the output:
a b c d g h i m n
a b c d g h j m n
a b c d g h k m n
a b c d g h l m n
a b c e g h i m n
a b c e g h j m n
a b c e g h k m n
a b c e g h l m n
a b c f g h i m n
a b c f g h j m n
a b c f g h k m n
a b c f g h l m n
If it's important to keep the original spacing, for some reason, then you can replace line.split() by re.findall("([^ ]*| *)", line):
import re
import itertools
line = "a b c d,e,f g h i,j,k,l m n"
line_fields = [field.split(",") for field in re.findall("([^ ]+| +)", line)]
for expanded_line_fields in itertools.product(*line_fields):
print "".join(expanded_line_fields)
This is the output:
a b c d g h i m n
a b c d g h j m n
a b c d g h k m n
a b c d g h l m n
a b c e g h i m n
a b c e g h j m n
a b c e g h k m n
a b c e g h l m n
a b c f g h i m n
a b c f g h j m n
a b c f g h k m n
a b c f g h l m n
If I have understood your example correctly You need following
import itertools
sss = "a b c d,e,f g h i,j,k,l m n d,e,f "
coma_separated = [i for i in sss.split() if ',' in i]
spited_coma_separated = [i.split(',') for i in coma_separated]
symbols = (i for i in itertools.product(*spited_coma_separated))
#use generator statement to save memory
for s in symbols:
st = sss
for part, symb in zip(coma_separated, s):
st = st.replace(part, symb, 1) # To prevent replacement of the
# same coma separated group replace once
# for first occurance
print (st.split()) # for python3 compatibility
Most other answers only produce one line instead of the multiple lines you seem to want.
To achieve what you want, you can work in several ways.
The recursive solution seems the most intuitive to me:
def dothestuff(l):
for n, i in enumerate(l):
if ',' in i:
# found a "," entry
items = i.split(',')
for j in items:
for rest in dothestuff(l[n+1:]):
yield l[:n] + [j] + rest
return
yield l
line = "a b c d,e,f g h i,j,k,l m n"
for i in dothestuff(line.split()): print i
for i in range(len(line)-1):
if line[i] == ',':
line = line.replace(line[i]+line[i+1], '')
import itertools
line_data = 'a b c d,e,f g h i,j,k,l m n'
comma_fields_indices = [i for i,val in enumerate(line_data.split()) if "," in val]
comma_fields = [i.split(",") for i in line_data.split() if "," in i]
all_comb = []
for val in itertools.product(*comma_fields):
sline_data = line_data.split()
for index,word in enumerate(val):
sline_data[comma_fields_indices[index]] = word
all_comb.append(" ".join(sline_data))
print all_comb

Categories

Resources