Why does pandas read_excel shift columns with each iteration? - python

I'm trying read a bunch of excel files (700+) and compile them into a single database using a for loop. However, each iteration of the for loop shifts the first four columns to the end of the data set in a bizarre repeating pattern. I'm not practiced in python, and I can't figure out what's causing this.
excel_files = glob.glob("/State Report_2020****308.xls")
list1 = [pd.read_excel(filename, sheet_name="Raw Data", usecols = "A:S", skiprows = 9, nrows=33-10) for filename in excel_files]
raw_data = pd.concat(list1, axis=0, ignore_index=True)
For example the data extracted from the first sheet looks like:
A B C ... Q R S
a b c ... q r s
a b c ... q r s
a b c ... q r s
a b c ... q r s
Then Data from the second sheet is extracted, appended to the bottom of the data frame, and looks like:
A B C ... Q R S T U V W
e f g ... q r S a b c d
e f g ... q r S a b c d
e f g ... q r S a b c d
e f g ... q r S a b c d
Then data from the third sheet is iterated into the data frame, and looks like:
A B C ... Q R S T U V W X Y Z AA
e f g ... q r S a b c d
e f g ... q r S a b c d
e f g ... q r S a b c d
e f g ... q r S a b c d
This pattern repeats with every iteration shifting the first columns of data further to the right.

The error was from "skiprows = 9" as one of the pd.read_excel inputs. Row 9 was the column headers for the table. I thought if I left the headers in there, they would be added as a row with each subsequent iteration like:
A B C ... Q R S
a b c ... q r s
a b c ... q r s
...............
A B C ... Q R S
a b c ... q r s
a b c ... q r s
...............
A B C ... Q R S
a b c ... q r s
a b c ... q r s
...............
Instead I left the headers in "skiprows = 8" and got the result I originally wanted.
A B C ... Q R S
a b c ... q r s
a b c ... q r s
...............
a b c ... q r s
a b c ... q r s
...............
a b c ... q r s
a b c ... q r s
...............

Related

Slow NetworkX graph creation

I have to create a graph, starting from a documents-terms matrix, loaded into a pandas dataframe, where nodes are terms and where arches contain the number of documents in which the two nodes appear together.
The code works well but is really really slow.
edges = []
edges_attrs = {}
columns = list(dtm.columns)
for key in dtm.columns:
for key1 in columns:
# skip the same node
if key == key1:
continue
df = dtm.loc[(dtm[key] != 0) & (dtm[key1] != 0), [key, key1]]
docs = df.shape[0]
edges.append((key, key1))
edges_attrs[(key, key1)] = {'docs': docs}
# no double arches (u, v) == (v, u)
columns.remove(key)
graph.add_edges_from(edges)
nx.set_edge_attributes(graph, edges_attrs)
For a dtm with 2k terms (columns), it takes more than 3 hours, that it sounds to me quite too much for that size.
Some hints on how to speed up?
Don't use for loops. Learn about inner and outer joins in databases. An introductory course in SQL would cover these concepts. Applying them to a pandas dataframe is then pretty straightforward:
#!/usr/bin/env python
"""
https://stackoverflow.com/q/62406586/2912349
"""
import numpy as np
import pandas as pd
# simulate some data
x = pd.DataFrame(np.random.normal(0, 1, (4,4)), index=['a', 'b', 'c', 'd'], columns=['e', 'f', 'g', 'h'])
x[:] = x > 0
# e f g h
# a False False True False
# b False False False True
# c True True True True
# d False True True True
sparse = pd.DataFrame(x[x > 0].stack().index.tolist(), columns=['Documents', 'Terms'])
# Documents Terms
# 0 a g
# 1 b h
# 2 c e
# 3 c f
# 4 c g
# 5 c h
# 6 d f
# 7 d g
# 8 d h
cooccurrences = pd.merge(sparse, sparse, how='inner', on='Documents')
# Documents Terms_x Terms_y
# 0 a g g
# 1 b h h
# 2 c e e
# 3 c e f
# 4 c e g
# 5 c e h
# 6 c f e
# 7 c f f
# 8 c f g
# 9 c f h
# 10 c g e
# 11 c g f
# 12 c g g
# 13 c g h
# 14 c h e
# 15 c h f
# 16 c h g
# 17 c h h
# 18 d f f
# 19 d f g
# 20 d f h
# 21 d g f
# 22 d g g
# 23 d g h
# 24 d h f
# 25 d h g
# 26 d h h
# remove self loops and repeat pairings such as the second tuple in (u, v), (v, u)
valid = cooccurrences['Terms_x'] > cooccurrences['Terms_y']
valid_cooccurrences = cooccurrences[valid]
# Documents Terms_x Terms_y
# 6 c f e
# 10 c g e
# 11 c g f
# 14 c h e
# 15 c h f
# 16 c h g
# 21 d g f
# 24 d h f
# 25 d h g
counts = valid_cooccurrences.groupby(['Terms_x', 'Terms_y']).count()
# Documents
# Terms_x Terms_y
# f e 1
# g e 1
# f 2
# h e 1
# f 2
# g 2
documents = valid_cooccurrences.groupby(['Terms_x', 'Terms_y']).aggregate(lambda x : set(x))
# Documents
# Terms_x Terms_y
# f e {c}
# g e {c}
# f {d, c}
# h e {c}
# f {d, c}
# g {d, c}

Different slices give different inequalities for same elements

import numpy as np
a = np.array([.4], dtype='float32')
b = np.array([.4, .6])
print(a > b)
print(a > b[0], a > b[1])
print(a[0] > b[0], a[0] > b[1])
[ True False]
[False] [False]
True False
What's the deal? Yes, b.dtype == 'float64', but so are its slices b[0] & b[1], and a remains 'float32'.
Note: I'm asking why this occurs, not how to circumvent it, which I know (e.g. cast both to 'float64').
As I've noted in another answer, type casting in numpy is pretty complicated, and this is the root cause of the behaviour you are seeing. The documents linked in that answer make it clear that scalars(/0d arrays) and 1d arrays differ in type conversions, since the latter aren't considered value by value.
The first half of the problem you already know: the problem is that type conversion happens differently for your two cases:
>>> (a + b).dtype
dtype('float64')
>>> (a + b[0]).dtype
dtype('float32')
>>> (a[0] + b[0]).dtype
dtype('float64')
There's also a helper called numpy.result_type() that can tell you the same information without having to perform the binary operation:
>>> np.result_type(a, b)
dtype('float64')
>>> np.result_type(a, b[0])
dtype('float32')
>>> np.result_type(a[0], b[0])
dtype('float64')
I believe we can understand what's happening in your example if we consider the type conversion tables:
>>> from numpy.testing import print_coercion_tables
can cast
[...]
In these tables, ValueError is '!', OverflowError is '#', TypeError is '#'
scalar + scalar
+ ? b h i l q p B H I L Q P e f d g F D G S U V O M m
? ? b h i l q l B H I L Q L e f d g F D G # # # O ! m
b b b h i l q l h i l d d d e f d g F D G # # # O ! m
h h h h i l q l h i l d d d f f d g F D G # # # O ! m
i i i i i l q l i i l d d d d d d g D D G # # # O ! m
l l l l l l q l l l l d d d d d d g D D G # # # O ! m
q q q q q q q q q q q d d d d d d g D D G # # # O ! m
p l l l l l q l l l l d d d d d d g D D G # # # O ! m
B B h h i l q l B H I L Q L e f d g F D G # # # O ! m
H H i i i l q l H H I L Q L f f d g F D G # # # O ! m
I I l l l l q l I I I L Q L d d d g D D G # # # O ! m
L L d d d d d d L L L L Q L d d d g D D G # # # O ! m
Q Q d d d d d d Q Q Q Q Q Q d d d g D D G # # # O ! m
P L d d d d d d L L L L Q L d d d g D D G # # # O ! m
e e e f d d d d e f d d d d e f d g F D G # # # O ! #
f f f f d d d d f f d d d d f f d g F D G # # # O ! #
d d d d d d d d d d d d d d d d d g D D G # # # O ! #
g g g g g g g g g g g g g g g g g g G G G # # # O ! #
F F F F D D D D F F D D D D F F D G F D G # # # O ! #
D D D D D D D D D D D D D D D D D G D D G # # # O ! #
G G G G G G G G G G G G G G G G G G G G G # # # O ! #
S # # # # # # # # # # # # # # # # # # # # # # # O ! #
U # # # # # # # # # # # # # # # # # # # # # # # O ! #
V # # # # # # # # # # # # # # # # # # # # # # # O ! #
O O O O O O O O O O O O O O O O O O O O O O O O O ! #
M ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
m m m m m m m m m m m m m m # # # # # # # # # # # ! m
scalar + neg scalar
[...]
array + scalar
+ ? b h i l q p B H I L Q P e f d g F D G S U V O M m
? ? b h i l q l B H I L Q L e f d g F D G # # # O ! m
b b b b b b b b b b b b b b e f d g F D G # # # O ! m
h h h h h h h h h h h h h h f f d g F D G # # # O ! m
i i i i i i i i i i i i i i d d d g D D G # # # O ! m
l l l l l l l l l l l l l l d d d g D D G # # # O ! m
q q q q q q q q q q q q q q d d d g D D G # # # O ! m
p l l l l l l l l l l l l l d d d g D D G # # # O ! m
B B B B B B B B B B B B B B e f d g F D G # # # O ! m
H H H H H H H H H H H H H H f f d g F D G # # # O ! m
I I I I I I I I I I I I I I d d d g D D G # # # O ! m
L L L L L L L L L L L L L L d d d g D D G # # # O ! m
Q Q Q Q Q Q Q Q Q Q Q Q Q Q d d d g D D G # # # O ! m
P L L L L L L L L L L L L L d d d g D D G # # # O ! m
e e e e e e e e e e e e e e e e e e F F F # # # O ! #
f f f f f f f f f f f f f f f f f f F F F # # # O ! #
d d d d d d d d d d d d d d d d d d D D D # # # O ! #
g g g g g g g g g g g g g g g g g g G G G # # # O ! #
F F F F F F F F F F F F F F F F F F F F F # # # O ! #
D D D D D D D D D D D D D D D D D D D D D # # # O ! #
G G G G G G G G G G G G G G G G G G G G G # # # O ! #
S # # # # # # # # # # # # # # # # # # # # # # # O ! #
U # # # # # # # # # # # # # # # # # # # # # # # O ! #
V # # # # # # # # # # # # # # # # # # # # # # # O ! #
O O O O O O O O O O O O O O O O O O O O O O O O O ! #
M ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
m m m m m m m m m m m m m m # # # # # # # # # # # ! m
[...]
The above is part of the current promotion tables for value-based promotion. It denotes how differing types contribute to a result type when pairing two numpy objects of a given kind (see the first column and first row for the specific types). The types are to be understood according to the single-character dtype specifications (below "One-character strings"), in particular np.dtype('f') corresponds to np.float32 (f for C-style float) and np.dtype('d') (d for C-style double) to np.float64 (see also np.typename('f') and the same for 'd').
I have noted two items in boldface in the above tables:
scalar f + scalar d --> d
array f + scalar d --> f
Now let's look at your cases. The premise is that you have an 'f' array a and a 'd' array b. The fact that a only has a single element doesn't matter: it's a 1d array with length 1 rather than a 0d array.
When you do a > b you are comparing two arrays, this is not denoted in the above tables. I'm not sure what the behaviour is here; my guess is that a gets broadcast to b's shape and then its type is cast to 'd'. The reason I think this is that np.can_cast(a, np.float64) is True and np.can_cast(b, np.float32) is False. But this is just a guess, a lot of this machinery in numpy is not intuitive to me.
When you do a > b[0] you are comparing a 'f' array to a 'd' scalar, so according to the above you get a 'f' array. That's what (a + b[0]).dtype told us. (When you use a > b[0] you don't see the conversion step, because the result is always a bool.)
When you do a[0] > b[0] you are comparing a 'f' scalar to a 'd' scalar, so according to the above you get a 'd' scalar. That's what (a[0] + b[0]).dtype told us.
So I believe this is all consistent with the quirks of type conversion in numpy. While it might seem like an unfortunate corner case with the value of 0.4 in double and single precision, this feature goes deeper and the problem serves as a big red warning that you should be very careful when mixing different dtypes.
The safest course of action is to convert your types yourself in order to control what happens in your code. Especially since there's discussion about reconsidering some aspects of type promotion.
As a side note (for now), there's a work-in-progress NEP 50 created in May 2021 that explains how confusing type promotion can be when scalars are involved, and plans to simplify some of the rules eventually. Since this also involves breaking changes, its implementation in NumPy proper won't happen overnight.

Alternative solution for printing pattern using python

I want to print pattern using python and i have done it but i want to
know other solutions possible for the same:-
A B C D E F G F E D C B A
A B C D E F F E D C B A
A B C D E E D C B A
......
....
A A
and here is my code:-
n=0
for i in range(71,64,-1):
for j in range(65,i+1):
a=chr(j)
print(a, end=" ")
if n>0:
for l in range(1,3+(n-1)*4):
print(end=" ")
if i<71:
j=j+1
for k in range(j-1,64,-1):
b=chr(k)
print(b, end=" ")
n=n+1
print()
Here's an alternative method using 3rd party library numpy. I use this library specifically because it allows vectorised assignment, which I use instead of an inner loop.
from string import ascii_uppercase
import numpy as np
n = 7
# extract first n letters from alphabet
letters = ascii_uppercase[:n]
res = np.array([list(letters + letters[-2::-1])] * (n-1))
# generate indices that are removed per line
idx = (range(n-i-1, n+i) for i in range(n-1))
# printing logic
print(' '.join(res[0]))
for i, j in enumerate(idx):
# vectorised assignment
res[i, j] = ' '
print(' '.join(res[i]))
Result:
A B C D E F G F E D C B A
A B C D E F F E D C B A
A B C D E E D C B A
A B C D D C B A
A B C C B A
A B B A
A A

regex pattern won't return in python script

Why does the first snippet return digits, but the latter does not? I have tried more complicated expressions without success. The expressions I use are valid according to pythex.org, but do not work in the script.
(\d{6}-){7}\d{6}) is one such expression. I've tested it against this string: 123138-507716-007469-173316-534644-033330-675057-093280
import re
pattern = re.compile('(\d{1})')
load_file = open('demo.txt', 'r')
search_file = load_file.read()
result = pattern.findall(search_file)
print(result)
==============
import re
pattern = re.compile('(\d{6})')
load_file = open('demo.txt', 'r')
search_file = load_file.read()
result = pattern.findall(search_file)
print(result)
When I put the string into a variable and then search the variable it works just fine. This should work as is. But it doesn't help if I want to read a text file. I've tried to read each line of the file and that seems to be where the script breaks down.
import re
pattern = re.compile('((\d{6}-){7})')
#pattern = re.compile('(\d{6})')
#load_file = open('demo.txt', 'r')
#search_file = load_file.read()
test_string = '123138-507716-007469-173316-534644-033330-675057-093280'
result = pattern.findall(test_string)
print(result)
=========
printout,
Search File:
ÿþB i t L o c k e r D r i v e E n c r y p t i o n R e c o v e r y K e y
T h e r e c o v e r y k e y i s u s e d t o r e c o v e r t h e d a t a o n a B i t L o c k e r p r o t e c t e d d r i v e .
T o v e r i f y t h a t t h i s i s t h e c o r r e c t r e c o v e r y k e y c o m p a r e t h e i d e n t i f i c a t i o n w i t h w h a t i s p r e s e n t e d o n t h e r e c o v e r y s c r e e n .
R e c o v e r y k e y i d e n t i f i c a t i o n : f f s d f a - f s d f - s f
F u l l r e c o v e r y k e y i d e n t i f i c a t i o n : 8 8 8 8 8 8 8 8 - 8 8 8 8 - 8 8 8 8 - 8 8 8 8 - 8 8 8 8 8 8 8 8 8 8 8
B i t L o c k e r R e c o v e r y K e y :
1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1 - 1 1 1 1 1 1
6 6 6 6 6 6
Search Results:
[]
Process finished with exit code 0
================
This is where I ended up. It finds the string just fine and without the commas.
import re
pattern = re.compile('(\w{6}-\w{6}-\w{6}-\w{6}-\w{6}-\w{6}-\w{6}-\w{6})')
load_file = open('demo3.txt', 'r')
for line in load_file:
print(pattern.findall(line))

Program to output letter pyramid

To print the output
A
A B
A B C
A B C D
A B C D E
I used the following code, but it does not work correctly.
strg = "A B C D E F"
i = 0
while i < len(strg):
print strg[0:i+1]
print "\n"
i = i + 1
For this code the obtained output is:
A
A
A B
A B
A B C
A B C
A B C D
A B C D
A B C D E
A B C D E
A B C D E F
Why does each line get printed twice?
Whitespace. You need to increment i by 2 instead of 1. Try:
strg = "A B C D E F"
i = 0
while i < len(strg):
print strg[0:i+2]
print "\n"
i = i+2
This will allow you to skip over the whitespace as "indices" of the string
A little more pythonic:
>>> strg = "ABCDEF"
>>> for index,_ in enumerate(strg):
print " ".join(strg[:index+1])
A
A B
A B C
A B C D
A B C D E
A B C D E F

Categories

Resources