How to handle System Error in Python? - python

When I run the following code:
import pandas as pd
web_states = {'Day':[1,2,3,4,5,6],
'Visitors': [43,53,46,78,88,24],
'BounceRates':[65,74,99,98,45,56]}
df= pd.DataFrame(web_states)
print(df)
I get the following error:
File "C:\Users\Python36-32\lib\site-packages\numpy\core_init_.py"‌​,
line 16, in from . import multiarray SystemError:
initialization of multiarray raised unreported exception
Please advise.

Bounce Rates is too short.
Your code:
web_states = {'Day': [1, 2, 3, 4, 5, 6],
'Visitors': [43, 53, 46, 78, 88, 24],
'BounceRates': [65, 74, 99, 98, 45]}
df = pd.DataFrame(web_states)
Produces:
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 5446, in extract_index
raise ValueError('arrays must all be same length')
ValueError: arrays must all be same length
Lengthen BounceRates:
web_states = {'Day': [1, 2, 3, 4, 5, 6],
'Visitors': [43, 53, 46, 78, 88, 24],
'BounceRates': [65, 74, 99, 98, 45, 0]}
df = pd.DataFrame(web_states)
print(df)
Produces:
BounceRates Day Visitors
0 65 1 43
1 74 2 53
2 99 3 46
3 98 4 78
4 45 5 88
5 0 6 24

Related

How to select different element of different row in numpy?

Lst's say there is a matrix A
A = [[ 34 61 29 74(17)32 72 92 93 57 ]
[(46)10 23 84 74 57 56 88 90 36 ]
[ 23(83)58 42 93 54 82 48 63 73 ]]
and a vector b of size 3
b = [4, 0, 1]
Does numpy has any function that would do the following job?
>>>A.choose_row_wise(b)
would output:
[17, 46, 83]
With numpy.take_along_axis:
b = np.array([4, 0, 1])
res = np.take_along_axis(A, b[:, None], axis=1).flatten()
print(res)
[17 46 83]
This can be done using indexing.
import numpy as np
A = np.array([[34, 61, 29, 74, 17, 32, 72, 92, 93, 57],[46, 10, 23, 84, 74, 57, 56, 88, 90, 36],[23, 83, 58, 42, 93, 54, 82, 48, 63, 73]])
b = np.array([4, 0, 1])
result = A[np.arange(len(b)), b]
[17 46 83]
Maybe you can use the following code:
import numpy as np
A = np.array([[34,61,29,74,(17),32,72,92,93,57],[(46),10,23,84,74,57,56,88,90,36],[23,(83),58,42,93,54,82,48,63,73]])
b = [4, 0, 1]
print([A[b.index(i), i] for i in b])

Expand a list-like column in dask DF across several columns

This is similar to previous questions about how to expand a list-based column across several columns, but the solutions I'm seeing don't seem to work for Dask. Note, that the true DFs I'm working with are too large to hold in memory, so converting to pandas first is not an option.
I have a df with column that contains lists:
df = pd.DataFrame({'a': [np.random.randint(100, size=4) for _ in range(20)]})
dask_df = dd.from_pandas(df, chunksize=10)
dask_df['a'].compute()
0 [52, 38, 59, 78]
1 [79, 71, 13, 63]
2 [15, 81, 79, 76]
3 [53, 4, 94, 62]
4 [91, 34, 26, 92]
5 [96, 1, 69, 27]
6 [84, 91, 96, 68]
7 [93, 56, 45, 40]
8 [54, 1, 96, 76]
9 [27, 11, 79, 7]
10 [27, 60, 78, 23]
11 [56, 61, 88, 68]
12 [81, 10, 79, 65]
13 [34, 49, 30, 3]
14 [32, 46, 53, 62]
15 [20, 46, 87, 31]
16 [89, 9, 11, 4]
17 [26, 46, 19, 27]
18 [79, 44, 45, 56]
19 [22, 18, 31, 90]
Name: a, dtype: object
According to this solution, if this were a pd.DataFrame I could do something like this:
new_dask_df = dask_df['a'].apply(pd.Series)
ValueError: The columns in the computed data do not match the columns in the provided metadata
Extra: [1, 2, 3]
Missing: []
There's another solution listed here:
import dask.array as da
import dask.dataframe as dd
x = da.ones((4, 2), chunks=(2, 2))
df = dd.io.from_dask_array(x, columns=['a', 'b'])
df.compute()
So for dask I tried:
df = dd.io.from_dask_array(dask_df.values)
but that just spits out the same DF I have from before:
[1]: https://i.stack.imgur.com/T099A.png
Not really sure why as the types between the example 'x' and the values in my df are the same:
print(type(dask_df.values), type(x))
<class 'dask.array.core.Array'> <class 'dask.array.core.Array'>
print(type(dask_df.values.compute()[0]), type(x.compute()[0]))
<class 'numpy.ndarray'> <class 'numpy.ndarray'>
Edit: I kind of having a working solution but it involves iterating through each groupby object. It feels like there should be a better way:
dask_groups = dask_df.explode('a').reset_index().groupby('index')
final_df = []
for idx in dask_df.index.values.compute():
group = dask_groups.get_group(idx).drop(columns='index').compute()
group_size = list(range(len(group)))
row = group.transpose()
row.columns = group_size
row['index'] = idx
final_df.append(dd.from_pandas(row, chunksize=10))
final_df = dd.concat(final_df).set_index('index')
In this case dask doesn't know what to expect from the outcome, so it's best to specify meta explicitly:
# this is a short-cut to use the existing pandas df
# in actual code it is sufficient to provide an
# empty series with the expected dtype
meta = df['a'].apply(pd.Series)
new_dask_df = dask_df['a'].apply(pd.Series, meta=meta)
new_dask_df.compute()
I got a working solution. My original function created a list which resulted in the column of lists, as above. Changing the applied function to return a dask bag seems to do the trick:
def create_df_row(x):
vals = np.random.randint(2, size=4)
return db.from_sequence([vals], partition_size=2).to_dataframe()
test_df = dd.from_pandas(pd.DataFrame({'a':[random.choice(['a', 'b', 'c']) for _ in range(20)]}), chunksize=10)
test_df.head()
mini_dfs = [*test_df.groupby('a')['a'].apply(lambda x: create_df_row(x))]
result = dd.concat(mini_dfs)
result.compute().head()
But not sure if this solves the in-memory issue as now i'm holding a list of groupby results.
Here's how to expand a list-like column across multiple columns manually:
dask_df["a0"] = dask_df["a"].str[0]
dask_df["a1"] = dask_df["a"].str[1]
dask_df["a2"] = dask_df["a"].str[2]
dask_df["a3"] = dask_df["a"].str[3]
print(dask_df.head())
a a0 a1 a2 a3
0 [71, 16, 0, 10] 71 16 0 10
1 [59, 65, 99, 74] 59 65 99 74
2 [83, 26, 33, 38] 83 26 33 38
3 [70, 5, 19, 37] 70 5 19 37
4 [0, 59, 4, 80] 0 59 4 80
SultanOrazbayev's answer seems more elegant.

Regarding ndarray creation in julia: Stacking in extra dimension

I would like to convert the following python code into julia:
import numpy as np
x = np.random.random([4,5,6])
y = np.array([[x, x, x ],
[2*x,3*x,4*x]])
print(y.shape)
-> (2, 3, 4, 5, 6)
In julia, the analogous syntax seems to me is
x = rand(4,5,6)
y = [x x x; 2x 3x 4x]
println(size(y))
-> (8, 15, 6)
These results are different. Can you tell me how to do it?
Using random numbers and multipliers obscures the details which you seek. Let's do consecutive numbering and try to get Python and Julia to display alike:
python>>>>>> z = np.reshape(np.array(range(1,121)), [4, 5, 6])
>>> z
array([[[ 1, 2, 3, 4, 5, 6],
[ 7, 8, 9, 10, 11, 12],
[ 13, 14, 15, 16, 17, 18],
[ 19, 20, 21, 22, 23, 24],
[ 25, 26, 27, 28, 29, 30]],
[[ 31, 32, 33, 34, 35, 36],
[ 37, 38, 39, 40, 41, 42],
[ 49, 50, 51, 52, 53, 54],
[ 55, 56, 57, 58, 59, 60]],
[[ 61, 62, 63, 64, 65, 66],
[ 67, 68, 69, 70, 71, 72],
[ 73, 74, 75, 76, 77, 78],
[ 79, 80, 81, 82, 83, 84],
[ 85, 86, 87, 88, 89, 90]],
[[ 91, 92, 93, 94, 95, 96],
[ 97, 98, 99, 100, 101, 102],
[103, 104, 105, 106, 107, 108],
[109, 110, 111, 112, 113, 114],
[115, 116, 117, 118, 119, 120]]])
julia>z = reshape(1:120, 6, 5, 4)
6×5×4 reshape(::UnitRange{Int64}, 6, 5, 4) with eltype Int64:
[:, :, 1] =
1 7 13 19 25
2 8 14 20 26
3 9 15 21 27
4 10 16 22 28
5 11 17 23 29
6 12 18 24 30
[:, :, 2] =
31 37 43 49 55
32 38 44 50 56
33 39 45 51 57
34 40 46 52 58
35 41 47 53 59
36 42 48 54 60
[:, :, 3] =
61 67 73 79 85
62 68 74 80 86
63 69 75 81 87
64 70 76 82 88
65 71 77 83 89
66 72 78 84 90
[:, :, 4] =
91 97 103 109 115
92 98 104 110 116
93 99 105 111 117
94 100 106 112 118
95 101 107 113 119
96 102 108 114 120
So, if you want things to print similarly on the screen, you need to swap first and last dimension sizes (reverse the order of dimensions) on the arrays between Julia and Python. In addition, since Julia concatenates the arrays when you put them in the same brackets, but Python just nests its arrays in greater depth, you need to use np.reshape on Python or reshape on Julia to change the arrays to the shape you want. I suggest you check the resulting arrays on a consecutive set of integers to be sure they print alike before going back to your random floating point numbers. Remember that the indexing order is different when you access elements, too. Consider
>>> zzz = np.array([[z,z,z], [z,z,z]]) # python
> zzz = reshape([z z z; z z z], 6, 5, 4, 3, 2) # julia

Looping through an array based on the first index

I have two arrays and I am wanting to loop through a second array to only return arrays whose first element is equal to an element from another array.
a = [10, 11, 12, 13, 14]
b = [[9, 23, 45, 67, 56, 23, 54], [10, 8, 52, 30, 15, 47, 109], [11, 81,
152, 54, 112, 78, 167], [13, 82, 84, 63, 24, 26, 78], [18, 182, 25, 63, 96,
104, 74]]
I have two different arrays, a and b. I would like to find a way to look through each of the sub-arrays(?) within b in which
the first value is equal to the values in array a to create a new array, c.
The result I am looking for is:
c = [[10, 8, 52, 30, 15, 47, 109],[11, 81, 152, 54, 112, 78, 167],[13, 82, 84, 63, 24, 26, 78]]
Does Python have a tool to do this in a way Excel has MATCH()?
I tried looping in a manner such as:
for i in a:
if i in b:
print (b)
But because there are other elements within the array, this way is not working. Any help would be greatly appreciated.
Further explanation of the problem:
a = [5, 6, 7, 9, 12]
I read in a excel file using XLRD (b_csv_data):
Start Count Error Constant Result1 Result2 Result3 Result4
5 41 0 45 23 54 66 19
5.4 44 1 21 52 35 6 50
6 16 1 42 95 39 1 13
6.9 50 1 22 71 86 59 97
7 38 1 43 50 47 83 67
8 26 1 29 100 63 15 40
9 46 0 28 85 9 27 81
12 43 0 21 74 78 20 85
Next, I created a look to read in a select number of rows. For simplicity, this file above only has a few rows. My current file has about 100 rows.
for r in range (1, 7): #skipping headers and only wanting first few rows to start
b_raw = b_csv_data.row_values(r)
b = np.array(b_raw) # I created this b numpy array from the line of code above
Use np.isin -
In [8]: b[np.isin(b[:,0],a)]
Out[8]:
array([[ 10, 8, 52, 30, 15],
[ 11, 81, 152, 54, 112],
[ 13, 82, 84, 63, 24]])
With sorted a, we can also use np.searchsorted -
idx = np.searchsorted(a,b[:,0])
idx[idx==len(a)] = 0
out = b[a[idx] == b[:,0]]
If you have an array with different number of elements per row, which is essentially array of lists, you need to modify the slicing part. So, in that case, get the first off elements -
b0 = [bi[0] for bi in b]
Then, use b0 to replace all instances of b[:,0] in earlier posted methods.
Use list comprehension:
c = [l for l in b if l[0] in a]
Output:
[[10, 8, 52, 30, 15], [11, 81, 152, 54, 112], [13, 82, 84, 63, 24]]
If your list or arrays are considerably large, using numpy.isin can be significantly faster:
b[np.isin(b[:, 0], a), :]
Benchmark:
a = [10, 11, 12, 13, 14]
b = [[9, 23, 45, 67, 56], [10, 8, 52, 30, 15], [11, 81, 152, 54, 112],
[13, 82, 84, 63, 24], [18, 182, 25, 63, 96]]
list_comp, np_isin = [], []
for i in range(1,100):
a_test = a * i
b_test = b * i
list_comp.append(timeit.timeit('[l for l in b_test if l[0] in a_test]', number=10, globals=globals()))
a_arr = np.array(a_test)
b_arr = np.array(b_test)
np_isin.append(timeit.timeit('b_arr[np.isin(b_arr[:, 0], a_arr), :]', number=10, globals=globals()))
While it is not clear and concise, I would recommend using list comprehension if the b is shorter than 100. Otherwise, numpy is your way to go.
You are doing it reverse. It is better to loop through the elements of b array and check if it is present in a. If yes then print that element of b. See the answer below.
a = [10, 11, 12, 13, 14]
b = [[9, 23, 45, 67, 56, 23, 54], [10, 8, 52, 30, 15, 47, 109], [11, 81, 152, 54, 112, 78, 167], [13, 82, 84, 63, 24, 26, 78], [18, 182, 25, 63, 96, 104, 74]]
for bb in b: # if you want to check only the first element of b is in a
if bb[0] in a:
print(bb)
for bb in b: # if you want to check if any element of b is in a
for bbb in bb:
if bbb in a:
print(bb)
Output:
[10, 8, 52, 30, 15, 47, 109]
[11, 81, 152, 54, 112, 78, 167]
[13, 82, 84, 63, 24, 26, 78]

removing all duplicates of max numbers in a correlation table

I need code for taking a .csv of a correlation table, sample of the table is posted here:
AA bb cc dd ff
AA 100 87 71 71 78
bb 87 100 73 74 81
cc 71 73 100 96 69
dd 71 74 96 100 71
ee 78 81 69 100 100
ff 72 73 68 68 71
Pg 68 69 62 62 64
Ph 68 69 69 62 64
Pi 68 69 62 62 64
Pj 68 69 63 63 64
Pk 70 71 65 65 67
I currently have read the .csv file with python's .csv module as a list of lists. I then removed the first column and row. And am now trying to take these int values and find the max values of each row. If there are multiple max values in a row, I want those values as well.
Then I intend to place that output into a table
file1values col row %
group1 AA AA 100
...
group1 dd ee 100
group1 ff ee 100
The issue I have so far is getting the max values for each row. Also I think I would be a bit confused on how to get the address (the col and row) for each max value.
Here is code so far:
from io import StringIO
import csv
import numpy as np
with open('/home/group1.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
data_as_list = list(reader)
a = np.array(data_as_list)
a = np.delete(a, (0), axis=0)
a = np.delete(a, (0), axis=1)
np.set_printoptions(threshold=np.nan)
print (a)
print ('')
count = 0
b = (a.astype(int))
maxArr = []
while (count < b.shape[0]):
print (b[count])
count = count + 1
maxArr.append(max(b[count - 1]))
print (maxArr)
there are easier ways...
create a random matrix for tests
> import numpy as np
> m=np.random.randint(100,size=(10,10))
set diagonal to zero (or set to an out of range negative number)
> np.fill_diagonal(m,0)
array([[ 0, 35, 52, 40, 54, 1, 20, 41, 62, 92],
[45, 0, 75, 71, 85, 86, 83, 39, 52, 69],
[29, 21, 0, 78, 32, 14, 13, 27, 31, 26],
[99, 90, 16, 0, 28, 36, 30, 45, 85, 41],
[29, 21, 48, 31, 0, 86, 18, 7, 70, 76],
[96, 97, 34, 82, 51, 0, 69, 22, 27, 85],
[71, 58, 98, 42, 3, 51, 0, 19, 41, 93],
[54, 97, 86, 75, 62, 91, 78, 0, 55, 89],
[87, 44, 44, 54, 94, 94, 57, 24, 0, 81],
[94, 32, 1, 92, 34, 46, 96, 38, 75, 0]])
find the maximum values per column/row (since your matrix is symmetric doesn't matter)
> cm=np.argmax(m,1)
array([9, 5, 3, 0, 5, 1, 2, 1, 4, 6])
You will need to map the row/column indices to your labels.
> for r in range(10):
print(r,cm[r],m[r,cm[r]])
0 9 92
1 5 86
2 3 78
3 0 99
4 5 86
5 1 97
6 2 98
7 1 97
8 4 94
9 6 96

Categories

Resources