Manipulating a single row DataFrame object to a 6x6 DataFrame - python

What I want to do is to do is pretty explained in the title, but for good measure here is my problem:
For the sake of the example let's say that I have a Google Form with 36 questions and I wanted to manipulate that row of answers to a dataframe using Python 3. Problem is that I get an error, but I'm getting ahead of myself. Here is what I've tried:
#
from flask import Flask, render_template, request
import pandas as pd
import numpy as np
io_table=pd.DataFrame(np.random.random_sample((1,36)))
fctr_column=pd.DataFrame(np.random.random_sample((6)))
io_table=pd.DataFrame(io_table) #Convert list to DataFrame
io_t=io_table
factor=fctr_column
test=pd.DataFrame()
for i in range(0,io_table.shape[1]+1):
test=io_table.loc[0,i+1:i+6], ignore_index=True
i=i+6
print(test)
And, as I mentioned before, I got an error:
File "path/to/temp.py", line 29, in <module>
test=io_table.loc[0,i+1:i+6], ignore_index=True
TypeError: cannot unpack non-iterable bool object
Now, I don't know what to do. Can anyone provide a solution?
EDIT: Expected input and output

Not sure if I got you right, but if you have a DataFrame with 36 values you can reshape it using something like in the following example:
import pandas as pd
a = range(1, 37)
df = pd.DataFrame(a).T
df.values.reshape((6,6))
#[[ 1 2 3 4 5 6]
# [ 7 8 9 10 11 12]
# [13 14 15 16 17 18]
# [19 20 21 22 23 24]
# [25 26 27 28 29 30]
# [31 32 33 34 35 36]]

Related

How to merge an itertools generated dataframe and a normal dataframe in pandas?

I have generated a dataframe containing all the possible two combinations of electrocardiogram (ECG) leads using itertools using the code below
source = [ 'I-s', 'II-s', 'III-s', 'aVR-s', 'aVL-s', 'aVF-s', 'V1-s', 'V2-s', 'V3-s', 'V4-s', 'V5-s', 'V6-s', 'V1Long-s', 'IILong-s', 'V5Long-s', 'Information-s' ]
target = [ 'I-t', 'II-t', 'III-t', 'aVR-t', 'aVL-t', 'aVF-t', 'V1-t', 'V2-t', 'V3-t', 'V4-t', 'V5-t', 'V6-t', 'V1Long-t', 'IILong-t', 'V5Long-t', 'Information-t' ]
from itertools import product
test = pd.DataFrame(list(product(source, target)), columns=['source', 'target'])
The test dataframe contains 256 rows/lines containing all the two possible combinations.
The value for each combination is zero as follows
test['value'] = 0
The test df looks like this:
I have another dataframe called diagramDF that contains the combinations where the value column is non-zero. The diagramDF is significanntly smaller than the test dataframe.
source target value
0 I-s II-t 137
1 II-s I-t 3
2 II-s III-t 81
3 II-s IILong-t 13
4 II-s V1-t 21
5 III-s II-t 3
6 III-s aVF-t 19
7 IILong-s II-t 13
8 IILong-s V1Long-t 353
9 V1-s aVL-t 11
10 V1Long-s IILong-t 175
11 V1Long-s V3-t 4
12 V1Long-s aVF-t 4
13 V2-s V3-t 8
14 V3-s V2-t 6
15 V3-s V6-t 2
16 V5-s aVR-t 5
17 V6-s III-t 4
18 aVF-s III-t 79
19 aVF-s V1Long-t 235
20 aVL-s I-t 1
21 aVL-s aVF-t 16
22 aVR-s aVL-t 1
Note that the first two columns source and target have the same notations
I have tried to replace the zero values of the test dataframe with the nonzero values of the diagramDF using merge like below:
df = pd.merge(test, diagramDF, how='left', on=['source', 'target'])
However, I get an error informing me that:
ValueError: The column label 'source' is not unique. For a
multi-index, the label must be a tuple with elements corresponding to
each level
Is there something that I am getting wrong? Is there a more efficient and fast way to do this?
Might help,
pd.merge(test, diagramDF, how='left', on=['source', 'target'],right_index=True,left_index=True)
Check this:
test = test.reset_index()
diagramDF = diagramDF.reset_index()
new = pd.merge(test, diagramDF, how='left', on=['source', 'target'])

How to implement fast numpy array computation with multiple occuring slice indices?

I was recently wondering how I could by-pass the following numpy behavior.
Starting with an simple example:
import numpy as np
a = np.array([[1,2,3,4,5,6,7,8,9,0], [11, 12, 13, 14, 15, 16, 17, 18, 19, 10]])
then:
b = a.copy()
b[:, [0,1,4,8]] = b[:, [0,1,4,8]] + 50
print(b)
...results in printing:
[[51 52 3 4 55 6 7 8 59 0]
[61 62 13 14 65 16 17 18 69 10]]
but also taking one index double into the slice then:
c = a.copy()
c[:, [0,1,4,4,8]] = c[:, [0,1,4,4,8]] + 50
print(c)
giving:
[[51 52 3 4 55 6 7 8 59 0]
[61 62 13 14 65 16 17 18 69 10]]
(in short; they do the same thing)
Could I also have that for index 4 it is executed 2 times?
Or more practically; Let the slice element i be given r times: Can we let the above expression be applied r times, instead of numpy just taking it once into account? Also if we replace "50" by something that differs for every occurance of i?
For my current code, I used:
w[p1] = w[p1] + D[pix]
where I define "pix", "p1" as some numpy arrays with dtype int, same length and some integers may appear multiple times.
(So one may have pix = [..., 1,1,1,2,2,3,...] at the same time as p1 = [..., 21,32,13,23,11,78,...], however, thus resulting on its own into taking for index 1 only the first 1 and the corresponding 21 and scraping the rest of the ones.)
Of course using a for loop would solve the problem easily. The point is that both the integers and the sizes of the arrays are huge, so it would cost a lot of computational resources to use for-loops instead of efficient numpy-array routines. Any ideas, links to existing documentation etc.?

Splitting a DataFrame into an Array Using Numpy

I have a file called data that looks like this:
Some Text Information (lines 1-6 in file)
1 22 23
2 44 44
3 55 55
4 66 66
5 77 77
What I'm trying to achieve is this something like this:
[[ 22. 23.]
[ 44. 44.]
[ 55. 55.]
[ 66. 66.]
[ 77. 77.]]
The issue I'm having is that the code I'm using doesn't properly split the data from the file. It ends up looking like this:
[ 1 22 23
0 2 44 44
1 3 55 55, Empty DataFrame
Columns: [1 6734 1453]
Index: [], 1 22 23
2 4 44 44
3 5 55 55
4 6 66 66
5 7 77 77
EOF]
Here's the code I'm using:
def loadFile(filename):
df1 = pd.read_fwf(filename, skiprows=6)
df1 = np.split(df, [2,2])
print('The data points:\n {}'.format(df1[:5]))
I understand the parameters of the split function. For instance, [2,2] should create two sub arrays from my dataframe and my axis is 0. However, why does it not properly split the array?
You can read file into pandas dataFrame and access the values attribute from it. Assuming "Some Text Information" is not the header:
import pandas as pd
df = pd.read_table(filepath, sep='\t', index_col= 0, skiprows = 6, header = None)
df.values # gives you the numpy ndarray
This should use the first column as index. Also you might need to remove the sep argument to let read_table figure it out. Also, try using other separators. If you get the row index in your data then try slicing to get desired results. Use something like:
df.iloc[:,1:].values
Do not use read_fwf, let pandas figure out the structure of your table:
df = pd.read_csv("yourfile", skiprows=6, header=None, sep='\s+')
To elaborate on ManKind_008's answer:
Your explicit line numbers are the problem. Pandas interprets these as valid data.
Using ManKinds solution does properly set the index column, but since your line numbers start at zero you end up with a DataFrame like:
pd.read_fwf('test.csv', header=None, index_col=0, skiprows=6)
1 2
0
1 22 23
2 44 44
3 55 55
4 66 66
5 77 77
Instead I suggest you read in all of your data using:
pd.read_fwf('test.csv', header=None, skiprows=6).iloc[:, 1:]
1 2
0 22 23
1 44 44
2 55 55
3 66 66
4 77 77
This leaves you with what you seem to need. The iloc call is ignoring the first row of data (your line numbers).
From here the df.values command will give you:
array([[22, 23],
[44, 44],
[55, 55],
[66, 66],
[77, 77]])
If you don't want a np.array, you can explicitly cast this to a list using the list() function.

Why does numpy.concatenate work along axis=1 for small one dimensional arrays, but not larger ones?

I couldn't get my 4 arrays of year, day of year, hour, and minute to concatenate the way I wanted, so I decided to test several variations on shorter arrays than my data.
I found that it worked using method "t" from my test code:
import numpy as np
a=np.array([[1, 2, 3, 4, 5, 6]])
b=np.array([[11, 12, 13, 14, 15, 16]])
c=np.array([[21, 22, 23, 24, 25, 26]])
d=np.array([[31, 32, 33, 34, 35, 36]])
print a
print b
print c
print d
q=np.concatenate((a, b, c, d), axis=0)
#concatenation along 1st axis
print q
t=np.concatenate((a.T, b.T, c.T, d.T), axis=1)
#transpose each array before concatenation along 2nd axis
print t
x=np.concatenate((a, b, c, d), axis=1)
#concatenation along 2nd axis
print x
But when I tried this with the larger arrays it behaved the same as method "q".
I found an alternative approach of using vstack over here that did what I wanted, but I am trying to figure out why concatenation sometimes works for this, but not always.
Thanks for any insights.
Also, here are the outputs of the code:
q:
[[ 1 2 3 4 5 6]
[11 12 13 14 15 16]
[21 22 23 24 25 26]
[31 32 33 34 35 36]]
t:
[[ 1 11 21 31]
[ 2 12 22 32]
[ 3 13 23 33]
[ 4 14 24 34]
[ 5 15 25 35]
[ 6 16 26 36]]
x:
[[ 1 2 3 4 5 6 11 12 13 14 15 16 21 22 23 24 25 26 31 32 33 34 35 36]]
EDIT: I added method t to the end of a section of the code that was already fixed with vstack, so you can compare how vstack will work with this data but not concatenate. Again, to clarify, I found a workaround already, but I don't know why the concatenate method doesn't seem to be consistent.
Here is the code:
import numpy as np
BAO10m=np.genfromtxt('BAO_010_2015176.dat', delimiter=",", usecols=range(0-6), dtype=[('h', int), ('year', int), ('day', int), ('time', int), ('temp', float)])
#10 meter weather readings at BAO tower site for June 25, 2015
hourBAO=BAO10m['time']/100
minuteBAO=BAO10m['time']%100
#print hourBAO
#print minuteBAO
#time arrays
dayBAO=BAO10m['day']
yearBAO=BAO10m['year']
#date arrays
datetimeBAO=np.vstack((yearBAO, dayBAO, hourBAO, minuteBAO))
#t=np.concatenate((a.T, b.T, c.T, d.T), axis=1) <this gave desired results in simple tests
#not working for this data, use vstack instead, with transposition after stack
print datetimeBAO
test=np.transpose(datetimeBAO)
#rotate array
print test
#this prints something that can be used for datetime
t=np.concatenate((yearBAO.T, dayBAO.T, hourBAO.T, minuteBAO.T), axis=1)
print t
#this prints a 1D array of all the year values, then all the day values, etc...
#but this method worked for shorter 1D arrays
The file I used can be found at this site.

Python Pandas qcut behavior with # of observations not divisible by # of bins

Suppose I had a pandas series of dollar values and wanted to discretize into 9 groups using qcut. The # of observations is not divisible by 9. SQL Server's ntile function has a standard approach for this case: it makes the first n out of 9 groups 1 observation larger than the remaining (9-n) groups.
I noticed in pandas that the assignment of which groups had x observations vs x + 1 observations seemed random. I tried to decipher the code in algos to figure out how the quantile function deals with this issue but could not figure it out.
I have three related questions:
Any pandas developers out there than can explain qcut's behavior? Is it random which groups get the larger number of observations?
Is there a way to force qcut to behave similarly to NTILE (i.e., first groups get x + 1 observations)?
If the answer to #2 is no, any ideas on a function that would behave like NTILE? (If this is a complicated endeavor, just an outline to your approach would be helpful.)
Here is an example of SQL Server's NTILE output.
Bin |# Observations
1 26
2 26
3 26
4 26
5 26
6 26
7 26
8 25
9 25
Here is pandas:
Bin |# Observations
1 26
2 26
3 26
4 25 (Why is this 25 vs others?)
5 26
6 26
7 25 (Why is this 25 vs others?)
8 26
9 26
The qcut behaves like this because it's more accurate. Here is an example:
for the ith level, it starts at quantile (i-1)*10%:
import pandas as pd
import numpy as np
a = np.random.rand(26*10+3)
r = pd.qcut(a, 10)
np.bincount(r.labels)
the output is:
array([27, 26, 26, 26, 27, 26, 26, 26, 26, 27])
If you want NTILE, you can calculate the quantiles yourself:
n = len(a)
ngroup = 10
counts = np.ones(ngroup, int)*(n//ngroup)
counts[:n%ngroup] += 1
q = np.r_[0, np.cumsum(counts / float(n))]
q[-1] = 1.0
r2 = pd.qcut(a, q)
np.bincount(r2.labels)
the output is:
array([27, 27, 27, 26, 26, 26, 26, 26, 26, 26])

Categories

Resources