Matplot Numpy ValueError: setting an array element with a sequence

Matplot Numpy ValueError: setting an array element with a sequence - python

I have the following code to try to create a 3d Plot for a dataset YearlyKeywordsFrequency. I cannot figure out why is this error coming
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from mpl_toolkits import mplot3d
myData = pd.read_csv('counted-JOSAIC.csv', delimiter=',', skiprows=0,usecols=range(0,5))
print(myData)
item_list = list(myData.columns) #Names of Columns
item_list = item_list[1:]
print(item_list)
myData = np.array(myData) #Convert to numpy
keywords = np.asarray(myData[:,0]) #Get the Keywords
print(keywords)
data = np.asarray(myData[:,1:]) #remove Keywords from data
print(data.shape)
print(data)
##################################################################################
###x=keyword
###y=year
###z=freq
y=range(len(keywords))
x=range(len(item_list))
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(x, y, data, 50, cmap='binary')
ax.set_yticklabels(keywords)
ax.set_xticklabels(item_list)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z');
plt.show()
This code gives the following Results with error
Kewords freq-2015 ... freq-2017 freq-2018
0 energy 526 ... 89 97
1 power 246 ... 170 125
2 wireless 194 ... 121 144
3 transmission 157 ... 77 106
4 optimal 153 ... 100 110
5 interference 136 ... 100 78
6 spectrum 132 ... 126 29
7 allocation 125 ... 143 101
8 harvesting 123 ... 5 11
9 node 114 ... 25 63
10 capacity 106 ... 92 67
11 cellular 102 ... 72 39
12 relay 98 ... 20 35
13 access 97 ... 138 98
14 control 94 ... 50 87
15 link 91 ... 62 105
16 radio 91 ... 78 55
17 localization 89 ... 11 3
18 receiver 84 ... 20 38
19 sensor 82 ... 4 21
20 optical 80 ... 6 50
21 simulation 79 ... 90 94
22 probability 79 ... 51 44
23 the 78 ... 59 64
24 mimo 78 ... 192 49
25 signal 76 ... 38 38
26 sensing 76 ... 33 0
27 throughput 73 ... 65 39
28 packet 73 ... 8 38
29 heterogeneous 71 ... 36 42
... ... ... ... ...
8348 rated 0 ... 0 1
8349 150 0 ... 0 1
8350 highdefinition 0 ... 0 1
8351 facilitated 0 ... 0 1
8352 750 0 ... 0 1
8353 240 0 ... 0 1
8354 supplied 0 ... 0 1
8355 robotic 0 ... 0 1
8356 confinement 0 ... 0 1
8357 jam 0 ... 0 1
8358 8x6 0 ... 0 1
8359 megahertz 0 ... 0 1
8360 rotations 0 ... 0 1
8361 sudden 0 ... 0 1
8362 fades 0 ... 0 1
8363 marine 0 ... 0 1
8364 habitat 0 ... 0 1
8365 probes 0 ... 0 1
8366 uowcs 0 ... 0 1
8367 uowc 0 ... 0 1
8368 manchestercoded 0 ... 0 1
8369 avalanche 0 ... 0 1
8370 apd 0 ... 0 1
8371 pin 0 ... 0 1
8372 shallow 0 ... 0 1
8373 harbor 0 ... 0 1
8374 waters 0 ... 0 1
8375 focal 0 ... 0 1
8376 lcd 0 ... 0 1
8377 display 0 ... 0 1
[8378 rows x 5 columns]
[' freq-2015', ' freq-2016', ' freq-2017', ' freq-2018']
['energy' 'power' 'wireless' ... 'focal' 'lcd' 'display']
(8378, 4)
[[526 747 89 97]
[246 457 170 125]
[194 248 121 144]
...
[0 0 0 1]
[0 0 0 1]
[0 0 0 1]]
Traceback (most recent call last):
File "<ipython-input-5-7d351bf710cc>", line 1, in <module>
runfile('C:/Users/Haseeb/Desktop/Report 5/PYTHON word removal/PlotingCharts.py', wdir='C:/Users/Haseeb/Desktop/Report 5/PYTHON word removal')
File "e:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 668, in runfile
execfile(filename, namespace)
File "e:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Haseeb/Desktop/Report 5/PYTHON word removal/PlotingCharts.py", line 111, in <module>
ax.contour3D(x, y, data, 50, cmap='binary')
File "e:\ProgramData\Anaconda3\lib\site-packages\mpl_toolkits\mplot3d\axes3d.py", line 2076, in contour
self.auto_scale_xyz(X, Y, Z, had_data)
File "e:\ProgramData\Anaconda3\lib\site-packages\mpl_toolkits\mplot3d\axes3d.py", line 494, in auto_scale_xyz
self.xy_dataLim.update_from_data_xy(np.array([x, y]).T, not had_data)
File "e:\ProgramData\Anaconda3\lib\site-packages\matplotlib\transforms.py", line 913, in update_from_data_xy
path = Path(xy)
File "e:\ProgramData\Anaconda3\lib\site-packages\matplotlib\path.py", line 127, in __init__
vertices = _to_unmasked_float_array(vertices)
File "e:\ProgramData\Anaconda3\lib\site-packages\matplotlib\cbook\__init__.py", line 1365, in _to_unmasked_float_array
return np.asarray(x, float)
File "e:\ProgramData\Anaconda3\lib\site-packages\numpy\core\numeric.py", line 492, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence
I am trying to make a chart something like this but i cant understand why is it giving an error when a 2d array is required and my array shape is (8374,4) so what is the problem.

You must change your x and y types to arrays:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from mpl_toolkits import mplot3d
myData = pd.read_csv('counted-JOSAIC.csv', delimiter=',', skiprows=0,usecols=range(0,5))
item_list = list(myData.columns) #Names of Columns
item_list = item_list[1:]
print(item_list)
myData = np.array(myData) #Convert to numpy
keywords = np.asarray(myData[:,0]) #Get the Keywords
print(keywords)
data = np.asarray(myData[:,1:]) #remove Keywords from data
print(data.shape)
print(data)
##################################################################################
###x=keyword
###y=year
###z=freq
y=np.arange(len(keywords))
x=np.arange(len(item_list))
X, Y = np.meshgrid(x, y)
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, data, 50, cmap='binary')
ax.set_yticklabels(keywords)
ax.set_xticklabels(item_list)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z');
plt.show()
This gives:
Obviously you must change colors and x and y accordingly to get your desired output.

x, y, and data are not compatible.
type(data) gives <class 'numpy.ndarray'>
type(x) gives <class 'range'>
type(y) gives <class 'range'>
On the other hand, in this example, all of X, Y, and Z are of <class 'numpy.ndarray'> type. I think that you should try converting x and y to arrays.

Related

Optimize dataframe fill and refill Python Pandas

I have changed the column names and have added new columns too.
I am having a numpy array that I have to fill in the respective dataframe columns.
I am getting a delayed response in filling the dataframe using the following code:
import pandas as pd
import numpy as np
df = pd.read_csv("sample.csv")
df = df.tail(1000)
DISPLAY_IN_TRAINING = []
Slice_Middle_Piece_X = slice(None,-1, None)
Slice_Middle_Piece_Y = slice(-1, None)
input_slicer = slice(None, None)
output_slice = slice(None, None)
seq_len = 15 # choose sequence length
n_steps = seq_len - 1
Disp_Data = df
def Generate_DataSet(stock,
df_clone,
seq_len
):
global DISPLAY_IN_TRAINING
data_raw = stock.values # convert to numpy array
data = []
len_data_raw = data_raw.shape[0]
for index in range(0, len_data_raw - seq_len + 1):
data.append(data_raw[index: index + seq_len])
data = np.array(data);
test_set_size = int(np.round(30 / 100 * data.shape[0]));
train_set_size = data.shape[0] - test_set_size;
x_train, y_train = Get_Data_Chopped(data[:train_set_size])
print("Training Sliced Successful....!")
df_train_candle = df_clone[n_steps : train_set_size + n_steps]
if len(DISPLAY_IN_TRAINING) == 0:
DISPLAY_IN_TRAINING = list(df_clone)
df_train_candle.columns = DISPLAY_IN_TRAINING
return [x_train, y_train, df_train_candle]
def Get_Data_Chopped(data_related_to):
x_values = []
y_values = []
for index,iter_values in enumerate(data_related_to):
x_values.append(iter_values[Slice_Middle_Piece_X,input_slicer])
y_values.append([item for sublist in iter_values[Slice_Middle_Piece_Y,output_slice] for item in sublist])
x_values = np.asarray(x_values)
y_values = np.asarray(y_values)
return [x_values,y_values]
x_train, y_train, df_train_candle = Generate_DataSet(df,
Disp_Data,
seq_len
)
df_train_candle.reset_index(drop = True, inplace = True)
df_columns = list(df_train_candle)
df_outputs_name = []
OUTPUT_COLUMN = df.columns
for output_column_name in OUTPUT_COLUMN:
df_outputs_name.append(output_column_name + "_pred")
for i in range(len(df_columns)):
if df_columns[i] == output_column_name:
df_columns[i] = output_column_name + "_orig"
break
df_train_candle.columns = df_columns
df_pred_names = pd.DataFrame(columns = df_outputs_name)
df_train_candle = df_train_candle.join(df_pred_names, how="outer")
for row_index, row_value in enumerate(y_train):
for valueindex, output_label in enumerate(OUTPUT_COLUMN):
df_train_candle.loc[row_index, output_label + "_orig"] = row_value[valueindex]
df_train_candle.loc[row_index, output_label + "_pred"] = row_value[valueindex]
print(df_train_candle.head())
The shape of my y_train is (195, 24) and the dataframe shape is (195, 48). Now I am trying to optimize and make the process work faster. The y_train may change shape to say (195, 1) or (195, 5).
So please can someone tell me what other way (optimized way) for doing the above process? I want a general solution that could fit anything without loosing the data integrity and is faster too.
If teh data size increases from 1000 to 2000 the process become slow. Please advise how to make it faster.
Sample Data df looks like this with shape (1000, 8)
A B C D E F G H
64272 195 215 239 272 22 11 33 55
64273 196 216 240 273 22 11 33 55
64274 197 217 241 274 22 11 33 55
64275 198 218 242 275 22 11 33 55
64276 199 219 243 276 22 11 33 55
The output looks like this:
A_orig B_orig C_orig D_orig E_orig F_orig G_orig H_orig A_pred B_pred C_pred D_pred E_pred F_pred G_pred H_pred
0 10 30 54 87 22 11 33 55 10 30 54 87 22 11 33 55
1 11 31 55 88 22 11 33 55 11 31 55 88 22 11 33 55
2 12 32 56 89 22 11 33 55 12 32 56 89 22 11 33 55
3 13 33 57 90 22 11 33 55 13 33 57 90 22 11 33 55
4 14 34 58 91 22 11 33 55 14 34 58 91 22 11 33 55
Please generate csv columns with 1000 or more lines and see that the program becomes slower. I want to make it faster. I hope this is good to go for understanding.

Euclidean Distance over 2 dataframes

I have 2 Dataframes
DF1-
Name X Y
0 Astonished 0.430 0.890
1 Excited 0.700 0.720
2 Expectant 0.320 0.067
3 Passionate 0.333 0.127
[47 rows * 3 columns]
DF2-
Id X Y
0 1 -0.288453 0.076105
1 4 -0.563453 -0.498895
2 5 -0.788453 -0.673895
3 6 -0.063453 -0.373895
4 7 0.311547 0.376105
[767 rows * 3 columns]
Now what I want to achieve is -
Take the X,Y from first entry from DF2, iterate it over DF1, calculate Euclidean Distance between each value of X,Y in DF2.
Find the minimum of all the Euclidean Distance obtained between the two points, save the minimum result somewhere along with the corresponding entry under the name column.
Example-
Say for any tuple of X,Y in DF2, the minimum Euclidean distance is corresponding to the X,Y value in the row 0 of DF1, then the result should be, the distance and name Astonished.
My Attempt-
import pandas as pd
import numpy as np
import csv
mood = pd.read_csv("C:/Users/Desktop/DF1.csv")
song_value = pd.read_csv("C:/Users/Desktop/DF2.csv")
df_temp = mood.loc[:, ['Arousal','Valence']]
df_temp1 = song_value.loc[:, ['Arousal','Valence']]
import scipy
from scipy import spatial
ary = scipy.spatial.distance.cdist(mood.loc[:, ['Arousal','Valence']], song_value.loc[:, ['Arousal','Valence']], metric='euclidean')
print (ary)
Result Obtained -
[[1.08563344 1.70762362 1.98252253 ... 0.64569366 0.47426051 0.83656989]
[1.17967807 1.75556794 2.03922435 ... 0.59326275 0.2469077 0.79334076]
[0.60852124 1.04915517 1.33326431 ... 0.1848471 0.53293637 0.08394834]
...
[1.26151359 1.5500629 1.81168766 ... 0.74070027 0.70209658 0.75277205]
[0.69085994 1.03764923 1.31608627 ... 0.33265268 0.61928227 0.21397822]
[0.84484398 1.11428893 1.38222899 ... 0.48330291 0.69288125 0.3886008 ]]
I have no clue how I should proceed now.
Please suggest something.
EDIT - 1
I converted the array in another data frame using
new_series = pd.DataFrame(ary)
print (new_series)
Result -
0 1 2 ... 764 765 766
0 1.085633 1.707624 1.982523 ... 0.645694 0.474261 0.836570
1 1.179678 1.755568 2.039224 ... 0.593263 0.246908 0.793341
2 0.608521 1.049155 1.333264 ... 0.184847 0.532936 0.083948
3 0.623534 1.093331 1.378075 ... 0.124156 0.479393 0.109057
4 0.791926 1.352785 1.636748 ... 0.197403 0.245908 0.398619
5 0.740038 1.260768 1.545785 ... 0.092072 0.304926 0.281791
6 0.923284 1.523395 1.803676 ... 0.415540 0.293217 0.611312
7 1.202447 1.679660 1.962823 ... 0.554256 0.247391 0.703298
8 0.824898 1.343684 1.628727 ... 0.177560 0.222666 0.360980
9 1.191411 1.604942 1.883150 ... 0.570771 0.395957 0.668736
10 0.822236 1.456863 1.708469 ... 0.706252 0.787271 0.823542
11 0.741683 1.371996 1.618916 ... 0.704496 0.835235 0.798964
12 0.346244 0.967891 1.240839 ... 0.376504 0.715617 0.359700
13 0.526096 1.163209 1.421820 ... 0.520190 0.748265 0.579333
14 0.435992 0.890291 1.083229 ... 0.937048 1.254437 0.884499
15 0.600338 1.162469 1.375755 ... 0.876228 1.116301 0.891714
16 0.634254 1.059083 1.226407 ... 1.088393 1.373536 1.058550
17 0.712227 1.284502 1.498187 ... 0.917272 1.117806 0.956957
18 0.194387 0.799728 1.045745 ... 0.666713 1.013563 0.597524
19 0.456000 0.708741 0.865870 ... 1.068296 1.420654 0.973234
20 0.633776 0.632060 0.709202 ... 1.277083 1.645173 1.157765
21 0.192291 0.597749 0.826602 ... 0.831713 1.204117 0.716746
22 0.522033 0.526969 0.645998 ... 1.170316 1.546040 1.041762
23 0.668148 0.504480 0.547920 ... 1.316602 1.698041 1.176933
24 0.718440 0.285718 0.280984 ... 1.334008 1.727796 1.166364
25 0.759187 0.265412 0.217165 ... 1.362786 1.757580 1.190132
26 0.598326 0.113459 0.380513 ... 1.087573 1.479296 0.896239
27 0.676841 0.263613 0.474246 ... 1.074911 1.456515 0.875707
28 0.865641 0.365394 0.462742 ... 1.239941 1.612779 1.038790
29 0.463623 0.511737 0.786284 ... 0.719525 1.099122 0.519226
30 0.780386 0.550483 0.750532 ... 0.987863 1.336760 0.788449
31 1.077559 0.711697 0.814205 ... 1.274933 1.602953 1.079529
32 1.020408 0.497152 0.522999 ... 1.372444 1.736938 1.170889
33 0.963911 0.367018 0.336035 ... 1.398444 1.778496 1.198905
34 1.092763 0.759612 0.873457 ... 1.256086 1.574565 1.063570
35 0.903631 0.810449 1.018501 ... 0.921287 1.219046 0.740134
36 0.728728 0.795942 1.045868 ... 0.695317 1.009043 0.512147
37 0.738314 0.600405 0.822742 ... 0.895225 1.239125 0.697393
38 1.206901 1.151385 1.343654 ... 1.064721 1.273002 0.922962
39 1.248530 1.293525 1.508517 ... 0.988508 1.137608 0.880669
40 0.988777 1.205968 1.463036 ... 0.622495 0.776919 0.541414
41 0.941001 1.043940 1.285215 ... 0.732293 0.960420 0.595174
42 1.242508 1.321327 1.544222 ... 0.947970 1.080069 0.851396
43 1.262534 1.399453 1.633948 ... 0.900340 0.989603 0.830024
44 1.261514 1.550063 1.811688 ... 0.740700 0.702097 0.752772
45 0.690860 1.037649 1.316086 ... 0.332653 0.619282 0.213978
46 0.844844 1.114289 1.382229 ... 0.483303 0.692881 0.388601
[47 rows x 767 columns]
Moreover, is this the best approach? Sorry, but am not sure, that's why am putting this up.

Say df_1 and df_2 are your dataframes, first extract your pairs as shown below:
pairs_1 = list(tuple(zip(df_1.X, df_1.Y)))
pairs_2 = list(tuple(zip(df_2.X, df_2.Y)))
Then iterate over pairs as per your use case and get the index of minimum distance for the iterated points:
from scipy import spatial
min_distances = []
closest_pairs = []
names = []
for i in pairs_2:
min_dist = scipy.spatial.distance.cdist([i], pairs_1, metric='euclidean').min()
index_min = scipy.spatial.distance.cdist([i], pairs_1, metric='euclidean').argmin()
min_distances.append(min_dist)
closest_pairs.append(df_1.loc[index_min, ['X', 'Y']])
names.append(df_1.loc[index_min, 'Name'])
Insert results to df_2:
df_2['min_distance'] = min_distances
df_2['closest_pairs'] = [tuple(i.values) for i in closest_pairs]
df_2['name'] = names
df_2
Output:
Id X Y min_distance closest_pairs name
0 1 -0.288453 0.076105 0.608521 (0.32, 0.067) Expectant
1 4 -0.563453 -0.498895 1.049155 (0.32, 0.067) Expectant
2 5 -0.788453 -0.673895 1.333264 (0.32, 0.067) Expectant
3 6 -0.063453 -0.373895 0.584316 (0.32, 0.067) Expectant
4 7 0.311547 0.376105 0.250027 (0.33, 0.127) Passionate
I have added min_distance and closest_pairs as well, you can exclude these columns if you want to.

Scipy peak_widths returns TypeError: only integer scalar arrays can be converted to a scalar index

I am trying to find the x value at the maxima of a data set and the width of the peak each maxima is from. I have tired the code below, the first part correctly returns the peak x positions but once I add the second section it fails with the error message:
TypeError: only integer scalar arrays can be converted to a scalar index
The code is below:
import matplotlib.pyplot as plt
import csv
from scipy.signal import find_peaks, find_peaks, peak_widths
import numpy
x = []
y = []
with open('data.csv','r') as csvfile:
plots = csv.reader(csvfile, delimiter=',')
for row in plots:
x.append(float(row[0]))
y.append(float(row[1]))
peaks = find_peaks(y, height=10000,) # set the height to remove background
list = numpy.array(x)[peaks[0]]
print("Optimum values")
print(list)
This next part fails:
peaks, _ = find_peaks(y)
results_half = peak_widths(y, peaks, rel_height=0.5)
print(results_half)
results_full = peak_widths(y, peaks, rel_height=1)
plt.plot(y)
plt.plot(peaks, y[peaks], "y")
plt.hlines(*results_half[1:], color="C2")
plt.hlines(*results_full[1:], color="C3")
plt.show()
I have read the scipy documentation but I think the issue is more fundamental than that. How can I make the second part work? I'd like it to return the peak widths and show on a plot which peaks it has picked.
Thanks
Example data:
-7 16
-6.879 14
-6.759 20
-6.638 31
-6.518 33
-6.397 28
-6.276 17
-6.156 17
-6.035 30
-5.915 50
-5.794 64
-5.673 77
-5.553 96
-5.432 113
-5.312 112
-5.191 113
-5.07 123
-4.95 151
-4.829 173
-4.709 207
-4.588 328
-4.467 590
-4.347 1246
-4.226 3142
-4.106 7729
-3.985 18015
-3.864 40097
-3.744 85164
-3.623 167993
-3.503 302845
-3.382 499848
-3.261 761264
-3.141 1063770
-3.02 1380165
-2.899 1644532
-2.779 1845908
-2.658 1931555
-2.538 1918458
-2.417 1788508
-2.296 1586322
-2.176 1346871
-2.055 1086383
-1.935 831396
-1.814 590559
-1.693 398865
-1.573 261396
-1.452 174992
-1.332 139774
-1.211 154694
-1.09 235406
-0.97 388021
-0.849 616041
-0.729 911892
-0.608 1248544
-0.487 1579659
-0.367 1859034
-0.246 2042431
-0.126 2120969
-0.005 2081017
0.116 1925716
0.236 1684327
0.357 1372293
0.477 1064307
0.598 766824
0.719 535333
0.839 346882
0.96 217215
1.08 125673
1.201 68861
1.322 35618
1.442 16286
1.563 7361
1.683 2572
1.804 1477
1.925 1072
2.045 977
2.166 968
2.286 1030
2.407 1173
2.528 1398
2.648 1586
2.769 1770
2.889 1859
3.01 1980
3.131 2041
3.251 2084
3.372 2069
3.492 2012
3.613 1937
3.734 1853
3.854 1787
3.975 1737
4.095 1643
4.216 1548
4.337 1399
4.457 1271
4.578 1143
4.698 1022
4.819 896
4.94 762
5.06 663
5.181 587
5.302 507
5.422 428
5.543 339
5.663 277
5.784 228
5.905 196
6.025 158
6.146 122
6.266 93
6.387 76
6.508 67
6.628 63
6.749 58
6.869 43
6.99 27
7.111 13
7.231 7
7.352 3
7.472 3
7.593 2
7.714 2
7.834 2
7.955 3
8.075 2
8.196 1
8.317 1
8.437 2
8.558 3
8.678 2
8.799 1
8.92 2
9.04 4
9.161 7
9.281 4
9.402 3
9.523 2
9.643 3
9.764 4
9.884 6
10.005 7
10.126 4
10.246 2
10.367 0
10.487 0
10.608 0
10.729 0
10.849 0
10.97 0
11.09 1
11.211 2
11.332 3
11.452 2
11.573 1
11.693 0
11.814 0
11.935 0
12.055 0
12.176 0
12.296 0
12.417 0
12.538 0
12.658 0
12.779 0
12.899 0
13.02 0
13.141 0
13.261 0
13.382 0
13.503 0
13.623 0
13.744 0
13.864 0
13.985 0
14.106 0
14.226 0
14.347 0
14.467 0
14.588 0
14.709 0
14.829 0
14.95 0
15.07 0
15.191 0
15.312 0
15.432 0
15.553 0
15.673 0
15.794 0
15.915 0
16.035 0
16.156 0
16.276 0
16.397 1
16.518 2
16.638 3
16.759 2
16.879 2
17 4

I think your problem is that y is a list, not a numpy array.
The slicing operation y[peaks] will only work if both y and peaks are numpy arrays.
So you should convert y before doing the slicing, e.g. as follows
y_arr = np.array(y)
plt.plot(y_arr)
plt.plot(peaks, y_arr[peaks], 'o', color="y")
plt.hlines(*results_half[1:], color="C2")
plt.hlines(*results_full[1:], color="C3")
plt.show()
plt.show()
This yields the following plot:

output multiple files based on column value python pandas

i have a sample pandas data frame:
import pandas as pd
df = {'ID': [73, 68,1,94,42,22, 28,70,47, 46,17, 19, 56, 33 ],
'CloneID': [1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4 ],
'VGene': ['64D', '64D', '64D', 61, 61, 61, 311, 311, 311, 311, 311, 311, 311, 311]}
df = pd.DataFrame(df)
it looks like this:
df
Out[7]:
CloneID ID VGene
0 1 73 64D
1 1 68 64D
2 1 1 64D
3 1 94 61
4 1 42 61
5 2 22 61
6 2 28 311
7 3 70 311
8 3 47 311
9 3 46 311
10 4 17 311
11 4 19 311
12 4 56 311
13 4 33 311
i want to write a simple script to output each cloneID to a different output file. so in this case there would be 4 different files.
the first file would be named 'CloneID1.txt' and it would look like this:
CloneID ID VGene
1 73 64D
1 68 64D
1 1 64D
1 94 61
1 42 61
second file would be named 'CloneID2.txt':
CloneID ID VGene
2 22 61
2 28 311
third file would be named 'CloneID3.txt':
CloneID ID VGene
3 70 311
3 47 311
3 46 311
and last file would be 'CloneID4.txt':
CloneID ID VGene
4 17 311
4 19 311
4 56 311
4 33 311
the code i found online was:
import pandas as pd
data = pd.read_excel('data.xlsx')
for group_name, data in data.groupby('CloneID'):
with open('results.csv', 'a') as f:
data.to_csv(f)
but it outputs everything to one file instead of multiple files.

You can do something like the following:
In [19]:
gp = df.groupby('CloneID')
for g in gp.groups:
print('CloneID' + str(g) + '.txt')
print(gp.get_group(g).to_csv())
CloneID1.txt
,CloneID,ID,VGene
0,1,73,64D
1,1,68,64D
2,1,1,64D
3,1,94,61
4,1,42,61
CloneID2.txt
,CloneID,ID,VGene
5,2,22,61
6,2,28,311
CloneID3.txt
,CloneID,ID,VGene
7,3,70,311
8,3,47,311
9,3,46,311
CloneID4.txt
,CloneID,ID,VGene
10,4,17,311
11,4,19,311
12,4,56,311
13,4,33,311
So here we iterate over the groups in for g in gp.groups: and we use this to create the result file path name and call to_csv on the group so the following should work for you:
gp = df.groupby('CloneID')
for g in gp.groups:
path = 'CloneID' + str(g) + '.txt'
gp.get_group(g).to_csv(path)
Actually the following would be even simpler:
gp = df.groupby('CloneID')
gp.apply(lambda x: x.to_csv('CloneID' + str(x.name) + '.txt'))

R's pdIndent function in RPy

I am working on translating the code for the lmeSplines tutorial to RPy.
I am now stuck at the following line:
fit1s <- lme(y ~ time, data=smSplineEx1,random=list(all=pdIdent(~Zt - 1)))
I have worked with nlme.lme before, and the following works just fine:
from rpy2.robjects.packages import importr
nlme = importr('nlme')
nlme.lme(r.formula('y ~ time'), data=some_data, random=r.formula('~1|ID'))
But this has an other random assignment. I am wondering how I can translate this bit and put it into my RPy code as well list(all=pdIdent(~Zt - 1)).
The structure of the (preprocessed) example data smSplineEx1 looks like this (with Zt.* up to 98):
time y y.true all Zt.1 Zt.2 Zt.3
1 1 5.797149 4.235263 1 1.168560e+00 2.071261e+00 2.944953e+00
2 2 5.469222 4.461302 1 1.487859e-01 1.072013e+00 1.948857e+00
3 3 4.567237 4.678477 1 -5.449190e-02 7.276623e-02 9.527613e-01
4 4 3.645763 4.887137 1 -5.364552e-02 -1.359115e-01 -4.333438e-02
5 5 5.094126 5.087615 1 -5.279913e-02 -1.337708e-01 -2.506194e-01
6 6 4.636121 5.280233 1 -5.195275e-02 -1.316300e-01 -2.466158e-01
7 7 5.501538 5.465298 1 -5.110637e-02 -1.294892e-01 -2.426123e-01
8 8 5.011509 5.643106 1 -5.025998e-02 -1.273485e-01 -2.386087e-01
9 9 6.114037 5.813942 1 -4.941360e-02 -1.252077e-01 -2.346052e-01
10 10 5.696472 5.978080 1 -4.856722e-02 -1.230670e-01 -2.306016e-01
11 11 6.615363 6.135781 1 -4.772083e-02 -1.209262e-01 -2.265980e-01
12 12 8.002526 6.287300 1 -4.687445e-02 -1.187854e-01 -2.225945e-01
13 13 6.887444 6.432877 1 -4.602807e-02 -1.166447e-01 -2.185909e-01
14 14 6.319205 6.572746 1 -4.518168e-02 -1.145039e-01 -2.145874e-01
15 15 6.482771 6.707130 1 -4.433530e-02 -1.123632e-01 -2.105838e-01
16 16 7.938015 6.836245 1 -4.348892e-02 -1.102224e-01 -2.065802e-01
17 17 7.585533 6.960298 1 -4.264253e-02 -1.080816e-01 -2.025767e-01
18 18 7.560287 7.079486 1 -4.179615e-02 -1.059409e-01 -1.985731e-01
19 19 7.571020 7.194001 1 -4.094977e-02 -1.038001e-01 -1.945696e-01
20 20 8.922418 7.304026 1 -4.010338e-02 -1.016594e-01 -1.905660e-01
21 21 8.241394 7.409737 1 -3.925700e-02 -9.951861e-02 -1.865625e-01
22 22 7.447076 7.511303 1 -3.841062e-02 -9.737785e-02 -1.825589e-01
23 23 7.317292 7.608886 1 -3.756423e-02 -9.523709e-02 -1.785553e-01
24 24 7.077333 7.702643 1 -3.671785e-02 -9.309633e-02 -1.745518e-01
25 25 8.268601 7.792723 1 -3.587147e-02 -9.095557e-02 -1.705482e-01
26 26 8.216013 7.879272 1 -3.502508e-02 -8.881481e-02 -1.665447e-01
27 27 8.968495 7.962427 1 -3.417870e-02 -8.667405e-02 -1.625411e-01
28 28 9.085605 8.042321 1 -3.333232e-02 -8.453329e-02 -1.585375e-01
29 29 9.002575 8.119083 1 -3.248593e-02 -8.239253e-02 -1.545340e-01
30 30 8.763187 8.192835 1 -3.163955e-02 -8.025177e-02 -1.505304e-01
31 31 8.936370 8.263695 1 -3.079317e-02 -7.811101e-02 -1.465269e-01
32 32 9.033403 8.331776 1 -2.994678e-02 -7.597025e-02 -1.425233e-01
33 33 8.248328 8.397188 1 -2.910040e-02 -7.382949e-02 -1.385198e-01
34 34 5.961721 8.460035 1 -2.825402e-02 -7.168873e-02 -1.345162e-01
35 35 8.400489 8.520418 1 -2.740763e-02 -6.954797e-02 -1.305126e-01
36 36 6.855125 8.578433 1 -2.656125e-02 -6.740721e-02 -1.265091e-01
37 37 9.798931 8.634174 1 -2.571487e-02 -6.526645e-02 -1.225055e-01
38 38 8.862758 8.687729 1 -2.486848e-02 -6.312569e-02 -1.185020e-01
39 39 7.282970 8.739184 1 -2.402210e-02 -6.098493e-02 -1.144984e-01
40 40 7.484208 8.788621 1 -2.317572e-02 -5.884417e-02 -1.104949e-01
41 41 8.404670 8.836120 1 -2.232933e-02 -5.670341e-02 -1.064913e-01
42 42 8.880734 8.881756 1 -2.148295e-02 -5.456265e-02 -1.024877e-01
43 43 8.826189 8.925603 1 -2.063657e-02 -5.242189e-02 -9.848418e-02
44 44 9.827906 8.967731 1 -1.979018e-02 -5.028113e-02 -9.448062e-02
45 45 8.528795 9.008207 1 -1.894380e-02 -4.814037e-02 -9.047706e-02
46 46 9.484073 9.047095 1 -1.809742e-02 -4.599961e-02 -8.647351e-02
47 47 8.911947 9.084459 1 -1.725103e-02 -4.385885e-02 -8.246995e-02
48 48 10.201343 9.120358 1 -1.640465e-02 -4.171809e-02 -7.846639e-02
49 49 8.908016 9.154849 1 -1.555827e-02 -3.957733e-02 -7.446283e-02
50 50 8.202368 9.187988 1 -1.471188e-02 -3.743657e-02 -7.045927e-02
51 51 7.432851 9.219828 1 -1.386550e-02 -3.529581e-02 -6.645572e-02
52 52 8.063268 9.250419 1 -1.301912e-02 -3.315505e-02 -6.245216e-02
53 53 10.155756 9.279810 1 -1.217273e-02 -3.101429e-02 -5.844860e-02
54 54 7.905281 9.308049 1 -1.132635e-02 -2.887353e-02 -5.444504e-02
55 55 9.688337 9.335181 1 -1.047997e-02 -2.673277e-02 -5.044148e-02
56 56 9.437176 9.361249 1 -9.633582e-03 -2.459201e-02 -4.643793e-02
57 57 9.165873 9.386295 1 -8.787198e-03 -2.245125e-02 -4.243437e-02
58 58 9.120195 9.410358 1 -7.940815e-03 -2.031049e-02 -3.843081e-02
59 59 9.955840 9.433479 1 -7.094432e-03 -1.816973e-02 -3.442725e-02
60 60 9.314230 9.455692 1 -6.248048e-03 -1.602897e-02 -3.042369e-02
61 61 9.706852 9.477035 1 -5.401665e-03 -1.388821e-02 -2.642014e-02
62 62 9.615765 9.497541 1 -4.555282e-03 -1.174746e-02 -2.241658e-02
63 63 7.918843 9.517242 1 -3.708898e-03 -9.606695e-03 -1.841302e-02
64 64 9.352892 9.536172 1 -2.862515e-03 -7.465935e-03 -1.440946e-02
65 65 9.722685 9.554359 1 -2.016132e-03 -5.325176e-03 -1.040590e-02
66 66 9.186888 9.571832 1 -1.169748e-03 -3.184416e-03 -6.402346e-03
67 67 8.652299 9.588621 1 -3.233650e-04 -1.043656e-03 -2.398788e-03
68 68 8.681421 9.604751 1 5.230184e-04 1.097104e-03 1.604770e-03
69 69 10.279181 9.620249 1 1.369402e-03 3.237864e-03 5.608328e-03
70 70 9.314963 9.635140 1 2.215785e-03 5.378623e-03 9.611886e-03
71 71 6.897151 9.649446 1 3.062168e-03 7.519383e-03 1.361544e-02
72 72 9.343135 9.663191 1 3.908552e-03 9.660143e-03 1.761900e-02
73 73 9.273135 9.676398 1 4.754935e-03 1.180090e-02 2.162256e-02
74 74 10.041796 9.689086 1 5.601318e-03 1.394166e-02 2.562612e-02
75 75 9.724713 9.701278 1 6.447702e-03 1.608242e-02 2.962968e-02
76 76 8.593517 9.712991 1 7.294085e-03 1.822318e-02 3.363323e-02
77 77 7.401988 9.724244 1 8.140468e-03 2.036394e-02 3.763679e-02
78 78 10.258688 9.735057 1 8.986852e-03 2.250470e-02 4.164035e-02
79 79 10.037192 9.745446 1 9.833235e-03 2.464546e-02 4.564391e-02
80 80 9.637510 9.755427 1 1.067962e-02 2.678622e-02 4.964747e-02
81 81 8.887625 9.765017 1 1.152600e-02 2.892698e-02 5.365102e-02
82 82 9.922013 9.774230 1 1.237239e-02 3.106774e-02 5.765458e-02
83 83 10.466709 9.783083 1 1.321877e-02 3.320850e-02 6.165814e-02
84 84 11.132830 9.791588 1 1.406515e-02 3.534926e-02 6.566170e-02
85 85 10.154038 9.799760 1 1.491154e-02 3.749002e-02 6.966526e-02
86 86 10.433068 9.807612 1 1.575792e-02 3.963078e-02 7.366881e-02
87 87 9.666781 9.815156 1 1.660430e-02 4.177154e-02 7.767237e-02
88 88 9.478004 9.822403 1 1.745069e-02 4.391230e-02 8.167593e-02
89 89 10.002749 9.829367 1 1.829707e-02 4.605306e-02 8.567949e-02
90 90 7.593259 9.836058 1 1.914345e-02 4.819382e-02 8.968305e-02
91 91 10.915754 9.842486 1 1.998984e-02 5.033458e-02 9.368660e-02
92 92 8.855580 9.848662 1 2.083622e-02 5.247534e-02 9.769016e-02
93 93 8.884683 9.854596 1 2.168260e-02 5.461610e-02 1.016937e-01
94 94 9.757451 9.860298 1 2.252899e-02 5.675686e-02 1.056973e-01
95 95 10.222361 9.865775 1 2.337537e-02 5.889762e-02 1.097008e-01
96 96 9.090410 9.871038 1 2.422175e-02 6.103838e-02 1.137044e-01
97 97 8.837872 9.876095 1 2.506814e-02 6.317914e-02 1.177080e-01
98 98 9.413135 9.880953 1 2.591452e-02 6.531990e-02 1.217115e-01
99 99 9.295531 9.885621 1 2.676090e-02 6.746066e-02 1.257151e-01
100 100 9.698118 9.890106 1 2.760729e-02 6.960142e-02 1.297186e-01

You can put list(all=pdIdent(~Zt - 1)) in the R's global environment using reval() method:
In [55]:
import rpy2.robjects as ro
import pandas.rpy.common as com
mydata = ro.r['data.frame']
read = ro.r['read.csv']
head = ro.r['head']
summary = ro.r['summary']
library = ro.r['library']
In [56]:
formula = '~ time'
library('lmeSplines')
ro.reval('data(smSplineEx1)')
ro.reval('smSplineEx1$all <- rep(1,nrow(smSplineEx1))')
ro.reval('smSplineEx1$Zt <- smspline(~ time, data=smSplineEx1)')
ro.reval('rnd <- list(all=pdIdent(~Zt - 1))')
#result = ro.r.smspline(formula=ro.r(formula), data=ro.r.smSplineEx1) #notice: data=ro.r.smSplineEx1
result = ro.r.lme(ro.r('y~time'), data=ro.r.smSplineEx1, random=ro.r.rnd)
In [57]:
print com.convert_robj(result.rx('coefficients'))
{'coefficients': {'random': {'all': Zt1 Zt2 Zt3 Zt4 Zt5 Zt6 Zt7 \
1 0.000509 0.001057 0.001352 0.001184 0.000869 0.000283 -0.000424
Zt8 Zt9 Zt10 ... Zt89 Zt90 Zt91 \
1 -0.001367 -0.002325 -0.003405 ... -0.001506 -0.001347 -0.000864
Zt92 Zt93 Zt94 Zt95 Zt96 Zt97 Zt98
1 -0.000631 -0.000569 -0.000392 -0.000049 0.000127 0.000114 0.000071
[1 rows x 98 columns]}, 'fixed': (Intercept) 6.498800
time 0.038723
dtype: float64}}
Be careful, the result is a little bit out of shape. Basically it is nested dictionary which can not be converted into a pandas.DataFrame.
You can access y in smsSplineEx by ro.r.smSplineEx1.rx('y'), similar to smsSplineEx1$y as you would do so in R.
Now say if you have the result variable in python, generated by
result = ro.r.lme(ro.r('y~time'), data=ro.r.smSplineEx1, random=ro.r.rnd)
and you want to plot it using R, (instead of plotting it using, say, matplotlib), you need to assign it to a variable in R workspace:
ro.R().assign('result', result)
Now there is a variable named result in R workspace, you can access it using ro.r.result.
Plotting it using R:
In [17]:
ro.reval('plot(smSplineEx1$time,smSplineEx1$y,pch="o",type="n", \
main="Spline fits: lme(y ~ time, random=list(all=pdIdent(~Zt-1)))", \
xlab="time",ylab="y")')
Out[17]:
rpy2.rinterface.NULL
In [21]:
ro.reval('lines(smSplineEx1$time, fitted(result),col=2)')
Out[21]:
rpy2.rinterface.NULL
Or you can do everything in R:
ro.reval('result <- lme(y ~ time, data=smSplineEx1,random=list(all=pdIdent(~Zt - 1)))')
ro.reval('plot(smSplineEx1$time,smSplineEx1$y,pch="o",type="n", \
main="Spline fits: lme(y ~ time, random=list(all=pdIdent(~Zt-1)))", \
xlab="time",ylab="y")')
ro.reval('lines(smSplineEx1$time, fitted(result),col=2)')
and access the R variables using:ro.r.smSplineEx1.rx2('time') or ro.r.result
Edit
Notice some R objects can not be converted to pandas.dataFrame as-is due to mixture of data structure:
In [62]:
ro.r["smSplineEx1"]
Out[62]:
<DataFrame - Python:0x108525518 / R:0x109e5da38>
[FloatVe..., FloatVe..., FloatVe..., FloatVe..., Matrix]
time: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x10807e518 / R:0x1022599e0>
[1.000000, 2.000000, 3.000000, ..., 98.000000, 99.000000, 100.000000]
y: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x108525a70 / R:0x102259d30>
[5.797149, 5.469222, 4.567237, ..., 9.413135, 9.295531, 9.698118]
y.true: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x1085257a0 / R:0x10225dfb0>
[4.235263, 4.461302, 4.678477, ..., 9.880953, 9.885621, 9.890106]
all: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x1085258c0 / R:0x10225e300>
[1.000000, 1.000000, 1.000000, ..., 1.000000, 1.000000, 1.000000]
Zt: <class 'rpy2.robjects.vectors.Matrix'>
<Matrix - Python:0x108525908 / R:0x103e8ba00>
[1.168560, 0.148786, -0.054492, ..., -0.030141, -0.030610, 0.757597]
Notice that we have a few vectors but the last one is a Matrix. We have to convert smSplineEx to python in two parts.
In [63]:
ro.r["smSplineEx1"].names
Out[63]:
<StrVector - Python:0x108525dd0 / R:0x1042ca7c0>
['time', 'y', 'y.true', 'all', 'Zt']
In [64]:
print com.convert_robj(ro.r["smSplineEx1"].rx(ro.IntVector(range(1, 5)))).head()
time y y.true all
1 1 5.797149 4.235263 1
2 2 5.469222 4.461302 1
3 3 4.567237 4.678477 1
4 4 3.645763 4.887137 1
5 5 5.094126 5.087615 1
In [65]:
print com.convert_robj(ro.r["smSplineEx1"].rx2('Zt')).head(2)
0 1 2 3 4 5 6 \
1 1.168560 2.071261 2.944953 3.782848 4.584037 5.348937 6.078121
2 0.148786 1.072013 1.948857 2.789264 3.593423 4.361817 5.095016
7 8 9 ... 88 89 90 \
1 6.772184 7.431719 8.057321 ... 0.933947 0.769591 0.619420
2 5.793601 6.458153 7.089255 ... 0.904395 0.745337 0.599976
91 92 93 94 95 96 97
1 0.484029 0.36401 0.259959 0.172468 0.102133 0.049547 0.015305
2 0.468893 0.35267 0.251890 0.167135 0.098986 0.048026 0.014836
[2 rows x 98 columns]
com.convert_robj(ro.r["smSplineEx1"]) will not work due to the mixed data structure issue.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matplot Numpy ValueError: setting an array element with a sequence - python

x, y, and data are not compatible. type(data) gives <class 'numpy.ndarray'> type(x) gives <class 'range'> type(y) gives <class 'range'> On the other hand, in this example, all of X, Y, and Z are of <class 'numpy.ndarray'> type. I think that you should try converting x and y to arrays.

Related

Optimize dataframe fill and refill Python Pandas

Euclidean Distance over 2 dataframes

Scipy peak_widths returns TypeError: only integer scalar arrays can be converted to a scalar index

output multiple files based on column value python pandas

R's pdIndent function in RPy

Categories

Resources