How to plot Large dataset using matplotlib bar graph

How to plot Large dataset using matplotlib bar graph - python

My sql query returning me 141 rows and two columns and i want to plot them in a horizontal bar graph. While doing so the data's are overlapping each other. How i can make the graph crisp and clear.
On Y axis it will show the report name and in X axis it'll show their value. How i can achieve that?
import os
import cx_Oracle
import matplotlib.pyplot as plt
import numpy as np
reportid_count = []
count_ID = []
c = conn.cursor()
query = 'select distinct (LTRIM(REGEXP_SUBSTR(ID, '[0-9]{3,}'), '0')) as ReportID,count(ID) from dev_user.RECORD_TABLE group by ID'
c.execute(query)
#loop through the rows fetched and store the records as arrays.
for row in c:
reportid_count.append(row[0])
print(row[0])
count_ID.append(row[1])
print(row[1])
fig = plt.figure(figsize=(13.5, 5))
#plot the bar chart
plt.barh(reportid_count,count_ID)#,color=['red', 'blue', 'purple']
for i, v in enumerate(count_ID):
plt.text(v, i, str(v), color='blue', fontweight='bold')
plt.title('Report_Details')
plt.xlabel('Report Count')
plt.ylabel("Report ID's")
path = r"\\dev_server.com\View\Foldert\uidDocuments\Store_Img"
os.chdir(path)
plt.savefig(path + '\squares.png')
plt.show()
conn.close()
Sample Dataset-
Report 1 1200
Report 2 0
Report 3 0
Report 4 0
Report 5 0
Report 6 0
Report 7 0
Report 8 0
Report 9 0
Report 10 0
Report 11 0
Report 12 0
Report 13 0
Report 14 0
Report 15 0
Report 16 0
Report 17 0
Report 18 0
Report 19 0
Report 20 0
Report 21 0
Report 22 0
Report 23 1
Report 24 2
Report 25 3
Report 26 4
Report 27 5
Report 28 6
Report 29 100
Report 30 101
Report 31 102
Report 32 103
Report 33 104
Report 34 105
Report 35 106
Report 36 107
Report 37 108
Report 38 109
Report 39 110
Report 40 111
Report 41 112
Report 42 113
Report 43 114
Report 44 115
Report 45 116
Report 46 117
Report 47 118
Report 48 119
Report 49 120
Report 50 121
Report 51 122
Report 52 123
Report 53 124
Report 54 125
Report 55 126
Report 56 127
Report 57 128
Report 58 129
Report 59 130
Report 60 131
Report 61 132
Report 62 133
Report 63 134
Report 64 135
Report 65 136
Report 66 137
Report 67 138
Report 68 139
Report 69 140
Report 70 141
Report 71 142
Report 72 143
Report 73 144
Report 74 145
Report 75 146
Report 76 147
Report 77 148
Report 78 149
Report 79 150
Report 80 151
Report 81 152
Report 82 153
Report 83 154
Report 84 155
Report 85 156
Report 86 157
Report 87 158
Report 88 159
Report 89 160
Report 90 161
Report 91 162
Report 92 163
Report 93 164
Report 94 165
Report 95 166
Report 96 167
Report 97 168
Report 98 169
Report 99 170
Report 100 171
Report 101 172
Report 102 173
Report 103 174
Report 104 175
Report 105 176
Report 106 177
Report 107 178
Report 108 179
Report 109 180
Report 110 181
Report 111 182
Report 112 183
Report 113 184
Report 114 185
Report 115 186
Report 116 187
Report 117 188
Report 118 189
Report 119 190
Report 120 191
Report 121 192
Report 122 193
Report 123 194
Report 124 195
Report 125 196
Report 126 197
Report 127 198
Report 128 199
Report 129 200
Report 130 201
Report 131 202
Report 132 203
Report 133 204
Report 134 205
Report 135 206
Report 136 207
Report 137 208
Report 138 209
Report 139 210
Report 140 211
Report 141 212
Report 142 0
Report 143 0
Report 144 0
Report 145 0
Report 146 0
Report 147 0
Report 148 0
Report 149 0
Report 150 0
Report 151 0
Report 152 700
Report 153 701
Report 154 702
Report 155 703
Report 156 704
Report 157 705
Report 158 706
Report 159 707
Report 160 708
Report 161 709
Report 162 710
Report 163 711
Report 164 712
Report 165 713
Report 166 714
Report 167 715
Report 168 716
Report 169 717
Report 170 718
Report 171 719
Report 172 720
Report 173 721
Report 174 722
Report 175 723
Report 176 724
Report 177 725
Report 178 726
Report 179 727
Report 180 728
Report 181 729
Report 182 730
Report 183 731
Report 184 732
Report 185 733
Report 186 734
Report 187 735
Report 188 736
Report 189 737
Report 190 738
Report 191 739
Report 192 740
Report 193 741
Report 194 742
Report 195 743
Report 196 744
Report 197 745
Report 198 746
Report 199 747
Report 200 748
Report 201 749
Report 202 750
Report 203 751
Report 204 752
Report 205 753
Report 206 754
Report 207 755
Report 208 756
Report 209 757
Report 210 758
Report 211 759
Report 212 760
Report 213 761
Report 214 762
Report 215 763
Report 216 764
Report 217 765
Report 218 766
Report 219 767
Report 220 768
Report 221 769
Report 222 770
Report 223 771
Report 224 772
Report 225 773
Report 226 774
Report 227 775
Report 228 776
Report 229 777
Report 230 778
Report 231 779
Report 232 780
Report 233 781
Report 234 782
Report 235 0
Report 236 0
Report 237 1300
Report 238 1400

Related

How can we run Scipy Optimize Based After Doing Group By?

I have a dataframe that looks like this.
SectorID MyDate PrevMainCost OutageCost
0 123 10/31/2022 332 193
1 123 9/30/2022 308 269
2 123 8/31/2022 33 440
3 123 7/31/2022 230 147
4 123 6/30/2022 264 184
5 123 5/31/2022 290 46
6 123 4/30/2022 51 150
7 123 3/31/2022 69 253
8 123 2/28/2022 257 308
9 123 1/31/2022 441 349
10 456 10/31/2022 280 188
11 456 9/30/2022 432 150
12 456 8/31/2022 357 307
13 456 7/31/2022 425 45
14 456 6/30/2022 101 278
15 456 5/31/2022 62 240
16 456 4/30/2022 407 46
17 456 3/31/2022 35 218
18 456 2/28/2022 403 113
19 456 1/31/2022 295 200
20 456 12/31/2021 20 235
21 456 11/30/2021 440 403
22 789 10/31/2022 145 181
23 789 9/30/2022 320 259
24 789 8/31/2022 485 472
25 789 7/31/2022 59 24
26 789 6/30/2022 345 64
27 789 5/31/2022 34 480
28 789 4/30/2022 260 162
29 789 3/31/2022 46 399
30 999 10/31/2022 491 346
31 999 9/30/2022 77 212
32 999 8/31/2022 316 112
33 999 7/31/2022 106 351
34 999 6/30/2022 481 356
35 999 5/31/2022 20 269
36 999 4/30/2022 246 268
37 999 3/31/2022 377 173
38 999 2/28/2022 426 413
39 999 1/31/2022 341 168
40 999 12/31/2021 144 471
41 999 11/30/2021 358 393
42 999 10/31/2021 340 197
43 999 9/30/2021 119 252
44 999 8/31/2021 470 203
45 999 7/31/2021 359 163
46 999 6/30/2021 410 383
47 999 5/31/2021 200 119
48 999 4/30/2021 230 291
I am trying to find the minimum of PrevMainCost and OutageCost, after grouping by SectorID. Here's my primitive code.
import numpy as np
import pandas
df = pandas.read_clipboard(sep='\\s+')
df
df_sum = df.groupby('SectorID').sum()
df_sum
df_sum.loc[df_sum['PrevMainCost'] <= df_sum['OutageCost'], 'Result'] = 'Main'
df_sum.loc[df_sum['PrevMainCost'] > df_sum['OutageCost'], 'Result'] = 'Out'
Result: (result column shows flags whether PrevMainCost is lower or OutageCost is lower)
PrevMainCost OutageCost Result
SectorID
123 2275 2339 Main
456 3257 2423 Out
789 1694 2041 Main
999 5511 5140 Out
I am trying to figure out how to use Scipy Optimization to solve this problem. I Googled this problem and came up with this simple code sample.
from scipy.optimize import *
df_sum.groupby(by=['SectorID']).apply(lambda g: minimize(equation, g.Result, options={'xtol':0.001}).x)
When I run that, I get an error saying 'NameError: name 'equation' is not defined'.
How can I find the minimum of either the preventative maintenance cost or the outage cost, after grouping by SectorID? Also, how can I add some kind of constraint, such as no more than 30% of all resources can be used by one any particular SectorID?

print each 10 numbers in line always printing the first value alone

I have this loop which print each 10 numbers in line then move to next line
for i in range(100, 201):
if i % 10 == 0:
print(i)
else:
print(i, end=" ", )
and this the result:
100
101 102 103 104 105 106 107 108 109 110
111 112 113 114 115 116 117 118 119 120
121 122 123 124 125 126 127 128 129 130
131 132 133 134 135 136 137 138 139 140
141 142 143 144 145 146 147 148 149 150
151 152 153 154 155 156 157 158 159 160
161 162 163 164 165 166 167 168 169 170
171 172 173 174 175 176 177 178 179 180
181 182 183 184 185 186 187 188 189 190
191 192 193 194 195 196 197 198 199 200
it printing first number in line alone, but the want the opposite, the last number alone, something like this
100 101 102 103 104 105 106 107 108 109
110 111 112 113 114 115 116 117 118 119
120 121 122 123 124 125 126 127 128 129
130 131 132 133 134 135 136 137 138 139
140 141 142 143 144 145 146 147 148 149
150 151 152 153 154 155 156 157 158 159
160 161 162 163 164 165 166 167 168 169
170 171 172 173 174 175 176 177 178 179
180 181 182 183 184 185 186 187 188 189
190 191 192 193 194 195 196 197 198 199
200

You have the correct code. The only thing is that you are breaking at mod 0. You should break at mod 9.
for i in range(100, 201):
if i % 10 == 9:
print(i)
else:
print(i, end=" ", )

Try this:
for i in range(100, 201):
if (i + 1) % 10 == 0:
print(i)
else:
print(i, end=" ")

Well, 100 % 10 is equals to zero.
Which means it's going to print 100 in a line, and 101 in another line.
If you want the new line to start from 100 then all you have to do is to make the statement i % 10 == 0 a false statement. So you will have to add one to the 100.
so just try this code instead
for i in range(100, 201):
if (i+1) % 10 == 0:
print(i)
else:
print(i, end=" ", )
or you could try changing i % 10 == 0 to i % 10 == 9

What I Can Say By Seeing Your Code Is That You Are Using 2 Print Statements. One Inside The If Statement And One Inside Else Statement.
So What Python Compiler Is Doing There Is Printing New Line When Each If Statement Is True.
So When Your Iterating Number%10 Will Be Zero It Will Print That Number And Add A New Line, Because After Each Print Statement Python Do So.
And As You Have end=" " in Else Statement So It Is Printing The Number In Same Line When IF Statements Are False.
So What You Can Do Is:
You Can Use if (i+1)%10: instead of i % 10 == 0:
When Your Iterating Number's(i's) One's Place Will Be 9 (i.e. 109,119,etc) The If Statement Will Be True For That One, The Number Will Be Printed On The Same Line And Will Add A New Line After That.

Values in pandas dataframe not getting sorted

I have a dataframe as shown below:
Category 1 2 3 4 5 6 7 8 9 10 11 12 13
A 424 377 161 133 2 81 141 169 297 153 53 50 197
B 231 121 111 106 4 79 68 70 92 93 71 65 66
C 480 379 159 139 2 116 148 175 308 150 98 82 195
D 88 56 38 40 0 25 24 55 84 36 24 26 36
E 1084 1002 478 299 7 256 342 342 695 378 175 132 465
F 497 246 283 206 4 142 151 168 297 224 194 198 148
H 8 5 4 3 0 2 3 2 7 5 3 2 0
G 3191 2119 1656 856 50 826 955 739 1447 1342 975 628 1277
K 58 26 27 51 1 18 22 42 47 35 19 20 14
S 363 254 131 105 6 82 86 121 196 98 81 57 125
T 54 59 20 4 0 9 12 7 36 23 5 4 20
O 554 304 207 155 3 130 260 183 287 204 98 106 195
P 756 497 325 230 5 212 300 280 448 270 201 140 313
PP 64 43 26 17 1 15 35 17 32 28 18 9 27
R 265 157 109 89 1 68 68 104 154 96 63 55 90
S 377 204 201 114 5 112 267 136 209 172 147 90 157
St 770 443 405 234 5 172 464 232 367 270 290 136 294
Qs 47 33 11 14 0 18 14 19 26 17 5 6 13
Y 1806 626 1102 1177 14 625 619 1079 1273 981 845 891 455
W 123 177 27 28 0 18 62 34 64 27 14 4 51
Z 2770 1375 1579 1082 17 900 1630 1137 1465 1383 861 755 1201
I want to sort the dataframe by values in each row. Once done, I want to sort the index also.
For example the values in first row corresponding to category A, should appear as:
2 50 53 81 133 141 153 161 169 197 297 377 424
I have tried df.sort_values(by=df.index.tolist(), ascending=False, axis=1) but this doesn't work. The values don't appear in sorted order at all

np.sort + sort_index
You can use np.sort along axis=1, then sort_index:
cols, idx = df.columns[1:], df.iloc[:, 0]
res = pd.DataFrame(np.sort(df.iloc[:, 1:].values, axis=1), columns=cols, index=idx)\
.sort_index()
print(res)
1 2 3 4 5 6 7 8 9 10 11 12 \
Category
A 2 50 53 81 133 141 153 161 169 197 297 377
B 4 65 66 68 70 71 79 92 93 106 111 121
C 2 82 98 116 139 148 150 159 175 195 308 379
D 0 24 24 25 26 36 36 38 40 55 56 84
E 7 132 175 256 299 342 342 378 465 478 695 1002
F 4 142 148 151 168 194 198 206 224 246 283 297
G 50 628 739 826 856 955 975 1277 1342 1447 1656 2119
H 0 0 2 2 2 3 3 3 4 5 5 7
K 1 14 18 19 20 22 26 27 35 42 47 51
O 3 98 106 130 155 183 195 204 207 260 287 304
P 5 140 201 212 230 270 280 300 313 325 448 497
PP 1 9 15 17 17 18 26 27 28 32 35 43
Qs 0 5 6 11 13 14 14 17 18 19 26 33
R 1 55 63 68 68 89 90 96 104 109 154 157
S 6 57 81 82 86 98 105 121 125 131 196 254
S 5 90 112 114 136 147 157 172 201 204 209 267
St 5 136 172 232 234 270 290 294 367 405 443 464
T 0 4 4 5 7 9 12 20 20 23 36 54
W 0 4 14 18 27 27 28 34 51 62 64 123
Y 14 455 619 625 626 845 891 981 1079 1102 1177 1273
Z 1 17 755 861 900 1082 1137 1375 1383 1465 1579 1630

One way is to apply sorted setting 1 as axis, applying pd.Series to return a dataframe instead of a list, and finally sorting by Category:
df.loc[:,'1':].apply(sorted, axis = 1).apply(pd.Series)
.set_index(df.Category).sort_index()
Category 0 1 2 3 4 5 6 7 8 9 10 ...
0 A 2 50 53 81 133 141 153 161 169 197 297 ...
1 B 4 65 66 68 70 71 79 92 93 106 111 ...

How I can group by index and make mean on each column

I have following df
Blades & Razors & Foam Diaper Empty Fem Care HairCare Irrelevant Laundry Oral Care Others Personal Cleaning Care Skin Care
retailer
RTM 158 486 193 2755 3490 1458 889 2921 69 1543 645
RTM 39 0 28 2305 80 27 0 0 0 1207 414
RTM 98 276 121 1090 2359 717 561 911 293 1286 528
RTM 107 484 54 2136 2777 151 80 2191 7 1096 673
RTM 156 465 254 2972 2802 763 867 1065 8 2777 728
RTM 126 326 142 2126 2035 581 575 753 45 1768 292
RTM 0 0 181 1816 1455 598 579 0 2 749 451
RTM 86 374 308 2197 2075 576 698 693 26 1398 212
RTM 132 61 153 2094 1508 180 590 785 66 1519 486
RTM 90 303 8 0 0 18 0 60 0 358 0
RTM 0 14 6 190 198 21 131 75 18 171 0
I want make groupby() on my index and then get average on every column with the group? Any idea how to get so ?

To group on index, use:
df.groupby(level=0).mean()
or
df.groupby(df.index).mean()
Sample:
df = pd.DataFrame(data=np.random.random((10, 5)), columns=list('CDEFG'), index=list('AB')*5)
df.head()
C D E F G
A 0.230504 0.830818 0.560533 0.266903 0.745196
B 0.996806 0.861006 0.257780 0.258976 0.738617
A 0.409191 0.688814 0.214247 0.309678 0.565571
B 0.805192 0.940919 0.707562 0.772370 0.122562
A 0.596964 0.935662 0.493612 0.108362 0.673538
for either of the above yields:
C D E F G
A 0.328301 0.560188 0.632549 0.491101 0.343343
B 0.405996 0.490331 0.540921 0.394136 0.466504
C D E F G
A 0.328301 0.560188 0.632549 0.491101 0.343343
B 0.405996 0.490331 0.540921 0.394136 0.466504

Add new column in a csv file and manipulate on the on records

I have 4 csv files named PV.csv, Dwel.csv, Sess.csv, and Elap.csv. I have 15 columns and arouind 2000 rows in each file. At first I would like to add a new column named Var in each file and fill up the cells of the new column with the same file name. Therefore, new column 'Var' in PV.csv file will filled up by PV. Same is for other 3 files.
After that I would like to manipulate the all the files as follows.
Finally I would like to merge / join these 4 files based on the A_ID and B_ID and write records into a new csv file name finalFile.csv.
Any suggestion and help is appreciated.
<p>PV.csv is as follows:</p>
A_ID B_ID LO UP LO UP
103 321 0 402
103 503 192 225 433 608
106 264 104 258 334 408
107 197 6 32 113 258
Dwell.csv is as follows:
A_ID B_ID LO UP LO UP
103 321 40 250 517 780
103 503 80 125 435 585
106 264 192 525 682
107 197 324 492 542 614
Session.csv is as follows:
A_ID B_ID LO UP LO UP
103 321 75 350 370 850
106 264 92 225 482 608
107 197 24 92 142
Elapsed.csv is as follows:
A_ID B_ID LO UP LO UP
103 321 5 35 75
103 503 100 225 333 408
106 264 102 325 582
107 197 24 92 142 214
First output file of PV.csv will be as follows:
Same way all rest of three files will be filled up with new column with ehrer file name, Dwell, Session, and Elapsed:
A_ID B_ID Var LO UP LO UP
103 321 PV 0 402
103 503 PV 192 225 433 608
106 264 PV 104 258 334 408
107 197 PV 6 32 113 258
Final output file will be as follows:
finalFile.csv.
A_ID B_ID Var LO UP
103 321 PV 0 402
103 321 Dwel 40 250
103 321 Dwel 251 517
103 321 Dwel 518 780
103 321 Sess 75 350
103 321 Sess 351 370
103 321 Sess 371 850
103 321 Elap 5 35
103 321 Elap 36 75
103 503 PV 192 225
103 503 PV 226 433
103 503 PV 434 608
103 503 Dwel 80 125
103 503 Dwel 126 435
103 503 Dwel 436 585
103 503 Elap 100 225
103 503 Elap 226 333
103 503 Elap 334 408
106 264 PV 104 258
106 264 PV 259 334
106 264 PV 335 408
106 264 Dwel 192 525
106 264 Dwel 526 682
106 264 Sess 92 225
106 264 Sess 226 482
106 264 Sess 483 608
106 264 Elap 102 325
106 264 Elap 326 582
107 197 PV 6 32
107 192 PV 33 113
107 192 PV 114 258
107 192 Dwel 324 492
107 192 Dwel 493 542
107 192 Dwel 543 614
107 192 Sess 24 92
107 192 Sess 93 142
107 192 Elap 24 92
107 192 Elap 93 142
107 192 Elap 143 214

You should use python builtin csv module.
To create the final csv file you can do like this. Read through each file, add the new column value to every row and write it to the new file
import csv
with open('finalcsv.csv', 'w') as outcsv:
writer = csv.writer(outcsv)
writer.writerow(['a','b','c','etc','Var']) # write final headers
for filename in ['PV.csv','Dwel.csv','Sess.csv','Elap.csv']:
with open(filename) as incsv:
val = filename.split('.csv')[0]
reader = csv.reader(incsv) # create reader object
reader.next() # skip the headers
for row in reader:
writer.writerow(row+[val])

The following script should get you started:
from collections import defaultdict
from itertools import groupby
import csv
entries = defaultdict(list)
csv_files = [(0, 'PV.csv', 'PV'), (1, 'Dwell.csv', 'Dwel'), (2, 'Session.csv', 'Sess'), (3, 'Elapsed.csv', 'Elap')]
for index, filename, shortname in csv_files:
f_input = open(filename, 'rb')
csv_input = csv.reader(f_input)
header = next(csv_input)
for row in csv_input:
row[:] = [col for col in row if col]
entries[(row[0], row[1])].append((index, shortname, row[2:]))
f_input.close()
f_output = open('finalFile.csv', 'wb')
csv_output = csv.writer(f_output)
csv_output.writerow(header[:2] + ['Var'] + header[2:4])
for key in sorted(entries.keys()):
for k, g in groupby(sorted(entries[key]), key=lambda x: x[1]):
var_group = list(g)
if len(var_group[0][2]):
up = var_group[0][2][0]
for entry in var_group:
for pair in zip(*[iter(entry[2])]*2):
csv_output.writerow([key[0], key[1], entry[1], up, pair[1]])
up = int(pair[1]) + 1
f_output.close()
Using the data you have provided, this gives the following output:
A_ID,B_ID,Var,LO,UP
103,321,PV,0,402
103,321,Dwel,40,250
103,321,Dwel,251,780
103,321,Sess,75,350
103,321,Sess,351,850
103,321,Elap,5,35
103,503,PV,192,225
103,503,PV,226,608
103,503,Dwel,80,125
103,503,Dwel,126,585
103,503,Elap,100,225
103,503,Elap,226,408
106,264,PV,104,258
106,264,PV,259,408
106,264,Dwel,192,525
106,264,Sess,92,225
106,264,Sess,226,608
106,264,Elap,102,325
107,197,PV,6,32
107,197,PV,33,258
107,197,Dwel,324,492
107,197,Dwel,493,614
107,197,Sess,24,92
107,197,Elap,24,92
107,197,Elap,93,214
To work with all csv files in a folder, you could add the following to the top of the script:
import os
import glob
csv_files = [(index, file, os.path.splitext(file)[0]) for index, file in enumerate(glob.glob('*.csv'))]
You should also change the location of the output file otherwise it will be read in the next time the script is run.
Tested using Python 2.6.6 (which I believe is what the OP is using)

There's a standard library module for these manipulations
https://docs.python.org/2/library/csv.html#module-csv
Not a full answer by any means, but your full implementation will almost certainly start there. The python docs above include several working examples which will get you started.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to plot Large dataset using matplotlib bar graph - python

Related

How can we run Scipy Optimize Based After Doing Group By?

print each 10 numbers in line always printing the first value alone

Values in pandas dataframe not getting sorted

How I can group by index and make mean on each column

Add new column in a csv file and manipulate on the on records

Categories

Resources