ValueError shape mismatch: objects cannot be broadcast to a single shape - python

This is my code that I plan to use for creating a BAR chart.Ignore next line.I am writing this just to balance the code and details .
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def bar1():
df=pd.read_csv('C:\\Users\Bhuwan Bhatt\Desktop\IP PROJECT\Book1.csv',encoding= 'unicode_escape')
x=np.arange(11)
Countries=df['Country']
STotalMed=df['SummerTotal']
WTotalMed=df['WinterTotal']
plt.bar(x-0.25,STotalMed,width=.2, label='Total Medals by Countries in Summer',color='g')
plt.bar(x+0.25,WTotalMed,width=.2, label='Total Medals by Countries in Winter',color='r')
plt.xticks(np.arange(11),Countries,rotation=30)
plt.title('Olympics Data Analysis of Top 10 Countries',color='red',fontsize=10)
plt.xlabel('Countries')
plt.ylabel('Total Medals')
plt.grid()
plt.legend()
plt.show()
bar1()
I get this error for some reason:
Traceback (most recent call last):
File "C:/Users/Bhuwan Bhatt/Desktop/dsd.py", line 19, in <module>
bar1()
File "C:/Users/Bhuwan Bhatt/Desktop/dsd.py", line 10, in bar1
plt.bar(x-0.25,STotalMed,width=.2, label='Total Medals by Countries in Summer',color='g')
File "C:\Users\Bhuwan Bhatt\AppData\Local\Programs\Python\Python38-32\lib\site-packages\matplotlib\pyplot.py", line 2471, in bar
return gca().bar(
File "C:\Users\Bhuwan Bhatt\AppData\Local\Programs\Python\Python38-32\lib\site-packages\matplotlib\__init__.py", line 1438, in inner
return func(ax, *map(sanitize_sequence, args), **kwargs)
File "C:\Users\Bhuwan Bhatt\AppData\Local\Programs\Python\Python38-32\lib\site-packages\matplotlib\axes\_axes.py", line 2430, in bar
x, height, width, y, linewidth = np.broadcast_arrays(
File "<__array_function__ internals>", line 5, in broadcast_arrays
File "C:\Users\Bhuwan Bhatt\AppData\Local\Programs\Python\Python38-32\lib\site-packages\numpy\lib\stride_tricks.py", line 264, in broadcast_arrays
shape = _broadcast_shape(*args)
File "C:\Users\Bhuwan Bhatt\AppData\Local\Programs\Python\Python38-32\lib\site-packages\numpy\lib\stride_tricks.py", line 191, in _broadcast_shape
b = np.broadcast(*args[:32])
ValueError: shape mismatch: objects cannot be broadcast to a single shape
This is the CSV file I've been using:
Country SummerTimesPart Sumgoldmedal Sumsilvermedal Sumbronzemedal SummerTotal WinterTimesPart Wingoldmedal Winsilvermedal Winbronzemedal WinterTotal TotalTimesPart Tgoldmedal Tsilvermedal Tbronzemedal TotalMedal
 Afghanistan  14 0 0 2 2 0 0 0 0 0 14 0 0 2 2
 Algeria  13 5 4 8 17 3 0 0 0 0 16 5 4 8 17
 Argentina  24 21 25 28 74 19 0 0 0 0 43 21 25 28 74
 Armenia  6 2 6 6 14 7 0 0 0 0 13 2 6 6 14
Australasia 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12
  Australia  26 147 163 187 497 19 5 5 5 15 45 152 168 192 512
  Austria  27 18 33 36 87 23 64 81 87 232 50 82 114 123 319
  Azerbaijan  6 7 11 24 42 6 0 0 0 0 12 7 11 24 42
  Bahamas  16 6 2 6 14 0 0 0 0 0 16 6 2 6 14
  Bahrain  9 2 1 0 3 0 0 0 0 0 9 2 1 0 3
  Barbados 12 0 0 1 1 0 0 0 0 0 12 0 0 1 1
  Belarus  6 12 27 39 78 7 8 5 5 18 13 20 32 44 96
  Belgium  26 40 53 55 148 21 1 2 3 6 47 41 55 58 154
  Bermuda  18 0 0 1 1 8 0 0 0 0 26 0 0 1 1
  Bohemia  3 0 1 3 4 0 0 0 0 0 3 0 1 3 4
  Botswana  10 0 1 0 1 0 0 0 0 0 10 0 1 0 1
  Brazil  22 30 36 63 129 8 0 0 0 0 30 30 36 63 129
  British WestIndies  1 0 0 2 2 0 0 0 0 0 1 0 0 2 2
  Bulgaria  20 51 87 80 218 20 1 2 3 6 40 52 89 83 224
  Burundi  6 1 1 0 2 0 0 0 0 0 6 1 1 0 2
  Cameroon 14 3 1 2 6 1 0 0 0 0 15 3 1 2 6
  Canada 26 64 102 136 302 23 73 64 62 199 49 137 166 198 501
  Chile  23 2 7 4 13 17 0 0 0 0 40 2 7 4 13
  China  10 224 167 155 546 11 13 28 21 62 21 237 195 176 608
  Colombia  19 5 9 14 28 2 0 0 0 0 21 5 9 14 28
  Costa Rica  15 1 1 2 4 6 0 0 0 0 21 1 1 2 4
  Ivory Coast  13 1 1 1 3 0 0 0 0 0 13 1 1 1 3
  Croatia  7 11 10 12 33 8 4 6 1 11 15 15 16 13 44
  Cuba  20 78 68 80 226 0 0 0 0 0 20 78 68 80 226
INFO-----> SummerTimesPart : No. of times participated in summer by each country
WinterTimesPart : No. of times participated in winter by each countryta

Just change
x=np.arange(11)
to
x = np.arange(len(df))
and
plt.xticks(np.arange(11),Countries,rotation=30)
to
plt.xticks(x,Countries,rotation=30)

Related

Counting values in data frame rows against another df to see how many values are higher

I have two data frames
df2022fl One is a list of 24 rows
df One is one row of values
1759 columns in each df.
I want to reference every row in dataframe with 24 rows too count how many columns are above the corresponding column in the one row df.
I used the code below, but keep getting the error below the line of code
( df2022fl > df.T[df2022fl.columns].values ).sum(axis=1)
KeyError: "None of [Index(['id', 'table_position', 'performance_rank', 'risk', 'competition_id',\n 'suspended_matches', 'homeAttackAdvantage', 'homeDefenceAdvantage',\n 'homeOverallAdvantage', 'seasonGoals_overall',\n ...\n 'freekicks_total_over275_away', 'freekicks_total_over285_overall',\n 'freekicks_total_over285_home', 'freekicks_total_over285_away',\n 'freekicks_total_over295_overall', 'freekicks_total_over295_home',\n 'freekicks_total_over295_away', 'freekicks_total_over305_overall',\n 'freekicks_total_over305_home', 'freekicks_total_over305_away'],\n dtype='object', length=1759)] are in the [columns]"
I have no idea why this is happening as I removed all type object also, to have just float64 dtypes
Any ideas to help please?
df in text format, this is the dataframe with one row
234 5 5 42 32 0 4 -33 -2 54 30 84 55 29 54 31 19 30 20 10 35 31 34 56 49 58 74 71 71 3 4 -4 16 8 7 13 5 6 7 3 4 38 19 19 4 3 3 1 13 5 5 28 26 21 22 10 9 48 50 39 10 23 9 13 2 3 19 50 42 42 9 10 18 10 6 47 42 32 6 2 2 13 9 9 1 1 1 2 2 1 1 1 1 0 0 0 35 35 30 27 26 25 18 13 21 10 6 2 21 26 8 8 8 17 35 33 39 2 3 8 9 16 17 51 26 17 1 1 0 0 0 0 0 0 0 0 0 0 20 12 7 16 7 5 37 19 14 -8 -2 -9 0 3 5 14 27 34 0 8 13 37 60 81 96 85 67 44 26 4 37 35 32 21 11 2 0 8 25 48 67 79 0 2 5 11 16 18 92 78 65 37 16 0 18 16 14 7 3 0 0 0 0 11 50 83 0 0 0 2 11 16 92 83 67 48 21 4 19 19 16 11 5 1 1 8 24 3 17 52 4 25 48 1 6 11 0 9 57 0 2 12 38 19 19 25 25 22 15 14 9 5 3 2 66 64 49 39 36 19 13 8 4 12 12 9 6 5 3 1 1 1 63 63 47 32 26 16 5 5 5 13 13 12 9 9 4 3 2 0 68 63 50 46 39 17 13 9 0 31 24 19 13 8 5 2 82 63 50 34 21 13 5 16 14 11 6 3 2 1 84 74 56 32 16 11 4 15 10 8 7 5 2 1 78 53 42 37 26 9 5 26 21 15 10 5 3 2 57 47 32 21 11 6 4 12 9 3 2 1 0 0 52 41 14 9 5 0 0 14 12 9 6 2 1 0 61 52 38 25 8 4 0 37 34 25 18 12 4 2 0 0 94 81 57 46 28 8 4 0 0 18 15 11 8 5 1 1 0 0 88 74 46 39 21 4 4 0 0 19 19 14 10 5 3 1 0 0 96 83 63 42 26 13 4 0 0 29 19 7 2 0 0 0 75 40 15 4 0 0 0 12 8 3 2 0 0 0 63 33 13 8 0 0 0 17 10 4 0 0 0 0 83 42 17 0 0 0 0 33 21 11 5 0 0 0 77 55 25 11 0 0 0 17 10 4 2 0 0 0 71 46 17 8 0 0 0 16 11 4 1 0 0 0 83 57 21 4 0 0 0 5 6 2 7 7 14 30 29 30 176 91 85 66 27 35 8 8 9 4 4 4 161 63 94 3 3 3 10 0 4 0 1 1 1 374 229 145 9 12 7 177 88 75 197 127 70 4 4 3 5 6 3 48 51 46 9 9 8 377 182 195 151 66 79 62 28 31 32 16 16 3 2 3 1 1 1 32 31 27 19 12 6 4 96 78 59 41 26 13 9 16 15 12 7 5 1 1 96 78 52 30 22 4 4 16 16 14 10 7 4 1 96 75 57 46 28 17 4 30 18 4 1 0 0 0 78 38 9 2 0 0 0 16 9 0 0 0 0 0 78 39 0 0 0 0 0 14 9 4 1 0 0 0 75 38 17 4 0 0 0 8 4 3 17 17 13 38 19 19 6 4 1 13 17 5 7 3 3 15 13 13 1 1 0 3 5 0 1 1 0 0 0 0 0 0 0 46 31 15 14 9 5 32 19 10 4 10 26 11 26 67 3 7 14 13 37 70 0 3 12 0 16 63 57 33 24 1 1 1 14 8 5 30 35 22 14 5 9 33 26 39 5 2 3 13 11 13 0 2 0 6 4 2 16 21 9 6 2 2 13 9 9 18 9 9 39 39 39 17 6 10 38 26 42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 3 3 1 3 4 4 3 6 3 5 1 5 4 8 2 5 5 5 7 7 10 7 10 11 9 9 16 13 12 16 3 4 5 7 9 11 5 3 6 5 3 5 1 1 0 2 3 2 4 3 4 1 1 2 4 5 6 0 0 0 1 0 2 0 2 1 2 1 2 2 1 3 1 2 0 4 6 7 5 6 5 3 2 9 9 9 9 0 0 0 0 0 2 0 1 2 0 0 3 2 1 3 1 2 0 2 1 1 0 0 1 2 1 2 2 1 1 2 2 1 3 1 3 2 2 6 3 3 7 4 3 7 42 0 0 0 0 0 0 -2 -1 -1 -2 -1 -1 1 5 17 28 2 11 37 67 1 3 9 15 4 13 39 65 0 1 6 12 0 5 28 56 0 1 5 24 0 3 13 63 0 1 5 13 0 4 21 54 0 0 0 10 0 0 0 53 37 18 19 38 19 19 44 21 21 92 44 48 19 8 7 40 19 20 22 11 11 47 23 22 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 2 2 2 50 52 48 23 22 25 0 0 0 27 27 21 40 28 39 15 9 18 14 11 9 56 52 61 34 36 25 50 43 56 66 27 35 73 38 34 1 1 1 1 1 1 139 66 73 3 2 3 4 1 1 3 1 1 3 1 1 0 0 0 38 19 19 10 11 9 3 5 3 2 3 6 4 1 12 8 7 4 8 5 5 3 4 2 2 1 31 21 16 9 42 25 23 14 17 9 9 4 19 16 13 5 4 3 11 9 4 42 36 28 23 18 14 57 47 21 19 13 11 8 4 4 13 11 9 7 7 7 6 2 2 1 1 1 50 34 28 21 11 11 61 56 47 37 37 37 32 11 9 5 5 5 28 18 12 6 11 7 5 2 13 7 3 1 62 40 31 13 50 32 23 9 68 37 16 5 1 0 1 32 15 17 1 0 0 2 0 4 78 74 74 2 0 0 4 1 2 0 0 0 1 0 0 8 4 11 0 0 0 2 0 0 32 16 16 350 172 178 10 10 11 23 27 20 6 9 21 13 8 13 26 17 4 2 5 6 7 2 2 0 0 0 1 1 3 2 0 0 0 1 4 3 1 0 0 0 0 38 19 19 10 8 1 1 1 0 26 35 5 2 4 0 0 0 0 3 1 0 26 11 13 6 4 1 7 3 3 10 7 3 12 3 5 15 13 13 22 13 25 13 21 0 0 0 7 4 0 57 48 54 15 17 4 30 6 6 6 2 2 3 9 10 7 18 12 6 0 0 0 0 0 0 41 52 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 2 7 2 3 17 8 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
df2022fl below why is the 24 rows dataframe
Compare both the dataframe using
df2022fl.ge(df.iloc[0]).sum()
This gives us the number of values in df2022fl which is greater than the value in df
Output :
id 24
table_position 20
performance_rank 20
risk 23
competition_id 24
..
freekicks_total_over295_home 24
freekicks_total_over295_away 24
freekicks_total_over305_overall 24
freekicks_total_over305_home 24
freekicks_total_over305_away 24
Length: 1759, dtype: int64
To get the number of column which came out to be greater than the values in dataframe df you can use axis = 1.
df2022fl['stats'] = df2022fl.ge(df.iloc[0]).sum(axis=1)
This gives you expected output :
id table_position ... freekicks_total_over305_away stats
1 234.0 6.0 ... 0.0 1688
2 235.0 18.0 ... 0.0 1529
3 236.0 16.0 ... 0.0 1565
4 237.0 24.0 ... 0.0 1409
5 242.0 3.0 ... 0.0 1566
6 244.0 4.0 ... 0.0 1681
7 246.0 23.0 ... 0.0 1607
8 247.0 5.0 ... 0.0 1642
9 248.0 14.0 ... 0.0 1603
10 253.0 15.0 ... 0.0 1575
11 254.0 12.0 ... 0.0 1554
12 255.0 13.0 ... 0.0 1593
13 257.0 20.0 ... 0.0 1533
14 258.0 21.0 ... 0.0 1537
15 259.0 9.0 ... 0.0 1585
16 262.0 17.0 ... 0.0 1488
17 265.0 11.0 ... 0.0 1647
18 267.0 7.0 ... 0.0 1628
19 268.0 2.0 ... 0.0 1615
20 1020.0 1.0 ... 0.0 1601
21 1827.0 8.0 ... 0.0 1603
22 1833.0 22.0 ... 0.0 1587
23 3124.0 19.0 ... 0.0 1594
24 3141.0 10.0 ... 0.0 1623

how to replace specific data using pandas python or excel

I having a data csv file containing some data. In which i have some data within semi colons. In these semi colon there is some specific id numbers and i need to replace it with the specific location name.
Available data
24CFA4A-12L - GF Electrical corridor
Replacing data within semicolons of id number
1;1;35;;1/2/1/37 24CFA4A;;;0;;;
Files with data - https://gofile.io/d/bQDppz
Thank you if anyone have solution.
[![Main data to replaced after finding id number and replacing with location ][3]][3]
Supposing you have dataframes:
df1 = pd.read_excel("ID_list.xlsx", header=None)
df2 = pd.read_excel("location.xlsx", header=None)
df1:
0
0 1;1;27;;1/2/1/29 25BAB3D;;;0;;;
1 1;1;27;1;;;;0;;;
2 1;1;28;;1/2/1/30 290E6D2;;;0;;;
3 1;1;28;1;;;;0;;;
4 1;1;29;;1/2/1/31 28BA737;;;0;;;
5 1;1;29;1;;;;0;;;
6 1;1;30;;1/2/1/32 2717823;;;0;;;
7 1;1;30;1;;;;0;;;
8 1;1;31;;1/2/1/33 254DEAA;;;0;;;
9 1;1;31;1;;;;0;;;
10 1;1;32;;1/2/1/34 28AE041;;;0;;;
11 1;1;32;1;;;;0;;;
12 1;1;33;;1/2/1/35 254DE82;;;0;;;
13 1;1;33;1;;;;0;;;
14 1;1;34;;1/2/1/36 2539D70;;;0;;;
15 1;1;34;1;;;;0;;;
16 1;1;35;;1/2/1/37 24CFA4A;;;0;;;
17 1;1;35;1;;;;0;;;
18 1;1;36;;1/2/1/39 28F023E;;;0;;;
19 1;1;36;1;;;;0;;;
20 1;1;37;;1/2/1/40 2717831;;;0;;;
21 1;1;37;1;;;;0;;;
22 1;1;38;;1/2/1/41 2397D75;;;0;;;
23 1;1;38;1;;;;0;;;
24 1;1;39;;1/2/1/42 287844C;;;0;;;
25 1;1;39;1;;;;0;;;
26 1;1;40;;1/2/1/43 28784F0;;;0;;;
27 1;1;40;1;;;;0;;;
28 1;1;41;;1/2/1/44 2865B67;;;0;;;
29 1;1;41;1;;;;0;;;
30 1;1;42;;1/2/1/45 2865998;;;0;;;
31 1;1;42;1;;;;0;;;
32 1;1;43;;1/2/1/46 287852F;;;0;;;
33 1;1;43;1;;;;0;;;
34 1;1;44;;1/2/1/47 287AC43;;;0;;;
35 1;1;44;1;;;;0;;;
36 1;1;45;;1/2/1/48 287ACF8;;;0;;;
37 1;1;45;1;;;;0;;;
38 1;1;46;;1/2/1/49 2878586;;;0;;;
39 1;1;46;1;;;;0;;;
40 1;1;47;;1/2/1/50 2878474;;;0;;;
41 1;1;47;1;;;;0;;;
42 1;1;48;;1/2/1/51 2846315;;;0;;;
df2:
0 1
0 GF General Dining TC 254DEAA-02L
1 GF General Dining TC 2717823-26L
2 GF General Dining FC 28BA737-50L
3 GF Preparation FC 25BAB3D-10L
4 GF Preparation TC 290E6D2-01M
5 GF Hospital Kitchen FC 25BAB2F-10L
6 GF Hospital Kitchen TC 2906F5C-01M
7 GF Food Preparation FC 25F5723-10L
8 GF Food Preparation TC 29070D6-01M
9 GF KITCHEN Corridor 254DF5D-02L
Then:
df1 = df1[0].str.split(";", expand=True)
df1[4] = df1[4].apply(lambda x: v[-1] if (v := x.split()) else "")
df2[1] = df2[1].apply(lambda x: x.split("-")[0])
df1:
0 1 2 3 4 5 6 7 8 9 10
0 1 1 27 25BAB3D 0
1 1 1 27 1 0
2 1 1 28 290E6D2 0
3 1 1 28 1 0
4 1 1 29 28BA737 0
5 1 1 29 1 0
6 1 1 30 2717823 0
7 1 1 30 1 0
8 1 1 31 254DEAA 0
9 1 1 31 1 0
10 1 1 32 28AE041 0
11 1 1 32 1 0
12 1 1 33 254DE82 0
13 1 1 33 1 0
14 1 1 34 2539D70 0
15 1 1 34 1 0
16 1 1 35 24CFA4A 0
17 1 1 35 1 0
18 1 1 36 28F023E 0
19 1 1 36 1 0
20 1 1 37 2717831 0
21 1 1 37 1 0
22 1 1 38 2397D75 0
23 1 1 38 1 0
24 1 1 39 287844C 0
25 1 1 39 1 0
26 1 1 40 28784F0 0
27 1 1 40 1 0
28 1 1 41 2865B67 0
29 1 1 41 1 0
30 1 1 42 2865998 0
31 1 1 42 1 0
32 1 1 43 287852F 0
33 1 1 43 1 0
34 1 1 44 287AC43 0
35 1 1 44 1 0
36 1 1 45 287ACF8 0
37 1 1 45 1 0
38 1 1 46 2878586 0
39 1 1 46 1 0
40 1 1 47 2878474 0
41 1 1 47 1 0
42 1 1 48 2846315 0
df2:
0 1
0 GF General Dining TC 254DEAA
1 GF General Dining TC 2717823
2 GF General Dining FC 28BA737
3 GF Preparation FC 25BAB3D
4 GF Preparation TC 290E6D2
5 GF Hospital Kitchen FC 25BAB2F
6 GF Hospital Kitchen TC 2906F5C
7 GF Food Preparation FC 25F5723
8 GF Food Preparation TC 29070D6
9 GF KITCHEN Corridor 254DF5D
To replace the values:
m = dict(zip(df2[1], df2[0]))
df1[4] = df1[4].replace(m)
df1:
0 1 2 3 4 5 6 7 8 9 10
0 1 1 27 GF Preparation FC 0
1 1 1 27 1 0
2 1 1 28 GF Preparation TC 0
3 1 1 28 1 0
4 1 1 29 GF General Dining FC 0
5 1 1 29 1 0
6 1 1 30 GF General Dining TC 0
7 1 1 30 1 0
8 1 1 31 GF General Dining TC 0
9 1 1 31 1 0
10 1 1 32 28AE041 0
11 1 1 32 1 0
12 1 1 33 254DE82 0
13 1 1 33 1 0
14 1 1 34 2539D70 0
15 1 1 34 1 0
16 1 1 35 24CFA4A 0
17 1 1 35 1 0
18 1 1 36 28F023E 0
19 1 1 36 1 0
20 1 1 37 2717831 0
21 1 1 37 1 0
22 1 1 38 2397D75 0
23 1 1 38 1 0
24 1 1 39 287844C 0
25 1 1 39 1 0
26 1 1 40 28784F0 0
27 1 1 40 1 0
28 1 1 41 2865B67 0
29 1 1 41 1 0
30 1 1 42 2865998 0
31 1 1 42 1 0
32 1 1 43 287852F 0
33 1 1 43 1 0
34 1 1 44 287AC43 0
35 1 1 44 1 0
36 1 1 45 287ACF8 0
37 1 1 45 1 0
38 1 1 46 2878586 0
39 1 1 46 1 0
40 1 1 47 2878474 0
41 1 1 47 1 0
42 1 1 48 2846315 0

Pandas code to get the count of each values

Here I'm sharing a sample data(I'm dealing with Big Data), the "counts" value varies from 1 to 3000+,, sometimes more than that..
Sample data looks like :
ID counts
41 44 17 16 19 52 6
17 30 16 19 4
52 41 44 30 17 16 6
41 44 52 41 41 41 6
17 17 17 17 41 5
I was trying to split "ID" column into multiple & trying to get that count,,
data= reading the csv_file
split_data = data.ID.apply(lambda x: pd.Series(str(x).split(" "))) # separating columns
as I mentioned, I'm dealing with big data,, so this method is not that much effective..i'm facing problem to get the "ID" counts
I want to collect the total counts of each ID & map it to the corresponding ID column.
Expected output:
ID counts 16 17 19 30 41 44 52
41 41 17 16 19 52 6 1 1 1 0 2 0 1
17 30 16 19 4 1 1 1 1 0 0 0
52 41 44 30 17 16 6 1 1 0 1 1 1 1
41 44 52 41 41 41 6 0 0 0 0 4 1 1
17 17 17 17 41 5 0 4 0 0 1 0 0
If you have any idea,, please let me know
Thank you
Use Counter for get counts of values splitted by space in list comprehension:
from collections import Counter
L = [{int(k): v for k, v in Counter(x.split()).items()} for x in df['ID']]
df1 = pd.DataFrame(L, index=df.index).fillna(0).astype(int).sort_index(axis=1)
df = df.join(df1)
print (df)
ID counts 16 17 19 30 41 44 52
0 41 44 17 16 19 52 6 1 1 1 0 1 1 1
1 17 30 16 19 4 1 1 1 1 0 0 0
2 52 41 44 30 17 16 6 1 1 0 1 1 1 1
3 41 44 52 41 41 41 6 0 0 0 0 4 1 1
4 17 17 17 17 41 5 0 4 0 0 1 0 0
Another idea, but I guess slowier:
df1 = df.assign(a = df['ID'].str.split()).explode('a')
df1 = df.join(pd.crosstab(df1['ID'], df1['a']), on='ID')
print (df1)
ID counts 16 17 19 30 41 44 52
0 41 44 17 16 19 52 6 1 1 1 0 1 1 1
1 17 30 16 19 4 1 1 1 1 0 0 0
2 52 41 44 30 17 16 6 1 1 0 1 1 1 1
3 41 44 52 41 41 41 6 0 0 0 0 4 1 1
4 17 17 17 17 41 5 0 4 0 0 1 0 0

Reading XML (NIST .n42 file) and data extraction

I have a xml file that I need to extract data from 'channelData' in the below xml.
from xml.dom import minidom
xmldoc = minidom.parse('Annex_B_n42.xml')
itemlist = xmldoc.getElementsByTagName('ChannelData')
print(len(itemlist))
print(itemlist[0].attributes['compressionCode'].value)
for s in itemlist:
print(s.attributes['compressionCode'].value)
Which doesn't return the data, just the value 'None'.
I also tried another approach from an another example:
import xml.etree.ElementTree as ET
root = ET.parse('Annex_B_n42.xml').getroot()
#value=[]
for type_tag in root.findall('Spectrum'):
value = type_tag.get('id')
print(value)
print("data from file " +str(value))
This did not work at all and value is not being populated. I really don't understand how to parse xml.
Here is the xml file
<?xml version="1.0"?>
<?xml-model href="http://physics.nist.gov/N42/2011/N42/schematron/n42.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<RadInstrumentData xmlns="http://physics.nist.gov/N42/2011/N42" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://physics.nist.gov/N42/2011/N42 file:///d:/Data%20Files/ANSI%20N42%2042/V2/Schema/n42.xsd" n42DocUUID="d72b7fa7-4a20-43d4-b1b2-7e3b8c6620c1">
<RadInstrumentInformation id="RadInstrumentInformation-1">
<RadInstrumentManufacturerName>RIIDs R Us</RadInstrumentManufacturerName>
<RadInstrumentModelName>iRIID</RadInstrumentModelName>
<RadInstrumentClassCode>Radionuclide Identifier</RadInstrumentClassCode>
<RadInstrumentVersion>
<RadInstrumentComponentName>Software</RadInstrumentComponentName>
<RadInstrumentComponentVersion>1.1</RadInstrumentComponentVersion>
</RadInstrumentVersion>
</RadInstrumentInformation>
<RadDetectorInformation id="RadDetectorInformation-1">
<RadDetectorCategoryCode>Gamma</RadDetectorCategoryCode>
<RadDetectorKindCode>NaI</RadDetectorKindCode>
</RadDetectorInformation>
<EnergyCalibration id="EnergyCalibration-1">
<CoefficientValues>-21.8 12.1 6.55e-03</CoefficientValues>
</EnergyCalibration>
<RadMeasurement id="RadMeasurement-1">
<MeasurementClassCode>Foreground</MeasurementClassCode>
<StartDateTime>2003-11-22T23:45:19-07:00</StartDateTime>
<RealTimeDuration>PT60S</RealTimeDuration>
<Spectrum id="RadMeasurement-1Spectrum-1" radDetectorInformationReference="RadDetectorInformation-1" energyCalibrationReference="EnergyCalibration-1">
<LiveTimeDuration>PT59.61S</LiveTimeDuration>
<ChannelData compressionCode="None">
0 0 0 22 421 847 1295 1982 2127 2222 2302 2276
2234 1921 1939 1715 1586 1469 1296 1178 1127 1047 928 760
679 641 542 529 443 423 397 393 322 272 294 227
216 224 208 191 189 163 167 173 150 137 136 129
150 142 160 159 140 103 90 82 83 85 67 76
73 84 63 74 70 69 76 61 49 61 63 65
58 62 48 75 56 61 46 56 43 37 55 47
50 40 38 54 43 41 45 51 32 35 29 33
40 44 33 35 20 26 27 17 19 20 16 19
18 19 18 20 17 45 55 70 62 59 32 30
21 23 10 9 5 13 11 11 6 7 7 9
11 4 8 8 14 14 11 9 13 5 5 6
10 9 3 4 3 7 5 5 4 5 3 6
5 0 5 6 3 1 4 4 3 10 11 4
1 4 2 11 9 6 3 5 5 1 4 2
6 6 2 3 0 2 2 2 2 0 1 3
1 1 2 3 2 4 5 2 6 4 1 0
3 1 2 1 1 0 1 0 0 2 0 1
0 0 0 1 0 0 0 0 0 0 0 2
0 0 0 1 0 1 0 0 2 1 0 0
0 0 1 3 0 0 0 1 0 1 0 0
0 0 0 0
</ChannelData>
</Spectrum>
</RadMeasurement>
</RadInstrumentData>
You can use BeautifulSoup to get the channeldata tag value like following
from bs4 import BeautifulSoup
with open('Annex_B_n42.xml') as f:
xml = f.read()
bs_obj = BeautifulSoup(xml)
print(bs_obj.find_all("channeldata")[0].text)
That will print you
' 0 0 0 22 421 847 1295 1982 2127 2222 2302 2276 2234 1921 1939 1715 1586 1469 1296 1178 1127 1047 928 760 679 641 542 529 443 423 397 393 322 272 294 227 216 224 208 191 189 163 167 173 150 137 136 129 150 142 160 159 140 103 90 82 83 85 67 76 73 84 63 74 70 69 76 61 49 61 63 65 58 62 48 75 56 61 46 56 43 37 55 47 50 40 38 54 43 41 45 51 32 35 29 33 40 44 33 35 20 26 27 17 19 20 16 19
18 19 18 20 17 45 55 70 62 59 32 30 21 23 10 9 5 13 11 11 6 7 7 9 11 4 8 8 14 14 11 9 13 5 5 6 10 9 3 4 3 7 5 5 4 5 3 6 5 0 5 6 3 1 4 4 3 10 11 4 1 4 2 11 9 6 3 5 5 1 4 2 6 6 2 3 0 2 2 2 2 0 1 3 1 1 2 3 2 4 5 2 6 4 1 0 3 1 2 1 1 0 1 0 0 2 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 1 0 1 0 0 2 1 0 0 0 0 1 3 0 0 0 1 0 1 0 0 0 0 0 0 '
Try this:
import xml.etree.ElementTree as ET
root = ET.parse('Annex_B_n42.xml').getroot()
elems = root.findall(".//*[#compressionCode='None']")
print(elems[0].text)

Reshape data to new format for object detection

I have a data set in this format in dataframe
0--Parade/0_Parade_marchingband_1_849.jpg
2
449 330 122 149 0 0 0 0 0 0
0--Parade/0_Parade_Parade_0_904.jpg
1
361 98 263 339 0 0 0 0 0 0
0--Parade/0_Parade_marchingband_1_799.jpg
45
78 221 7 8 0 0 0 0 0
78 238 14 17 2 0 0 0 0 0
3 232 11 15 2 0 0 0 2 0
20 215 12 16 2 0 0 0 2 0
0--Parade/0_Parade_marchingband_1_117.jpg
23
69 359 50 36 1 0 0 0 0 1
227 382 56 43 1 0 1 0 0 1
296 305 44 26 1 0 0 0 0 1
353 280 40 36 2 0 0 0 2 1
885 377 63 41 1 0 0 0 0 1
819 391 34 43 2 0 0 0 1 0
727 342 37 31 2 0 0 0 0 1
598 246 33 29 2 0 0 0 0 1
740 308 45 33 1 0 0 0 2 1
0--Parade/0_Parade_marchingband_1_778.jpg
35
27 226 33 36 1 0 0 0 2 0
63 95 16 19 2 0 0 0 0 0
64 63 17 18 2 0 0 0 0 0
88 13 16 15 2 0 0 0 1 0
231 1 13 13 2 0 0 0 1 0
263 122 14 20 2 0 0 0 0 0
367 68 15 23 2 0 0 0 0 0
198 98 15 18 2 0 0 0 0 0
293 161 52 59 1 0 0 0 1 0
412 36 14 20 2 0 0 0 1 0
Can anyone tell me how to put these in dataframe where 1st column contain all the .jpg path next column contains all the coordinates but all the coordinate should be in correspondence to that .jpg path
eg.
Column1 coulmn2 column3
0--Parade/0_Parade_marchingband_1_849.jpg | 2 | 449 330 122 149 0 0 0 0 0 0
0--Parade/0_Parade_Parade_0_904.jpg | 1 | 361 98 263 339 0 0 0 0 0 0
0--Parade/0_Parade_marchingband_1_799.jpg | 45 | 78 221 7 8 0 0 0 0 0
| | 78 238 14 17 2 0 0 0 0 0
| | 3 232 11 15 2 0 0 0 2 0
| | 20 215 12 16 2 0 0 0 2 0
I have tried this
count1=0
count2=0
dict1 = {}
dict2 = {}
dict3 = {}
for i in data[0]:
if (i.find('.jpg') == -1):
dict1[count1] = i
count1+=1
else:
dict2[count2] = i
count2+=1

Categories

Resources