How to keep pandas group by column when applying transform function?

How to keep pandas group by column when applying transform function? - python

This is my pandas dataframe look's like:
sampling_time MQ2_LPG MQ2_CO MQ2_SMOKE MQ2_ALCOHOL MQ2_CH4 MQ2_H2 MQ2_PROPANE
0 2018-07-15 08:41:49.028 4.41 32.87 19.12 7.70 10.29 7.59 4.49
1 2018-07-15 08:41:49.028 2.98 19.08 12.47 4.72 6.34 5.15 3.02
2 2018-07-15 08:41:49.028 2.73 16.88 11.33 4.22 5.69 4.72 2.76
3 2018-07-15 08:41:49.028 2.69 16.47 11.11 4.13 5.57 4.64 2.71
4 2018-07-15 08:41:49.028 2.66 16.26 11.00 4.09 5.50 4.60 2.69
When I'm doing group by (split apply combine method), my sampling time column was removed.
transformed = dataframe.groupby('sampling_time').transform(lambda x: (x - x.mean()) / x.std())
transformed.head()
MQ2_LPG MQ2_CO MQ2_SMOKE MQ2_ALCOHOL MQ2_CH4 MQ2_H2 MQ2_PROPANE
0 15.710127 15.975636 15.773724 15.876433 15.874190 15.694674
1 3.519619 3.313661 3.494836 3.408578 3.404160 3.563717
2 1.388411 1.293621 1.389884 1.316656 1.352130 1.425885
3 1.047418 0.917159 0.983665 0.940110 0.973294 1.028148
4 0.791673 0.724337 0.780556 0.772756 0.752306 0.829280
Any help or suggestion about how to keep the sampling time column would be very appreciated.

You can do this by setting 'sampling_time' into the index, then when you runs groupby with transform, you will get your transform columns out with the index.
df1 = df.set_index('sampling_time')
df1.groupby('sampling_time').transform(lambda x: x-x.std())
output:
MQ2_LPG MQ2_CO MQ2_SMOKE MQ2_ALCOHOL \
sampling_time
2018-07-15 08:41:49.028 3.663522 25.760508 15.652432 6.154209
2018-07-15 08:41:49.028 2.233522 11.970508 9.002432 3.174209
2018-07-15 08:41:49.028 1.983522 9.770508 7.862432 2.674209
2018-07-15 08:41:49.028 1.943522 9.360508 7.642432 2.584209
2018-07-15 08:41:49.028 1.913522 9.150508 7.532432 2.544209
MQ2_CH4 MQ2_H2 MQ2_PROPANE
sampling_time
2018-07-15 08:41:49.028 8.243523 6.313227 3.7205
2018-07-15 08:41:49.028 4.293523 3.873227 2.2505
2018-07-15 08:41:49.028 3.643523 3.443227 1.9905
2018-07-15 08:41:49.028 3.523523 3.363227 1.9405
2018-07-15 08:41:49.028 3.453523 3.323227 1.9205

Related

Getting Empty DataFrame in pandas from table data

I'm getting data from using print command but in Pandas DataFrame throwing result as : Empty DataFrame,Columns: [],Index: [`]
Script:
from bs4 import BeautifulSoup
import requests
import re
import json
import pandas as pd
url='http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=jsonp1640132253903&t=XNAS:AAPL'
req=requests.get(url).text
#print(req)
data=re.search(r'jsonp1640132253903\((\{.*\})\)',req).group(1)
json_data=json.loads(data)['componentData']
#print(json_data)
# with open('index.html','w') as f:
# f.write(json_data)
soup=BeautifulSoup(json_data,'lxml')
for tr in soup.select('tr'):
row_data=[td.get_text(strip=True) for td in tr.select('td,th') if td.text]
if not row_data:
continue
if len(row_data) < 12:
row_data = ['Particulars'] + row_data
#print(row_data)
df=pd.DataFrame(row_data)
print(df)
Print result:
['Particulars', '2012-09', '2013-09', '2014-09', '2015-09', '2016-09', '2017-09', '2018-09', '2019-09', '2020-09', '2021-09', 'TTM']
['RevenueUSD Mil', '156,508', '170,910', '182,795', '233,715', '215,639', '229,234', '265,595', '260,174', '274,515', '365,817', '365,817']
['Gross Margin %', '43.9', '37.6', '38.6', '40.1', '39.1', '38.5', '38.3', '37.8', '38.2', '41.8', '41.8']
['Operating IncomeUSD Mil', '55,241', '48,999', '52,503', '71,230', '60,024', '61,344', '70,898', '63,930', '66,288', '108,949', '108,949']
['Operating Margin %', '35.3', '28.7', '28.7', '30.5', '27.8', '26.8', '26.7', '24.6', '24.1', '29.8', '29.8']
['Net IncomeUSD Mil', '41,733', '37,037', '39,510', '53,394', '45,687', '48,351', '59,531', '55,256', '57,411',
'94,680', '94,680']
['Earnings Per ShareUSD', '1.58', '1.42', '1.61', '2.31', '2.08', '2.30', '2.98', '2.97', '3.28', '5.61', '5.61'
Expected output:
2012-09 2013-09 2014-09 2015-09 2016-09 2017-09 2018-09 2019-09 2020-09 2021-09 TTM
Revenue USD Mil 156,508 170,910 182,795 233,715 215,639 229,234 265,595 260,174 274,515 365,817 365,817
Gross Margin % 43.9 37.6 38.6 40.1 39.1 38.5 38.3 37.8 38.2 41.8 41.8
Operating Income USD Mil 55,241 48,999 52,503 71,230 60,024 61,344 70,898 63,930 66,288 108,949 108,949
Operating Margin % 35.3 28.7 28.7 30.5 27.8 26.8 26.7 24.6 24.1 29.8 29.8
Net Income USD Mil 41,733 37,037 39,510 53,394 45,687 48,351 59,531 55,256 57,411 94,680 94,680
Earnings Per Share USD 1.58 1.42 1.61 2.31 2.08 2.30 2.98 2.97 3.28 5.61 5.61
Dividends USD 0.09 0.41 0.45 0.49 0.55 0.60 0.68 0.75 0.80 0.85 0.85
Payout Ratio % * — 27.4 28.5 22.3 24.8 26.5 23.7 25.1 23.7 16.3 15.2
Shares Mil 26,470 26,087 24,491 23,172 22,001 21,007 20,000 18,596 17,528 16,865 16,865
Book Value Per Share * USD 4.25 4.90 5.15 5.63 5.93 6.46 6.04 5.43 4.26 3.91 3.85
Operating Cash Flow USD Mil 50,856 53,666 59,713 81,266 65,824 63,598 77,434 69,391 80,674 104,038 104,038
Cap Spending USD Mil -9,402 -9,076 -9,813 -11,488 -13,548 -12,795 -13,313 -10,495 -7,309 -11,085 -11,085
Free Cash Flow USD Mil 41,454 44,590 49,900 69,778 52,276 50,803 64,121 58,896 73,365 92,953 92,953
Free Cash Flow Per Share * USD 1.58 1.61 1.93 2.96 2.24 2.41 2.88 3.07 4.04 5.57 —
Working Capital USD Mil 19,111 29,628 5,083 8,768 27,863 27,831 14,473 57,101 38,321 9,355
Expected columns:
'Particulars', '2012-09', '2013-09', '2014-09', '2015-09', '2016-09', '2017-09', '2018-09', '2019-09', '2020-09', '2021-09', 'TTM'

#QHarr's answer is by far the most straightforward, but in case you are wondering what is wrong with your code, it's that you are resetting the variable row_data for every iteration of the loop.
To make your code work, you can instead store each row as an element in a list. Then to build a DataFrame, you can pass this list of rows and the column names to pd.DataFrame:
data = []
soup=BeautifulSoup(json_data,'lxml')
for tr in soup.select('tr'):
row_data=[td.get_text(strip=True) for td in tr.select('td,th') if td.text]
if not row_data:
continue
elif len(row_data) < 12:
columns = ['Particulars'] + row_data
else:
data.append(row_data)
df=pd.DataFrame(data, columns=columns)
Result:
>>> df
Particulars 2012-09 2013-09 2014-09 2015-09 2016-09 2017-09 2018-09 2019-09 2020-09 2021-09 TTM
0 RevenueUSD Mil 156,508 170,910 182,795 233,715 215,639 229,234 265,595 260,174 274,515 365,817 365,817
1 Gross Margin % 43.9 37.6 38.6 40.1 39.1 38.5 38.3 37.8 38.2 41.8 41.8
2 Operating IncomeUSD Mil 55,241 48,999 52,503 71,230 60,024 61,344 70,898 63,930 66,288 108,949 108,949
3 Operating Margin % 35.3 28.7 28.7 30.5 27.8 26.8 26.7 24.6 24.1 29.8 29.8
4 Net IncomeUSD Mil 41,733 37,037 39,510 53,394 45,687 48,351 59,531 55,256 57,411 94,680 94,680
5 Earnings Per ShareUSD 1.58 1.42 1.61 2.31 2.08 2.30 2.98 2.97 3.28 5.61 5.61
6 DividendsUSD 0.09 0.41 0.45 0.49 0.55 0.60 0.68 0.75 0.80 0.85 0.85
7 Payout Ratio % * — 27.4 28.5 22.3 24.8 26.5 23.7 25.1 23.7 16.3 15.2
8 SharesMil 26,470 26,087 24,491 23,172 22,001 21,007 20,000 18,596 17,528 16,865 16,865
9 Book Value Per Share *USD 4.25 4.90 5.15 5.63 5.93 6.46 6.04 5.43 4.26 3.91 3.85
10 Operating Cash FlowUSD Mil 50,856 53,666 59,713 81,266 65,824 63,598 77,434 69,391 80,674 104,038 104,038
11 Cap SpendingUSD Mil -9,402 -9,076 -9,813 -11,488 -13,548 -12,795 -13,313 -10,495 -7,309 -11,085 -11,085
12 Free Cash FlowUSD Mil 41,454 44,590 49,900 69,778 52,276 50,803 64,121 58,896 73,365 92,953 92,953
13 Free Cash Flow Per Share *USD 1.58 1.61 1.93 2.96 2.24 2.41 2.88 3.07 4.04 5.57 —
14 Working CapitalUSD Mil 19,111 29,628 5,083 8,768 27,863 27,831 14,473 57,101 38,321 9,355 —

Use read_html for the DataFrame creation and then drop the na rows
json_data=json.loads(data)['componentData']
pd.read_html(json_data)[0].dropna(axis=0, how='all')

Finding highest values "zone" in a 2d matrix in Python

I have a 2d matrix in Python like this (a 10 rows/20 columns list I use to later do an imshow):
[[-20.17 -12.88 -20.7 -25.69 -21.69 -34.22 -32.65 -31.74 -36.36 -37.65
-41.42 -41.14 -44.01 -43.19 -41.85 -39.25 -40.15 -41.31 -39.73 -28.66]
[ 14.18 53.86 70.03 64.39 72.37 39.95 30.44 28.14 20.77 17.98
25.74 25.66 27.56 37.61 42.39 42.39 35.79 41.65 41.65 41.84]
[ 33.71 68.35 69.39 66.7 59.99 40.08 40.08 40.8 26.19 19.82
19.82 18.07 20.32 19.51 24.77 22.81 21.45 21.45 21.45 23.7 ]
[103.72 55.11 32.3 29.47 16.53 15.54 9.4 8.11 5.06 5.06
13.07 13.07 12.99 13.47 13.47 13.47 12.92 12.92 14.27 20.63]
[ 59.02 18.6 37.53 24.5 13.01 34.35 8.16 13.66 12.57 8.11
8.11 8.11 8.11 8.11 8.11 5.66 5.66 5.66 5.66 7.41]
[ 52.69 14.17 7.25 -5.79 3.19 -1.75 -2.43 -3.98 -4.92 -6.68
-6.68 -6.98 -6.98 -8.89 -8.89 -9.15 -9.15 -9.15 -9.15 -9.15]
[ 29.24 10.78 0.6 -3.15 -12.55 3.04 -1.68 -1.68 -1.41 -6.15
-6.15 -6.15 -10.59 -10.59 -10.59 -10.59 -10.59 -9.62 -10.29 -10.29]
[ 6.6 0.11 2.42 0.21 -5.68 -10.84 -10.84 -13.6 -16.12 -14.41
-15.28 -15.28 -15.28 -18.3 -5.55 -13.16 -13.16 -13.16 -13.16 -14.15]
[ 3.67 -11.69 -6.99 -16.75 -19.31 -20.28 -21.5 -21.5 -34.02 -37.16
-25.51 -25.51 -26.36 -26.36 -26.36 -26.36 -29.38 -29.38 -29.59 -29.38]
[ 31.36 -2.87 0.34 -8.06 -12.14 -22.7 -24.39 -25.51 -26.36 -27.37
-29.38 -31.54 -31.54 -31.54 -32.41 -33.26 -33.26 -15.54 -15.54 -15.54]]
I'm trying to find a way to detect the "zone" of this matrix that contains the highest density of high values in it. It means it might not contain the highest single value of the whole list, obviously.
I suppose to do so I should define how big this zone is, so let's say it should be 2x2 (so I want to find what is the 'square' of 2x2 items containing the highest values).
I always think I have a logical solution to do so, but then I always fail following the logic of how it could work!
Anyone has a suggestion I could start from?

I know there might be some easier ways to do so, but this is the easiest for me. I've created the following function to perform this task which takes two arguments:
arr: a 2D numpy array.
zone_size: the size of the square zone.
And the function goes like so:
def get_heighest_zone(arr, zone_size):
max_sum = float("-inf")
row_idx, col_idx = 0, 0
for row in range(arr.shape[0]-zone_size):
for col in range(arr.shape[1]-zone_size):
curr_sum = np.sum(arr[row:row+zone_size, col:col+zone_size])
if curr_sum > max_sum:
row_idx, col_idx = row, col
max_sum = curr_sum
return arr[row_idx:row_idx+zone_size, col_idx:col_idx+zone_size]
Assuming arr is the numpy array posted in your question, applying this function over different zone_sizes will return these values:
>>> get_heighest_zone(arr, 2)
[[70.03 64.39]
[69.39 66.7 ]]
>>> get_heighest_zone(arr, 3)
[[53.86 70.03 64.39]
[68.35 69.39 66.7 ]
[55.11 32.3 29.47]]
>>> get_heighest_zone(arr, 4)
[[ 14.18 53.86 70.03 64.39]
[ 33.71 68.35 69.39 66.7 ]
[103.72 55.11 32.3 29.47]
[ 59.02 18.6 37.53 24.5 ]]
If the zone_size doesn't have to be square, then you will need to modify a little bit in the code. Also, you should assert that zone_size is less than the array size.
Hopefully, this is what you was looking for!

extract only certain rows in a dataframe

I have a dataframe like this:
Code Date Open High Low Close Volume VWAP TWAP
0 US_GWA_BTC 2014-04-01 467.28 488.62 467.28 479.56 74,776.48 482.76 482.82
1 GWA_BTC 2014-04-02 479.20 494.30 431.32 437.08 114,052.96 460.19 465.93
2 GWA_BTC 2014-04-03 437.33 449.74 414.41 445.60 91,415.08 432.29 433.28
.
316 MWA_XRP_US 2018-01-19 1.57 1.69 1.48 1.53 242,563,870.44 1.59 1.59
317 MWA_XRP_US 2018-01-20 1.54 1.62 1.49 1.57 140,459,727.30 1.56 1.56
I want to filter out rows where code which has GWA infront of it.
I tried this code but it's not working.
df.set_index("Code").filter(regex='[GWA_]*', axis=0)

Try using startswith:
df[df.Code.str.startswith('GWA')]

Columns error while converting to float

df = pd.read_csv('TestdataNALA.csv', sep=';', na_values='ND')
df.fillna(0, inplace=True)
df = df[['ID','VegType','Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']]
new_list = df[['Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']]
nnew_list = [float(i) for i in new_list]
print(type(df['ID']))
#--> <class 'pandas.core.series.Series'>
print('ID' in df.columns)
#-->True
Some data from the CSV:
ID;VegType;Elevation;C;N;C/N;Soil13C;Soil15N;pH;Na_[mg/g];K_[mg/g];Mg_[mg/g];Ca_[mg/g];P_[µg/g]
A2425WC;Cyt;2425;2,78;0,23;12,14;-24,43;4,85;4,88;0,0005;0,0772;0,0646;0,3588;13,8400
C2489WCG;Cyt_Gen;2489;2,18;0,17;12,50;-24,51;5,53;4,21;0,0010;0,0639;0,0286;0,0601;0,6800
C2591WCG;Cyt_Gen;2591;5,13;0,36;14,29;-25,52;3,41;4,30;0,0046;0,0854;0,1169;0,7753;5,7000
E2695WC;Cyt;2695;1,43;0,14;10,55;-25,71;5,75;4,95;ND;0,0766;0,0441;0,0978;8,8500
A2900WC;Cyt;2900;6,95;0,53;13,11;-25,54;3,38;5,35;0,0032;0,1119;0,2356;1,8050;14,7100
A2800WC;Cyt;2800;1,88;0,16;11,62;-23,68;5,88;5,28;0,0025;0,0983;0,0770;0,3777;5,4200
A3050WC;Cyt;3050;1,50;0,12;12,50;-24,62;2,23;5,97;ND;0,0696;0,0729;0,5736;9,4000
When I run this, I get the error:
ValueError: could not convert string to float: 'Elevation'
It seems as if it's trying to convert the headlines, but I only wanted to convert the list.

To convert columns to numeric, you should use pd.to_numeric:
cols = ['Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
Your code will not work because, like a dictionary and its keys, when you loop through a dataframe you only find the column headings.
Update 1
Try also using the options below:
import pandas as pd
from io import StringIO
mystr = StringIO(""" <bound method NDFrame.head of ID VegType Elevation C N C/N Soil13C Soil15N pH \
1 C2489WCG Cyt_Gen 2489 2,18 0,17 12,50 -24,51 5,53 4,21
2 C2591WCG Cyt_Gen 2591 5,13 0,36 14,29 -25,52 3,41 4,30
3 E2695WC Cyt 2695 1,43 0,14 10,55 -25,71 5,75 4,95
""")
df = pd.read_csv(mystr, skiprows=1, decimal=',', header=None, delim_whitespace=True)
# 0 1 2 3 4 5 6 7 8 9
# 0 1 C2489WCG Cyt_Gen 2489 2.18 0.17 12.50 -24.51 5.53 4.21
# 1 2 C2591WCG Cyt_Gen 2591 5.13 0.36 14.29 -25.52 3.41 4.30
# 2 3 E2695WC Cyt 2695 1.43 0.14 10.55 -25.71 5.75 4.95
Update 2
import pandas as pd
from io import StringIO
mystr = StringIO("""ID;VegType;Elevation;C;N;C/N;Soil13C;Soil15N;pH;Na_[mg/g];K_[mg/g];Mg_[mg/g];Ca_[mg/g];P_[µg/g]
A2425WC;Cyt;2425;2,78;0,23;12,14;-24,43;4,85;4,88;0,0005;0,0772;0,0646;0,3588;13,8400
C2489WCG;Cyt_Gen;2489;2,18;0,17;12,50;-24,51;5,53;4,21;0,0010;0,0639;0,0286;0,0601;0,6800
C2591WCG;Cyt_Gen;2591;5,13;0,36;14,29;-25,52;3,41;4,30;0,0046;0,0854;0,1169;0,7753;5,7000
E2695WC;Cyt;2695;1,43;0,14;10,55;-25,71;5,75;4,95;ND;0,0766;0,0441;0,0978;8,8500
A2900WC;Cyt;2900;6,95;0,53;13,11;-25,54;3,38;5,35;0,0032;0,1119;0,2356;1,8050;14,7100
A2800WC;Cyt;2800;1,88;0,16;11,62;-23,68;5,88;5,28;0,0025;0,0983;0,0770;0,3777;5,4200
A3050WC;Cyt;3050;1,50;0,12;12,50;-24,62;2,23;5,97;ND;0,0696;0,0729;0,5736;9,4000
""")
df = pd.read_csv(mystr, decimal=',', delimiter=';')
# ID VegType Elevation C N C/N Soil13C Soil15N pH \
# 0 A2425WC Cyt 2425 2.78 0.23 12.14 -24.43 4.85 4.88
# 1 C2489WCG Cyt_Gen 2489 2.18 0.17 12.50 -24.51 5.53 4.21
# 2 C2591WCG Cyt_Gen 2591 5.13 0.36 14.29 -25.52 3.41 4.30
# 3 E2695WC Cyt 2695 1.43 0.14 10.55 -25.71 5.75 4.95
# 4 A2900WC Cyt 2900 6.95 0.53 13.11 -25.54 3.38 5.35
# 5 A2800WC Cyt 2800 1.88 0.16 11.62 -23.68 5.88 5.28
# 6 A3050WC Cyt 3050 1.50 0.12 12.50 -24.62 2.23 5.97
# Na_[mg/g] K_[mg/g] Mg_[mg/g] Ca_[mg/g] P_[µg/g]
# 0 0,0005 0.0772 0.0646 0.3588 13.84
# 1 0,0010 0.0639 0.0286 0.0601 0.68
# 2 0,0046 0.0854 0.1169 0.7753 5.70
# 3 ND 0.0766 0.0441 0.0978 8.85
# 4 0,0032 0.1119 0.2356 1.8050 14.71
# 5 0,0025 0.0983 0.0770 0.3777 5.42
# 6 ND 0.0696 0.0729 0.5736 9.40

Remove values when reading a csv and return to a list

I have a csv file of subjects XY Coordinates. Some XY's have been removed if the X-Coordinate is less than 5. This can be for any player and changes over time. (See example dataset).
At the start of this file P2, P7, P12, P17 have removed data. Although, throughout the file each player will have data missing. for about 90% of the file there will be at least 4 players having missing data at any time point.
Frame Time P1_X P2_Y P2_X P2_Y P3_X P3_Y P4_X P4_Y P5_X P5_Y P6_X P6_Y P7_X P7_Y P8_X P8_Y P9_X P9_Y P10_X P10_Y P11_X P11_Y P12_X P12_Y
0 10:39.2 65.75 45.10 73.74 -3.52 61.91 41.80 67.07 -24.62 77.14 -22.98 93.95 3.51 56.52 28.44 70.21 11.06 73.08 -35.54 69.79 45.73 73.34 29.26 64.73 -40.69 70.90 6.11 70.94 -45.11 42.78 3.00 61.77 -1.05 72.07 38.62
1 10:39.3 65.77 45.16 73.69 -3.35 61.70 41.79 67.19 -24.59 77.17 -23.03 93.90 3.53 56.54 28.38 70.20 11.00 73.15 -35.48 69.79 45.86 73.20 29.30 64.96 -40.77 70.91 6.10 71.04 -45.29 42.84 3.02 61.82 -0.99 72.12 38.71
2 10:39.4 65.78 45.24 73.63 -3.17 61.70 41.79 67.32 -24.56 77.20 -23.05 93.83 3.55 56.59 28.31 70.20 10.92 73.20 -35.41 69.79 45.86 73.03 29.36 65.19 -40.84 70.91 6.10 71.15 -45.50 42.91 3.04 61.89 -0.91 72.16 38.80
3 10:39.5 65.78 45.33 73.57 -3.00 61.49 41.78 67.45 -24.50 77.25 -23.07 93.75 3.57 56.59 28.31 70.21 10.83 73.25 -35.33 69.77 46.01 72.86 29.43 65.45 -40.86 70.90 6.09 71.15 -45.50 43.01 3.08 61.98 -0.81 72.19 38.86
4 10:39.6 65.78 45.33 73.51 -2.86 61.32 41.76 67.45 -24.50 77.31 -23.09 93.64 3.60 56.65 28.22 70.23 10.72 73.29 -35.22 69.72 46.17 72.69 29.51 65.75 -40.84 70.88 6.08 71.24 -45.71 43.11 3.12 62.06 -0.70 72.22 38.90
5 10:39.7 65.75 45.44 73.51 -2.86 61.20 41.73 67.59 -24.37 77.38 -23.10 93.52 3.63 56.73 28.09 70.25 10.59 73.29 -35.22 69.68 46.33 72.49 29.60 66.06 -40.84 70.86 6.05 71.31 -45.91 43.22 3.14 62.13 -0.59 72.26 38.92
6 10:39.8 65.72 45.56 73.45 -2.72 61.08 41.71 67.72 -24.19 77.44 -23.12 93.39 3.69 56.80 27.91 70.27 10.45 73.34 -35.08 69.66 46.48 72.27 29.67 66.36 -40.87 70.86 6.01 71.39 -46.09 43.35 3.17 62.20 -0.47 72.29 38.93
7 10:39.9 65.72 45.56 73.34 -2.48 60.97 41.72 67.92 -23.76 77.51 -23.13 93.23 3.75 56.80 27.91 70.30 10.31 73.40 -34.76 69.64 46.63 72.01 29.74 66.62 -40.93 70.85 5.96 71.39 -46.09 43.51 3.18 62.27 -0.35 72.31 38.93
8 10:40.0 65.73 45.90 73.34 -2.48 60.86 41.72 67.92 -23.76 77.51 -23.13 93.05 3.80 56.91 27.47 70.30 10.31 73.40 -34.76 69.63 46.76 72.01 29.74 66.82 -41.06 70.83 5.88 71.53 -46.45 43.68 3.20 62.27 -0.35 72.29 38.92
9 10:40.1 65.73 46.09 73.29 -2.39 60.74 41.70 68.00 -23.52 77.60 -23.12 92.83 3.86 56.99 27.23 70.35 10.17 73.43 -34.58 69.64 46.88 71.72 29.80 66.99 -41.22 70.80 5.79 71.60 -46.63 43.86 3.23 62.34 -0.22 72.22 38.89
10 10:40.2 65.76 46.27 73.22 -2.32 60.60 41.65 68.07 -23.24 77.71 -23.05 92.83 3.86 57.14 26.98 70.43 10.05 73.47 -34.38 69.68 46.96 71.42 29.85 67.16 -41.38 70.77 5.70 71.64 -46.80 44.04 3.28 62.43 -0.08 72.13 38.86
11 10:40.3 65.81 46.43 73.12 -2.28 60.43 41.60 68.12 -22.93 77.83 -22.94 92.58 3.89 57.32 26.72 70.54 9.92 73.50 -34.16 69.75 46.99 71.08 29.89 67.16 -41.38 70.74 5.62 71.67 -46.96 44.21 3.33 62.54 0.09 72.03 38.84
12 10:40.4 65.87 46.58 72.98 -2.29 60.24 41.55 68.15 -22.57 77.94 -22.76 92.30 3.93 57.52 26.45 70.67 9.78 73.50 -33.91 69.85 47.00 70.72 29.91 67.31 -41.57 70.70 5.52 71.73 -47.15 44.37 3.40 62.66 0.24 72.03 38.84
13 10:40.5 65.91 46.69 72.80 -2.32 60.07 41.49 68.17 -22.18 78.01 -22.53 91.99 3.98 57.71 26.18 70.81 9.60 73.49 -33.68 69.97 47.03 70.33 29.92 67.45 -41.78 70.64 5.38 71.81 -47.35 44.37 3.40 62.80 0.40 71.96 38.81
14 10:40.6 65.94 46.80 72.60 -2.34 59.93 41.43 68.19 -21.77 78.05 -22.27 91.69 4.03 57.89 25.90 70.96 9.42 73.47 -33.47 70.10 47.09 69.93 29.93 67.54 -41.96 70.56 5.20 71.86 -47.53 44.54 3.50 62.98 0.58 71.91 38.77
15 10:40.7 65.95 46.93 72.36 -2.36 59.80 41.38 68.18 -21.32 78.08 -21.99 91.38 4.09 58.11 25.63 71.11 9.26 73.41 -33.26 70.24 47.15 69.50 29.91 67.58 -42.15 70.56 5.20 71.86 -47.69 44.54 3.50 63.16 0.77 71.91 38.77
16 10:40.8 65.93 47.09 72.10 -2.34 59.65 41.36 68.16 -20.86 78.11 -21.68 91.09 4.17 58.35 25.38 71.23 9.13 73.31 -33.05 70.38 47.20 69.07 29.84 67.56 -42.32 70.44 5.00 71.81 -47.84 45.00 3.79 63.34 0.97 71.80 38.60
17 10:40.9 65.92 47.23 71.85 -2.28 59.47 41.37 68.11 -20.41 78.11 -21.37 90.81 4.27 58.59 25.12 71.33 9.00 73.22 -32.84 70.52 47.26 68.63 29.75 67.47 -42.51 70.28 4.78 71.75 -47.97 45.26 3.94 63.52 1.14 71.73 38.46
Because there is missing data I tried to read the csv file as such. If I removed the try: except: function I received a Type Error stating I couldn't convert string to float.
with open('NoBench.csv') as csvfile :
readCSV = csv.reader(csvfile, delimiter=',')
n=0
for row in readCSV :
if n == 0 :
n+=1
try:
visuals[0].append([float(row[3]),float(row[5]),float(row[7]),float(row[9]),float(row[11]),float(row[13]),float(row[15]),float(row[17]),float(row[19]),float(row[21]),float(row[23]),float(row[25]),float(row[27]),float(row[29]),float(row[31]),float(row[33]),float(row[35]),float(row[37]),float(row[39]),float(row[41]),float(row[43])])
visuals[1].append([float(row[2]),float(row[4]),float(row[6]),float(row[8]),float(row[10]),float(row[12]),float(row[14]),float(row[16]),float(row[18]),float(row[20]),float(row[22]),float(row[24]),float(row[26]),float(row[28]),float(row[30]),float(row[32]),float(row[34]),float(row[36]),float(row[38]),float(row[40]),float(row[42])])
except ValueError:
continue
However, when I use this code, it only returns the values to the list when every row of data is present. As mentioned, this only occurs for about 10% of the file. I am using the xy's to create a scatter plot at each point so cannot change to 0,0 as that will create a false data point. How do I alter the code so it returns the xy values when players data isn't removed.

You can define your own convert before the loop:
def convert_float(x);
if x: # equivalent to if x == ''
return float(x)
else:
return 0.0 # or set the default value you expect to replace the missing data with.
In combination with #juanpa.arrivillaga's excellent suggestion, change the visual.append lines to these:
visual[0].append(list(map(convert_float, row[3::2]))
visual[1].append(list(map(convert_float, row[2::2]))
Also I'm not sure what your n+=1 line is supposed to do... if you merely wanted to skip the first row (headers), simply do this:
def convert_float(x);
if x:
return float(x)
else:
return 0.0
for i, row in enumerate(readCSV):
if n > 0:
visual[0].append(list(map(convert_float, row[3::2]))
visual[1].append(list(map(convert_float, row[2::2]))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to keep pandas group by column when applying transform function? - python

Related

Getting Empty DataFrame in pandas from table data

Finding highest values "zone" in a 2d matrix in Python

extract only certain rows in a dataframe

Columns error while converting to float

Remove values when reading a csv and return to a list

Categories

Resources