I'm cycling through points in a geodataframe by index in such away where I am comparing index 0 and 1, then 1 and 2, then 3 and 4 etc... The purpose is to compare the 2 points. if the points are occupying the same location pass, else draw a line between the 2 points and summarize some stats. I figured if I compared the distance between the 2 points and got 0 then that would be skipped. What I have done before was to pass the 2 points in a single geodataframe into a function that returns a value for distance. They are in a projected crs units metres.
def getdist(pt_pair):
shift_pt = pt_pair.shift()
return pt_pair.distance(shift_pt)[1]
When I pass my 2 points to the function the first 2 return 0.0 the next return nan then I get this error.
Traceback (most recent call last):
File "C:/.../PycharmProjects/.../vessel_track_builder.py", line 33, in <module>
print(getdist(set_pts))
File "C:/.../PycharmProjects/.../vessel_track_builder.py", line 19, in getdist
if math.isnan(mdist1.distance(shift_pt)[1]):
File "C:\OSGEO4~1\apps\Python37\lib\site-packages\pandas\core\series.py", line 871, in __getitem__
result = self.index.get_value(self, key)
File "C:\OSGEO4~1\apps\Python37\lib\site-packages\pandas\core\indexes\base.py", line 4405, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas\_libs\index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 90, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 997, in
pandas._libs.hashtable.Int64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1004, in
pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 1
Process finished with exit code 1
Thought this might be an error in the point geometry so I added an if nan return 0 to the function.
def getdist(pt_pair):
shift_pt = pt_pair.shift()
if math.isnan(pt_pair.distance(shift_pt)[1]):
return 0
else:
return pt_pair.distance(shift_pt)[1]
The result is 0.0, 0, then the aforementioned error.
I added a print statement of my geodataframes but didn't see anything out of the ordinary.
index ... MMSI MONTH geometry
0 92 ... 123 4 POINT (2221098.494 1668358.870)
1 39 ... 123 4 POINT (2221098.494 1668358.870)
[2 rows x 12 columns]
index ... MMSI MONTH geometry
1 39 ... 456 4 POINT (2221098.494 1668358.870)
2 3231 ... 456 4 POINT (2221098.494 1668358.870)
[2 rows x 12 columns]
index ... MMSI MONTH geometry
2 3231 ... 789 4 POINT (2221098.494 1668358.870)
3 1032 ... 789 4 POINT (2221098.494 1668358.870)
I tried it on some test data with simple points and it went through them fine so I am wondering if there is something with how I am passing the geodataframe to the function. Since I am trying to compare each point to the one after it I am using the index to keep the order, could that be the issue?
for mmsi in points_gdf.MMSI.unique():
track_pts = points_gdf[(points_gdf.MMSI == mmsi)].sort_values(['POSITION_UTC_DATE']).reset_index()
print(track_pts.shape[0])
for index, row in track_pts.iterrows():
if index + 1 < track_pts.shape[0]:
set_pts = track_pts[(track_pts.index == index) | (track_pts.index == index + 1)]
print(set_pts)
print(getdist(set_pts))
else:
sys.exit()
I am noticing the index header which when I look at the data in QGIS there is no index column the first column is OBJECTID and the data is stored in a filegeodatabase. Could the index column be causing me the issue?
Instead of looping through each pair of points, do this once:
dist_to_next_point = track_pts.distance(track_pts.shift()).dropna()
I am making a project for a class, and i am trying to predict nfl socre games using linear regression and predict functions from sklearn, my problem comes when i want to fit the training data into de fit function, here is my code:
onehotdata_x1 = pd.get_dummies(goal_model_data,columns=['team','opponent'])
# Crea el object de regression linear
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(onehotdata_x1[['home','team','opponent']], onehotdata_x1['goals'])
This is the structure of dataframe(goal_model_data):
team opponent goals home
NE KC 27 1
BUF NYJ 21 1
CHI ATL 17 1
CIN BAL 0 1
CLE PIT 18 1
DET ARI 35 1
HOU JAX 7 1
TEN OAK 16 1
and this is the error that i get when i run the program:
Traceback (most recent call last):
File "predictnflgames.py", line 76, in <module>
regr.fit(onehotdata_x1[['home','team','opponent']], onehotdata_x1['goals'])
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2133, in __getitem__
return self._getitem_array(key)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2177, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 1269, in _convert_to_indexer
.format(mask=objarr[mask]))
KeyError: "['team' 'opponent'] not in index"
The problem is that after pd.get_dummies there are no team and opponent columns.
I use this data in txt format for my example: https://ufile.io/e2vtv (same as yours).
Try this and see:
import pandas as pd
from sklearn.linear_model import LinearRegression
goal_model_data = pd.read_table('goal_model_data.txt', delim_whitespace=True)
onehotdata_x1 = pd.get_dummies(goal_model_data,columns=['team','opponent'])
regr = LinearRegression()
#see the columns in onehotdata_x1
onehotdata_x1.columns
#see the data (only 2 rows of the data for the example)
onehotdata_x1.head(2)
Results:
Index([u'goals', u'home', u'team_BUF', u'team_CHI', u'team_CIN', u'team_CLE',
u'team_DET', u'team_HOU', u'team_NE', u'team_TEN', u'opponent_ARI',
u'opponent_ATL', u'opponent_BAL', u'opponent_JAX', u'opponent_KC',
u'opponent_NYJ', u'opponent_OAK', u'opponent_PIT'],
dtype='object')
goals home team_BUF team_CHI team_CIN team_CLE team_DET team_HOU \
0 27 1 0 0 0 0 0 0
1 21 1 1 0 0 0 0 0
team_NE team_TEN opponent_ARI opponent_ATL opponent_BAL opponent_JAX \
0 1 0 0 0 0 0
1 0 0 0 0 0 0
opponent_KC opponent_NYJ opponent_OAK opponent_PIT
0 1 0 0 0
1 0 1 0 0
EDIT 1
Based on the original code, you might want to do something like the following:
import pandas as pd
from sklearn.linear_model import LinearRegression
data = pd.read_table('data.txt', delim_whitespace=True)
onehotdata = pd.get_dummies(data,columns=['team','opponent'])
regr = LinearRegression()
#in x get all columns except goals column
x = onehotdata.loc[:, onehotdata.columns != 'goals']
#use goals column as target variable
y= onehotdata['goals']
regr.fit(x,y)
regr.predict(x)
Hope this helps.
When you use pd.get_dummies(goal_model_data,columns=['team','opponent']) the team and opponent column will be dropped from your dataframe and onehotdata_x1 won't contain these two columns.
Then, when you do onehotdata_x1[['home','team','opponent']] you get a KeyError simply because team and opponent do not exist as columns in the onehotdata_x1 dataframe.
Using a toy dataframe, here's what happens:
I want to perform some operations on a pandas data frame that is split into chunks. After splitting the data frame, I then try to iterate over the chunks, but after the first iterations runs well, I get an error (see below). I have gone through some questions like these: 1 and 2 but they don't quite address my issue. Kindly help me resolve this as I don't fully understand it.
import pandas as pd
tupList = [('Eisenstadt', 'Paris','1', '2'), ('London', 'Berlin','1','3'), ('Berlin', 'stuttgat','1', '4'),
('Liverpool', 'Southampton','1', '5'),('Tirana', 'Blackpool', '1', '6'),('blackpool', 'tirana','1','7'),
('Paris', 'Lyon','1','8'), ('Manchester', 'Nice','1','10'),('Orleans', 'Madrid','1', '12'),
('Lisbon','Stockholm','1','12')]
cities = pd.DataFrame(tupList, columns=['Origin', 'Destination', 'O_Code', 'D_code'])
# purpose - splits the DataFrame into smaller of max size chunkSize (last is smaller)
def splitDataFrameIntoSmaller(df, chunkSize = 3):
listOfDf = list()
numberChunks = len(df) // chunkSize + 1
for i in range(numberChunks):
listOfDf.append(df[i*chunkSize:(i+1)*chunkSize])
return listOfDf
citiesChunks = splitDataFrameIntoSmaller(cities)
for ind, cc in enumerate(citiesChunks):
cc["distance"] = 0
cc["time"] = 0
for i in xrange(len(cc)):
al = cc['Origin'][i]
bl = cc['Destination'][i]
'...' #trucating to make it readable
cc.to_csv('out.csv', sep=',', encoding='utf-8')
Traceback (most recent call last):
File ..., line 39, in <module>
al = cc['Origin'][i]
File ..., line 603, in __getitem__
result = self.index.get_value(self, key)
File ..., line 2169, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas\index.pyx", line 98, in pandas.index.IndexEngine.get_value (pandas\index.c:3557)
File "pandas\index.pyx", line 106, in pandas.index.IndexEngine.get_value (pandas\index.c:3240)
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)
File "pandas\src\hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:8564)
File "pandas\src\hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:8508)
KeyError: 0L
You can first floor divide index values and then use list comprehension - loop by unique values and select by loc, last reset_index for remove duplicated index:
cities.index = cities.index // 3
print (cities)
Origin Destination O_Code D_code
0 Eisenstadt Paris 1 2
0 London Berlin 1 3
0 Berlin stuttgat 1 4
1 Liverpool Southampton 1 5
1 Tirana Blackpool 1 6
1 blackpool tirana 1 7
2 Paris Lyon 1 8
2 Manchester Nice 1 10
2 Orleans Madrid 1 12
3 Lisbon Stockholm 1 12
citiesChunks = [cities.loc[[x]].reset_index(drop=True) for x in cities.index.unique()]
#print (citiesChunks)
print (citiesChunks[0])
Origin Destination O_Code D_code
0 Eisenstadt Paris 1 2
1 London Berlin 1 3
2 Berlin stuttgat 1 4
Last need iterrows if need loop in DataFrame:
#write columns to file first
cols = ['Origin', 'Destination', 'O_Code', 'D_code', 'distance', 'time']
df = pd.DataFrame(columns=cols)
df.to_csv('out.csv', encoding='utf-8', index=False)
for ind, cc in enumerate(citiesChunks):
cc["distance"] = 0
cc["time"] = 0
for i, val in cc.iterrows():
al = cc.loc[i, 'Origin']
bl = cc.loc[i, 'Destination']
'...' #trucating to make it readable
cc.to_csv('out.csv', encoding='utf-8', mode='a', header=None, index=False)
print (cc.to_csv(encoding='utf-8'))
,Origin,Destination,O_Code,D_code,distance,time
0,Eisenstadt,Paris,1,2,0,0
1,London,Berlin,1,3,0,0
2,Berlin,stuttgat,1,4,0,0
,Origin,Destination,O_Code,D_code,distance,time
0,Liverpool,Southampton,1,5,0,0
1,Tirana,Blackpool,1,6,0,0
2,blackpool,tirana,1,7,0,0
,Origin,Destination,O_Code,D_code,distance,time
0,Paris,Lyon,1,8,0,0
1,Manchester,Nice,1,10,0,0
2,Orleans,Madrid,1,12,0,0
,Origin,Destination,O_Code,D_code,distance,time
0,Lisbon,Stockholm,1,12,0,0
I want to pass each cell of a column in a dataframe to a function which then creates a new cell
I've looked here and here but these don't address my issue.
I'm using an obscure package so I'll simplify the method using the base packages to ask the question, hopefully the issue will be clear.
Method:
Load the data
import pandas as pd
import math
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
Pass the values of one column to a variable
lat = df['A']
Create a new column by applying the function to the variable
df['sol'] = df.apply(math.sqrt(lat))
This gives the error
TypeError: cannot convert the series to <type 'float'>
The error I'm getting using the pyeto package is actually
Traceback (most recent call last):
File "<ipython-input-10-b160408e9808>", line 1, in <module>
data['sol_dec'] = data['dayofyear'].apply(pyeto.sol_dec(data['dayofyear']), axis =1) # Solar declination
File "build\bdist.win-amd64\egg\pyeto\fao.py", line 580, in sol_dec
_check_doy(day_of_year)
File "build\bdist.win-amd64\egg\pyeto\_check.py", line 36, in check_doy
if not 1 <= doy <= 366:
File "C:\Users\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\generic.py", line 731, in __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().`
I think the issue is the same in both cases, the function will not apply to every cell in the dataframe column, and produces an error.
I want to be able to apply a function to each cell of a dataframe column (i.e. get the square root of each cell in column 'A'). Then store the result of this function as a variable (or another column in the dataframe i.e. have a 'sqrtA' column) , then apply a function to that variable (or column) and so on (i.e. have a new column which is 'sqrtA*100'.
I can't figure out how to do this, and would really appreciate guidance.
EDIT
#EdChum 's answer df['A'].apply(math.sqrt) or data['dayofyear'].apply(pyeto.sol_dec) (for the package function) helped a lot.
I'm now having issues with another function in the package which takes multiple arguments:
sha = pyeto.sunset_hour_angle(lat, sol_dec)
This function doesn't apply to a data-frame column, and I have lat and sol_dec stored as Series variables, but when I try to create a new column in the dataframe using these like so
data['sha'] = pyeto.sunset_hour_angle(lat, sol_dec) I get the same error as before...
Attempting to apply the function to multiple columns:
data['sha'] = data[['lat'],['sol_dec']].apply(pyeto.sunset_hour_angle)
gives the error:
Traceback (most recent call last):
File "<ipython-input-28-7b603745af93>", line 1, in <module>
data['sha'] = data[['lat'],['sol_dec']].apply(pyeto.sunset_hour_angle)
File "C:\Users\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.py", line 1969, in __getitem__
return self._getitem_column(key)
File "C:\Users\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\frame.py", line 1976, in _getitem_column
return self._get_item_cache(key)
File "C:\Users\pflattery\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\core\generic.py", line 1089, in _get_item_cache
res = cache.get(item)
TypeError: unhashable type: 'list'
Use np.sqrt, as this understands arrays:
In [86]:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df['sol'] = np.sqrt(df['A'])
df
Out[86]:
A B C D sol
0 52 38 4 71 7.211103
1 59 4 36 15 7.681146
2 37 28 33 73 6.082763
3 58 26 4 96 7.615773
4 31 48 47 78 5.567764
5 43 58 45 4 6.557439
6 69 35 27 39 8.306624
.. .. .. .. .. ...
98 42 6 40 36 6.480741
99 22 44 11 24 4.690416
[100 rows x 5 columns]
To apply a function you can do:
In [87]:
import math
df['A'].apply(math.sqrt)
Out[87]:
0 7.211103
1 7.681146
2 6.082763
3 7.615773
4 5.567764
5 6.557439
6 8.306624
7 7.483315
8 7.071068
9 9.486833
...
95 3.464102
96 6.855655
97 5.385165
98 6.480741
99 4.690416
Name: A, dtype: float64
What you tried was to pass a Series to math.sqrt but math.sqrt doesn't understand non-scalar values hence the error. Also you should avoid using apply when a vectorised method exists as this will be faster for a 10K row df:
In [90]:
%timeit df['A'].apply(math.sqrt)
%timeit np.sqrt(df['A'])
100 loops, best of 3: 2.15 ms per loop
10000 loops, best of 3: 99.7 µs per loop
Here you can see that numpy version is ~22x faster here
with respect to what you're trying to do, the following should work:
data['dayofyear'].apply(pyeto.sol_dec)
Edit
to pass multiple columns as args to a method:
data.apply(lambda x: pyeto.sunset_hour_angle(x['lat'],x['sol_dec']), axis=1)