Pandas groupby based on column value

Pandas groupby based on column value - python

I have following dataframe - dfgeo:
x y z zt n k pv geometry dist
0 6574878.210 4757530.610 1152.588 1 8 4 90 POINT (6574878.210 4757530.610) 0.000000
1 6574919.993 4757570.314 1174.724 0 POINT (6574919.993 4757570.314) 57.638760
2 6575020.518 4757665.839 1177.339 0 POINT (6575020.518 4757665.839) 138.673362
3 6575239.548 4757873.972 1160.156 1 8 4 90 POINT (6575239.548 4757873.972) 302.148120
4 6575351.603 4757980.452 1202.418 0 POINT (6575351.603 4757980.452) 154.577856
5 6575442.780 4758067.093 1199.297 0 POINT (6575442.780 4758067.093) 125.777217
6 6575538.217 4758157.782 1192.914 1 8 4 90 POINT (6575538.217 4758157.782) 131.653772
7 6575594.625 4758240.033 1217.442 0 POINT (6575594.625 4758240.033) 99.735096
8 6575738.820 4758450.289 1174.477 0 POINT (6575738.820 4758450.289) 254.950551
9 6575850.937 4758613.772 1123.852 1 8 4 90 POINT (6575850.937 4758613.772) 198.234490
10 6575984.323 4758647.118 1131.761 0 POINT (6575984.323 4758647.118) 137.491020
11 6576204.312 4758702.115 1119.407 0 POINT (6576204.312 4758702.115) 226.759410
12 6576303.976 4758727.031 1103.064 0 POINT (6576303.976 4758727.031) 102.731300
13 6576591.496 4758798.910 1114.06 0 POINT (6576591.496 4758798.910) 296.368590
14 6576736.965 4758835.277 1120.285 1 8 4 90 POINT (6576736.965 4758835.277) 149.945952
I am trying to group by zt values an summarize dist column. I have tried this:
def summarize(group):
s = group['zt'].eq(1).cumsum()
return group.groupby(s).agg(
D=('dist', 'sum')
)
dfzp=dfgeo.apply(summarize)
But i get following errors on last line of code
s = group['zt'].eq(1).cumsum()
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py", line 871, in __getitem__
result = self.index.get_value(self, key)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 4405, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas\_libs\index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 90, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 135, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index_class_helper.pxi", line 109, in pandas._libs.index.Int64Engine._check_type
KeyError: 'zt'
Any help in resolving this appreciated.

If need pass Dataframe to function use:
dfzp=summarize(dfgeo)
Or DataFrame.pipe:
dfzp=dfgeo.pipe(summarize)
If use DataFrame.apply then is used function per columns or per rows if axis=1.

Related

Finding the distance between 2 points fails

I'm cycling through points in a geodataframe by index in such away where I am comparing index 0 and 1, then 1 and 2, then 3 and 4 etc... The purpose is to compare the 2 points. if the points are occupying the same location pass, else draw a line between the 2 points and summarize some stats. I figured if I compared the distance between the 2 points and got 0 then that would be skipped. What I have done before was to pass the 2 points in a single geodataframe into a function that returns a value for distance. They are in a projected crs units metres.
def getdist(pt_pair):
shift_pt = pt_pair.shift()
return pt_pair.distance(shift_pt)[1]
When I pass my 2 points to the function the first 2 return 0.0 the next return nan then I get this error.
Traceback (most recent call last):
File "C:/.../PycharmProjects/.../vessel_track_builder.py", line 33, in <module>
print(getdist(set_pts))
File "C:/.../PycharmProjects/.../vessel_track_builder.py", line 19, in getdist
if math.isnan(mdist1.distance(shift_pt)[1]):
File "C:\OSGEO4~1\apps\Python37\lib\site-packages\pandas\core\series.py", line 871, in __getitem__
result = self.index.get_value(self, key)
File "C:\OSGEO4~1\apps\Python37\lib\site-packages\pandas\core\indexes\base.py", line 4405, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas\_libs\index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 90, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 997, in
pandas._libs.hashtable.Int64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1004, in
pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 1
Process finished with exit code 1
Thought this might be an error in the point geometry so I added an if nan return 0 to the function.
def getdist(pt_pair):
shift_pt = pt_pair.shift()
if math.isnan(pt_pair.distance(shift_pt)[1]):
return 0
else:
return pt_pair.distance(shift_pt)[1]
The result is 0.0, 0, then the aforementioned error.
I added a print statement of my geodataframes but didn't see anything out of the ordinary.
index ... MMSI MONTH geometry
0 92 ... 123 4 POINT (2221098.494 1668358.870)
1 39 ... 123 4 POINT (2221098.494 1668358.870)
[2 rows x 12 columns]
index ... MMSI MONTH geometry
1 39 ... 456 4 POINT (2221098.494 1668358.870)
2 3231 ... 456 4 POINT (2221098.494 1668358.870)
[2 rows x 12 columns]
index ... MMSI MONTH geometry
2 3231 ... 789 4 POINT (2221098.494 1668358.870)
3 1032 ... 789 4 POINT (2221098.494 1668358.870)
I tried it on some test data with simple points and it went through them fine so I am wondering if there is something with how I am passing the geodataframe to the function. Since I am trying to compare each point to the one after it I am using the index to keep the order, could that be the issue?
for mmsi in points_gdf.MMSI.unique():
track_pts = points_gdf[(points_gdf.MMSI == mmsi)].sort_values(['POSITION_UTC_DATE']).reset_index()
print(track_pts.shape[0])
for index, row in track_pts.iterrows():
if index + 1 < track_pts.shape[0]:
set_pts = track_pts[(track_pts.index == index) | (track_pts.index == index + 1)]
print(set_pts)
print(getdist(set_pts))
else:
sys.exit()
I am noticing the index header which when I look at the data in QGIS there is no index column the first column is OBJECTID and the data is stored in a filegeodatabase. Could the index column be causing me the issue?

Instead of looping through each pair of points, do this once:
dist_to_next_point = track_pts.distance(track_pts.shift()).dropna()

Pandas apply getting KeyError: [duplicate]

This question already has an answer here:
Why do I get a KeyError when using pandas apply?
(1 answer)
Closed 13 days ago.
I was looking at this answer by Roman Pekar for using apply. I initially copied the code exactly and it worked fine. Then I used it on my df3 that is created from a csv file and I got a KeyError. I checked datatypes the columns I was using are int64, so that is okay. I don't have nulls. If I can get this working then I will make the function more complex. How do I get this working?
def fxy(x, y):
return x * y
df3 = pd.read_csv(path + 'test_data.csv', usecols=[0,1,2])
print(df3.dtypes)
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']))
Trace back
Traceback (most recent call last):
File "f:\...\my_file.py", line 54, in <module>
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']))
File "C:\...\frame.py", line 8833, in apply
return op.apply().__finalize__(self, method="apply")
File "C:\...\apply.py", line 727, in apply
return self.apply_standard()
File "C:\...\apply.py", line 851, in apply_standard
results, res_index = self.apply_series_generator()
File "C:\...\apply.py", line 867, in apply_series_generator
results[i] = self.f(v)
File "f:\...\my_file.py", line 54, in <lambda>
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']))
File "C:\...\series.py", line 958, in __getitem__
return self._get_value(key)
File "C:\...\series.py", line 1069, in _get_value
loc = self.index.get_loc(label)
File "C:\...\range.py", line 389, in get_loc
raise KeyError(key)
KeyError: 'Len'
I don't see a way to attach the csv file. Below is Sample df3 if I save the below with excel as "CSV (Comma delimited)(*.csv) I get the same results.
ID
Len
Width
A
170
4
B
362
5
C
12
15
D
42
7
E
15
3
F
46
49
G
71
74

I think you miss the axis=1 on apply:
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']), axis=1)
But in your case, you can just do:
df3['Area'] = df3['Len'] * df3['Width']
print(df3)
# Output
ID Len Width Area
0 A 170 4 680
1 B 362 5 1810
2 C 12 15 180
3 D 42 7 294
4 E 15 3 45
5 F 46 49 2254
6 G 71 74 5254

Python convert conditional array to dataframe

I have a dataframe with the following information:
ticker date close gap
0 BHP 1981-07-31 0.945416 -0.199458
1 BHP 1981-08-31 0.919463 -0.235930
2 BHP 1981-09-30 0.760040 -0.434985
3 BHP 1981-10-30 0.711842 -0.509136
4 BHP 1981-11-30 0.778578 -0.428161
.. ... ... ... ...
460 BHP 2019-11-29 38.230000 0.472563
461 BHP 2019-12-31 38.920000 0.463312
462 BHP 2020-01-31 39.400000 0.459691
463 BHP 2020-02-28 33.600000 0.627567
464 BHP 2020-03-31 28.980000 0.784124
I developed the following code to find where the rows are when it crosses 0:
zero_crossings =np.where(np.diff(np.sign(BHP_data['gap'])))[0]
This returns:
array([ 52, 54, 57, 75, 79, 86, 93, 194, 220, 221, 234, 235, 236,
238, 245, 248, 277, 379, 381, 382, 383, 391, 392, 393, 395, 396],
dtype=int64)
I need to be able to do the following:
calculate the number of months between points where 'gap' crosses 0
remove items where the number of months is <12
average the remaining months
However, I don't know how to turn this nd.array into something useful that I can make the calculations from. When I try:
pd.DataFrame(zero_crossings)
I get the following df, which only returns the index:
0
0 52
1 54
2 57
3 75
4 79
5 86
.. ..
Please help...

Just extended your code a bit to get the zero_crossings into the original dataframe as required.
import pandas as pd
import numpy as np
BHP_data = pd.DataFrame({'gap': [-0.199458, 0.472563, 0.463312, 0.493318, -0.509136, 0.534985, 0.784124]})
BHP_data['zero_crossings'] = 0
zero_crossings = np.where(np.diff(np.sign(BHP_data['gap'])))[0]
print(zero_crossings) # [0 3 4]
# Updates the column to 1 based on the 0 crossing
BHP_data.loc[zero_crossings, 'zero_crossings'] = 1
print(BHP_data)
Output
gap zero_crossings
0 -0.199458 1
1 0.472563 0
2 0.463312 0
3 0.493318 1
4 -0.509136 1
5 0.534985 0
6 0.784124 0

How to draw a boxplot correctly by seaborn in python3.x

I've loaded a csv file, and printed correctly, but I get an error when drawing boxplot with a Series.
Loaded my data and printed correctly
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
data2 = pd.read_csv(...)
print(data2)
ax = sns.boxplot(x=data2['2'])
plt.show()
and the formation of my datas are followed:
0 1 2 3 4 5 6 7 ... 29 30 31 32 33 34 35 36
0 2016-06-06 04:07:42 0 26.0 0 1 101 0 0 ... 0 0 0 0 0 0 0
1 2016-06-08 12:34:10 0 25.0 0 1 101 0 0 ... 0 0 0 0 0 0 0
....
I want to draw a boxplot with the 2 columns (26.0、25.0), but I got this error:
Traceback (most recent call last):
File "D:\Python-Anaconda\lib\site-packages\pandas\core\indexes\base.py", line 2657, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 129, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index_class_helper.pxi", line 91, in pandas._libs.index.Int64Engine._check_type
KeyError: '2'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:/work/fLUTE/Solve-52/练习/sns练习/boxplot.py", line 16, in
ax = sns.boxplot(x=data2['2'])
File "D:\Python-Anaconda\lib\site-packages\pandas\core\frame.py", line 2927, in getitem
indexer = self.columns.get_loc(key)
File "D:\Python-Anaconda\lib\site-packages\pandas\core\indexes\base.py", line 2659, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 129, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index_class_helper.pxi", line 91, in pandas._libs.index.Int64Engine._check_type
KeyError: '2'
When changing
ax = sns.boxplot(x=data2['2'])
to
ax = sns.boxplot(x=data2[2])
another error occurs:
TypeError: cannot perform reduce with flexible type

First, change ax = sns.boxplot(x=data2['2']) to ax = sns.boxplot(x=data2[2])
Second, add such codes data2[2] = data2[2].astype(float)

Pandas KeyError when working on split data frame

I want to perform some operations on a pandas data frame that is split into chunks. After splitting the data frame, I then try to iterate over the chunks, but after the first iterations runs well, I get an error (see below). I have gone through some questions like these: 1 and 2 but they don't quite address my issue. Kindly help me resolve this as I don't fully understand it.
import pandas as pd
tupList = [('Eisenstadt', 'Paris','1', '2'), ('London', 'Berlin','1','3'), ('Berlin', 'stuttgat','1', '4'),
('Liverpool', 'Southampton','1', '5'),('Tirana', 'Blackpool', '1', '6'),('blackpool', 'tirana','1','7'),
('Paris', 'Lyon','1','8'), ('Manchester', 'Nice','1','10'),('Orleans', 'Madrid','1', '12'),
('Lisbon','Stockholm','1','12')]
cities = pd.DataFrame(tupList, columns=['Origin', 'Destination', 'O_Code', 'D_code'])
# purpose - splits the DataFrame into smaller of max size chunkSize (last is smaller)
def splitDataFrameIntoSmaller(df, chunkSize = 3):
listOfDf = list()
numberChunks = len(df) // chunkSize + 1
for i in range(numberChunks):
listOfDf.append(df[i*chunkSize:(i+1)*chunkSize])
return listOfDf
citiesChunks = splitDataFrameIntoSmaller(cities)
for ind, cc in enumerate(citiesChunks):
cc["distance"] = 0
cc["time"] = 0
for i in xrange(len(cc)):
al = cc['Origin'][i]
bl = cc['Destination'][i]
'...' #trucating to make it readable
cc.to_csv('out.csv', sep=',', encoding='utf-8')
Traceback (most recent call last):
File ..., line 39, in <module>
al = cc['Origin'][i]
File ..., line 603, in __getitem__
result = self.index.get_value(self, key)
File ..., line 2169, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas\index.pyx", line 98, in pandas.index.IndexEngine.get_value (pandas\index.c:3557)
File "pandas\index.pyx", line 106, in pandas.index.IndexEngine.get_value (pandas\index.c:3240)
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)
File "pandas\src\hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:8564)
File "pandas\src\hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:8508)
KeyError: 0L

You can first floor divide index values and then use list comprehension - loop by unique values and select by loc, last reset_index for remove duplicated index:
cities.index = cities.index // 3
print (cities)
Origin Destination O_Code D_code
0 Eisenstadt Paris 1 2
0 London Berlin 1 3
0 Berlin stuttgat 1 4
1 Liverpool Southampton 1 5
1 Tirana Blackpool 1 6
1 blackpool tirana 1 7
2 Paris Lyon 1 8
2 Manchester Nice 1 10
2 Orleans Madrid 1 12
3 Lisbon Stockholm 1 12
citiesChunks = [cities.loc[[x]].reset_index(drop=True) for x in cities.index.unique()]
#print (citiesChunks)
print (citiesChunks[0])
Origin Destination O_Code D_code
0 Eisenstadt Paris 1 2
1 London Berlin 1 3
2 Berlin stuttgat 1 4
Last need iterrows if need loop in DataFrame:
#write columns to file first
cols = ['Origin', 'Destination', 'O_Code', 'D_code', 'distance', 'time']
df = pd.DataFrame(columns=cols)
df.to_csv('out.csv', encoding='utf-8', index=False)
for ind, cc in enumerate(citiesChunks):
cc["distance"] = 0
cc["time"] = 0
for i, val in cc.iterrows():
al = cc.loc[i, 'Origin']
bl = cc.loc[i, 'Destination']
'...' #trucating to make it readable
cc.to_csv('out.csv', encoding='utf-8', mode='a', header=None, index=False)
print (cc.to_csv(encoding='utf-8'))
,Origin,Destination,O_Code,D_code,distance,time
0,Eisenstadt,Paris,1,2,0,0
1,London,Berlin,1,3,0,0
2,Berlin,stuttgat,1,4,0,0
,Origin,Destination,O_Code,D_code,distance,time
0,Liverpool,Southampton,1,5,0,0
1,Tirana,Blackpool,1,6,0,0
2,blackpool,tirana,1,7,0,0
,Origin,Destination,O_Code,D_code,distance,time
0,Paris,Lyon,1,8,0,0
1,Manchester,Nice,1,10,0,0
2,Orleans,Madrid,1,12,0,0
,Origin,Destination,O_Code,D_code,distance,time
0,Lisbon,Stockholm,1,12,0,0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas groupby based on column value - python

If need pass Dataframe to function use: dfzp=summarize(dfgeo) Or DataFrame.pipe: dfzp=dfgeo.pipe(summarize) If use DataFrame.apply then is used function per columns or per rows if axis=1.

Related

Finding the distance between 2 points fails

Pandas apply getting KeyError: [duplicate]

Python convert conditional array to dataframe

How to draw a boxplot correctly by seaborn in python3.x

Pandas KeyError when working on split data frame

Categories

Resources