Is there a way to identify which GPS coordinates represent same location. e.g. given the following Data Frame. How to tell that Id 1 and 2 are from same source location.
+-----+--------------+-------------+
| Id | VehLat | VehLong |
+-----+--------------+-------------+
| 66 | 63.3917005 | 10.4264724 |
| 286 | 63.429603 | 10.4167367 |
| 61 | 33.6687838 | 73.0755573 |
| 67 | 63.4150316 | 10.3980401 |
| 5 | 64.048128 | 10.083776 |
| 8 | 63.4332386 | 10.3971859 |
| 9 | 63.4305769 | 10.3927124 |
| 6 | 63.4293578 | 10.4164764 |
| 1 | 64.048254 | 10.084230 |
+-----+--------------+-------------+
Now, Ids 5 and 1 are basically same location but what's the best approach to classify these two locations as same.
IIUC, you need this.
df[['VehLat','VehLong']].round(3).duplicated(keep=False)
You can change the number within round to adjust what you consider as "same"
Output
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 True
If you want the df itself with duplicate values, you can do as below
df[df[['VehLat','VehLong']].round(2).duplicated(keep=False)]
OR
df.loc[df[['VehLat','VehLong']].round(2).duplicated(keep=False)]
Output
id VehLat VehLong
1 286 63.429603 10.416737
4 5 64.048128 10.083776
7 6 63.429358 10.416476
8 1 64.048254 10.084230
Use DataFrame.sort_values + Series.between:
this allows you greater flexibility when establishing the criteria to
consider two coordinates as equivalent
df2=df[['VehLat','VehLong']].sort_values(['VehLong','VehLat'])
eq=df2.apply(lambda x: x.diff().between(-0.001,0.001)).all(axis=1)
df2[eq|eq.shift(-1)]
this returns a data frame with equivalent coordinates
VehLat VehLong
4 64.048128 10.083776
8 64.048254 10.084230
7 63.429358 10.416476
1 63.429603 10.416737
df2[~(eq|eq.shift(-1))]
this returns unique coordinates
VehLat VehLong
6 63.430577 10.392712
5 63.433239 10.397186
3 63.415032 10.398040
0 63.391700 10.426472
2 33.668784 73.075557
you can restore order using DataFrame.sort_index
df_noteq=df2[~(eq|eq.shift(-1))].sort_index()
print(df_noteq)
VehLat VehLong
0 63.391700 10.426472
2 33.668784 73.075557
3 63.415032 10.398040
5 63.433239 10.397186
6 63.430577 10.392712
Related
I have a csv which has data that looks like this
id | code | date
-------------+-----------------------------
| 1 | 2 | 2022-10-05 07:22:39+00::00 |
| 1 | 0 | 2022-11-05 02:22:35+00::00 |
| 2 | 3 | 2021-01-05 10:10:15+00::00 |
| 2 | 0 | 2019-01-11 10:05:21+00::00 |
| 2 | 1 | 2022-01-11 10:05:22+00::00 |
| 3 | 2 | 2022-10-10 11:23:43+00::00 |
I want to remove duplicate id based on the following condition -
For code column, choose the value which is not equal to 0 and then choose one which is having latest timestamp.
Add another column prev_code, which contains list of all the remaining value of the code that's not present in code column.
Something like this -
id | code | prev_code
-------------+----------
| 1 | 2 | [0] |
| 2 | 1 | [0,2] |
| 3 | 2 | [] |
There is probably a sleeker solution but something along the following lines should work.
df = pd.read_csv('file.csv')
lastcode = df[df.code!=0].groupby('id').apply(lambda block: block[block['date'] == block['date'].max()]['code'])
prev_codes = df.groupby('id').agg(code=('code', lambda x: [val for val in x if val != lastcode[x.name].values[0]]))['code']
pd.DataFrame({'id': map(lambda x: x[0], lastcode.index.values), 'code': lastcode.values, 'prev_code': prev_codes.values})
I have two data frames, one of them has data about railways and coordinates, in the other I have the city code and the coordinates but these coordinates don't fit perfectly so I need to calculate the difference between all the coordinates of the dataframe b and the lines from dataframe a to choose the city code that has the smallest difference.
Dataframe a:
| FROMNODENO | TONODENO | LON | LAT |
| 3 | 4 | -46.720863 | -23.653625 |
| 3 | 5 | -46.868323 | -23.270917 |
| 4 | 6 | -46.869839 | -23.274121 |
Dataframe b:
| COD | LON | LAT |
| 5200050 | -16.75730 | -49.4412 |
| 3100104 | -18.48310 | -47.3916 |
| 5200100 | -16.19700 | -48.7057 |
I need the final dataframe to be something like this:
| FROMNODENO | TONODENO | LON | LAT | COD |
| 3 | 4 | -46.720863 | -23.653625 | 5200050 |
i imagine i need to do a for loop but i don't know how i can do that
You can use a package like geo_pandas to efficiently solve this problem. However if you can't/dont want another 3rd party dependency install, then you can:
cross join these DataFrames.
calculate the abs LAT/LON distance for each.
then filter that data down to the minimum for each node.
print(cities)
COD LON LAT
0 5200050 -16.7573 -49.4412
1 3100104 -18.4831 -47.3916
2 5200100 -16.1970 -48.7057
print(nodes)
FROMNODENO TONODENO LON LAT
0 3 4 -46.720863 -23.653625
1 3 5 -46.868323 -23.270917
2 4 6 -46.869839 -23.274121
out = (
pd.merge(cities, nodes, how="cross", suffixes=("_city", "_node"))
.eval("combined_abs_dist = abs(LON_city - LON_node) + abs(LAT_city - LAT_node)")
.loc[lambda df:
df.groupby(["FROMNODENO", "TONODENO"])["combined_abs_dist"].idxmin()
]
)
print(out)
COD LON_city LAT_city FROMNODENO TONODENO LON_node LAT_node combined_abs_dist
3 3100104 -18.4831 -47.3916 3 4 -46.720863 -23.653625 51.975738
4 3100104 -18.4831 -47.3916 3 5 -46.868323 -23.270917 52.505906
5 3100104 -18.4831 -47.3916 4 6 -46.869839 -23.274121 52.504218
I have a column in a dataframe as follows:
| Category |
------------
| B5050.88
| 5051.90
| B5050.97Q
| 5051.23B
| 5051.78E
| B5050.11
| 5051.09
| Z5052
I want to extract the text after the period. For example, from B5050.88, I want only "88"; from 5051.78E, I want only "78E"; for Z50502, it would be nothing as there's no period.
Expected output:
| Category | Digits |
---------------------
| B5050.88 | 88 |
| 5051.90 | 90 |
| B5050.97Q| 97Q |
| 5051.23B | 23B |
| 5051.78E | 78E |
| B5050.11 | 11 |
| 5051.09 | 09 |
| Z5052 | - |
I tried using this
df['Digits'] = df.Category.str.extract('.(.*)')
But I'm not getting the right answer. Using the above, for B5050.88, I'm getting the same B5050.88; for 5051.09, I'm getting NaN. Basically NaN if there's no text.
You can do
splits = [str(p).split(".") for p in df["Category"]]
df["Digits"] = [p[1] if len(p)>1 else "-" for p in splits]
i.e
df = pd.DataFrame({"Category":["5050.88","5051.90","B5050.97","5051.23B","5051.78E",
"B5050.11","5051.09","Z5052"]})
#df
# Category
# 0 5050.88
# 1 5051.90
# 2 B5050.97
# 3 5051.23B
# 4 5051.78E
# 5 B5050.11
# 6 5051.09
# 7 Z5052
splits = [str(p).split(".") for p in df["Category"]]
splits
# [['5050', '88'],
# ['5051', '90'],
# ['B5050', '97'],
# ['5051', '23B'],
# ['5051', '78E'],
# ['B5050', '11'],
# ['5051', '09'],
# ['Z5052']]
df["Digits"] = [p[1] if len(p)>1 else "-" for p in splits]
df
# Category Digits
# 0 5050.88 88
# 1 5051.90 90
# 2 B5050.97 97
# 3 5051.23B 23B
# 4 5051.78E 78E
# 5 B5050.11 11
# 6 5051.09 09
# 7 Z5052 -
not so pretty but it works
EDIT:
Added the "-" instead of NaN and the code snippet
Another way
df.Category.str.split('[\.]').str[1]
0 88
1 90
2 97Q
3 23B
4 78E
5 11
6 09
7 NaN
Alternatively
df.Category.str.extract('((?<=[.])(\w+))')
You need to escape your first . and do fillna:
df["Digits"] = df["Category"].astype(str).str.extract("\.(.*)").fillna("-")
print(df)
Output:
Category Digits
0 B5050.88 88
1 5051.90 90
2 B5050.97Q 97Q
3 5051.23B 23B
4 5051.78E 78E
5 B5050.11 11
6 5051.09 09
7 Z5052 -
try out below :
df['Category'].apply(lambda x : x.split(".")[-1] if "." in list(x) else "-")
check below code
I'm working with a panel dataset that contains many days' info on each ID number. There is one variable that takes the number of months in which the clients did something.
I want to find the clients that only reached 1 month, so the clients that never reached months 2, 3, etc.
Here is a sample of my data. The date column is in str format.
Client| Date | Months
1 | 04/01/2019 | 1
1 | 05/01/2019 | 1
1 | 06/01/2019 | 2
2 | 11/01/2019 | 1
2 | 12/01/2019 | 1
2 | 13/01/2019 | 1
2 | 14/01/2019 | 1
3 | 20/01/2019 | 1
3 | 21/01/2019 | 2
3 | 22/01/2019 | 2
3 | 23/01/2019 | 2
3 | 24/01/2019 | 3
3 | 25/01/2019 | 3
3 | 26/01/2019 | 3
In this example only client 2 would be selected. I would make a list or something like that to store the client numbers that follow the rule.
The code I tried was
df.loc[df["MONTHS"]==1, "CLIENT"].unique()
which didn't give me what I wanted (this includes all client id's that ever had 1 month, but not the ones that only had 1 month)
Any ideas are very much appreciated!
Perhaps something like this:
s = df.set_index('Client')['Months'].eq(1).groupby(level=0).all()
s[s].index
Result:
Int64Index([2], dtype='int64', name='Client')
Get the rows where there is only one unique month and filter :
df.loc[df.groupby(["Client"]).Months.transform("nunique").eq(1)]
Client Date Months
3 2 11/01/2019 1
4 2 12/01/2019 1
5 2 13/01/2019 1
6 2 14/01/2019 1
If you just want the Client number :
df.loc[df.groupby(["Client"]).Months.transform("nunique").eq(1), "Client"].unique()[0]
OR
df.groupby("Client").Months.nunique().loc[lambda x: x == 1].index[0]
I am trying to select a grouped average.
a1_avg = session.query(func.avg(Table_A.a1_value).label('a1_avg'))\
.filter(between(Table_A.a1_date, '2011-10-01', '2011-10-30'))\
.group_by(Table_A.a1_group)
I have tried a few different iterations of this query and this is as close as I can get to what I need. I am fairly certain the group_by is creating the issue, but I am unsure how to correctly implement the query using SQLA. The table structure and expected output:
TABLE A
A1_ID | A1_VALUE | A1_DATE | A1_LOC | A1_GROUP
1 | 5 | 2011-10-05 | 5 | 6
2 | 15 | 2011-10-14 | 5 | 6
3 | 2 | 2011-10-21 | 6 | 7
4 | 20 | 2011-11-15 | 4 | 8
5 | 6 | 2011-10-27 | 6 | 7
EXPECTED OUTPUT
A1_LOC | A1_GROUP | A1_AVG
5 | 6 | 10
6 | 7 | 4
I would guess that you are just missing the group identifier (a1_group) in the result. Also (given I understand your model correctly), you need to add a group by clause also for a1_loc column:
edit-1: updated the query due to question specificaion
a1_avg = session.query(Table_A.a1_loc, Table_A.a1_group, func.avg(Table_A.a1_value).label('a1_avg'))\
.filter(between(Table_A.a1_date, '2011-10-01', '2011-10-30'))\
#.filter(Table_A.a1_id == '12')\ # #note: you do NOT NEED this
.group_by(Table_A.a1_loc)\ # #note: you NEED this
.group_by(Table_A.a1_group)