I am trying to select a grouped average.
a1_avg = session.query(func.avg(Table_A.a1_value).label('a1_avg'))\
.filter(between(Table_A.a1_date, '2011-10-01', '2011-10-30'))\
.group_by(Table_A.a1_group)
I have tried a few different iterations of this query and this is as close as I can get to what I need. I am fairly certain the group_by is creating the issue, but I am unsure how to correctly implement the query using SQLA. The table structure and expected output:
TABLE A
A1_ID | A1_VALUE | A1_DATE | A1_LOC | A1_GROUP
1 | 5 | 2011-10-05 | 5 | 6
2 | 15 | 2011-10-14 | 5 | 6
3 | 2 | 2011-10-21 | 6 | 7
4 | 20 | 2011-11-15 | 4 | 8
5 | 6 | 2011-10-27 | 6 | 7
EXPECTED OUTPUT
A1_LOC | A1_GROUP | A1_AVG
5 | 6 | 10
6 | 7 | 4
I would guess that you are just missing the group identifier (a1_group) in the result. Also (given I understand your model correctly), you need to add a group by clause also for a1_loc column:
edit-1: updated the query due to question specificaion
a1_avg = session.query(Table_A.a1_loc, Table_A.a1_group, func.avg(Table_A.a1_value).label('a1_avg'))\
.filter(between(Table_A.a1_date, '2011-10-01', '2011-10-30'))\
#.filter(Table_A.a1_id == '12')\ # #note: you do NOT NEED this
.group_by(Table_A.a1_loc)\ # #note: you NEED this
.group_by(Table_A.a1_group)
Related
Let's say I have table which would look like that
| id | value_one | type | value_two |
|----|-----------|------|-----------|
| 1 | 2 | A | 1 |
| 1 | 4 | B | 1 |
| 2 | 3 | A | 2 |
| 2 | 1 | B | 3 |
I know that there are only A and B types for specific ID, what I want to achieve is to group those two values and calculate new type using formula A/B, it should be applied to value_one and value_two, so table afterwards should look like:
| id | value_one | type | value_two|
|----|-----------| -----|----------|
| 1 | 0.5 | C | 1 |
| 2 | 3 | C | 0.66 |
I am new to PySpark, and as for now I wasn't able to achieve described result, would appreciate any tips/solutions.
You can consider dividing the original dataframe into two parts according to type, and then use SQL statements to implement the calculation logic.
df.filter('type = "A"').createOrReplaceTempView('tmp1')
df.filter('type = "B"').createOrReplaceTempView('tmp2')
sql = """
select
tmp1.id
,tmp1.value_one / tmp2.value_one as value_one
,'C' as type
,tmp1.value_two / tmp2.value_two as value_two
from tmp1 join tmp2 using (id)
"""
reuslt_df = spark.sql(sql)
reuslt_df.show(truncate=False)
I have two data frames, one of them has data about railways and coordinates, in the other I have the city code and the coordinates but these coordinates don't fit perfectly so I need to calculate the difference between all the coordinates of the dataframe b and the lines from dataframe a to choose the city code that has the smallest difference.
Dataframe a:
| FROMNODENO | TONODENO | LON | LAT |
| 3 | 4 | -46.720863 | -23.653625 |
| 3 | 5 | -46.868323 | -23.270917 |
| 4 | 6 | -46.869839 | -23.274121 |
Dataframe b:
| COD | LON | LAT |
| 5200050 | -16.75730 | -49.4412 |
| 3100104 | -18.48310 | -47.3916 |
| 5200100 | -16.19700 | -48.7057 |
I need the final dataframe to be something like this:
| FROMNODENO | TONODENO | LON | LAT | COD |
| 3 | 4 | -46.720863 | -23.653625 | 5200050 |
i imagine i need to do a for loop but i don't know how i can do that
You can use a package like geo_pandas to efficiently solve this problem. However if you can't/dont want another 3rd party dependency install, then you can:
cross join these DataFrames.
calculate the abs LAT/LON distance for each.
then filter that data down to the minimum for each node.
print(cities)
COD LON LAT
0 5200050 -16.7573 -49.4412
1 3100104 -18.4831 -47.3916
2 5200100 -16.1970 -48.7057
print(nodes)
FROMNODENO TONODENO LON LAT
0 3 4 -46.720863 -23.653625
1 3 5 -46.868323 -23.270917
2 4 6 -46.869839 -23.274121
out = (
pd.merge(cities, nodes, how="cross", suffixes=("_city", "_node"))
.eval("combined_abs_dist = abs(LON_city - LON_node) + abs(LAT_city - LAT_node)")
.loc[lambda df:
df.groupby(["FROMNODENO", "TONODENO"])["combined_abs_dist"].idxmin()
]
)
print(out)
COD LON_city LAT_city FROMNODENO TONODENO LON_node LAT_node combined_abs_dist
3 3100104 -18.4831 -47.3916 3 4 -46.720863 -23.653625 51.975738
4 3100104 -18.4831 -47.3916 3 5 -46.868323 -23.270917 52.505906
5 3100104 -18.4831 -47.3916 4 6 -46.869839 -23.274121 52.504218
I created an array full of number. I want to add data from the array to the prettytable basically using add_columns
number=[1,2,3,4,5,6,7,8,9,.....,100]
I want the output of the pretty table to be like this below
+-----------+------+------------+-----------------+
| Number |1 | 2 | 3 | 4 | ..... |
+-----------+------+------------+-----------------+
My code is shown below.
from prettytable import PrettyTable
x = PrettyTable()
x.add_column(["number", print(number)])
print(x)
When I run the python script, it generated an error
TypeError: add_column() missing 1 required positional argument: 'column'
How to achieve this?
you should use add raw like this and if
you need to add 'number' as text you should added in the list
number=['number',1,2,3,4,5,6,7,8,9]
from prettytable import PrettyTable
x = PrettyTable()
x.add_row(number)
print(x)
OUTPUT:
+---------+---------+---------+---------+---------+---------+---------+---------+---------+----------+
| Field 1 | Field 2 | Field 3 | Field 4 | Field 5 | Field 6 | Field 7 | Field 8 | Field 9 | Field 10 |
+---------+---------+---------+---------+---------+---------+---------+---------+---------+----------+
| number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
+---------+---------+---------+---------+---------+---------+---------+---------+---------+----------+
I’m trying to translate a window-function from SQL to Pandas, which is only applied under the condition, that a match is possible – otherwise a NULL (None) value is inserted.
SQL-Code (example)
SELECT
[ID_customer]
[cTimestamp]
[TMP_Latest_request].[ID_req] AS [ID of Latest request]
FROM [table].[Customer] AS [Customer]
LEFT JOIN (
SELECT * FROM(
SELECT [ID_req], [ID_customer], [rTimestamp],
RANK() OVER(PARTITION BY ID_customer ORDER BY rTimestamp DESC) as rnk
FROM [table].[Customer_request]
) AS [Q]
WHERE rnk = 1
) AS [TMP_Latest_request]
ON [Customer].[ID_customer] = [TMP_Latest_request].[ID_customer]
Example
Joining the ID of the latest customer request (if exists) to the customer.
table:Customer
+-------------+------------+
| ID_customer | cTimestamp |
+-------------+------------+
| 1 | 2014 |
| 2 | 2014 |
| 3 | 2015 |
+-------------+------------+
table: Customer_request
+--------+-------------+------------+
| ID_req | ID_customer | rTimestamp |
+--------+-------------+------------+
| 1 | 1 | 2012 |
| 2 | 1 | 2013 |
| 3 | 1 | 2014 |
| 4 | 2 | 2014 |
+--------+-------------+------------+
Result: table:merged
+-------------+------------+----------------------+
| ID_customer | cTimestamp | ID of Latest request |
+-------------+------------+----------------------+
| 1 | 2014 | 3 |
| 2 | 2014 | 4 |
| 3 | 2015 | None/NULL |
+-------------+------------+----------------------+
What is the equivalent in Python Pandas?
Instead of using RANK() function, you can simply using the below, and it is easy to convert.
SELECT A.ID_Customer,A.cTimeStamp,B.ID_req
FROM Customer A
LEFT JOIN (
SELECT ID_Customer,MAX(ID_req)ID_req
FROM Customer_request
GROUP BY ID_Customer
)B
ON A.ID_Customer = B.ID_Customer
Try the following query, if you are facing any issues, ask me in the comments.
Is there a way to identify which GPS coordinates represent same location. e.g. given the following Data Frame. How to tell that Id 1 and 2 are from same source location.
+-----+--------------+-------------+
| Id | VehLat | VehLong |
+-----+--------------+-------------+
| 66 | 63.3917005 | 10.4264724 |
| 286 | 63.429603 | 10.4167367 |
| 61 | 33.6687838 | 73.0755573 |
| 67 | 63.4150316 | 10.3980401 |
| 5 | 64.048128 | 10.083776 |
| 8 | 63.4332386 | 10.3971859 |
| 9 | 63.4305769 | 10.3927124 |
| 6 | 63.4293578 | 10.4164764 |
| 1 | 64.048254 | 10.084230 |
+-----+--------------+-------------+
Now, Ids 5 and 1 are basically same location but what's the best approach to classify these two locations as same.
IIUC, you need this.
df[['VehLat','VehLong']].round(3).duplicated(keep=False)
You can change the number within round to adjust what you consider as "same"
Output
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 True
If you want the df itself with duplicate values, you can do as below
df[df[['VehLat','VehLong']].round(2).duplicated(keep=False)]
OR
df.loc[df[['VehLat','VehLong']].round(2).duplicated(keep=False)]
Output
id VehLat VehLong
1 286 63.429603 10.416737
4 5 64.048128 10.083776
7 6 63.429358 10.416476
8 1 64.048254 10.084230
Use DataFrame.sort_values + Series.between:
this allows you greater flexibility when establishing the criteria to
consider two coordinates as equivalent
df2=df[['VehLat','VehLong']].sort_values(['VehLong','VehLat'])
eq=df2.apply(lambda x: x.diff().between(-0.001,0.001)).all(axis=1)
df2[eq|eq.shift(-1)]
this returns a data frame with equivalent coordinates
VehLat VehLong
4 64.048128 10.083776
8 64.048254 10.084230
7 63.429358 10.416476
1 63.429603 10.416737
df2[~(eq|eq.shift(-1))]
this returns unique coordinates
VehLat VehLong
6 63.430577 10.392712
5 63.433239 10.397186
3 63.415032 10.398040
0 63.391700 10.426472
2 33.668784 73.075557
you can restore order using DataFrame.sort_index
df_noteq=df2[~(eq|eq.shift(-1))].sort_index()
print(df_noteq)
VehLat VehLong
0 63.391700 10.426472
2 33.668784 73.075557
3 63.415032 10.398040
5 63.433239 10.397186
6 63.430577 10.392712