I am fairly new in python. I am working with the nycflights13 df.
My goal is to determine the most common and least common origin for every destiny.
+---------+--------------------+-----------+---------------------+------------+
| destiny | most_common_origin | max_count | least_common_origin | min_count |
+---------+--------------------+-----------+---------------------+------------+
| MIA | LGA | 5781 | EWR | 2633 |
+---------+--------------------+-----------+---------------------+------------+
I managed to do it for one destiny but I´m having trouble in the loop and also the binding the rows because sometimes I get a series.
My working example
from nycflights13 import flights
flights_to_MIA = flights[flights['dest']=='MIA']
flights_to_MIA.groupby(flights_to_MIA['origin']).size().reset_index(name='counts')
Note that I used size and then reset.index to make it a df.
You may check agg + value_counts
s=df.groupby('dest').agg(most_common_origin=('origin',lambda x : x.mode()[0]),
max_count=('origin',lambda x : x.value_counts().sort_values().iloc[-1]),
least_common_origin=('origin',lambda x : x.value_counts().sort_values().iloc[[0]].index),
min_count=('origin',lambda x : x.value_counts().sort_values().iloc[0]))
Related
I am using an API that returns a dictionary with a nested list inside, lets name it coins_best The result looks like this:
{'bitcoin': [[1603782192402, 13089.646908288987],
[1603865643028, 13712.070136258053]],
'ethereum': [[1603782053064, 393.6741989091851],
[1603865024078, 404.86117057956386]]}
The first value on the list is a timestamp, while the second is a price in dollars. I want to create a DataFrame with the prices and having the timestamps as index. I tried with this code to do it in just one step:
d = pd.DataFrame()
for id, obj in coins_best.items():
for i in range(0,len(obj)):
temp = pd.DataFrame({
obj[i][1]
}
)
d = pd.concat([d, temp])
d
This attempt gave me a DataFrame with just one column and not the two required, because using the columns argument threw errors (TypeError: Index(...) must be called with a collection of some kind, 'bitcoin' was passed) when I tried with id
Then I tried with comprehensions to preprocess the dictionary and their lists:
for k in coins_best.keys():
inner_lists = (coins_best[k] for inner_dict in coins_best.values())
items = (item[1] for ls in inner_lists for item in ls)
I could not obtain the both elements in the dictionary, just the last.
I know is possible to try:
df = pd.DataFrame(coins_best, columns=coins_best.keys())
Which gives me:
bitcoin ethereum
0 [1603782192402, 13089.646908288987] [1603782053064, 393.6741989091851]
1 [1603785693143, 13146.275972229188] [1603785731599, 394.6174435303511]
And then try to remove the first element in every list of every row, but was even harder to me. The required answer is:
bitcoin ethereum
1603782192402 13089.646908288987 393.6741989091851
1603785693143 13146.275972229188 394.6174435303511
Do you know how to process the dictionary before creating the DataFrame in order the get this result?
Is my first question, I tried to be as clear as possible. Thank you very much.
Update #1
The answer by Sander van den Oord also solved the problem of timestamps and is useful for its purpose. However, the sample code while correct (as it used the info provided) was limited to these two keys. This is the final code that solved the problem for every key in the dictionary.
for k in coins_best:
df_coins1 = pd.DataFrame(data=coins_best[k], columns=['timestamp', k])
df_coins1['timestamp'] = pd.to_datetime(df_coins1['timestamp'], unit='ms')
df_coins = pd.concat([df_coins1, df_coins], sort=False)
df_coins_resampled = df_coins.set_index('timestamp').resample('d').mean()
Thank you very much for your answers.
I think you shouldn't ignore the fact that values of coins are taken at different times. You could do something like this:
import pandas as pd
import hvplot.pandas
coins_best = {
'bitcoin': [[1603782192402, 13089.646908288987],
[1603865643028, 13712.070136258053]],
'ethereum': [[1603782053064, 393.6741989091851],
[1603865024078, 404.86117057956386]],
}
df_bitcoin = pd.DataFrame(data=coins_best['bitcoin'], columns=['timestamp', 'bitcoin'])
df_bitcoin['timestamp'] = pd.to_datetime(df_bitcoin['timestamp'], unit='ms')
df_ethereum = pd.DataFrame(data=coins_best['ethereum'], columns=['timestamp', 'ethereum'])
df_ethereum['timestamp'] = pd.to_datetime(df_ethereum['timestamp'], unit='ms')
df_coins = pd.concat([df_ethereum, df_bitcoin], sort=False)
Your df_coins will now look like this:
+----+----------------------------+------------+-----------+
| | timestamp | ethereum | bitcoin |
|----+----------------------------+------------+-----------|
| 0 | 2020-10-27 07:00:53.064000 | 393.674 | nan |
| 1 | 2020-10-28 06:03:44.078000 | 404.861 | nan |
| 0 | 2020-10-27 07:03:12.402000 | nan | 13089.6 |
| 1 | 2020-10-28 06:14:03.028000 | nan | 13712.1 |
+----+----------------------------+------------+-----------+
Now if you want values to be on the same line, you could use resampling, here I do it per day: all values of the same day for a coin type are averaged:
df_coins_resampled = df_coins.set_index('timestamp').resample('d').mean()
df_coins_resampled will look like this:
+---------------------+------------+-----------+
| timestamp | ethereum | bitcoin |
|---------------------+------------+-----------|
| 2020-10-27 00:00:00 | 393.674 | 13089.6 |
| 2020-10-28 00:00:00 | 404.861 | 13712.1 |
+---------------------+------------+-----------+
I like to use hvplot to get an interactive plot of the result:
df_coins_resampled.hvplot.scatter(
x='timestamp',
y=['bitcoin', 'ethereum'],
s=20, padding=0.1
)
Resulting plot:
There are different timestamps, so the correct output looks differently, than what you presented, but other than that, its a oneliner (where d is your input dictionary):
pd.concat([pd.DataFrame(val, columns=['timestamp', key]).set_index('timestamp') for key, val in d.items()], axis=1)
For each set of coordinates in a pyspark dataframe, I need to find closest set of coordinates in another dataframe
I have one pyspark dataframe with coordinate data like so (dataframe a):
+------------------+-------------------+
| latitude_deg| longitude_deg|
+------------------+-------------------+
| 40.07080078125| -74.93360137939453|
| 38.704022| -101.473911|
| 59.94919968| -151.695999146|
| 34.86479949951172| -86.77030181884766|
| 35.6087| -91.254898|
| 34.9428028| -97.8180194|
And another like so (dataframe b): (only few rows are shown for understanding)
+-----+------------------+-------------------+
|ident| latitude_deg| longitude_deg|
+-----+------------------+-------------------+
| 00A| 30.07080078125| -24.93360137939453|
| 00AA| 56.704022| -120.473911|
| 00AK| 18.94919968| -109.695999146|
| 00AL| 76.86479949951172| -67.77030181884766|
| 00AR| 10.6087| -87.254898|
| 00AS| 23.9428028| -10.8180194|
Is it possible to somehow merge the dataframes to have a result that a has the closest ident from dataframe b for each row in dataframe a:
+------------------+-------------------+-------------+
| latitude_deg| longitude_deg|closest_ident|
+------------------+-------------------+-------------+
| 40.07080078125| -74.93360137939453| 12A|
| 38.704022| -101.473911| 14BC|
| 59.94919968| -151.695999146| 278A|
| 34.86479949951172| -86.77030181884766| 56GH|
| 35.6087| -91.254898| 09HJ|
| 34.9428028| -97.8180194| 09BV|
What I have tried so far:
I have a pyspark UDF to calculate the haversine distance between 2 pairs of coordinates defined.
udf_get_distance = F.udf(get_distance)
It works like this:
df = (df.withColumn(“ABS_DISTANCE”, udf_get_distance(
df.latitude_deg_a, df.longitude_deg_a,
df.latitude_deg_b, df.longitude_deg_b,)
))
I'd appreciate any kind of help. Thanks so much
You need to do a crossJoin first. something like this
joined_df=source_df1.crossJoin(source_df2)
Then you can call your udf like you have mentioned, generate rownum based on distance and filter out the close one
from pyspark.sql.functions import row_number,Window
rwindow=Window.partitionBy("latitude_deg_a","latitude_deg_b").orderBy("ABS_DISTANCE")
udf_result_df = joined_df.withColumn(“ABS_DISTANCE”, udf_get_distance(
df.latitude_deg_a, df.longitude_deg_a,
df.latitude_deg_b, df.longitude_deg_b).withColumn("rownum",row_number().over(rwindow)).filter("rownum = 1")
Note: add return type to your udf
I made a grid search that contains 36 models.
For each model the confusion matrix is available with :
grid_search.get_grid(sort_by='a_metrics', decreasing=True)[index].confusion_matrix(valid=valid_set)
My problematic is I only want to access some parts of this confusion matrix in order to make my own ranking, which is not natively available with h2o.
Let's say we have the confusion_matrix of the first model of the grid_search below:
+---+-------+--------+--------+--------+------------------+
| | 0 | 1 | Error | Rate | |
+---+-------+--------+--------+--------+------------------+
| 0 | 0 | 766.0 | 2718.0 | 0.7801 | (2718.0/3484.0) |
| 1 | 1 | 351.0 | 6412.0 | 0.0519 | (351.0/6763.0) |
| 2 | Total | 1117.0 | 9130.0 | 0.2995 | (3069.0/10247.0) |
+---+-------+--------+--------+--------+------------------+
Actually, the only things that really interest me is the precision of the class 0 as 766/1117 = 0,685765443. While h2o consider precision metrics for all the classes and it is done to the detriment of what I am looking for.
I tried to convert it in dataframe with:
model = grid_search.get_grid(sort_by='a_metrics', decreasing=True)[0]
model.confusion_matrix(valid=valid_set).as_data_frame()
Even if some topics on internet suggest it works, actually it does not (or doesn't anymore):
AttributeError: 'ConfusionMatrix' object has no attribute 'as_data_frame'
I search a way to return a list of attributes of the confusion_matrix without success.
According to H2O documentation there is no as_dataframe method: http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/model/confusion_matrix.html
I assume the easiest way is to call to_list().
The .table attribute gives the object with as_data_frame method.
model.confusion_matrix(valid=valid_set).table.as_data_frame()
If you need to access the table header, you can do
model.confusion_matrix(valid=valid_set).table._table_header
Hint: You can use dir() to check the valid attributes of a python object.
I am trying to put together a useable set of data about glaciers. Our original data comes from an ArcGIS dataset, and latitude/longitude values were stored in a separate file, now detached from the CSV with all of our data. I am attempting to merge the latitude/longitude files with our data set. Heres a preview of what the files look like.
This is my main dataset file, glims (columns dropped for clarity)
| ANLYS_ID | GLAC_ID | AREA |
|----------|----------------|-------|
| 101215 | G286929E46788S | 2.401 |
| 101146 | G286929E46788S | 1.318 |
| 101162 | G286929E46788S | 0.061 |
This is the latitude-longitude file, coordinates
| lat | long | glacier_id |
|-------|---------|----------------|
| 1.187 | -70.166 | G001187E70166S |
| 2.050 | -70.629 | G002050E70629S |
| 3.299 | -54.407 | G002939E70509S |
The problem is, the coordinates data frame has one row for each glacier id with latitude longitude, whereas my glims data frame has multiple rows for each glacier id with varying data for each entry.
I need every single entry in my main data file to have a latitude-longitude value added to it, based on the matching glacier_id between the two data frames.
Heres what I've tried so far.
glims = pd.read_csv('glims_clean.csv')
coordinates = pd.read_csv('LatLong_GLIMS.csv')
df['que'] = np.where((coordinates['glacier_id'] ==
glims['GLAC_ID']))
error returns: 'int' object is not subscriptable
and:
glims.merge(coordinates, how='right', on=('glacier_id', 'GLAC_ID'))
error returns: int' object has no attribute 'merge'
I have no idea how to tackle this big of a merge. I am also afraid of making mistakes because it is nearly impossible to catch them, since the data carries no other identifying factors.
Any guidance would be awesome, thank you.
This should work
glims = glims.merge(coordinates, how='left', left_on='GLAC_ID', right_on='glacier_id')
This a classic merging problem. One way to solve is using straight loc and index-matching
glims = glims.set_index('GLAC_ID')
glims.loc[:, 'lat'] = coord.set_index('glacier_id').lat
glims.loc[:, 'long'] = coord.set_index('glacier_id').long
glims = glims.reset_index()
You can also use pd.merge
pd.merge(glims,
coord.rename(columns={'glacier_id': 'GLAC_ID'}),
on='GLAC_ID')
Let's say I have a model like this:
+-----------+--------+--------------+
| Name | Amount | Availability |
+-----------+--------+--------------+
| Milk | 100 | True |
+-----------+--------+--------------+
| Chocolate | 200 | False |
+-----------+--------+--------------+
| Honey | 450 | True |
+-----------+--------+--------------+
Now in a second model I want to have a field (also named 'Amount') which is always equal to the sum of the amounts of the rows which have Availability = True. For example like this:
+-----------+-----------------------------------------------+
| Inventory | Amount |
+-----------+-----------------------------------------------+
| Groceries | 550 #this is the field I want to be dependent |
+-----------+-----------------------------------------------+
Is that possible? Or is there a better way of doing this?
Of course that is possible: i would recommend one of two things:
Do this "on the fly" as one person commented. then store in django cache mechanisim so that it only calculates once in awhile (saving database/computation resources).
create a database view that does the summation; again it will let the database cache the results/etc. to save resources.
That said, I only think #1 or 2 is needed on a very large record set on a very busy site.