Create list of dictionary items from lists - python

I am working on a project that involves going through two columns of latitude and longitude values. If the lat/long in one pair of columns are blank, then I need to figure out which pair of lat/long values in another two columns are (geographically) closest to those in the destination. The dataframe looks like this:
origin_lat | origin_lon | destination_lat | destination_lon
----------------------------------------------------------------
20.291326 -155.838488 25.145242 -98.491404
25.611236 -80.551706 25.646763 -81.466360
26.897654 -75.867564 nan nan
I am trying to build two dictionaries, one with the origin lat and long, and the other with the destination lat and long, in this format:
tmplist = [{'origin_lat': 39.7612992, 'origin_lon': -86.1519681},
{'origin_lat': 39.762241, 'origin_lon': -86.158436 },
{'origin_lat': 39.7622292, 'origin_lon': -86.1578917}]
What I want to do is for every row where the destination lat/lon are blank, compare the origin lat/lon in the same row to a dictionary of all the non-nan destination lat/lon values, then print the geographically closest lat/lon from the dictionary of destination lat/lon to the row in place of the nan values. I've been playing around with creating lists of dictionary objects but can't seem to build a dictionary in the correct format. Any help would be appreciated!

If df is your pandas.DataFrame, you can generate the requested dictionaries by iterating through the rows of df:
origin_dicts = [{'origin_lat': row['origin_lat'], 'origin_long': row['origin_lon']} for _, row in df.iterrows()]
and analogously for destination_dicts.
Remark: if the only reason for creating the dictionaries is the calculation of values replacing the nan-entries, it might be easier to do this directly on the data frame, e.g.
df['destination_lon'] = df.apply(find_closest_lon, axis=1)
df['destination_lat'] = df.apply(find_closest_lat, axis=1)
where find_closest_lon, find_closes_lat are functions receiving a data frame row as an argument and having access to the values of the origin-columns of the data frame.

The format that you want is the built-in 'records' format:
df[['origin_lat','origin_lon']].to_dict(orient = 'records')
produces
[{'origin_lat': 20.291326, 'origin_lon': -155.83848799999998},
{'origin_lat': 25.611235999999998, 'origin_lon': -80.55170600000001},
{'origin_lat': 26.897654, 'origin_lon': -75.867564}]
and of course you can equally have
df[['destination_lat','destination_lon']].to_dict(orient = 'records')
But I agree with #ctenar that you do not need to generate dictionaries for your ultimate task, Pandas provide enough functionality for that

Related

Function for adding rows if they aren't already present from a unique list/csv?

I downloaded a .csv file with columns: Genus, species, Region, and Distribution. (image below).
Each Genus + species has a different variation of Regions, none exactly the same, and in Distribution it says 'Present' in every single row because the species are all present.
I created a list of unique regions in my dataframe called unique_regions (and a .csv file containing a single column of all regions in my dataset). This file also has the corresponding Latitude and Longitude for each unique region.
My goal is to use this unique_regions variable (or .csv file) to systematically go through each Genus + species and add the countries that were not included (or in other words, not present) into the Region column and then add 'Absent' into the Distribution column.
For example:
Here is a species that is only present in 20 Regions of the world (out of the 324 total unique regions I have in my list):
I need there to be 304 new rows (just for this species alone), with same Family, Genus, and species entry. The regions that were not included should be added along with the corresponding Latitude and Longitude from the unique_regions list or .csv file, and next to those regions it should say 'Absent'.
Assuming that you turn the file into a python list first, you could do something like this:
List = [{"Family":"family name here","Genus":"genus name here","Species":"species name here"}]
#This list contains all of the information from your file. With this
#script, each entry is a dictionary so that you can access a column by going:
#
#item = List[Index_Number]
#data_you_need = item["type of data you need"]
#
#However, you could also just use a list and remember which list index corresponds
#to which kind of data
item = List[0]
for i in range(304):
List.append({"Family":item["Species"], "Genus":item["Genus"],
"Species":item["Species"]}
#This bit here uses the keys that you define to access each type of data,
#grabs the data entered in the previous entries, and copies it into a new
#entry.
Figured it out.
# load csv with species
df = pd.read_csv('./Erysiphaceae_combined/Thekopsora_areolata.csv')
# load csv with unique regions
unique_regions = pd.read_csv('./unique_regions.csv')
# add all unique_regions to Region column of df, replace NA values with desired values
combined = pd.concat([unique_regions,df],sort=False)
combined['Family'] = combined['Family'].replace(np.nan,'Erysiphaceae')
combined['Genus'] = combined['Genus'].replace(np.nan,'Thekopsora') #HERE
combined['Species'] = combined['Species'].replace(np.nan,'areolata') #HERE
combined['Distribution'] = combined['Distribution'].replace(np.nan,'Absent')
combined = combined.drop_duplicates(subset = 'Region',keep ='last')
# write new csv with desired outputs
combined.to_csv("./Kriging/Thekopsora_areolata.csv")

Automating the creation of dataframes from subsets of an existing dataframe

I'm working with the kaggle New York City Airbnb Open Data which is available here:
https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
The data contains a column of the 'neighbourhood_groups', consisting of the 5 boroughs of NYC, and 'neighbourhood', consisting of the neighbourhoods within each neighbourhood group.
I have created a subset of the Manhattan neighbourhood with the following code:
airbnb_manhattan = airbnb[airbnb['neighbourhood_group'] == 'Manhattan']
I would like to create further subsets of this dataframe by neighbourhood. However, there are 32 neighbourhoods, so I'd like to automate the process.
This is the code that I tried:
manhattan_neighbourhoods = list(airbnb_manhattan['neighbourhood'].unique())
neighbourhoods = pd.DataFrame()
for n in manhattan_neighbourhoods:
neighbourhoods[n] = pd.DataFrame(affordable_manhattan[affordable_manhattan['neighbourhood'] == manhattan_neighbourhoods[n]])
Which produces the following error message:
TypeError: list indices must be integers or slices, not str
Thanks.
You should not copy into new dfs unless strictly necessary. Try to do your analysis with the full df as much as possible. Use .groupby as in
by_neigh = airbnb.groupby('neighbourhood_group')
Then use .agg, .apply, or .transform as needed. Or as a last resort you can iterate with
for neigh, rows in by_neigh:
Or get just one group with
by_neigh.get_group('Manhattan')
The advantage of all this is that underlying data is not copied intil absolutely necessary, and pandas can just view the same array with different filters and slices as needed.
Read more in the docs

For loop does not stop

for lat,lng,value in zip(location_saopaulo_df['geolocation_lat'], location_saopaulo_df['geolocation_lng'], location_saopaulo_df['municipality']):
coordinates = (lat,lng)
items = rg.search(coordinates)
value = items[0]['admin2']
I am trying to iterate over 3 columns from the dataframe, get the latitude and longitude values from the two columns, use it to get the address then add the city name to the last column I stated which is an empty column consists of NaN values.
However, my for loop is not stopping. I would be grateful if you can tell me why it doesn't stop or better way to do what I'm trying to do.
Thank you in advance.
if rg is reverse_geocoder, there is a better way to query several coordinates at once than looping. try this:
res = rg.search(tuple(zip(location_saopaulo_df['geolocation_lat'],
location_saopaulo_df['geolocation_lng'])))
And then extract just the admin2 value by constructing dataframe for example like:
df_ = pd.Dataframe(res)
and see what it looks like. You may be able to perform a merge or index alignment to put it back into your original dataframe location_saopaulo_df

Integrating two diff dataframes using contains condition and creating a new dataframe

First data frame looks like below:
OSIED geometry
257005 POLYGON ((311852.712 178933.993, 312106.023 17...
017049 POLYGON ((272943.107 137755.159, 272647.627 13...
017032 POLYGON ((276637.425 146141.397, 276601.509 14...
Second data frame looks like below:
small_area Median_BER
217099001/217112002/217112003/2570052005/217112... 212.9
047041005/047041001/2570051004/047041002/047041... 271.3
157041002/157041004/157041003/157041001/157129... 222.5
I need to search col1(df1) in col1(df2) using "contains" condition.
If it matches/has the string then fetch the corresponding values from df1 and df2
I tried merge,df.get and str.contains.
str.contains works but I am unable to fetch other records
Output should look like this:
OSIED geometry small_area Median_BER
257005 POLYGON ((311852.712 178933.993, 312106.023 17... 217099001/217112002/217112003/2570052005/217112
212.9
017049 POLYGON ((272943.107 137755.159, 272647.627 13... 047041005/047041001/2570051004/047041002/047041
222.5
Playing around with some code I was able to generate the following
small_area_oseid_df = pd.DataFrame(
[
{'OSIED': oseid[:6], 'median_ber': row['median_ber']}
for row in df.to_dict(orient='records')
for oseid in row['small_area'].split('/')
]
)
Then you can join this table with the first table on the OSIED key. This is dependent on how many elements are in each row in the split. Since this will explode the dimension for the small_area_oseid_df you will create.

Splitting DataFrame into two DataFrames and filter these two DataFrames in order to have the same dimensions

i have the following problem and had an idea to solve it, but it didn't worked:
I have the data on DAX Call and Put Options for every trading day in a month. After transforming and some calculations I have the following DataFrame:
DaxOpt. The goal is now to get rid of every row (either Call or Put Option) which does not have the respective pair. With pair I mean a Call and Put Option with the same 'EXERCISE_PRICE' and 'TAU', where 'TAU' = the time to maturity in years. The red boxes in the picture are examples for a pair. So either having a DataFrame with only the pairs or having two DataFrames with Call and Put Options where the rows are the respective pairs.
My idea was creating two new DataFrames one which contains only the Call Options and the other the Put Options, sort them after 'TAU' and 'EXERCISE_PRICE' and working my way through with pandas isin function, in order to get rid of the Call or Put Options which do not have the respective pair.
DaxOptCall = DaxOpt[DaxOpt.CALL_PUT_FLAG == 'C']
DaxOptPut = DaxOpt[DaxOpt.CALL_PUT_FLAG == 'P']
The problem is that the DaxOptCall and DaxOptPut have different dimensions, so isin function is not applicable. I am trying to find the most efficient way, since the data I am using now is just a fraction from the real data.
Would appreciate any help or idea.
See if this works for you:
Once you separated your df into two dfs by CALL/PUT options, convert the column(s) that are unique to your pairs into index columns:
# Assuming your unique columns are TAU and EXERCISE_PRICE
df_call = df_call.set_index(["EXERCISE_PRICE", "TAU"])
df_put = df_put.set_index(["EXERCISE_PRICE", "TAU"])
Next, take the intersection of the indexes, which will return a pandas MultiIndex object
mtx = df_call.index.intersection(df_put.index)
Then use the mtx object to extract the common elements from the dfs
df_call.loc[mtx]
df_put.loc[mtx]
You can merge these if you want them to be in the same df and reset the index to the original column.

Categories

Resources