I was following this tutorial nearest neighbor analysis:
https://automating-gis-processes.github.io/2017/lessons/L3/nearest-neighbour.html
I get this error:
('id', 'occurred at index 0')
after i run this:
def nearest (row, geom_union, df1, df2, geom1_col='geometry', geom2_col='geometry', src_column=None):
# Find the nearest point and return the corresponding value from specified column.
# Find the geometry that is closest
nearest= df2[geom2_col] == nearest_points(row[geom1_col], geom_union)[1]
#Get the corresponding value from df2 (matching is based on the geometry)
value = df2[nearest][src_column].get_values()[0]
return value
df1['nearest_id'] = df1.apply(nearest, geom_union=unary_union, df1=df1, df2=df2, geom1_col='centroid', src_column='id', axis=1)
I am using my own data for this. It is similar to the one given in example. But i have the addresses, geometry, latitude and longitude in a shp file. So i am not using a .kml file. I can`t figure out this error.
Did you follow the code literally,
df1['nearest_id'] = df1.apply(nearest, geom_union=unary_union,
df1=df1, df2=df2, geom1_col='centroid',
src_column='id', axis=1)
Then the problem is likely src_column - the function returns the value of the column src_column argument, which is given value id by the sample code. If you get problem with column id, most likely you don't have a column with such name and should provide the name of existing column in your dataset.
Related
I have data that is grouped and has a value associated with it. I'm fine with working out the rank of each value within its subgroup -
df['Rank'] = df.groupby('Group')['Value'].rank(ascending=True)
However, I'd also like to create an additional column that shows the top-ranked value and second-ranked value within each group as a separate column - i.e. in the image attached below (I've not worked out how to draw a table on this website yet!). Many thanks.
Data Table
Get unique rank values for each group
group_data = df[['Group','Rank']].drop_duplicates()
Get rank 1 and rank 2 data
group_rank_1 = group_data.loc[group_data.Rank==1,'Value']
group_rank_1.columns=['Group','GroupRank1Value']
group_rank_2 = group_data.loc[group_data.Rank==2,'Value']
group_rank_1.columns=['Group','GroupRank2Value']
Join with the original data frame
res = df.merge(group_rank_1,how='inner', on='Group')
res = res.merge(group_rank_2,how='inner', on='Group')
I am a beginner at python and dataframes and now I have encountered a problem. I have made a dataframe containing adresses. I want to create a calculated column that shows the distance from my house. I have gotten this far:
API_key = 'Very secret api key'
gmaps = googlemaps.Client(key=API_key)
def distance(destination):
origin = ('44100, Kuhnamontie 1, Finland')
distance = (gmaps.distance_matrix(origin, destination, mode='driving')["rows"][0]["elements"][0]["distance"]["value"])/1000
return distance
df["distance"] = distance(df.Adress)
This solution works for the first row in the dataframe but all rows in the column get the same value assigned. I guess the calculation and api-request must be made per row. I guess I could loop through the data frame, but as I understand it - there are better ways.
Can you help me?
Using pandas apply should be slightly more efficient than loop:
df["distance"] = df.apply(lambda row : distance(row["Adress"]),axis=1))
I'm using the "LGBT_Survey_DailyLife.csv" dataset from Kaggle(Link) without the question_code and notes columns.
I want each question (question_label) and country (CountryCode) combination to be on its own line, and to have each column be a combination of group (subset) and response (answer) with the values being those given in the percentage column.
It seems like this should be pretty straightforward, but when I run the following:
daily_life.pivot(index = ['CountryCode', 'question_label'], columns = ['subset', 'answer'], values = 'percentage')*
I get this error:
ValueError: Length of passed values is 34020, index implies 2*
You have to first clean up the percentage column as it contains non integer values
And then use pivot_table
df.percentage = df.percentage.replace(':', 0).astype('float')
df1 = df.pivot_table(values="percentage", index=["CountryCode", "question_label"], columns=["subset", "answer"])
My dataset looks like this:
Paste_Values AB_IDs AC_IDs AD_IDs
AE-1001-4 AB-1001-0 AC-1001-3 AD-1001-2
AE-1964-7 AB-1964-2 AC-1964-7 AD-1964-1
AE-2211-1 AB-2211-1 AC-2211-3 AD-2211-2
AE-2182-4 AB-2182-6 AC-2182-7 AD-2182-5
I need to compare all values in the Paste_values column with the all other three values in a row.
For Example:
AE-1001-4 is split into two part AE and 1001-4 we need check 1001-4 is present other columns or not
if its not present we need to create new columns put the same AE-1001-4
if 1001-4 match with other columns we need to change it inot 'AE-1001-5' put in the new column
After:
If there is no match I need to to write the value of Paste_values as is in the newly created column named new_paste_value.
If there is a match (same value) in other columns within the same row, then I need to change the last digit of the value from Paste_values column, so that the whole value should not be the same as in any other whole values in the row and that newly generated value should be written in new_paste_value column.
I need to do this with every row in the data frame.
So the result should look like:
Paste_Values AB_IDs AC_IDs AD_IDs new_paste_value
AE-1001-4 AB-1001-0 AC-1001-3 AD-1001-2 AE-1001-4
AE-1964-7 AB-1964-2 AC-1964-7 AD-1964-1 AE-1964-3
AE-2211-1 AB-2211-1 AC-2211-3 AD-2211-2 AE-2211-4
AE-2182-4 AB-2182-6 AC-2182-4 AD-2182-5 AE-2182-1
How can I do it?
Start from defining a function to be applied to each row of your DataFrame:
def fn(row):
rr = row.copy()
v1 = rr.pop('Paste_Values') # First value
if not rr.str.contains(f'{v1[3:]}$').any():
return v1 # No match
v1a = v1[3:-1] # Central part of v1
for ch in '1234567890':
if not rr.str.contains(v1a + ch + '$').any():
return v1[:-1] + ch
return '????' # No candidate found
A bit of explanation:
The row argument is actually a Series, with index values taken from
column names.
So rr.pop('Paste_Values') removes the first value, which is saved in v1
and the rest remains in rr.
Then v1[3:] extracts the "rest" of v1 (without "AE-")
and str.contains checks each element of rr whether it
contains this string at the end position.
With this explanation, the rest of this function should be quite
understandable. If not, execute each individual instruction and
print their results.
And the only thing to do is to apply this function to your DataFrame,
substituting the result to a new column:
df['new_paste_value'] = df.apply(fn, axis=1)
To run a test, I created the following DataFrame:
df = pd.DataFrame(data=[
['AE-1001-4', 'AB-1001-0', 'AC-1001-3', 'AD-1001-2'],
['AE-1964-7', 'AB-1964-2', 'AC-1964-7', 'AD-1964-1'],
['AE-2211-1', 'AB-2211-1', 'AC-2211-3', 'AD-2211-2'],
['AE-2182-4', 'AB-2182-6', 'AC-2182-4', 'AD-2182-5']],
columns=['Paste_Values', 'AB_IDs', 'AC_IDs', 'AD_IDs'])
I received no error on this data. Perform a test on the above data.
Maybe the source of your error is in some other place?
Maybe your DataFrame contains also other (float) columns,
which you didn't include in your question.
If this is the case, run my function on a copy of your DataFrame,
with this "other" columns removed.
I downloaded for one single city the corresponding openstreetmap data. I want to get the maximum and minimum values for latitude and longitude. How can I do that?
My approach is the following:
import osmread as osm
#...
def _parse_map(self):
geo = [(entity.lat, entity.lon) for entity in osm.parse_file('map.osm') if isinstance(entity, osm.Node)]
return max(geo[0]), min(geo[0]), max(geo[1]), min(geo[1])
But when I print these values I don't think they are right. When I downloaded the area from the OpenStreetMap site I had a white area indicating which region I am exporting. And for this area I also got the minimum and maximum values for latitude and longitude. And these values arent fitting with the ones I get from my simple script.
What am I doing wrong?
return max(geo[0]), min(geo[0]), max(geo[1]), min(geo[1])
You are taking the extrema of the first and second element of geo. But geo is a list of 2-tuples. So the first element geo[0] is a 2-tuple consisting of entity.lat and entity.lon for the first node. Therefore you are just choosing min/max of latitude and longitude for one node.
If you want to feed the first (second) element of each tuple in the list to the aggregate function, then you have to specifically choose these. For example with a generator:
return max(x[0] for x in geo), min(x[0] for x in geo), max(x[1] for x in geo), min(x[1] for x in geo)