I have been doing some investigation to find a package to install and use for Geospatial Analytics
The closest I got to was https://github.com/harsha2010/magellan - This however has only scala interface and no doco how to use it with Python.
I was hoping if you someone knows of a package I can use?
What I am trying to do is analyse Uber's data and map it to the actual postcodes/suburbs and run it though SGD to predict the number of trips to a particular suburb.
There is already lots of data info here - http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/#comment-606532 and I am looking for ways to do it in Python.
In Python I'd take a look at GeoPandas. It provides a data structure called GeoDataFrame: it's a list of features, each one having a geometry and some optional attributes. You can join two GeoDataFrames together based on geometry intersection, and you can aggregate the numbers of rows (say, trips) within a single geometry (say, postcode).
I'm not familiar with Uber's data, but I'd try to find a way to get it into a GeoPandas GeoDataFrame.
Likewise postcodes can be downloaded from places like the U.S. Census, OpenStreetMap[1], etc, and coerced into a GeoDataFrame.
Join #1 to #2 based on geometry intersection. You want a new GeoDataFrame with one row per Uber trip, but with the postcode attached to each. Another StackOverflow post discusses how do to this, and it's currently harder than it ought to be.
Aggregate this by postcode and count the trips in each. The code will look like joined_dataframe.groupby('postcode').count().
My fear for the above process is if you have hundreds of thousands of very complex trip geometries, it could take forever on one machine. The link you posted uses Spark and you may end up wanting to parallelize this after all. You can write Python against a Spark cluster(!) but I'm not the person to help you with this component.
Finally, for the prediction component (e.g. SGD), check out scikit-learn: it's a pretty fully featured machine learning package, with a dead simple API.
[1]: There is a separate package called geopandas_osm that grabs OSM data and returns a GeoDataFrame: https://michelleful.github.io/code-blog/2015/04/27/osm-data/
I realize this is an old questions, but to build on Jeff G's answer.
If you arrive at this page looking for help putting together a suite of geospatial analytics tools in python - I would highly recommend this tutorial.
https://geohackweek.github.io/vector
It really picks up steam in the 3rd section.
It shows how to integrate
GeoPandas
PostGIS
Folium
rasterstats
add in scikit-learn, numpy, and scipy and you can really accomplish a lot. You can grab information from this nDarray tutorial as well
Related
Let's say I want to create maps of crime, education, traffic or etc on a street or a city. What are the modules I need to learn or the best ones?
For some data, I will be using excell like documents where I will have street names or building numbers unlinked to Google Maps directly and will be combined later through codes. For some, I want to obtain data directly from Google Maps, such as names of the stores or street numbers. I'm a beginner and a sociologist and this is the main reason I want to learn programming. Maybe painting on a map picture can be a lot easier but on the long term my aim is using Google Maps since it can obtain data by itself. Thanks in advance.
I'm a beginner, need a long shot plan and an advice. I watched some numpy and pandas videos and they seem ok and doable so far.
There are several Python modules that can be used to work with Google Maps data. Some of the most popular ones include:
Google Maps API: This is the official API for working with Google Maps data. It allows you to access a wide range of data, including street maps, satellite imagery, and places of interest. You can use the API to search for addresses, get directions, and even create custom maps.
gmaps: This is a Python wrapper for the Google Maps API. It makes it easy to work with the API by providing a simple, Pythonic interface. It also includes support for several popular Python libraries, such as Pandas and Numpy.
folium: This is a library for creating leaflet maps in Python. It allows you to create interactive maps, add markers and other data, and customize the appearance of your maps.
geopandas: This library allows you to work with geospatial data in Python. It is built on top of the popular Pandas library and includes support for working with shapefiles, geojson, and more.
geopy: This is a Python library for working with geocoding and distance calculations. It can be used to convert addresses to latitude and longitude coordinates, as well as to perform distance calculations between two points.
In general, it's recommended to start with Google Maps API and gmaps and folium, you can also use geopandas and geopy later when you need more advanced functionalities. Try to start with simple examples and gradually increase the complexity of your projects.
Situation
Our company generates waste from various locations in US. The waste is taken to different locations based on suppliers' treatment methods and facilities placed nationally.
Consider a waste stream A which is being generated from location X. Now the overall costs to take care of Stream A includes Transportation cost from our site as well treatment method. This data is tabulated.
What I want to achieve
I would like my python program to import excel table containing this data and plot the distance between our facility and treatment facility and also show in a hub-spoke type picture just like airlines do as well show data regarding treatment method as a color or something just like on google maps.
Can someone give me leads on where should I start or which python API or module that might best suite my scenario?
This is a rather broad question and perhaps not the best for SO.
Now to answer it: you can read excel's csv files with the csv module. Plotting is best done with matplotlib.pyplot.
I have a large collection (and growing) of geospatial data (lat, lon) points (stored in mongodb, if that helps).
I'd like to generate a choropleth map (http://vis.stanford.edu/protovis/ex/choropleth.html), which requires knowing the state which contains that point. Is there a database or algorithm that can do this without requiring call to external APIs (i.e. I'm aware of things like geopy and the google maps API).
Actually, the web app you linked to contains the data you need -
If you look at http://vis.stanford.edu/protovis/ex/us_lowres.js for each state, borders[] contains a [lat,long] polyline which outlines the state. Load this data and check for point-in-polygon - http://en.wikipedia.org/wiki/Point_in_polygon
Per Reverse Geocoding Without Web Access you can speed it up a lot by pre-calculating a bounding box on each state and only testing point-in-polygon if point-in-bounding-box.
Here's how to do it in FORTRAN. Remember FORTRAN? Me neither. Anyway, it looks pretty simple, as every state has its own range.
EDIT It's been point out to me that your starting point is LAT-LONG, not the zipcode.
The algorithm for converting a lat-long to a political division is called "a map". Seriously, that's allan ordinary map is, a mapping of every point in some range to the division it belongs to. A detailed digital map of all 48 contiguous states would be a big database, and then you would need some (fairly simple) code to determine for each state (described as a series of line segments outlining the border) whether a given point was inside it or out.
you can try using Geonames database. It has long/lat as well as city, postal and other location type data. It is free as well.
but If you need to host it locally or import it into your own database , the USGS and NGA provide a comprehensive list of cities with lat/lon. It's updated reguarly, free, and reliable.
http://geonames.usgs.gov
http://earth-info.nga.mil/gns/html/index.html
Not sure the quality of the data, but give this a shot: http://www.boutell.com/zipcodes/
If you don't mind a very crude solution, you could adapt the click-map here.
I'm new to data mining and experimenting a bit.
Let's say I have N twitter users and what I want to find
is the overall theme they're writing about (based on tweets).
Then I want to give higher weight to each theme if that user has higher followers.
Then I want to merge all themes if there're similar enough but still
retain the weighting by twitter count.
So basically a list of "important" themes ranked by authority (user's twitter count)
For instance, like news.google.com but ranking would be based on twitter followers that are responsible for theme.
I'd prefer something in python since that's the language I'm most familiar with.
Any ideas?
Thanks
EDIT:
Here's a good example of what I'm trying to do (but with diff data)
http://www.facebook.com/notes/facebook-data-team/whats-on-your-mind/477517358858
Basically analyzing various data and their correlation to each other: work categories and each persons age or word categories and friend count as in this example.
Where would I begin to solve this and generate such graphs?
Generally speaking : R has some packages specifically directed at text mining and datamining, offering a wide range of techniques. I have no knowledge of that kind of packages in Python, but that doesn't mean they don't exist. I just wouldn't implement it all myself, it's a bit more complicated than it looks at first sight.
Some things you have to consider :
define "theme" : Is that the tags they use? Do you group tags? Do you have a small list with a limited set, or is the set unlimited?
define "general theme" : Is that the most used theme? How do you deal with ties? If a user writes about 10 themes about as much, what then?
define "weight" : Is that equivalent to the number of users? The square root? Some category?
If you have a general idea about this, you can start using the tm package for extracting all the information in a workable format. The package is based on matrices, and metadata objects. These allow you to get weighted frequencies for the different themes, provided you have defined what you consider a theme. You can also use different weighting functions to obtain what you want. The manual is here. But please also visit crossvalidated.com for extra guidance if you're not sure about what you're doing. This is actually more a question about data mining than it is about programming.
I have no specific code but I believe the methodology you want to employ is TF-IDF. It is explained here: http://en.wikipedia.org/wiki/Tf%E2%80%93idf and is used quote often classify text.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 months ago.
Improve this question
I am working on an application where one of the requirements is that I be able to perform realtime reverse geocoding operations based on GPS data. In particular, I must be able to determine the state/province to which a latitude, longitude pair maps and detect when we have moved from one state/province to another.
I have a couple ideas so far but wondered if anyone had any ideas on either of the following:
What is the best approach for tackling this problem in an efficient manner?
Where is a good place to find and what is the appropriate format for North American state/province boundaries
As a starter, here are the two main ideas I have:
Break North America into a grid with each rectangle in the grid mapping to a particular state province. Do a lookup on this table (which grows quickly the more precise you would like to be) based on the latitude and then the longitude (or vice versa).
Define polygons for each of the states and do some sort of calculation to determine in which polygon a lat/lon pair lies. I am not sure exactly how to go about this. HTML image maps come to mind as one way of defining the bounds for a state/province.
I am working in python for the interested or those that might have a nice library they would like to suggest.
To be clear... I do not have web access available to me, so using an existing reverse geocoding service is not an option at runtime
I created an offline reverse geocoding module for countries: https://github.com/richardpenman/reverse_geocode
>>> import reverse_geocode
>>> coordinates = (-37.81, 144.96), (31.76, 35.21)
>>> reverse_geocode.search(coordinates)
[{'city': 'Melbourne', 'code': 'AU', 'country': 'Australia'},
{'city': 'Jerusalem', 'code': 'IL', 'country': 'Israel'}]
I will see if I can add data for states.
I suggest using a variant of your first idea: Use a spatial index. A spatial index is a data structure built from rectangles, mapping lat/long to the payload. In this case you will probably map rectangles to state-province pairs. An R-tree may be a good option. Here's an R-tree python package. You could detect roaming by comparing the results of consecutive searches.
I would stay away from implementing your own solution from scratch. This is a pretty big undertaking and there are already tools out there to do this. If you're looking for an open source approach (read: free), take a look at this blog post: Using PostGIS to Reverse Geocode.
If you can get hold of state boundaries as polygons (for example, via OpenStreetMap), determining the current state is just a point-in-polygon test.
If you need address data, an offline solution would be to use Microsoft Mappoint.
You can get data for the entire united states from open street map You could then extract the data you need such as city or state locations into what ever format works best for your application. Note although data quality is good it isn't guaranteed to be completely accurate so if you need complete accuracy you may have to look somewhere else.
I have a database with all of this data and some access tools. I made mine from the census tiger data. I imagine it'd basically be an export of my database to sqlite and a bit of code translation.
The free reverse geocoding service I developed (www.feroeg.com) is based on spatialite, a sqlite library implementing SQL spatial capabilities (r-tree).
The data are imported from OpenStreetMap (nation, cities, streets,street number) and OpenAddresses (street numbers) using proprietary tools.
The entire world consumes about 250GB.
There is a paper describing the architecture of the service:
https://feroeg.com/Feroeg_files/Feroeg Presentation.pdf
At the moment the project (importer and server) is closed source.
Reverse Geocoding Library (C++) and converting tools are availabile on request.