Situation
Our company generates waste from various locations in US. The waste is taken to different locations based on suppliers' treatment methods and facilities placed nationally.
Consider a waste stream A which is being generated from location X. Now the overall costs to take care of Stream A includes Transportation cost from our site as well treatment method. This data is tabulated.
What I want to achieve
I would like my python program to import excel table containing this data and plot the distance between our facility and treatment facility and also show in a hub-spoke type picture just like airlines do as well show data regarding treatment method as a color or something just like on google maps.
Can someone give me leads on where should I start or which python API or module that might best suite my scenario?
This is a rather broad question and perhaps not the best for SO.
Now to answer it: you can read excel's csv files with the csv module. Plotting is best done with matplotlib.pyplot.
Related
I have a reasonably technical background and have done a fair bit of node development, but I’m a bit of a novice when it comes to statistics and a complete novice with python, so any advice on a synthetic data generation experiment I’m trying my hand at would be very welcome :)
I’ve set myself the problem of generating some realistic(ish) sales data for a bricks and mortar store (old school, I know).
I’ve got a smallish real-world transactional dataset (~500k rows) from the internet that I was planning on analysing with a tool of some sort, to provide the input to a PRNG.
Hopefully if I explain my thinking across a couple of broad problem domains, someone(s?!) can help me:
PROBLEM 1
I think I should be able to use the real data I have to either:
a) generate a probability distribution curve or
b) identify an ‘out of the box’ distribution that’s the closest match to the actual data
I’m assuming there’s a tool or library in Python or Node that will do one or both of those things if fed the data and, further, give me the right values to plug in to a PRNG to produce a series of data points that not are not only distributed like the original's, but also within the same sort of ranges.
I suspect b) would be less expensive computationally and, also, better supported by tools - my need for absolute ‘realness’ here isn’t that high - it’s only an experiment :)
Which leads me to…
QUESTION 1: What tools could I use to do do the analysis and generate the data points? As I said, my maths is ok, but my statistics isn't great (and the docs for the tools I’ve seen are a little dense and, to me at least, somewhat impenetrable), so some guidance on using the tool would also be welcome :)
And then there’s my next, I think more fundamental, problem, which I’m not even sure how to approach…
PROBLEM 2
While I think the approach above will work well for generating timestamps for each row, I’m going round in circles a little bit on how to model what the transaction is actually for.
I’d like each transaction to be relatable to a specific product from a list of products.
Now the products don’t need to be ‘real’ (I reckon I can just use something like Faker to generate random words for the brand, product name etc), but ideally the distribution of what is being purchased should be a bit real-ey (if that’s a word).
My first thought was just to do the same analysis for price as I’m doing for timestamp and then ‘make up’ a product for each price that’s generated, but I discarded that for a couple of reasons: It might be consistent ‘within’ a produced dataset, but not ‘across’ data sets. And I imagine on largish sets would double count quite a bit.
So my next thought was I would create some sort of lookup table with a set of pre-defined products that persists across generation jobs, but Im struggling with two aspects of that:
I’d need to generate the list itself. I would imagine I could filter the original dataset to unique products (it has stock codes) and then use the spread of unit costs in that list to do the same thing as I would have done with the timestamp (i.e. generate a set of products that have a similar spread of unit cost to the original data and then Faker the rest of the data).
QUESTION 2: Is that a sensible approach? Is there something smarter I could do?
When generating the transactions, I would also need some way to work out what product to select. I thought maybe I could generate some sort of bucketed histogram to work out what the frequency of purchases was within a range of costs (say $0-1, 1-2$ etc). I could then use that frequency to define the probability that a given transaction's cost would fall within one those ranges, and then randomly select a product whose cost falls within that range...
QUESTION 3: Again, is that a sensible approach? Is there a way I could do that lookup with a reasonably easy to understand tool (or at least one that’s documented in plain English :))
This is all quite high level I know, but any help anyone could give me would be greatly appreciated as I’ve hit a wall with this.
Thanks in advance :)
The synthesised dataset would simply have timestamp, product_id and item_cost columns.
The source dataset looks like this:
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/1/2010 8:26,7.65,17850,United Kingdom
536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/1/2010 8:26,4.25,17850,United Kingdom
536366,22633,HAND WARMER UNION JACK,6,12/1/2010 8:28,1.85,17850,United Kingdom
Trying to figure out a good way of solving this problem but wanted to ask for the best way of doing this.
In my project, I am looking at multiple instrument note pairs for a neural network. The only problem is that there are multiple instruments with the same name and just because they have the same name doesn't mean that they are the same instrument 100% of the time. (It should be but I want to be sure.)
I personally would like to analyze the instrument itself (like metadata on just the instrument in question) and not the notes associated with it. Is that possible?
I should also mention that I am using pretty-midi to collect the musical instruments.
In MIDI files, bank and program numbers uniquely identity instruments.
In General MIDI, drums are on channel 10 (and, in theory, should not use a Program Change message).
In GM2/GS/XG, the defaults for drums are the same, but can be changed with bank select messages.
I would like to develop a trend following strategy via back-testing a universe of stocks; lets just say all NYSE or S&P500 equities. I am asking this question today because I am unsure how to handle the storage/organization of the massive amounts of historical price data.
After multiple hours of research I am here, asking for your experience and awareness. I would be extremely grateful for any information/awareness you can share on this topic
Personal Experience background:
-I know how to code. Was a Electrical Engineering major, not a CS major.
-I know how to pull in stock data for individual tickers into excel.
Familiar with using filtering and custom studies on ThinkOrSwim.
Applied Context:
From 1995 to today lets evaluate the best performing equities on a relative strength/momentum basis. We will look to compare many technical characteristics to develop a strategy. The key to this is having data for a universe of stocks that we can run backtests on using python, C#, R, or any other coding language. We can then determine possible strategies by assesing the returns, the omega ratio, median excess returns, and Jensen's alpha (measured weekly) of entries and exits that are technical driven.
Here's where I am having trouble figuring out what the next step is:
-Loading data for all S&P500 companies into a single excel workbook is just not gonna work. Its too much data for excel to handle I feel like. Each ticker is going to have multiple MB of price data.
-What is the best way to get and then store the price data for each ticker in the universe? Are we looking at something like SQL or Microsoft access here? I dont know; I dont have enough awareness on the subject of handling lots of data like this. What are you thoughts?
I have used ToS to filter stocks based off of true/false parameters over a period of time in the past; however the capabilities of ToS are limited.
I would like a more flexible backtesting engine like code written in python or C#. Not sure if Rscript is of any use. - Maybe, there are libraries out there that I do not have awareness of that would make this all possible? If there are let me know.
I am aware that Quantopia and other web based Quant platforms are around. Are these my best bets for backtesting? Any thoughts on them?
Am I making this too complicated?
Backtesting a strategy on a single equity or several equities isnt a problem in excel, ToS, or even Tradingview. But with lots of data Im not sure what the best option is for storing that data and then using a python script or something to perform the back test.
Random Final thought:-Ultimately would like to explore some AI assistance with optimizing strategies that were created based off parameters. I know this is a thing but not sure where to learn more about this. If you do please let me know.
Thank you guys. I hope this wasn't too much. If you can share any knowledge to increase my awareness on the topic I would really appreciate it.
Twitter:#b_gumm
The amout of data is too much for EXCEL or CALC. Even if you want to screen only 500 Stocks from S&P 500, you will get 2,2 Millions of rows (approx. 220 days/year * 20 years * 500 stocks). For this amount of data, you should use a SQL Database like MySQL. It is performant enough to handle this amount of data. But you have to find a way for updating. If you get the complete time series daily and store it into your database, this process can take approx. 1 hour. You could also use delta downloads but be aware of corporate actions (e.g. splits).
I don't know Quantopia, but I know a similar backtesting service where I have created a python backtesting script last year. The outcome was quite different to what I have expected. The research result was that the backtesting service was calculating wrong results because of wrong data. So be cautious about the results.
I have been doing some investigation to find a package to install and use for Geospatial Analytics
The closest I got to was https://github.com/harsha2010/magellan - This however has only scala interface and no doco how to use it with Python.
I was hoping if you someone knows of a package I can use?
What I am trying to do is analyse Uber's data and map it to the actual postcodes/suburbs and run it though SGD to predict the number of trips to a particular suburb.
There is already lots of data info here - http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/#comment-606532 and I am looking for ways to do it in Python.
In Python I'd take a look at GeoPandas. It provides a data structure called GeoDataFrame: it's a list of features, each one having a geometry and some optional attributes. You can join two GeoDataFrames together based on geometry intersection, and you can aggregate the numbers of rows (say, trips) within a single geometry (say, postcode).
I'm not familiar with Uber's data, but I'd try to find a way to get it into a GeoPandas GeoDataFrame.
Likewise postcodes can be downloaded from places like the U.S. Census, OpenStreetMap[1], etc, and coerced into a GeoDataFrame.
Join #1 to #2 based on geometry intersection. You want a new GeoDataFrame with one row per Uber trip, but with the postcode attached to each. Another StackOverflow post discusses how do to this, and it's currently harder than it ought to be.
Aggregate this by postcode and count the trips in each. The code will look like joined_dataframe.groupby('postcode').count().
My fear for the above process is if you have hundreds of thousands of very complex trip geometries, it could take forever on one machine. The link you posted uses Spark and you may end up wanting to parallelize this after all. You can write Python against a Spark cluster(!) but I'm not the person to help you with this component.
Finally, for the prediction component (e.g. SGD), check out scikit-learn: it's a pretty fully featured machine learning package, with a dead simple API.
[1]: There is a separate package called geopandas_osm that grabs OSM data and returns a GeoDataFrame: https://michelleful.github.io/code-blog/2015/04/27/osm-data/
I realize this is an old questions, but to build on Jeff G's answer.
If you arrive at this page looking for help putting together a suite of geospatial analytics tools in python - I would highly recommend this tutorial.
https://geohackweek.github.io/vector
It really picks up steam in the 3rd section.
It shows how to integrate
GeoPandas
PostGIS
Folium
rasterstats
add in scikit-learn, numpy, and scipy and you can really accomplish a lot. You can grab information from this nDarray tutorial as well
I'm new to data mining and experimenting a bit.
Let's say I have N twitter users and what I want to find
is the overall theme they're writing about (based on tweets).
Then I want to give higher weight to each theme if that user has higher followers.
Then I want to merge all themes if there're similar enough but still
retain the weighting by twitter count.
So basically a list of "important" themes ranked by authority (user's twitter count)
For instance, like news.google.com but ranking would be based on twitter followers that are responsible for theme.
I'd prefer something in python since that's the language I'm most familiar with.
Any ideas?
Thanks
EDIT:
Here's a good example of what I'm trying to do (but with diff data)
http://www.facebook.com/notes/facebook-data-team/whats-on-your-mind/477517358858
Basically analyzing various data and their correlation to each other: work categories and each persons age or word categories and friend count as in this example.
Where would I begin to solve this and generate such graphs?
Generally speaking : R has some packages specifically directed at text mining and datamining, offering a wide range of techniques. I have no knowledge of that kind of packages in Python, but that doesn't mean they don't exist. I just wouldn't implement it all myself, it's a bit more complicated than it looks at first sight.
Some things you have to consider :
define "theme" : Is that the tags they use? Do you group tags? Do you have a small list with a limited set, or is the set unlimited?
define "general theme" : Is that the most used theme? How do you deal with ties? If a user writes about 10 themes about as much, what then?
define "weight" : Is that equivalent to the number of users? The square root? Some category?
If you have a general idea about this, you can start using the tm package for extracting all the information in a workable format. The package is based on matrices, and metadata objects. These allow you to get weighted frequencies for the different themes, provided you have defined what you consider a theme. You can also use different weighting functions to obtain what you want. The manual is here. But please also visit crossvalidated.com for extra guidance if you're not sure about what you're doing. This is actually more a question about data mining than it is about programming.
I have no specific code but I believe the methodology you want to employ is TF-IDF. It is explained here: http://en.wikipedia.org/wiki/Tf%E2%80%93idf and is used quote often classify text.