Best Way to Clean Up Cell Values in Pandas Dataframe

Best Way to Clean Up Cell Values in Pandas Dataframe - python

I'm taking an RSS ('http://www.reddit.com/new/.rss?sort=new') and uploading it to a SQL database. Here are my steps:
From that URL I was able to create a pandas data frame which I will upload to a SQL database. The column names in the dataframe are title, link, summary, author and tags. What is the best way to clean up the summary column and get rid of all the tags??
'<!-- SC_OFF --><div class="md"><p>The title says most of it, I’m running about a 12-13 min mile. I haven’t run in about 4.5 years and I need to get to my fastest 1.5 with more in the tank afterward, and I need it to be solid. </p> <p>I’ve read blogs and running guides, but I thought I’d get it from the source, people who just love to run, just like the way I used to love to lift. </p> <p>I guess my question is, where do I start? Some say football conditioning, others say just run… Some even say just walk. I’m trying to slim down fast and have a solid mile and a half to 2-mile sprint. </p> <p>The only other conditioning I’m doing right now is three days of fight sports (2 Krav/kickboxing, 1 combat fitness style). Looking at running 3ish days and taking Sunday off.</p> </div><!-- SC_ON --> submitted by /u/Logical_penguin to r/running <br/> <span>[link]</span> <span>[comments]</span>'
I was able to use the below for one portion
df['summary'] = df['summary'].map(lambda x: x.lstrip('<!-- SC_OFF --->'))
However this will take too long for everything in the summary column.

import re
df['summary'] = df['summary'].map(lambda x: re.sub('<[^<]+?>', '', x))
This can remove , result of your example will be:
'The title says most of it, I’m running about a 12-13 min mile. I haven’t run in about 4.5 years and I need to get to my fastest 1.5 with more in the tank afterward, and I need it to be solid. I’ve read blogs and running guides, but I thought I’d get it from the source, people who just love to run, just like the way I used to love to lift. I guess my question is, where do I start? Some say football conditioning, others say just run… Some even say just walk. I’m trying to slim down fast and have a solid mile and a half to 2-mile sprint. The only other conditioning I’m doing right now is three days of fight sports (2 Krav/kickboxing, 1 combat fitness style). Looking at running 3ish days and taking Sunday off. submitted by /u/Logical_penguin to r/running [link] [comments]'
For single operations on the same example, performance of re and lstrip:
For a dataframe that has 10 rows(the same string):
re is more powerful than lstrip/rstrip in this scenario. Comparing them here is just to provide some info about how much time is taken away for re to perform.
Moreover, to save time, it is better to use re upon df.values instead of using apply/map.

Related

Trying to work out how to produce a synthetic data set using python or javascript in a repeatable way

I have a reasonably technical background and have done a fair bit of node development, but I’m a bit of a novice when it comes to statistics and a complete novice with python, so any advice on a synthetic data generation experiment I’m trying my hand at would be very welcome :)
I’ve set myself the problem of generating some realistic(ish) sales data for a bricks and mortar store (old school, I know).
I’ve got a smallish real-world transactional dataset (~500k rows) from the internet that I was planning on analysing with a tool of some sort, to provide the input to a PRNG.
Hopefully if I explain my thinking across a couple of broad problem domains, someone(s?!) can help me:
PROBLEM 1
I think I should be able to use the real data I have to either:
a) generate a probability distribution curve or
b) identify an ‘out of the box’ distribution that’s the closest match to the actual data
I’m assuming there’s a tool or library in Python or Node that will do one or both of those things if fed the data and, further, give me the right values to plug in to a PRNG to produce a series of data points that not are not only distributed like the original's, but also within the same sort of ranges.
I suspect b) would be less expensive computationally and, also, better supported by tools - my need for absolute ‘realness’ here isn’t that high - it’s only an experiment :)
Which leads me to…
QUESTION 1: What tools could I use to do do the analysis and generate the data points? As I said, my maths is ok, but my statistics isn't great (and the docs for the tools I’ve seen are a little dense and, to me at least, somewhat impenetrable), so some guidance on using the tool would also be welcome :)
And then there’s my next, I think more fundamental, problem, which I’m not even sure how to approach…
PROBLEM 2
While I think the approach above will work well for generating timestamps for each row, I’m going round in circles a little bit on how to model what the transaction is actually for.
I’d like each transaction to be relatable to a specific product from a list of products.
Now the products don’t need to be ‘real’ (I reckon I can just use something like Faker to generate random words for the brand, product name etc), but ideally the distribution of what is being purchased should be a bit real-ey (if that’s a word).
My first thought was just to do the same analysis for price as I’m doing for timestamp and then ‘make up’ a product for each price that’s generated, but I discarded that for a couple of reasons: It might be consistent ‘within’ a produced dataset, but not ‘across’ data sets. And I imagine on largish sets would double count quite a bit.
So my next thought was I would create some sort of lookup table with a set of pre-defined products that persists across generation jobs, but Im struggling with two aspects of that:
I’d need to generate the list itself. I would imagine I could filter the original dataset to unique products (it has stock codes) and then use the spread of unit costs in that list to do the same thing as I would have done with the timestamp (i.e. generate a set of products that have a similar spread of unit cost to the original data and then Faker the rest of the data).
QUESTION 2: Is that a sensible approach? Is there something smarter I could do?
When generating the transactions, I would also need some way to work out what product to select. I thought maybe I could generate some sort of bucketed histogram to work out what the frequency of purchases was within a range of costs (say $0-1, 1-2$ etc). I could then use that frequency to define the probability that a given transaction's cost would fall within one those ranges, and then randomly select a product whose cost falls within that range...
QUESTION 3: Again, is that a sensible approach? Is there a way I could do that lookup with a reasonably easy to understand tool (or at least one that’s documented in plain English :))
This is all quite high level I know, but any help anyone could give me would be greatly appreciated as I’ve hit a wall with this.
Thanks in advance :)
The synthesised dataset would simply have timestamp, product_id and item_cost columns.
The source dataset looks like this:
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/1/2010 8:26,7.65,17850,United Kingdom
536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/1/2010 8:26,4.25,17850,United Kingdom
536366,22633,HAND WARMER UNION JACK,6,12/1/2010 8:28,1.85,17850,United Kingdom

Backtesting a Universe of Stocks

I would like to develop a trend following strategy via back-testing a universe of stocks; lets just say all NYSE or S&P500 equities. I am asking this question today because I am unsure how to handle the storage/organization of the massive amounts of historical price data.
After multiple hours of research I am here, asking for your experience and awareness. I would be extremely grateful for any information/awareness you can share on this topic
Personal Experience background:
-I know how to code. Was a Electrical Engineering major, not a CS major.
-I know how to pull in stock data for individual tickers into excel.
Familiar with using filtering and custom studies on ThinkOrSwim.
Applied Context:
From 1995 to today lets evaluate the best performing equities on a relative strength/momentum basis. We will look to compare many technical characteristics to develop a strategy. The key to this is having data for a universe of stocks that we can run backtests on using python, C#, R, or any other coding language. We can then determine possible strategies by assesing the returns, the omega ratio, median excess returns, and Jensen's alpha (measured weekly) of entries and exits that are technical driven.
Here's where I am having trouble figuring out what the next step is:
-Loading data for all S&P500 companies into a single excel workbook is just not gonna work. Its too much data for excel to handle I feel like. Each ticker is going to have multiple MB of price data.
-What is the best way to get and then store the price data for each ticker in the universe? Are we looking at something like SQL or Microsoft access here? I dont know; I dont have enough awareness on the subject of handling lots of data like this. What are you thoughts?
I have used ToS to filter stocks based off of true/false parameters over a period of time in the past; however the capabilities of ToS are limited.
I would like a more flexible backtesting engine like code written in python or C#. Not sure if Rscript is of any use. - Maybe, there are libraries out there that I do not have awareness of that would make this all possible? If there are let me know.
I am aware that Quantopia and other web based Quant platforms are around. Are these my best bets for backtesting? Any thoughts on them?
Am I making this too complicated?
Backtesting a strategy on a single equity or several equities isnt a problem in excel, ToS, or even Tradingview. But with lots of data Im not sure what the best option is for storing that data and then using a python script or something to perform the back test.
Random Final thought:-Ultimately would like to explore some AI assistance with optimizing strategies that were created based off parameters. I know this is a thing but not sure where to learn more about this. If you do please let me know.
Thank you guys. I hope this wasn't too much. If you can share any knowledge to increase my awareness on the topic I would really appreciate it.
Twitter:#b_gumm

The amout of data is too much for EXCEL or CALC. Even if you want to screen only 500 Stocks from S&P 500, you will get 2,2 Millions of rows (approx. 220 days/year * 20 years * 500 stocks). For this amount of data, you should use a SQL Database like MySQL. It is performant enough to handle this amount of data. But you have to find a way for updating. If you get the complete time series daily and store it into your database, this process can take approx. 1 hour. You could also use delta downloads but be aware of corporate actions (e.g. splits).
I don't know Quantopia, but I know a similar backtesting service where I have created a python backtesting script last year. The outcome was quite different to what I have expected. The research result was that the backtesting service was calculating wrong results because of wrong data. So be cautious about the results.

FORE! Choosing a data type for my horrendous golf game

Ive started learning Python and decided to give myself a golf related project to work on. My question revolves around choosing the best data type to use. Now i know th3nanswer to this is based on requirements but tht isnt helping me.
Besides simple data like name, date, name of course, etc., ill alao be generating 9 and 18 hole scores for multiple players in my locl society.
While keeping a historical record of past scores is nice i may want to perform some analytics across my dataset to find handicaps, hardest holes, etc. And, yes, i know there are apps out there aldeady im doing this to learn. ;)
So....which data structur should i use to work with? Lists, dictionaries, numpy arrats, objects, or a combination?
Many thanks!

Parsing specific content using BeautifulSoup

I am looking to extract fantasy football information from a website.
I can write enough code to get the following output but all I really want is the following information:
"fullName":"Justin Forsett"
"pointsSEASON":75
Can anyone help explain how to isolate these items and write them to, for example, a csv file?
[<div class="mod-content" id="fantasy-content">{"averagePoints":9.4,"percentOwned":98.6,"pointsSEASON":75,"seasonOutlook":{"outlook":"Forsett finished 2014 as fantasy's No. 8 RB, so why aren't we higher on him? Well, it's difficult to reconcile what we know about his size (5-8, 197), age (30 in October) and career with the 1,529 scrimmage yards he racked up as Baltimore's surprise starter. Forsett had never even eclipsed 1,000 total yards in any of his six previous seasons. Yet his quickness and vision were consistently excellent last year, and new OC Marc Trestman loves throwing to RBs. Lorenzo Taliaferro and rookie Javorius Allen loom as heftier options, and some kind of rotation could develop. But Forsett will get the benefit of the doubt in Week 1.","seasonId":2015,"date":"Wed May 20"},"positionRank":18,"playerId":11467,"percentChange":-0.2,"averageDraftPosition":42.5,"fullName":"Justin Forsett","mostRecentNews":{"news":null,"spin":"The Jaguars have allowed the second-fewest yards per carry (3.4) in the league, but have ceded one rushing score per game in the process. Forsett will need a good deal of volume to overcome a quietly tough matchup, but we're trusting the workload will be enough.","date":"Tue Nov 10"},"totalPoints":75,"projectedPoints":13.957546548,"projectedDifference":4.582546548}</div>]

It looks like the text of the tag you are looking for is in JSON format. You have successfully gotten the div tag, but now you have to extract the JSON, and then extract the information you want. Here is what you will need to add to your code.
import json
rawJSONString = {originaltag}.get_text()
JSONString = json.loads(rawJSONString)
print(JSONString['fullName'])
print(JSONString['pointsSEASON'])
{originaltag} is the tag you printed up above, since you didn't show your code, I couldn't run it. Instead I ran the following code
string = '{"averagePoints":9.4,"percentOwned":98.6,"pointsSEASON":75,"seasonOutlook":{"outlook":"Forsett finished 2014 as fantasys No. 8 RB, so why arent we higher on him? Well, its difficult to reconcile what we know about his size (5-8, 197), age (30 in October) and career with the 1,529 scrimmage yards he racked up as Baltimores surprise starter. Forsett had never even eclipsed 1,000 total yards in any of his six previous seasons. Yet his quickness and vision were consistently excellent last year, and new OC Marc Trestman loves throwing to RBs. Lorenzo Taliaferro and rookie Javorius Allen loom as heftier options, and some kind of rotation could develop. But Forsett will get the benefit of the doubt in Week 1.","seasonId":2015,"date":"Wed May 20"},"positionRank":18,"playerId":11467,"percentChange":-0.2,"averageDraftPosition":42.5,"fullName":"Justin Forsett","mostRecentNews":{"news":null,"spin":"The Jaguars have allowed the second-fewest yards per carry (3.4) in the league, but have ceded one rushing score per game in the process. Forsett will need a good deal of volume to overcome a quietly tough matchup, but were trusting the workload will be enough.","date":"Tue Nov 10"},"totalPoints":75,"projectedPoints":13.957546548,"projectedDifference":4.582546548}'
s = json.loads(string)
print(s['fullName'])
print(s['pointsSEASON'])
And got this output
Justin Forsett
75
Edited to add: Here is information on writing to a csv file.

Data modeling/forecasting in Python?

How do I go about this problem I've thought up? I don't even know if it's possible in Python, but anyway. Basically, I want to give Python some data to look for patterns in the data and then display the most likely result. I thought pocket money would make a good example:
A gets $7 a week, a dollar a day, he spends $5 at the weekend and his balance on Monday is $2 + $1 ($7 a week 1 a day). This continues for three weeks. What will A get on the forth week?
Balance in account:
week1 = 1,2,3,4,5,6,7
week2 = 3,4,5,6,7,8,9
week3 = 5,6,7,8,9,10,11
week4 = ??????????
Now, instead of basic math I was wondering if it was possible to create a model, one that looks for patterns and then creates the data using the existing data and the patterns. So the script should be able to see that A gets $7 a week and he loses $5 a week. Is this possible?
Is the model possible to be flexible as in if I give it other data of the same nature, will it be able to find patterns?
(I'm using Python 3.2.)

What you're describing is classified as a regression problem. Armed with this term, you should be able to find lots of information online and in books.
To point you in the right direction, I'd suggest investigating "linear regression" and perhaps moving on to a more complex regression model like the "random forests regressor".
The "scikit-learn" Python package has a number of different regression models and the documentation is excellent.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.