Data Scrubbing using python

Data Scrubbing using python - python

So, I am cleaning a data set that has a timestamp as the first column, prices and volumes as the next two columns. I am trying to remove bad rows using several logic and then output a file that has all the bad ticks and another file that has all the good ticks. Every logic seems to work other than where I remove my duplicates. I end up getting lesser number of rows than I started with:
from datetime import datetime
to_sort = list()
noise= list()
line_counter=0
with open("hi.txt", 'r') as f:
for line in f:
#splitting the lines using the delimiter comma
splitted_line = line.strip().split(",")
#stripping the time using the datetime function from the datetime library
date1 = datetime.strptime(splitted_line[0],'%Y%m%d:%H:%M:%S.%f')
#creating columns for volume and price
price = float(splitted_line[1])
volume = int(splitted_line[2])
#creating a tuple using date as first column, price as second and volume as third
my_tuple=(date1,price,volume)
#EDA shows that the prices are between 0 and 3000 and volume must be greater than zero
if price > 0 and price<3000 and volume >0:
to_sort.append(my_tuple)
else:
noise.append(my_tuple)
line_counter +=1
if line_counter %13==0:
#removing duplicates using the set function
sorted_signal=sorted(set(to_sort))
with open ("true.txt","a")as s:
for line in sorted_signal:
s.write(str(line[0])+","+ str(line[1])+","+str(line[2])+"\n")
to_sort=list()
with open ("noise.txt","a")as n:
for line in noise:
n.write(str(line[0])+","+ str(line[1])+","+str(line[2])+"\n")
noise=list()

Related

How to arrange data week wise csv file in python

I have generated csv file which has formate as shown in below image:
In this image, I have data week wise but somewhere I couldn't arrange data week wise. If you look into the below image, you will see the red mark and blue mark. I want to separate this both marks. How I will do it?
Note: If Holiday on Friday then it should set a week from Monday to Thursday.
currently, I'm using below logic :
Image: Please click here to see image
current logic:
import csv
blank_fields=[' ']
fields=[' ','Weekly Avg']
# Read csv file
file1 = open('test.csv', 'rb')
reader = csv.reader(file1)
new_rows_list = []
# Read data row by row and store into new list
for row in reader:
new_rows_list.append(row)
if 'Friday' in row:
new_rows_list.append(fields)
file1.close()

overall you are going towards the right direction, your condition is just a little too error prone, things can get worse (e.g., just one day in a week appears in your list). So testing for the weekday string isn't the best choice here.
I would suggest "understanding" the date/time in your table to solve this using weekdays, like this:
from datetime import datetime as dt, timedelta as td
# remember the last weekday
last = None
# each item in list represents one week, while active_week is a temp-var
weeks = []
_cur_week = []
for row in reader:
# assuming the date is in row[1] (split, convert to int, pass as args)
_cur_date = dt(*map(int, row[1].split("-")))
# weekday will be in [0 .. 6]
# now if _cur_date.weekday <= last.weekday, a week is complete
# (also catching corner-case with more than 7 days, between two entries)
if last and (_cur_date.weekday <= last.weekday or (_cur_date - last) >= td(days=7)):
# append collected rows to 'weeks', empty _cur_week for next rows
weeks.append(_cur_week)
_cur_week = []
# regular invariant, append row and set 'last' to '_cur_date'
_cur_week.append(row)
last = _cur_date
Pretty verbose and extensive, but I hope I can transport the pattern used here:
parse existing date and use weekday to distinguish one week from another (i.e., weekday will increase monotonously, means any decrease (or equality) will tell you the current date represents the next week).
store rows in a temporary list during one week
append _cur_week into weeks once the condition for next week gets triggered
empty _cur_week for next rows i.e., week
Finally the last thing to do is to "concat" the data e.g. like this:
new_rows_list = [[fields] + week for week in weeks]

I have another logic for this same thing and it is successfully worked and easy solution.
import csv
import datetime
fields={'':' ', 'Date':'Weekly Avg'}
#Read csv file
file1 = open('temp.csv', 'rb')
reader = csv.DictReader(file1)
new_rows_list = []
last_date = None
# Read data row by row and store into new list
for row in reader:
cur_date = datetime.datetime.strptime(row['Date'], '%Y-%m-%d').date()
if last_date and ((last_date.weekday() > cur_date.weekday()) or (cur_date.weekday() == 0)):
new_rows_list.append(fields)
last_date = cur_date
new_rows_list.append(row)
file1.close()

AttributeError: 'list' object has no attribute 'strftime' python 3 csv - Socratica

I'm a newbie and I'm using Socratica to get my feet wet. I love the channel and I've made it to this video "CSV Files in Python || Python Tutorial || Learn Python Programming" I copy and pasted my notes.
################################## C.omma, S.eperated,V.alue ################################################
# https:// ten years of googles stock prices to practise manipulating data.
# Most small amounts of data are stored in a spread sheet and larger amounts of data are stored in databases.
# However CSV's have a place. They are easy, no drivers or api's are needed.
# Header - first entry of data in the list
# Data - all the rest of the data
# - Each row data(like a record in a data base) is seperated by comma's and everything is a string.
# - Two commas in a row ',,' means there is a missing piece of data.
import time
import csv
from datetime import datetime
path = (r"C:\Users\mexib\Socratica Learning\Google Stock Market Data - google_stock_data.csv.csv")
file = open(path, newline='')
# ^ adding empty brackets makes this keyword arg(kwarg) universal for different pc's
reader = csv.reader(file) # Using the reader function here parse the data
header = next(reader) # The first line is the header so we use next fucntion to get to the data.
# at this point the list has been parsed but each element of this list is still a string.
data = []
for row in reader:
# row = [Date, Open, High, Low, CLose, Volume, Adj. Close]
date = datetime.strptime(row[0], '%m/%d/%Y')
open_price = float(row[1]) # 'open' is a builtin function
high = float(row[2])
low = float(row[3])
close = float(row[4])
volume = int(row[5])
adj_close = float(row[6])
data.append([data, open_price, high, low, close, volume, adj_close])
# Compute the daily stock return which is percent change in price
returns_path = (r"C:\Users\mexib\Socratica Learning\google_returns.csv") # Writes a file called google_returns.csv
file = open(returns_path, 'w') # The 'w' open's up this file in write mode
writer = csv.writer(file) # writer is a object that is storing the results from our next code computations
writer.writerow(["Date","Return"]) # To write a write we use the writerow method and give a list of values * this is the header
# Since this list is in chronological order we can loop through the data
for i in range(len(data) - 1):
todays_row = data[i]
todays_date = todays_row[0]
todays_price = todays_row[-1]
yesterdays_row = data[i+1]
yesterdays_price = yesterdays_row[-1]
daily_return = (todays_price - yesterdays_price)/ yesterdays_price
formatted_date = todays_date.datetime.strftime('%m/%d/%Y')
writer.writerow([formatted_date, daily_return])
#for i in range(len(data) - 1): # we stopped at the second to last row because the first row has no previous day to compare
I get the AttributeError: 'list' object has no attribute 'datetime' I've tried a couple things and nothing seems to change that outcome. For the most part Socratica has been an awesome resource.

Arrays' Length exploding when appending Values to it from CSV

below is some code written to open a CSV file. Its values are stored like this:
03/05/2017 09:40:19,21.2,35.0
03/05/2017 09:40:27,21.2,35.0
03/05/2017 09:40:38,21.1,35.0
03/05/2017 09:40:48,21.1,35.0
This is just a snippet of code I use in a real time plotting program, which fully works but the fact that the array is getting so big is unclean. Normally new values get added to the CSV while the program is running and the length of the arrays is very high. Is there a way to not have exploding arrays like this?
Just run the program, you will have to make a CSV with those values too and you will see my problem.
from datetime import datetime
import time
y = [] #temperature
t = [] #time object
h = [] #humidity
def readfile():
readFile = open('document.csv', 'r')
sepFile = readFile.read().split('\n')
readFile.close()
for idx, plotPair in enumerate(sepFile):
if plotPair in '. ':
# skip. or space
continue
if idx > 1: # to skip the first line
xAndY = plotPair.split(',')
time_string = xAndY[0]
time_string1 = datetime.strptime(time_string, '%d/%m/%Y %H:%M:%S')
t.append(time_string1)
y.append(float(xAndY[1]))
h.append(float(xAndY[2]))
print([y])
while True:
readfile()
time.sleep(2)
This is the output I get:
[[21.1]]
[[21.1, 21.1]]
[[21.1, 21.1, 21.1]]
[[21.1, 21.1, 21.1, 21.1]]
[[21.1, 21.1, 21.1, 21.1, 21.1]]
Any help is appreciated.

You can use Python's deque if you also want to limit the total number of entries you wish to keep. It produces a list which features a maximum length. Once the list is full, any new entries push the oldest entry off the start.
The reason your list is growing is that you need to re-read your file up to the point of you last entry before continuing to add new entries. Assuming your timestamps are unique, you could use takewhile() to help you do this, which reads entries until a condition is met.
from itertools import takewhile
from collections import deque
from datetime import datetime
import csv
import time
max_length = 1000 # keep this many entries
t = deque(maxlen=max_length) # time object
y = deque(maxlen=max_length) # temperature
h = deque(maxlen=max_length) # humidity
def read_file():
with open('document.csv', newline='') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input) # skip over the header line
# If there are existing entries, read until the last read item is found again
if len(t):
list(takewhile(lambda row: datetime.strptime(row[0], '%d/%m/%Y %H:%M:%S') != t[-1], csv_input))
for row in csv_input:
print(row)
t.append(datetime.strptime(row[0], '%d/%m/%Y %H:%M:%S'))
y.append(float(row[1]))
h.append(float(row[2]))
while True:
read_file()
print(t)
time.sleep(1)
Also, it is easier to work with the entries using Python's built in csv library to read each of the values into a list for each row. As you have a header row, read this in using next() before starting the loop.

String Replace for Multiple Lines In A CSV

Below is a snippet from a csv file. The first column is the product number, 2 is the stock level, 3 is the target level, and 4 is the distance from target (target minus stock level.)
34512340,0,95,95
12395675,3,95,92
56756777,70,95,25
90673412,2,95,93
When the stock level gets to 5 or below, I want to have the stock levels updated from python when a user requests it.
I am currently using this piece of code which I have adapted from just updating one line in the CSV. It isn't working though. The first line is written back to the file as 34512340,0,95,95 and the rest of the file is deleted.
choice = input("\nTo update the stock levels of the above products, type 1. To cancel, enter anything else.")
if choice == '1':
with open('stockcontrol.csv',newline='') as f:
for line in f:
data = line.split(",")
productcode = int(data[0])
target = int(data[2])
stocklevel = int(data[1])
if stocklevel <= 5:
target = str(target)
import sys
import csv
data=[]
newval= target
newtlevel = "0"
f=open("stockcontrol.csv")
reader=csv.DictReader(f,fieldnames=['code','level', 'target', 'distancefromtarget'])
for line in reader:
line['level']= newval
line['distancefromtarget']= newtlevel
data.append('%s,%s,%s,%s'%(line['code'],line['level'],line['target'],line['distancefromtarget']))
f.close()
f=open("stockcontrol.csv","w")
f.write("\n".join(data))
f.close()
print("The stock levels were updated successfully")
else:
print("Goodbye")
Here is the code that I had changing one line in the CSV file and works:
with open('stockcontrol.csv',newline='') as f:
for line in f:
if code in line:
data = line.split(",")
target = (data[2])
newlevel = stocklevel - quantity
updatetarget = int(target) - int(newlevel)
stocklevel = str(stocklevel)
newlevel = str(newlevel)
updatetarget = str(updatetarget)
import sys
import csv
data=[]
code = code
newval= newlevel
newtlevel = updatetarget
f=open("stockcontrol.csv")
reader=csv.DictReader(f,fieldnames=['code','level', 'target', 'distancefromtarget'])
for line in reader:
if line['code'] == code:
line['level']= newval
line['distancefromtarget']= newtlevel
data.append('%s,%s,%s,%s'%(line['code'],line['level'],line['target'],line['distancefromtarget']))
f.close()
f=open("stockcontrol.csv","w")
f.write("\n".join(data))
f.close()
What can I change to make the code work? I basically want the program to loop through each line of the CSV file, and if the stock level (column 2) is equal to or less than 5, update the stock level to the target number in column 3, and then set the number in column 4 to zero.
Thanks,

The below code reads each line and checks the value of column 2. If it is less than or equal to 5, the value of column2 is changed to value of column3 and last column is changed to 0 else all the columns are left unchanged.
import sys
import csv
data=[]
f=open("stockcontrol.csv")
reader=csv.DictReader(f,fieldnames=['code','level','target','distancefromtarget'])
for line in reader:
if int(line['level']) <= 5:
line['level']= line['target']
line['distancefromtarget']= 0
data.append("%s,%s,%s,%s"%(line['code'],line['level'],line['target'],line['distancefromtarget']))
f.close()
f=open("stockcontrol.csv","w")
f.write("\n".join(data))
f.close()
Coming to issues in your code:
You are first reading the file without using the csv module and getting the values in each column by splitting the line. You are again using the DictReader method of csv module to read the values you already had.

Python: Reading files and editing its content

I got the following problem: I would like to read a data textfile which consists of two columns, year and temperature, and be able to calculate the minimum temperature etc. for each year. The whole file starts like this:
1995.0012 -1.34231
1995.3030 -3.52533
1995.4030 -7.54334
and so on, until year 2013. I had the following idea:
f=open('munich_temperatures_average.txt', 'r')
for line in f:
line = line.strip()
columns = line.split()
year = float(columns[0])
temperature=columns[1]
if year-1995<1 and year-1995>0:
print 1995, min(temperature)
With this I get only the year 1995 data which is what I want in a first step. In a second step I would like to calculate the minimal temperature of the whole dataset in year 1995. By using the script above, I however get the minimum temperature for every line in the datafile. I tried building a list and then appending the temperature but I run into trouble if I want to transform the year into an integer or the temperature into a float etc.
I feel like I am missing the right idea how to calculate the minimum value of a set of values in a column (but not of the whole column).
Any ideas how I could approach said problem? I am trying to learn Python but still at a beginners stage so if there is a way to do the whole thing without using "advanced" commands, I'd be ecstatic!

I could do this using the regexp
import re
from collections import defaultdict
REGEX = re.compile(ur"(\d{4})\.\d+ ([0-9\-\.\+]+)")
f = open('munich_temperatures_average.txt', 'r')
data = defaultdict(list)
for line in f:
year, temperature = REGEX.findall(line)[0]
temperature = float(temperature)
data[year].append(temperature)
print min(data["1995"])

You could use the csv module which would make it very easy to read and manipulate each row of your file:
import csv
with open('munich_temperatures_average.txt', 'r') as temperatures:
for row in csv.reader(temperatures, delimiter=' '):
print "year", row[0], "temp", row[1]
Afterwards it is just a matter of finding the min temperature in the rows. See
csv module documentation

If you just want the years and the temps:
years,temp =[],[]
with open("f.txt") as f:
for line in f:
spl = line.rstrip().split()
years.append(int(spl[0].split(".")[0]))
temp.append(float(spl[1]))
print years,temp
[1995, 1995, 1995] [-1.34231, -3.52533, -7.54334]

I've previously submit another approach, using a numpy library, that could be confusing considering that you are new to python. Sorry for that. As you mentioned yourself, you need to have some kind of record of the year 1995, but you don't need a list for that:
mintemp1995 = None
for line in f:
line = line.strip()
columns = line.split()
year = int(float(columns[0]))
temp = float(columns[1])
if year == 1995 and (mintemp1995 is None or temp < mintemp1995):
mintemp1995 = temp
print "1995:", mintemp1995
Note the cast to int of the year, so you can directly compare it to 1995, and the condition after it:
If the variable mintemp1995 has never set before (is None and therefore, the first entry of the dataset), or the current temperature is lower than that, it replaces it, so you have a record of only the lowest temperature.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data Scrubbing using python - python

Related

How to arrange data week wise csv file in python

AttributeError: 'list' object has no attribute 'strftime' python 3 csv - Socratica

Arrays' Length exploding when appending Values to it from CSV

String Replace for Multiple Lines In A CSV

Python: Reading files and editing its content

Categories

Resources