Converting a huge txt file

Converting a huge txt file - python

I have a huuge csv file (524 MB, notepad opens it for 4 minutes) that I need to change formatting of. Now it's like this:
1315922016 5.800000000000 1.000000000000
1315922024 5.830000000000 3.000000000000
1315922029 5.900000000000 1.000000000000
1315922034 6.000000000000 20.000000000000
1315924373 5.950000000000 12.452100000000
The lines are divided by a newline symbol, when I paste it into Excel it divides into lines. I would've done it by using Excel functions but the file is too big to be opened.
First value is the number of seconds since 1-01-1970, second is price, third is volumen.
I need it to be like this:
01-01-2009 13:55:59 5.800000000000 1.000000000000 01-01-2009 13:56:00 5.830000000000 3.000000000000
etc.
Records need to be divided by a space. Sometimes there are multiple values of price from the same second like this:
1328031552 6.100000000000 2.000000000000
1328031553 6.110000000000 0.342951630000
1328031553 6.110000000000 0.527604200000
1328031553 6.110000000000 0.876088370000
1328031553 6.110000000000 0.971026920000
1328031553 6.100000000000 0.965781090000
1328031589 6.150000000000 0.918752490000
1328031589 6.150000000000 0.940974100000
When this happens, I need the code to take average price from that second and save just one price for each second.
These are bitcoin transactions which didn't happen every second when BTC started.
When there is no record from some second, there needs to be created a new record with the following second and the values of price and volumen copied from the last known price and volumen.
Then save everything to a new txt file.
I can't seem to do it, I've been trying to write a converter in python for hours, please help.

shlex is a lexical parser. We use it to pick the numbers from the input one at a time. Function records groups these into lists where the first element of the list is an integer and the other elements are floating points.
The loop reads the results of records and averages on times as necessary. It also prints two outputs to a line.
from shlex import shlex
lexer = shlex(instream=open('temp.txt'), posix=False)
lexer.wordchars = r'0123456789.\n'
lexer.whitespace = ' \n'
lexer.whitespace_split = True
import time
def Records():
record = []
while True:
token = lexer.get_token()
if token:
token = token.strip()
if token:
record.append(token)
if len(record)==3:
record[0] = int(record[0])
record[1] = float(record[1])
record[2] = float(record[2])
yield record
record=[]
else:
break
else:
break
def conv_time(t):
return time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(t))
records = Records()
pos = 1
current_date, price, volume = next(records)
price_sum = price
volume_sum = volume
count = 1
for raw_date, price, volume in records:
if raw_date == current_date:
price_sum += price
volume_sum += volume
count += 1
else:
print (conv_time(current_date), price_sum/count, volume_sum/count, end=' ' if pos else '\n')
pos = (pos+1)%2
current_date = raw_date
price_sum = price
volume_sum = volume
count = 1
print (conv_time(current_date), price_sum/count, volume_sum/count, end=' ' if pos else '\n')
Here are the results. You might need to do something about significant digits to the rights of decimal points.
2011-09-13 09:53:36 5.8 1.0 2011-09-13 09:53:44 5.83 3.0
2011-09-13 09:53:49 5.9 1.0 2011-09-13 09:53:54 6.0 20.0
2011-09-13 10:32:53 5.95 12.4521 2012-01-31 12:39:12 6.1 2.0
2012-01-31 12:39:13 6.108 0.736690442 2012-01-31 12:39:49 6.15 0.9298632950000001

1) Reading a single line from a file
data = {}
with open(<path to file>) as fh:
while True:
line = fh.readline()[:-1]
if not line: break
values = line.split(' ')
for n in range(0, len(values), 3):
dt, price, volumen = values[n:n+3]
2) Checking if it's the next second after the last record's
If so, adding the price and volumen values to a variable and increasing a counter for later use in calculating the average
3) If the second is not the next second, copy values of last price and volumen.
if not dt in data:
data[dt] = []
data[dt].append((price, volumen))
4) Divide timestamps like "1328031552" into seconds, minutes, hours, days, months, years.
Somehow take care of gap years.
for dt in data:
# seconds, minutes, hours, days, months, years = datetime (dt)
... for later use in calculating the average
p_sum, v_sum = 0
for p, v in data[dt]:
p_sum += p
v_sum += v
n = len(data[dt])
price = p_sum / n
volumen = v_sum / n
5) Arrange values in the 01-01-2009 13:55:59 1586.12 220000 order
6) Add the record to the end of the new database file.
print(datetime, price, volumen)

Related

Extracting data from a .txt file without using modules

I am taking a course in python and one of the problem sets is as follows:
Read in the contents of the file SP500.txt which has monthly data for 2016 and 2017 about the S&P 500 closing prices as well as some other financial indicators, including the “Long Term Interest Rate”, which is interest rate paid on 10-year U.S. government bonds.
Write a program that computes the average closing price (the second column, labeled SP500) and the highest long-term interest rate. Both should be computed only for the period from June 2016 through May 2017. Save the results in the variables mean_SP and max_interest.
SP500.txt:
Date,SP500,Dividend,Earnings,Consumer Price Index,Long Interest Rate,Real Price,Real Dividend,Real Earnings,PE10
1/1/2016,1918.6,43.55,86.5,236.92,2.09,2023.23,45.93,91.22,24.21
2/1/2016,1904.42,43.72,86.47,237.11,1.78,2006.62,46.06,91.11,24
3/1/2016,2021.95,43.88,86.44,238.13,1.89,2121.32,46.04,90.69,25.37
4/1/2016,2075.54,44.07,86.6,239.26,1.81,2167.27,46.02,90.43,25.92
5/1/2016,2065.55,44.27,86.76,240.23,1.81,2148.15,46.04,90.23,25.69
6/1/2016,2083.89,44.46,86.92,241.02,1.64,2160.13,46.09,90.1,25.84
7/1/2016,2148.9,44.65,87.64,240.63,1.5,2231.13,46.36,91,26.69
8/1/2016,2170.95,44.84,88.37,240.85,1.56,2251.95,46.51,91.66,26.95
9/1/2016,2157.69,45.03,89.09,241.43,1.63,2232.83,46.6,92.19,26.73
10/1/2016,2143.02,45.25,90.91,241.73,1.76,2214.89,46.77,93.96,26.53
11/1/2016,2164.99,45.48,92.73,241.35,2.14,2241.08,47.07,95.99,26.85
12/1/2016,2246.63,45.7,94.55,241.43,2.49,2324.83,47.29,97.84,27.87
1/1/2017,2275.12,45.93,96.46,242.84,2.43,2340.67,47.25,99.24,28.06
2/1/2017,2329.91,46.15,98.38,243.6,2.42,2389.52,47.33,100.89,28.66
3/1/2017,2366.82,46.38,100.29,243.8,2.48,2425.4,47.53,102.77,29.09
4/1/2017,2359.31,46.66,101.53,244.52,2.3,2410.56,47.67,103.74,28.9
5/1/2017,2395.35,46.94,102.78,244.73,2.3,2445.29,47.92,104.92,29.31
6/1/2017,2433.99,47.22,104.02,244.96,2.19,2482.48,48.16,106.09,29.75
7/1/2017,2454.1,47.54,105.04,244.79,2.32,2504.72,48.52,107.21,30
8/1/2017,2456.22,47.85,106.06,245.52,2.21,2499.4,48.69,107.92,29.91
9/1/2017,2492.84,48.17,107.08,246.82,2.2,2523.31,48.76,108.39,30.17
10/1/2017,2557,48.42,108.01,246.66,2.36,2589.89,49.05,109.4,30.92
11/1/2017,2593.61,48.68,108.95,246.67,2.35,2626.9,49.3,110.35,31.3
12/1/2017,2664.34,48.93,109.88,246.52,2.4,2700.13,49.59,111.36,32.09
My solution (correct but not optimal):
file = open("SP500.txt", "r")
content = file.readlines()
# List that will hold the range of months we need
data=[]
for line in content:
# Get a list of values for each line
values = line.split(',')
# Return lines with the required dates
for i in range(6,13):
month_range = f"{i}/1/2016"
if month_range == values[0]:
data.append(values)
# Return lines with the required dates
for i in range(1,6):
month_range = f"{i}/1/2017"
if month_range == values[0]:
data.append(values)
sum_total = 0
max_interest = 0
# Loop through the data of our required months
for entry in data:
# Get the sum total
sum_price += float(entry[1])
# Find the highest interest rate in list
if max_interest < float(entry[5]):
max_interest = float(entry[5])
mean_SP = sum_total / len(data)
I'm self-learning these concepts and I would love to learn a better way of implementing this solution. My code seems borderline hard coding (exact date in values[0]) and I imagine it to be error prone for bigger problems. Especially the excessive looping that's being done, which seems quite exaustive for such a simple problem.
Thanks in advance.
EDIT:
New code (based Deepak Tripathi answer):
with open('SP500.txt') as f:
lines = f.readlines()
lines = [line.rstrip().split(",") for line in lines]
date_index, spf_index, long_interest_rate = 0, 1, 5
start_year, end_year = 2016, 2017
start_month, end_month = 6, 5
mean_SP, max_interest = 0, -1000 # Some random negative number
total_entries = 0
for line in lines[1:]:
date_values = line[date_index].split('/')
if (int(date_values[2]) == start_year and int(date_values[0]) >= start_month) or (int(date_values[2]) == end_year and int(date_values[0]) <= end_month):
total_entries += 1
mean_SP += float(line[spf_index])
max_interest = max(max_interest, float(line[long_interest_rate]))
mean_SP /= total_entries
print(mean_SP, max_interest)

I think you can optimized by storing the index of columns in some variable
with open('temp.txt') as f:
lines = f.readlines()
lines = [line.rstrip().split(",") for line in lines]
date_index, spf_index, long_interest_rate = 0, 1, 5
start_date, end_date = "01/06/2016", "31/05/2017"
mean_SP, max_interest = 0, -1000 # Some random negative number
for line in lines[1:]:
if start_date.zfill(10) <= line[date_index] <= end_date.zfill(10):
mean_SP += float(line[spf_index])
max_interest = max(max_interest, float(line[long_interest_rate]))
mean_SP /= len(lines[1:])
print(mean_SP, max_interest)

How to iterate to find days? in smth["smth"]["1.555"]["2022-04-05T08:34:39+02:00"]

im trying to iterate this and cant figureout how. Theres .csv file.
QUESTION: so i finds LOW_num's data[0], and got to get TOP_num for TOP_num's data[0] < LOW_num's data[0] What the formula could be? The example:
for line in file:
data = line.split(sep)
A line looks like this:
2022-04-05T08:34:39+02:00, 1.2024, 1.2024, 1.2024, 1.2024, 1.2185, 1.2059028833000065, 1.2024784243912705, 1.2004400559932131, 1.198116316019428
So data[0] means Column A, data[1] is Column B, data[2] is Column C, (...)
memory["high"] = {}
memory["low"] = {}
for line in file:
data = line.split(sep)
if data[5] < data[9]:
memory["high"][float(data[2])] = str(data[0])
memory["low"][float(data[3])] = str(data[0])
# those are collecing data[2] and data[3] only between events when
# values changes from column F > J, to F < J, in that .csv file
then in the same "for line in file:", but different if:
LOW_num = min(memory["low"]) # it gets lowest number of all collected data[3] (Column D)
TOP_num = max(memory["high"]) # it gets biggest number of all collected data[2] (Column C)
#so TOP_num is for example: "1.555"
#but that TOP_num got day, month, and year attached to it as well:
#ex: memory["high"]["1.555"]["2022-04-05T08:34:39+02:00"]
TOP_data0 = str(memory["high"][TOP_num])
LOW_data0 = str(memory["low"][LOW_num])
i tried some things but, cant get it righ, example:
for i in memory["high"][i][j]:
if memory["high"][i][memory["high"][TOP_num][TOP_data0] < memory["low"][LOW_num]LOW_data0]:
print(memory["high"][i][TOP_num])
The .csv file represents some coin's price data ex: ADAUSDT frome exchange,
(time, open, high, low, close, somthing, somthing, somthing, somthing, somthing)
I finds Lowest price of given time period (Low_num), starting from some start price earlier.
And must find the biggest price between those start point and Low_num point.
That biggets price is the Stop loss numer had to be set, in order to achive the Lowest point for this example, it was a short.

figured out!
memory["SL"] = {}
for number in memory["high"]:
if number > LOW_num: # so its only numbers higher than Lowest obviously
x = number
if memory["high"][x] < LOW_data0: # and among them, with date only earlier than LOW_date0
memory["SL"][x] = str(memory["high"][x]) # and saving it to new memory set for later max() or min() upon it
Wow python can compare dates!

Working with Two Different Input Files --- example: Hourly Data and Daily Data (with different lengths)

I'm working on some code to manipulate hourly and daily data for a year and am a little confused about how to combine data from the two files. What I am doing is using the hourly pattern of Data Set B but scaling it using Daily Set A. ... so in essence (using the example below) I will take the daily average (Data Set A) of 93 cfs and multiple it by 24 hrs in a day which would equal 2232 . I'll then sum the hourly cfs values for all 24hrs of each day (Data Set B)... which in this case for 1/1/2021 would equal 2596. Normally manipulating a rate in these manners doesn't make sense but in this case it doesn't matter because the units cancel out. I'd then need to take these values and divide them by each other 2232/2596 = 0.8597 and apply that to the hourly cfs values for all 24hrs of each day (Data Set B) for a new "scaled" dataset (to be Data Set C).
My problem is that I have never coded in Python using two different input datasets (I am a complete newbie). I started experimenting with the code but the problem is - is I can't seem to integrate the two datasets. If anyone can point me in the direction of how to integrate two separate input files I'd be most appreciative. Beneath the datasets is my attempts at the code (please note the reverse order of code - working first with hourly data (Data Set B) and then the daily data (Data Set A). My print out of the final scaling factor (SF) is only giving me one print out... not all 8,760 because I'm not in the loop... but how can I be in the loop of both input files at the same time???
Data Set A (Daily) -- 365 lines of data:
1/1/2021 93 cfs
1/2/2021 0 cfs
1/3/2021 70 cfs
1/4/2021 70 cfs
Data Set B (Hourly) -- 8,760 lines of data:
1/1/2021 0:00 150 cfs
1/1/2021 1:00 0 cfs
1/1/2021 2:00 255 cfs
(where summation of all 24 hrs of 1/1/2021 = 2596 cfs)
etc.
Sorry if this is a ridiculously easy question... I am very new to coding.
Here is the code that I've written so far... what I need is 8,760 lines of SF... that I can then use to multiple by the original Data Set B. The final product of Data Set C will be Date - Time - rescaled hourly data. I actually have to do this for three pumping units total... to give me a matrix of 5 columns by 8,760 rows but I think I'll be able to figure the unit thing out. My problem now is how to integrate the two data sets. Thank you for reading!
print('Solving the Temperature Model programming problem')
fhand1 = open('Interpolate_CY21_short.txt')
fhand2 = open('WSE_Daily_CY21_short.txt')
#Hourly Interpolated Pardee PowerHouse Data
for line1 in fhand1:
line1 = line1.rstrip()
words1 = line1.split()
#Hourly interpolated data - parsed down (cfs)
x = float(words1[7])
if x<100:
x = 0
#print(x)
#WSE Daily Average PowerHouse Data
for line2 in fhand2:
line2 = line2.rstrip()
words2 = line2.split()
#Daily cfs average x 24 hrs
aa = float(words2[2])*24
#print(a)
SF = x * aa
print(SF)

This is how you would get the data into two lists,
fhand1 = open('Interpolate_CY21_short.txt', 'r')
fhand2 = open('WSE_Daily_CY21_short.txt', 'r')
daily_average = fhand1.readlines()
daily = fhand2.readlines()
# this is what the to lists would look like, roughly
# each line would be a separate string
daily_average = ["1/1/2021 93 cfs","1/2/2021 0 cfs"]
daily = ["1/1/2021 0:00 150 cfs", "1/1/2021 1:00 0 cfs", "1/2/2021 1:00 0 cfs"]
Then, to process the lists could probably use a double nested for loop
for average_line in daily_average:
average_line = average_line.rstrip()
average_date, average_count, average_symbol = average_line.split()
for daily_line in daily:
daily_line = daily_line.rstrip()
date, hour, count, symbol = daily_line.split()
if average_date == date:
print(f"date={date}, average_count={average_count} count={count}")
Or a dictionary
# populate data into dictionaries
daily_average_data = dict()
for line in daily_average:
line = line.rstrip()
day, count, symbol = line.split()
daily_average_data[day] = (day, count, symbol)
daily_data = dict()
for line in daily:
line = line.rstrip()
day, hour, count, symbol = line.split()
if day not in daily_data:
daily_data[day] = list()
daily_data[day].append((day, hour, count, symbol))
# now you can access daily_average_data and daily_data as
# dictionaries instead of files
# process data
result = list()
for date in daily_data.keys():
print(date)
print(daily_average_data[date])
print(daily_data[date])
If the data items corresponded with one another line by line, you could use https://realpython.com/python-zip-function/
here is an example:
for data1, data2 in zip(daiy_average, daily):
print(f"{data1} {data2}")

Similar to what #oasispolo decribed, the solution is to make a single loop and process both lists in it. I'm personally not fond of the "zip" function. (It's a purely stylistic objection; lots of other people like it and that's fine.)
Here's a solution with syntax that I find more intuitive:
print('Solving the Temperature Model programming problem')
fhand1 = open('Interpolate_CY21_short.txt', 'r')
fhand2 = open('WSE_Daily_CY21_short.txt', 'r')
# Convert each file into a list of lines. You're doing this
# implicitly, but I like to be explicit about it.
lines1 = fhand1.readlines()
lines2 = fhand2.readlines()
if len(lines1) != len(lines2):
raise ValueError("The two files have different length!")
# Initialize an output array. You cold also construct it
# one item at a time, but that can be slow for large arrays.
# It is more efficient to initialize the entire array at
# once if possible.
sf_list = [0]*len(lines1)
for position in range(len(lines1)):
# range(L) generates numbers 0...L-1
line1 = lines1[position].rstrip()
words1 = line1.split()
x = float(words1[7])
if x<100:
x = 0
line2 = lines2[position].rstrip()
words2 = line2.split()
aa = float(words2[2])*24
sf_list[position] = x * aa
print(sf_list)

Trouble printing out the max key/value pair in a dictionary

I'm working on trying to calculate the greatest increase/decrease in a change to profits/losses over time from a CSV.
The data set in csv is as follows (extract only):
Date,Profit/Losses
Jan-2010,867884
Feb-2010,984655
Mar-2010,322013
Apr-2010,-69417
So far, i've imported the csv file and added the items to a dictionary. Calculated total months, total profit/loss, calculated the change in profit/loss from month to month but now need to find the greatest and smallest change in the month and have the code return both the month and the change figure.
The output when trying to print the greatest increase/decrease returns only the final month on the list and all change values (instead of just the biggest change value and it's corresponding month)
Here is the code. Would appreciate any perspective:
budget = {}
total_months = 0
total_pnl = 0
date = 0
pnl = 0
monthly_change = []
previous_pnl = 0
greatest_increase = ["Date",[0]]
greatest_decrease = ["Date",[100000000000000]]
with open(csvpath, 'r') as csvfile:
csvreader = csv.reader(csvfile, delimiter=',')
header = next(csvreader)
for row in csvreader:
date = 0
pnl = 1
budget[row[date]] = int(row[pnl])
for date, pnl in budget.items():
total_months = total_months + 1
total_pnl = total_pnl + pnl
pnlchange = pnl - previous_pnl
if total_months > 1:
monthly_change.append(pnlchange)
previous_pnl = pnl
if (monthly_change > greatest_increase[1]):
greatest_increase[1] = monthly_change
greatest_increase[0] = row[0]
if (monthly_change < greatest_decrease[1]):
greatest_decrease[1] = monthly_change
greatest_decrease[0] = row[0]
print(greatest_increase)
The primary problem is the final part of the code (the if statement). When I print 'greatest_increase' this currently returns the final value in the list rather than the highest value of change.
current output is:
[['Feb-2017', '671099'], [116771, -662642, -391430, 379920, 212354, 510239, -428211, -821271, 693918, 416278, -974163, 860159, -1115009, 1033048, 95318, -308093, 99052, -521393, 605450, 231727, -65187, -702716, 177975, -1065544, 1926159, -917805, 898730, -334262, -246499, -64055, -1529236, 1497596, 304914, -635801, 398319, -183161, -37864, -253689, 403655, 94168, 306877, -83000, 210462, -2196167, 1465222, -956983, 1838447, -468003, -64602, 206242, -242155, -449079, 315198, 241099, 111540, 365942, -219310, -368665, 409837, 151210, -110244, -341938, -1212159, 683246, -70825, 335594, 417334, -272194, -236462, 657432, -211262, -128237, -1750387, 925441, 932089, -311434, 267252, -1876758, 1733696, 198551, -665765, 693229, -734926, 77242, 532869]]
What i am trying to get is the bold value being the highest value (along with the relevant month)
Apologies if this isn't clear, I'm still fairly new (3rd week learning!)

Sorting dates in Python

I am saving the day and time period of some scheduled tasks in a txt in this format:
Monday,10:50-11:32
Friday,18:33-18:45
Sunday,17:10-17:31
Sunday,14:10-15:11
Friday,21:10-23:11
I am opening the txt and get the contents in a list.
How can I sort the list to get the days and the time periods in order?
Like this:
Monday,10:50-11:32
Friday,18:33-18:45
Friday,21:10-23:11
Sunday,14:10-15:11
Sunday,17:10-17:31

Ok let's say you only have dayofweek and the timestamps. One alternative is to calculate the amount of minutes each item is (Monday 00:00 = 0 minutes and Sunday 23:59 = max minutes) and sort with that function.
The example below sorts with the first timestamp value. One comment from a fellow SO:er pointed out that this does not include the second timestamp (end-time). To include this we can add a decimal value by inverting the amount of minutes per day.
((int(h2)* 60 + int(m2))/24*60) # minutes divided by maximum minutes per day gives a decimal number
However the key here is the following code:
weekday[day]*24*60 + int(h1)*60 + int(m1) # gets the total minutes passed, we sort with this!
And of course the sort function with a join (double-break line). When you pass a key to sorted() and that key is a function the sorting will be based on the return values of that function (which is the amount of minutes).
'\n\n'.join(sorted(list_, key=get_min))
Enough text... let's jump to a full example updated version
import io
file= """Monday,10:50-11:32
Friday,18:33-18:45
Sunday,17:10-17:31
Sunday,17:10-15:11
Friday,21:10-23:11"""
list_ = [i.strip('\n') for i in io.StringIO(file).readlines() if i != "\n"]
weekday = dict(zip(["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"],[0,1,2,3,4,5,6]))
def get_min(time_str):
day,time = time_str.split(",")
h1, m1 = time.split('-')[0].split(":")
h2, m2 = time.split('-')[1].split(":")
return weekday[day]*24*60 + int(h1)*60 + int(m1) + ((int(h2)* 60 + int(m2))/24*60)
with open("output.txt", "w") as outfile:
outfile.write('\n\n'.join(sorted(list_, key=get_min)))
print('\n\n'.join(sorted(list_, key=get_min)))
Creates "output.txt" with:
Monday,10:50-11:32
Friday,18:33-18:45
Friday,21:10-23:11
Sunday,17:10-15:11
Sunday,17:10-17:31

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.