Python CSV list manipulation / Date and Time

Python CSV list manipulation / Date and Time - python

I am working on a python program that changes the format of an existing CSV
This is the Goal format for the CSV
This is the Original State
3 obstacles
remove "-" from modelchass (Complete)
add "-" to prod date and also 'T' (complete)
change PL Seq list to time's / or possibly create new column with times
Condition, start at 08:00:00
Condition, each line increase by 1
Condition, restart for each date
Condition, restart for H seq
Steps 1 and 2 I have figured out, but I am lost on step 3;
this is my code so far
import pandas as pd
df = pd.read_csv("AMS truck schedule.txt",delimiter=';')
df.to_csv('Demo1.csv')
import csv
with open('Demo1.csv','r') as csv_file:
csv_reader = csv.DictReader(csv_file)
order_numbers = []
csvtimes = []
sequence = []
for line in csv_reader:
order_numbers.append(line['MODELCHASS'])
csvtimes.append(line['Prod Date'])
sequence.append(line['PL Seq'])
#replace the dash for the order numbers
on = [sub.replace("-","") for sub in order_numbers]
print(on[1225])
newtimes = [x[0] + x[1] +x[2] +x[3] +"-" +x[4] +x[5] +"-" +x[6] +x[7] + "T" for x in csvtimes]

I am not 100% sure when you want to restart.
From what I understand you restart an hour if:
a) in prod date col last digit changes
b) when in pl seq first letter changes
What if we reach 24h and the (a or b) is False? Do we continue with 24h until we a or b is True?
Anyway you can add more conditions, I don't know if it is the most effective way but it works. Before doing it you have to create a column:
df['hours'] = 8, so it's a column with all rows = 8
and df = pd.DataFrame(pd.read_csv(filename))
prev_row = None
for index,row in df.iterrows():
if prev_row is not None:
if (row['pl'][0] == prev_row['pl'][0]) and (str(row['date'])[-2:] == str(prev_row['date'])[-2:]) :
row['hours'] = prev_row['hours'] + 1
print(row)
else:
row['hours'] = 8
df.iloc[index] = row
prev_row = row

Related

Trouble printing out the max key/value pair in a dictionary

I'm working on trying to calculate the greatest increase/decrease in a change to profits/losses over time from a CSV.
The data set in csv is as follows (extract only):
Date,Profit/Losses
Jan-2010,867884
Feb-2010,984655
Mar-2010,322013
Apr-2010,-69417
So far, i've imported the csv file and added the items to a dictionary. Calculated total months, total profit/loss, calculated the change in profit/loss from month to month but now need to find the greatest and smallest change in the month and have the code return both the month and the change figure.
The output when trying to print the greatest increase/decrease returns only the final month on the list and all change values (instead of just the biggest change value and it's corresponding month)
Here is the code. Would appreciate any perspective:
budget = {}
total_months = 0
total_pnl = 0
date = 0
pnl = 0
monthly_change = []
previous_pnl = 0
greatest_increase = ["Date",[0]]
greatest_decrease = ["Date",[100000000000000]]
with open(csvpath, 'r') as csvfile:
csvreader = csv.reader(csvfile, delimiter=',')
header = next(csvreader)
for row in csvreader:
date = 0
pnl = 1
budget[row[date]] = int(row[pnl])
for date, pnl in budget.items():
total_months = total_months + 1
total_pnl = total_pnl + pnl
pnlchange = pnl - previous_pnl
if total_months > 1:
monthly_change.append(pnlchange)
previous_pnl = pnl
if (monthly_change > greatest_increase[1]):
greatest_increase[1] = monthly_change
greatest_increase[0] = row[0]
if (monthly_change < greatest_decrease[1]):
greatest_decrease[1] = monthly_change
greatest_decrease[0] = row[0]
print(greatest_increase)
The primary problem is the final part of the code (the if statement). When I print 'greatest_increase' this currently returns the final value in the list rather than the highest value of change.
current output is:
[['Feb-2017', '671099'], [116771, -662642, -391430, 379920, 212354, 510239, -428211, -821271, 693918, 416278, -974163, 860159, -1115009, 1033048, 95318, -308093, 99052, -521393, 605450, 231727, -65187, -702716, 177975, -1065544, 1926159, -917805, 898730, -334262, -246499, -64055, -1529236, 1497596, 304914, -635801, 398319, -183161, -37864, -253689, 403655, 94168, 306877, -83000, 210462, -2196167, 1465222, -956983, 1838447, -468003, -64602, 206242, -242155, -449079, 315198, 241099, 111540, 365942, -219310, -368665, 409837, 151210, -110244, -341938, -1212159, 683246, -70825, 335594, 417334, -272194, -236462, 657432, -211262, -128237, -1750387, 925441, 932089, -311434, 267252, -1876758, 1733696, 198551, -665765, 693229, -734926, 77242, 532869]]
What i am trying to get is the bold value being the highest value (along with the relevant month)
Apologies if this isn't clear, I'm still fairly new (3rd week learning!)

How can you increase the speed of an algorithm that computes a usage streak?

I have the following problem: I have data (table called 'answers') of a quiz application including the answered questions per user with the respective answering date (one answer per line), e.g.:
UserID
Time
Term
QuestionID
Answer
1
2019-12-28 18:25:15
Winter19
345
a
2
2019-12-29 20:15:13
Winter19
734
b
I would like to write an algorithm to determine whether a user has used the quiz application several days in a row (a so-called 'streak'). Therefore, I want to create a table ('appData') with the following information:
UserID
Term
HighestStreak
1
Winter19
7
2
Winter19
10
For this table I need to compute the variable 'HighestStreak'. I managed to do so with the following code:
for userid, term in zip(appData.userid, appData.term):
final_streak = 1
for i in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
temp_streak = 1
while i + pd.DateOffset(days=1) in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
i += pd.DateOffset(days=1)
temp_streak += 1
if temp_streak > final_streak:
final_streak = temp_streak
appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak
Unfortunately, running this code takes about 45 minutes. The table 'answers' has about 4,000 lines. Is there any structural 'mistake' in my code that makes it so slow or do processes like this take that amount of time?
Any help would be highly appreciated!
EDIT:
I managed to increase the speed from 45 minutes to 2 minutes with the following change:
I filtered the data to students who answered at least one answer first and set the streak to 0 for the rest (as the streak for 0 answers is 0 in every case):
appData.loc[appData.totAnswers==0, 'highestStreak'] = 0
appDataActive = appData[appData.totAnswers!=0]
Furthermore I moved the filtered list out of the loop, so the algorithm does not need to filter twice, resulting in the following new code:
appData.loc[appData.totAnswers==0, 'highestStreak'] = 0
appDataActive = appData[appData.totAnswers!=0]
for userid, term in zip(appData.userid, appData.term):
activeDays = answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique()
final_streak = 1
for day in activeDays:
temp_streak = 1
while day + pd.DateOffset(days=1) in activeDays:
day += pd.DateOffset(days=1)
temp_streak += 1
if temp_streak > final_streak:
final_streak = temp_streak
appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak
Of course, 2 minutes is much better than 45 minutes. But are there any more tips?

my attempt, which borrows some key ideas from the connected components problem; a fairly early problem when looking at graphs
first I create a random DataFrame with some user id's and some dates.
import datetime
import random
import pandas
import numpy
#generate basic dataframe of users and answer dates
def sample_data_frame():
users = ['A' + str(x) for x in range(10000)] #generate user id
date_range = pandas.Series(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()),
name='date')
users = pandas.Series(users, name='user')
df = pandas.merge(date_range, users, how='cross')
removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
df.drop(removals, inplace=True)
return df
def sample_data_frame_v2(): #pandas version <1.2
users = ['A' + str(x) for x in range(10000)] #generate user id
date_range = pandas.DataFrame(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()), columns = ['date'])
users = pandas.DataFrame(users, columns = ['user'])
date_range['key'] = 1
users['key'] = 1
df = users.merge(date_range, on='key')
df.drop(labels = 'key', axis = 1)
removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
df.drop(removals, inplace=True)
return df
put your DataFrame in sorted order, so that the next row is next answer day and then by user
create two new columns from the row below containing the userid and the date of the row below
if the user of row below is the same as the current row and the current date + 1 day is the same as the row below set the column result to false numerically known as 0, otherwise if it's a new streak set to True, which can be represented numerically as 1.
cumulatively sum the results which will group your streaks
finally count how many entries exist per group and find the max for each user
for 10k users over 364 days worth of answers my running time is about a 1 second
df = sample_data_frame()
df = df.sort_values(by=['user', 'date']).reset_index(drop = True)
df['shift_date'] = df['date'].shift()
df['shift_user'] = df['user'].shift()
df['result'] = ~((df['shift_date'] == df['date'] - datetime.timedelta(days=1)) & (df['shift_user'] == df['user']))
df['group'] = df['result'].cumsum()
summary = (df.groupby(by=['user', 'group']).count()['result'].max(level='user'))
summary.sort_values(ascending = False) #print user with highest streak

Python Remove duplicates from csv if value in column duplicated

I am trying to write csv parser so if i have the same name in the name column i will delete the second name's line. For example:
['CSE_MAIN\\LC-CSEWS61', 'DEREGISTERED', '2018-04-18-192446'],
['CSE_MAIN\\IT-Laptop12', 'DEREGISTERED', '2018-03-28-144236'],
['CSE_MAIN\\LC-CSEWS61', 'DEREGISTERED', '2018-03-28-144236']]
I need that the last line will be deleted because it has the same name as the first one.
What i wrote is:
file2 = str(sys.argv[2])
print ("The first file is:" + file2)
reader2 = csv.reader (open(file2))
with open("result2.csv",'wb') as result2:
wtr2= csv.writer( result2 )
for r in reader2:
wtr2.writerow( (r[0], r[6], r[9] ))
newreader2 = csv.reader (open("result2.csv"))
sortedlist2 = sorted(newreader2, key=lambda col: col[2] , reverse = True)
for i in range(len(sortedlist2)):
for j in range(len(sortedlist2)-1):
if (sortedlist2[i][0] == sortedlist2[j+1][0] and sortedlist2[i][1]!=sortedlist2[j+1][1]):
if(sortedlist2[i][1]>sortedlist2[j+1][1]):
del sortedlist2[i][0-2]
else:
del sortedlist2[j+1][0-2]
Thanks.

Try with pandas:
import pandas as pd
df = pd.read_csv('path/name_file.csv')
df = df.drop_duplicates([0]) #0 this is columns which will compare.
df.to_csv('New_file.csv') #save to csv
This method delete all duplicates from columns 1.
If you need simple delete you can use method drop.
#You file after use pandas (print(df)):
0 1 2
0 CSE_MAIN\LC-CSEWS61 DEREGISTERED 2018-04-18-192446
1 CSE_MAIN\IT-Laptop12 DEREGISTERED 2018-03-28-144236
2 CSE_MAIN\LC-CSEWS61 DEREGISTERED 2018-03-28-144236
For example you need delete 2 row.
df.drop(2,axis=0, inplace=True) #axis=0 means row, if you switch 1 this is columns.
Output:
0 1 2
0 CSE_MAIN\LC-CSEWS61 DEREGISTERED 2018-04-18-192446
1 CSE_MAIN\IT-Laptop12 DEREGISTERED 2018-03-28-144236

pandas read_csv skiprows - determine rows to skip

below is a csv snippet with some dummy headers while the actual frame anchored by beerId:
This work is an unpublished, copyrighted work and contains confidential information.
beer consumption
consumptiondate 7/24/2018
consumptionlab H1
numbeerssuccessful 40
numbeersfailed 0
totalnumbeers 40
consumptioncomplete TRUE
beerId Book
341027 Northern Light
this df = pd.read_csv(path_csv, header=8) code works, but the issue is that header is not always in 8 depending on a day. cannot figure out how to use lambda from help as in
skiprows : list-like or integer or callable, default None
Line numbers to skip (0-indexed) or number of lines to skip (int) at
the start of the file.
If callable, the callable function will be evaluated against the row
indices, returning True if the row should be skipped and False
otherwise. An example of a valid callable argument would be lambda x:
x in [0, 2].
to find the index row of beerId

I think need preprocessing first:
path_csv = 'file.csv'
with open(path_csv) as f:
lines = f.readlines()
#get list of all possible lins starting by beerId
num = [i for i, l in enumerate(lines) if l.startswith("beerId" )]
#if not found value return 0 else get first value of list subtracted by 1
num = 0 if len(num) == 0 else num[0] - 1
print (num)
8
df = pd.read_csv(path_csv, header=num)
print (df)
beerId Book
0 341027 Northern Light

Path dependent slicing - function code modification

I am playing with the really nice code #piRSquared has provided and this code can be seen below.
I have added another condition if row[col2] == 4000 and this is only seen once in the additional column I added. As expected this additional code has the function yield only a single row as the condition is only seen once.
My question is how can the code be modified to then yield another row after the move is >= move_size.
Desired output is two rows. One when row['B'] == 4000 (as the code produces now) and another when a move is seen >= move_size in Col A. I see these as a trade entry and exit so it would be nice to have an order id in another dataframe column df['C'] as per desired output shown below.
Code from original post:
#starting python community conventions
import numpy as np
import pandas as pd
# n is number of observations
n = 5000
day = pd.to_datetime(['2013-02-06'])
# irregular seconds spanning 28800 seconds (8 hours)
seconds = np.random.rand(n) * 28800 * pd.Timedelta(1, 's')
# start at 8 am
start = pd.offsets.Hour(8)
# irregular timeseries
tidx = day + start + seconds
tidx = tidx.sort_values()
s = pd.Series(np.random.randn(n), tidx, name='A').cumsum()
s.plot()
Generator function with slight modification:
def mover_df(df, col,col2, move_size=10):
ref = None
for i, row in df.iterrows():
#added test condition for new col2 signal column
if row[col2] == 4000:
if ref is None or (abs(ref - row.loc[col]) >= move_size):
yield row
ref = row.loc[col]
Generate data
df = s.to_frame()
df['B'] = range(0,len(df))
moves_df = pd.concat(mover_df(df, 'A','B', 3), axis=1).T
Current output:
A B
2013-02-06 14:30:43.874386317 -50.136432 4000.0
Desired output:
(Values in cols A,B on the second row would be whatever the code generates,I have just added random values to show the format I'm interested in. Col C is the trade id and for every two rows this would increment +1)
A B C
2013-02-06 14:30:43.874386317 -50.136432 4000.0 1
2013-02-06 14:30:43.874386317 -47.136432 6000.0 1
I have been tying to code this for hours (doesn't help with the kids running around the house now its the school holidays...) and appreciate any help. Would be fantastic to get input from #piRSquared but appreciate people are busy.

I don't have too much experience with generators or Pandas, but does this work? My data has different output due to the random seed so I am not sure.
I changed the generator to include the alternative case given, that the first column row[col2] == 4000, so calling the generator twice should give both values:
def mover_df(df, col, col2, move_size=10, found=False):
ref = None
for i, row in df.iterrows():
#added test condition for new col2 signal column
if row[col2] == 4000:
if ref is None or (abs(ref - row.loc[col]) >= move_size):
yield row
found = True # flag that we found the first row we want
ref = row.loc[col]
elif found: # if we found the first row, find the second meeting the condition
if ref is None or (abs(ref - row.loc[col]) >= move_size):
yield row
And then you can use it like this:
data_generator = mover_df(df, 'A', 'B', 3)
moves_df = pd.concat([data.next(), data.next()], axis=1).T

I'd edit the mover_df like this
note:
I changed 4000 condition to % 1000 == 0 to give a few more samples
def mover_df(df, move_col, look_col, move_size=10):
ref, seen = None, False
for i, row in df.iterrows():
#added test condition for new col2 signal column
look_cond = row[look_col] % 1000 == 0
if look_cond and not seen:
yield row
ref, seen = row.loc[move_col], True
elif seen:
move_cond = (abs(ref - row.loc[move_col]) >= move_size)
if move_cond:
yield row
ref, seen = None, False
df = s.to_frame()
df['B'] = range(0,len(df))
moves_df = pd.concat(mover_df(df, 'A','B', 3), axis=1).T
print(moves_df)
A B
2013-02-06 08:00:03.264481639 0.554390 0.0
2013-02-06 08:04:26.609855185 -2.479520 35.0
2013-02-06 09:38:07.962175581 -15.042391 1000.0
2013-02-06 09:40:50.737806497 -18.385956 1026.0
2013-02-06 11:13:03.018013689 -29.074125 2000.0
2013-02-06 11:14:30.980633575 -32.221009 2019.0
2013-02-06 12:49:41.432845325 -35.048040 3000.0
2013-02-06 12:50:28.098114592 -38.881795 3012.0
2013-02-06 14:27:15.008225195 13.437165 4000.0
2013-02-06 14:27:32.790466500 9.513736 4003.0
caveat
This will continue to look for an exit until it is found or you reach the end of the dataframe even if you reach another potential entry point. Meaning, in my example, I look every 1000 rows and enter. I then look for when the move is greater than 10 and exit. If I do not find a move greater than 10 before the next 1000 row market arrives, I'll ignore that 1000 row marker and continue looking for an exit.
The philosophy was that if I'm in the trade, I have to exit. I don't want to enter into another trade prior to resolving the one I'm still in.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python CSV list manipulation / Date and Time - python

Related

Trouble printing out the max key/value pair in a dictionary

How can you increase the speed of an algorithm that computes a usage streak?

Python Remove duplicates from csv if value in column duplicated

pandas read_csv skiprows - determine rows to skip

Path dependent slicing - function code modification

Categories

Resources