Truncate a time serie files and extract some descriptive variable - python

I have two major problems, and I can't imagine the solution in python. Now, I explain you the context.
On the one hand I have a dataset, containing some date point with ID (1 ID = 1 patient) like this :
ID
Date point
0001
25/12/2022 09:00
0002
29/12/2022 16:00
0003
30/12/2022 18:00
...
....
And on the other hand, i have a folder with many text files containing the times series, like this :
0001.txt
0002.txt
0003.txt
...
The files have the same architecture : the ID (same as the dataset) is in the name of the file, and inside the file is structured like that (first column contains the date and the second de value) :
25/12/2022 09:00 155
25/12/2022 09:01 156
25/12/2022 09:02 157
25/12/2022 09:03 158
...
1/ I would like to truncate the text files and retrieve only the variables prior to the 48H dataset Date point.
2/ To make some statistical analysis, I want to take some value like the mean or the maximum of this variables and add in a dataframe like this :
ID
Mean
Maximum
0001
0002
0003
...
....
...
I know for you it will be a trivial problem, but for me (a beginner in python code) it will be a challenge !
Thank you everybody.
Manage time series with a dataframe containing date point and take some statistical values.

You could do something along these lines using pandas (I've not been able to test this fully):
import pandas as pd
from pathlib import Path
# I'll create a limited version of your initial table
data = {
"ID": ["0001", "0002", "0003"],
"Date point": ["25/12/2022 09:00", "29/12/2022 16:00", "30/12/2022 18:00"]
}
# put in a Pandas DataFrame
df = pd.DataFrame(data)
# convert the "Date point" column to a datetime object
df["Date point"] = pd.to_datetime(df["Date point"])
# provide the path to the folder containing the files
folder = Path("/path_to_files")
newdata = {"ID": [], "Mean": [], "Maximum": []} # an empty dictionary that you'll fill with the required statistical info
# loop through the IDs and read in the files
for i, date in zip(df["ID"], df["Date point"]):
inputfile = folder / f"{i}.txt" # construct file name
if inputfile.exists():
# read in the file
subdata = pd.read_csv(
inputfile,
sep="\s+", # columns are separated by spaces
header=None, # there's no header information
parse_dates=[[0, 1]], # the first and second columns should be combined and converted to datetime objects
infer_datetime_format=True
)
# get the values 48 hours after the current date point
td = pd.Timedelta(value=48, unit="hours")
mask = (subdata["0_1"] > date) & (subdata["0_1"] <= date + td)
# add in the required info
newdata["ID"].append(i)
newdata["Mean"].append(subdata[2].loc[mask].mean())
newdata["Maximum"].append(subdata[2].loc[mask].max())
# put newdata into a DataFrame
dfnew = pd.DataFrame(newdata)

Related

Python script to find the number of date columns in a csv file and update the date format to MM-DD-YYYY

I get a file everyday with around 15 columns. Somedays there are 2 date columns and some days one date column. Also the date format on somedays is YYYY-MM-DD and on some its DD-MM-YYYY. Task is to convert the 2 or 1 date columns to MM-DD-YYYY. Sample data in csv file for few columns :
Execution_date
Extract_date
Requestor_Name
Count
2023-01-15
2023-01-15
John Smith
7
Sometimes we dont get the second column above - extract_date :
Execution_date
Requestor_Name
Count
17-01-2023
Andrew Mill
3
Task is to find all the date columns in the file and change the date format to MM-DD-YYYY.
So the sample output of above 2 files will be :
Execution_date
Extract_date
Requestor_Name
Count
01-15-2023
01-15-2023
John Smith
7
Execution_date
Requestor_Name
Count
01-17-2023
Andrew Mill
3
I am using pandas and can't figure out how to deal with the missing second column on some days and the change of the date value format.
I can hardcode the 2 column names and change the format by :
df['Execution_Date'] = pd.to_datetime(df['Execution_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
df['Extract_Date'] = pd.to_datetime(df['Extract_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
This will only work when the file has 2 columns and the values are in DD-MM-YYYY format.
Looking for guidance on how to dynamically find the number of date columns and the date value format so that I can use it in my above 2 lines of code. If not then any other solution would also work for me. I can use powershell if it can't be done in python. But I am guessing there will be a lot more avenues in python to do this than we will have in powershell.
The following loads a CSV file into a dataframe, checks each value (that is a str) to see if it matches one of the date formats, and if it does rearranges the date to the format you're looking for. Other values are untouched.
import pandas as pd
import re
df = pd.read_csv("today.csv")
# compiling the patterns ahead of time saves a lot of processing power later
d_m_y = re.compile(r"(\d\d)-(\d\d)-(\d\d\d\d)")
d_m_y_replace = r"\2-\1-\3"
y_m_d = re.compile(r"(\d\d\d\d)-(\d\d)-(\d\d)")
y_m_d_replace = r"\2-\3-\1"
def change_dt(value):
if isinstance(value, str):
if d_m_y.fullmatch(value):
return d_m_y.sub(d_m_y_replace, value)
elif y_m_d.fullmatch(value):
return y_m_d.sub(y_m_d_replace, value)
return value
new_df = df.applymap(change_dt)
However, if there are other columns containing dates that you don't want to change, and you just want to specify the columns to be altered, use this instead of the last line above:
cols = ["Execution_date", "Extract_date"]
for col in cols:
if col in df.columns:
df[col] = df[col].apply(change_dt)
You can convert the columns to datetimes if you wish.
You can use a function to check all column names that contain "date" and use .fillna to try other formats (add all possible formats).
import pandas as pd
def convert_to_datetime(df: pd.DataFrame, column_name: str) -> pd.DataFrame:
for column in df.columns[df.columns.str.contains(column_name, case=False)]:
df[column] = (
pd.to_datetime(df[column], format="%d-%m-%Y", errors="coerce")
.fillna(pd.to_datetime(df[column], format="%Y-%m-%d", errors="coerce"))
).dt.strftime("%m-%d-%Y")
return df
data1 = {'Execution_date': '2023-01-15', 'Extract_date': '2023-01-15', 'Requestor_Name': "John Smith", 'Count': 7}
df1 = pd.DataFrame(data=[data1])
data2 = {'Execution_date': '17-01-2023', 'Requestor_Name': 'Andrew Mill', 'Count': 3}
df2 = pd.DataFrame(data=[data2])
final1 = convert_to_datetime(df=df1, column_name="date")
print(final1)
final2 = convert_to_datetime(df=df2, column_name="date")
print(final2)
Output:
Execution_date Extract_date Requestor_Name Count
0 01-15-2023 01-15-2023 John Smith 7
Execution_date Requestor_Name Count
0 01-17-2023 Andrew Mill 3

What is the most efficient way to read and augment (copy samples and change some values) large dataset in .csv

Currently, I have managed to solve this but it is slower than what I need. It takes approximately: 1 hour for 500k samples, the entire dataset is ~100M samples, which requires ~200hours for 100M samples.
Hardware/Software specs: RAM 8GB, Windows 11 64bit, Python 3.8.8
The problem:
I have a dataset in .csv (~13GB) where each sample has a value and a respective start-end period of few months.I want to create a dataset where each sample will have the same value but referring to each specific month.
For example:
from:
idx | start date | end date | month | year | value
0 | 20/05/2022 | 20/07/2022 | 0 | 0 | X
to:
0 | 20/05/2022 | 20/07/2022 | 5 | 2022 | X
1 | 20/05/2022 | 20/07/2022 | 6 | 2022 | X
2 | 20/05/2022 | 20/07/2022 | 7 | 2022 | X
Ideas: Manage to do it parallel (like Dask, but I am not sure how for this task).
My implementation:
Chunk read in pandas, augment in dictionaries , append to CSV. Use a function that, given a df, calculates for each sample the months from start date to end date and creates a copy sample for each month appending it to a dictionary. Then it returns the final dictionary.
The calculations are done in dictionaries as they were found to be way faster than doing it in pandas. Then I iterate through the original CSV in chunks and apply the function at each chunk appending the resulting augmented df to another csv.
The function:
def augment_to_monthly_dict(chunk):
'''
Function takes a df or subdf data and creates and returns an Augmented dataset with monthly data in
Dictionary form (for efficiency)
'''
dict={}
l=1
for i in range(len(chunk)):#iterate through every sample
# print(str(chunk.iloc[i].APO)[4:6] )
#Find the months and years period
mst =int(float((str(chunk.iloc[i].start)[4:6])))#start month
mend=int(str(chunk.iloc[i].end)[4:6]) #end month
yst =int(str(chunk.iloc[i].start)[:4] )#start year
yend=int(str(chunk.iloc[i].end)[:4] )#end year
if yend==yst:
months=[ m for m in range(mst,mend+1)]
years=[yend for i in range(len(months))]
elif yend==yst+1:# year change at same sample
months=[m for m in range(mst,13)]
years=[yst for i in range(mst,13)]
months= months+[m for m in range(1, mend+1)]
years= years+[yend for i in range(1, mend+1)]
else:
continue
#months is a list of each month in the period of the sample and years is a same
#length list of the respective years eg months=[11,12,1,2] , years=
#[2021,2022,2022,2022]
for j in range(len(months)):#iterate through list of months
#copy the original sample make it a dictionary
tmp=pd.DataFrame(chunk.iloc[i]).transpose().to_dict(orient='records')
#change the month and year values accordingly (they were 0 for initiation)
tmp[0]['month'] = months[j]
tmp[0]['year'] = years[j]
# Here could add more calcs e.g. drop irrelevant columns, change datatypes etc
#to reduce size
#
#-------------------------------------
#Append new row to the Augmented data
dict[l] = tmp[0]
l+=1
return dict
Reading the original dataset (.csv ~13GB), augment using the function and append result to new .csv:
chunk_count=0
for chunk in pd.read_csv('enc_star_logar_ek.csv', delimiter=';', chunksize=10000):
chunk.index = chunk.reset_index().index
aug_dict = augment_to_monthly_dict(chunk)#make chunk dictionary to work faster
chunk_count+=1
if chunk_count ==1: #get the column names and open csv write headers and 1st chunk
#Find the dicts keys, the column names only from the first dict(not reading all data)
for kk in aug_dict.values():
key_names = [i for i in kk.keys()]
print(key_names)
break #break after first input dict
#Open csv file and write ';' separated data
with open('dic_to_csv2.csv', 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile,delimiter=';', fieldnames=key_names)
writer.writeheader()
writer.writerows(aug_dict.values())
else: # Save the rest of the data chunks
print('added chunk: ', chunk_count)
with open('dic_to_csv2.csv', 'a', newline='') as csvfile:
writer = csv.DictWriter(csvfile,delimiter=';', fieldnames=key_names)
writer.writerows(aug_dict.values())
Pandas efficiency comes in to play when you need to manipulate columns of data, and to do that Pandas reads the input row-by-row building up a series of data for each column; that's a lot of extra computation your problem doesn't benefit from, and in fact just slows your solution down.
You actually need to manipulate rows, and for that the fastest way is to use the standard csv module; all you need to do is read a row in, write the derived rows out, and repeat:
import csv
import sys
from datetime import datetime
def parse_dt(s):
return datetime.strptime(s, r"%d/%m/%Y")
def get_dt_range(beg_dt, end_dt):
"""
Returns a range of (month, year) tuples, from beg_dt up-to-and-including end_dt.
"""
if end_dt < beg_dt:
raise ValueError(f"end {end_dt} is before beg {beg_dt}")
mo, yr = beg_dt.month, beg_dt.year
dt_range = []
while True:
dt_range.append((mo, yr))
if mo == 12:
mo = 1
yr = yr + 1
else:
mo += 1
if (yr, mo) > (end_dt.year, end_dt.month):
break
return dt_range
fname = sys.argv[1]
with open(fname, newline="") as f_in, open("output_csv.csv", "w", newline="") as f_out:
reader = csv.reader(f_in)
writer = csv.writer(f_out)
writer.writerow(next(reader)) # transfer header
for row in reader:
beg_dt = parse_dt(row[1])
end_dt = parse_dt(row[2])
for mo, yr in get_dt_range(beg_dt, end_dt):
row[3] = mo
row[4] = yr
writer.writerow(row)
And, to compare with Pandas in general, let's examine #abokey's specifc Pandas solution—I'm not sure if there is a better Pandas implementation, but this one kinda does the right thing:
import sys
import pandas as pd
fname = sys.argv[1]
df = pd.read_csv(fname)
df["start date"] = pd.to_datetime(df["start date"], format="%d/%m/%Y")
df["end date"] = pd.to_datetime(df["end date"], format="%d/%m/%Y")
df["month"] = df.apply(
lambda x: pd.date_range(
start=x["start date"], end=x["end date"] + pd.DateOffset(months=1), freq="M"
).month.tolist(),
axis=1,
)
df["year"] = df["start date"].dt.year
out = df.explode("month").reset_index(drop=True)
out.to_csv("output_pd.csv")
Let's start with the basics, though, do the programs actually do the right thing. Given this input:
idx,start date,end date,month,year,value
0,20/05/2022,20/05/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
0,20/12/2022,20/01/2023,0,0,X
My program, ./main.py input.csv, produces:
idx,start date,end date,month,year,value
0,20/05/2022,20/05/2022,5,2022,X
0,20/05/2022,20/07/2022,5,2022,X
0,20/05/2022,20/07/2022,6,2022,X
0,20/05/2022,20/07/2022,7,2022,X
0,20/12/2022,20/01/2023,12,2022,X
0,20/12/2022,20/01/2023,1,2023,X
I believe that's what you're looking for.
The Pandas solution, ./main_pd.py input.csv, produces:
,idx,start date,end date,month,year,value
0,0,2022-05-20,2022-05-20,5,2022,X
1,0,2022-05-20,2022-07-20,5,2022,X
2,0,2022-05-20,2022-07-20,6,2022,X
3,0,2022-05-20,2022-07-20,7,2022,X
4,0,2022-12-20,2023-01-20,12,2022,X
5,0,2022-12-20,2023-01-20,1,2022,X
Ignoring the added column for the frame index, and the fact the date format has been changed (I'm pretty sure that can be fixed with some Pandas directive I don't know), it still does the right thing with regards to creating new rows with the appropriate date range.
So, both do the right thing. Now, on to performance. I duplicated your initial sample, just the 1 row, for 1_000_000 and 10_000_000 rows:
import sys
nrows = int(sys.argv[1])
with open(f"input_{nrows}.csv", "w") as f:
f.write("idx,start date,end date,month,year,value\n")
for _ in range(nrows):
f.write("0,20/05/2022,20/07/2022,0,0,X\n")
I'm running a 2020, M1 MacBook Air with the 2TB SSD (which gives very good read/write speeds):
1M rows (sec, RAM)
10M rows (sec, RAM)
csv module
7.8s, 6MB
78s, 6MB
Pandas
75s, 569MB
750s, 5.8GB
You can see both programs following a linear increase in time-to-run that follows the increase in the size of rows. The csv module's memory remains constanly non-existent because it's streaming data in-and-out (holding on to virtually nothing); Pandas's memory rises with the size of rows it has to hold so that it can do the actual date-range computations, again on whole columns. Also, not shown, but for the 10M-rows Pandas test, Pandas spent nearly 2 minutes just writing the CSV—longer than the csv-module approach took to complete the entire task.
Now, for all my putting-down of Pandas, the solution is far fewer lines, and is probably bug free from the get-go. I did have a problem writing get_dt_range(), and had to spend about 5 minutes thinking about what it actually needed to do and debug it.
You can view my setup with the small test harness, and the results, here.
I suggest you to use pandas (or even dask) to return the list of months between two columns of a huge dataset (e.g, .csv ~13GB). First you need to convert your two columns to a datetime by using pandas.to_datetime. Then, you can use pandas.date_range to get your list.
Try with this :
import pandas as pd
from io import StringIO
s = """start date end date month year value
20/05/2022 20/07/2022 0 0 X
"""
df = pd.read_csv(StringIO(s), sep='\t')
df['start date'] = pd.to_datetime(df['start date'], format = "%d/%m/%Y")
df['end date'] = pd.to_datetime(df['end date'], format = "%d/%m/%Y")
df["month"] = df.apply(lambda x: pd.date_range(start=x["start date"], end=x["end date"] + pd.DateOffset(months=1), freq="M").month.tolist(), axis=1)
df['year'] = df['start date'].dt.year
out = df.explode('month').reset_index(drop=True)
>>> print(out)
start date end date month year value
0 2022-05-20 2022-07-20 5 2022 X
1 2022-05-20 2022-07-20 6 2022 X
2 2022-05-20 2022-07-20 7 2022 X
Note : I tested the code above on a 1 million .csv dataset and it took ~10min to get the output.
you can read very large csv file with dask, then process it (same api as pandas), then convert it to pandas dataframe if you need.
dask is perfect when pandas fails due to data size or computation speed. But for data that fits into RAM, pandas can often be faster and easier to use than Dask DataFrame.
import dask.dataframe as dd
#1. read the large csv
dff = dd.read_csv('path_to_big_csv_file.csv') #return Dask.DataFrame
#if still not enough, try more reducing IO costs:
dff = dd.read_csv('largefile.csv', blocksize=25e6) #use blocksize (number of bytes by which to cut up larger files)
dff = dd.read_csv('largefile.csv', columns=["a", "b", "c"]) #return only columns a, b and c
#2. work with dff, dask has the same api than pandas:
#https://docs.dask.org/en/stable/dataframe-api.html
#3. then, finally, convert dff to pandas dataframe if you want
df = dff.compute() #return pandas dataframe
you can also try other alternatives for reading very large csv files efficiently with high speed & low momory usage:
pola, modin, koalas.
all those packages, same as dask, use similar api as pandas.
if you have very big csv file, pandas read_csv with chunksize usually don't succeed, and even if if succeed, it will be waist of time and energy
There's a Table helper in convtools library (I must confess, a lib of mine). This helper processes csv files as a stream, using simple csv.reader under the hood:
from datetime import datetime
from convtools import conversion as c
from convtools.contrib.tables import Table
def dt_range_to_months(dt_start, dt_end):
return tuple(
(year_month // 12, year_month % 12 + 1)
for year_month in range(
dt_start.year * 12 + dt_start.month - 1,
dt_end.year * 12 + dt_end.month,
)
)
(
Table.from_csv("tmp/in.csv", header=True)
.update(
year_month=c.call_func(
dt_range_to_months,
c.call_func(datetime.strptime, c.col("start date"), "%d/%m/%Y"),
c.call_func(datetime.strptime, c.col("end date"), "%d/%m/%Y"),
)
)
.explode("year_month")
.update(
year=c.col("year_month").item(0),
month=c.col("year_month").item(1),
)
.drop("year_month")
.into_csv("tmp/out.csv")
)
Input/Output:
~/o/convtools ❯❯❯ head tmp/in.csv
idx,start date,end date,month,year,value
0,20/05/2022,20/07/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
0,20/05/2022,20/07/2022,0,0,X
...
~/o/convtools ❯❯❯ head tmp/out.csv
idx,start date,end date,month,year,value
0,20/05/2022,20/07/2022,5,2022,X
0,20/05/2022,20/07/2022,6,2022,X
0,20/05/2022,20/07/2022,7,2022,X
0,20/05/2022,20/07/2022,5,2022,X
0,20/05/2022,20/07/2022,6,2022,X
0,20/05/2022,20/07/2022,7,2022,X
0,20/05/2022,20/07/2022,5,2022,X
0,20/05/2022,20/07/2022,6,2022,X
0,20/05/2022,20/07/2022,7,2022,X
...
On my M1 Mac on a file where each row explodes into three it processes 100K of rows per second. In case of 100M rows of the same structure it should take ~ 1000s (< 17 min). Of course it depends on how deep inner by-month cycles are.

Python: iterate through the rows of a csv and calculate date difference if there is a change in a column

Only basic knowledge of Python, so I'm not even sure if this is possible?
I have a csv that looks like this:
[1]: https://i.stack.imgur.com/8clYM.png
(This is dummy data, the real one is about 30K rows.)
I need to find the most recent job title for each employee (unique id) and then calculate how long (= how many days) the employee has been on the same job title.
What I have done so far:
import csv
import datetime
from datetime import *
data = open("C:\\Users\\User\\PycharmProjects\\pythonProject\\jts.csv",encoding="utf-8")
csv_data = csv.reader(data)
data_lines = list(csv_data)
print(data_lines)
for i in data_lines:
for j in i[0]:
But then I haven't got anywhere because I can't even conceptualise how to structure this. :-(
I also know that at one point I will need:
datetime.strptime(data_lines[1][2] , '%Y/%M/%d').date()
Could somebody help, please? I just need a new list saying something like:
id jt days
500 plumber 370
Edit to clarify: The dates are data points taken. I need to calculate back from the most recent of those back until the job title was something else. So in my example for employee 5000 from 04/07/2021 to 01/03/2020.
Let's consider sample data as follows:
id,jtitle,date
5000,plumber,01/01/2020
5000,senior plumber,02/03/2020
6000,software engineer,01/02/2020
6000,software architecture,06/02/2021
7000,software tester,06/02/2019
The following code works.
import pandas as pd
import datetime
# load data
data = pd.read_csv('data.csv')
# convert to datetime object
data.date = pd.to_datetime(data.date, dayfirst=True)
print(data)
# group employees by ID
latest = data.sort_values('date', ascending=False).groupby('id').nth(0)
print(latest)
# find the latest point in time where there is a change in job title
prev_date = data.sort_values('date', ascending=False).groupby('id').nth(1).date
print(prev_date)
# calculate the difference in days
latest['days'] = latest.date - prev_date
print(latest)
Output:
jtitle date days
id
5000 senior plumber 2020-03-02 61 days
6000 software architecture 2021-02-06 371 days
7000 software tester 2019-02-06 NaT
But then I haven't got anywhere because I can't even conceptualise how to structure this. :-(
Have a map (dict) of employee to (date, title).
For every row, check if you already have an entry for the employee. If you don't just put the information in the map, otherwise compare the date of the row and that of the entry. If the row has a more recent date, replace the entry.
Once you've gone through all the rows, you can just go through the map you've collected and compute the difference between the date you ended up with and "today".
Incidentally your pattern is not correct, the sample data uses a %d/%m/%Y (day/month/year) or %m/%d/%Y (month/day/year) format, the sample data is not sufficient to say which, but it certainly is not YMD.
Seems like I'm too late... Nevertheless, in case you're interested, here's a suggestion in pure Python (nothing wrong with Pandas, though!):
import csv
import datetime as dt
from operator import itemgetter
from itertools import groupby
reader = csv.reader('data.csv')
next(reader) # Discard header row
# Read, transform (date), and sort in reverse (id first, then date):
data = sorted(((i, jtitle, dt.datetime.strptime(date, '%d/%m/%Y'))
for i, jtitle, date in reader),
key=itemgetter(0, 2), reverse=True)
# Process data grouped by id
result = []
for i, group in groupby(data, key=itemgetter(0)):
_, jtitle, end = next(group) # Fetch last job title resp. date
# Search for first ocurrence of different job title:
start = end
for _, jt, start in group:
if jt != jtitle:
break
# Collect results in list with datetimes transformed back
result.append((i, jtitle, end.strftime('%d/%m/%Y'), (end - start).days))
result = sorted(result, key=itemgetter(0))
The result for the input data
id,jtitle,date
5000,plumber,01/01/2020
5000,plumber,01/02/2020
5000,senior plumber,01/03/2020
5000,head plumber,01/05/2020
5000,head plumber,02/09/2020
5000,head plumber,05/01/2021
5000,head plumber,04/07/2021
6000,electrician,01/02/2018
6000,qualified electrician,01/06/2020
7000,plumber,01/01/2004
7000,plumber,09/11/2020
7000,senior plumber,05/06/2021
is
[('5000', 'head plumber', '04/07/2021', 490),
('6000', 'qualified electrician', '01/06/2020', 851),
('7000', 'senior plumber', '05/06/2021', 208)]

Remove rows in a csv file based on the format of column value

I have a csv file which contains three columns - computer_name, software_code, software_update_date. The file contains computers that I don't need in my final report. I only need the data for computers whose name starts with 40- , 46- or 98-. Here is the sample file:
computer_name software_code software_update_date
07-0708 436 2019-02-07 0:00
30-0207 35170 2021-01-18 0:00
40-0049 41 2017-06-21 23:00
46-0001 11 2013-11-23 0:00
So I would like to delete rows 07-0708 and 30-0207. I tried with pandas but the generated file is exactly the same with no error message. I am quite new to python and still grasping the concepts. I wrote the below code:
import csv
import pandas as pd
fname = 'RAWfile.csv'
df=pd.read_csv(fname,encoding='ISO-8859-1')
#Renaming columns from the report
df.rename(columns = {'computer_name':'PC_NO', 'software_code':'SOFT_CODE', 'software_update_date':'UPDATE_DATE'}, inplace=True)
computers = ['40-','46-','98-']
searchstr = '|'.join(computers)
df[df['PC_NO'].str.contains(searchstr)]
df.to_csv('updatedfile.csv',index=False,quoting=csv.QUOTE_ALL,line_terminator='\n')
UPDATE: There are almost 70,000 rows in the csv file. Corrected the values in computers list to match the question.
You can try this,
# String to be searched in start of string
search = ("40-", "46-", "98-")
# boolean series returned with False at place of NaN
series = df["computer_name"].str.startswith(search, na = False)
# displaying filtered dataframe
df[series]

Slicing my data frame is returning unexpected results

I have 13 CSV files that contain billing information in an unusual format. Multiple readings are recorded every 30 minutes of the day. Five days are recorded beside each other (columns). Then the next five days are recorded under it. To make things more complicated, the day of the week, date, and billing day is shown over the first recording of KVAR each day.
The image blow shows a small example. However, imagine that KW, KVAR, and KVA repeat 3 more times before continuing some 50 rows later.
My goal as to create a simple python script that would make the data into a data frame with the columns: DATE, TIME, KW, KVAR, KVA, and DAY.
The problem is my script returns NaN data for the KW, KVAR, and KVA data after the first five days (which is correlated with a new instance of a for loop). What is weird to me is that when I try to print out the same ranges I get the data that I expect.
My code is below. I have included comments to help further explain things. I also have an example of sample output of my function.
def make_df(df):
#starting values
output = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
time = df1.loc[3:50,0]
val_start = 3
val_end = 51
date_val = [0,2]
day_type = [1,2]
# There are 7 row movements that need to take place.
for row_move in range(1,8):
day = [1,2,3]
date_val[1] = 2
day_type[1] = 2
# There are 5 column movements that take place.
# The basic idea is that I would cycle through the five days, grab their data in a temporary dataframe,
# and then append that dataframe onto the output dataframe
for col_move in range(1,6):
temp_df = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
temp_df['TIME'] = time
#These are the 3 values that stop working after the first column change
# I get the values that I expect for the first 5 days
temp_df['KW'] = df.iloc[val_start:val_end, day[0]]
temp_df['KVAR'] = df.iloc[val_start:val_end, day[1]]
temp_df['KVA'] = df.iloc[val_start:val_end, day[2]]
# These 2 values work perfectly for the entire data set
temp_df['DAY'] = df.iloc[day_type[0], day_type[1]]
temp_df["DATE"] = df.iloc[date_val[0], date_val[1]]
# trouble shooting
print(df.iloc[val_start:val_end, day[0]])
print(temp_df)
output = output.append(temp_df)
# increase values for each iteration of row loop.
# seems to work perfectly when I print the data
day = [x + 3 for x in day]
date_val[1] = date_val[1] + 3
day_type[1] = day_type[1] + 3
# increase values for each iteration of column loop
# seems to work perfectly when I print the data
date_val[0] = date_val[0] + 55
day_type [0]= day_type[0] + 55
val_start = val_start + 55
val_end = val_end + 55
return output
test = make_df(df1)
Below is some sample output. It shows where the data starts to break down after the fifth day (or first instance of the column shift in the for loop). What am I doing wrong?
Could be pd.append requiring matched row indices for numerical values.
import pandas as pd
import numpy as np
output = pd.DataFrame(np.random.rand(5,2), columns=['a','b']) # fake data
output['c'] = list('abcdefghij') # add a column of non-numerical entries
tmp = pd.DataFrame(columns=['a','b','c'])
tmp['a'] = output.iloc[0:2, 2]
tmp['b'] = output.iloc[3:5, 2] # generates NaN
tmp['c'] = output.iloc[0:2, 2]
data.append(tmp)
(initial response)
How does df1 look like? Is df.iloc[val_start:val_end, day[0]] have any issue past the fifth day? The codes didn't show how you read from the csv files, or df1 itself.
My guess: if val_start:val_end gives invalid indices on the sixth day, or df1 happens to be malformed past the fifth day, df.iloc[val_start:val_end, day[0]] will return an empty Series object and possibly make its way into temp_df. iloc do not report invalid row indices, though similar column indices would trigger IndexError.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5,3), columns=['a','b','c'], index=np.arange(5)) # fake data
df.iloc[0:2, 1] # returns the subset
df.iloc[100:102, 1] # returns: Series([], Name: b, dtype: float64)
A little off topic but I would recommend preprocessing the csv files rather than deal with indexing in Pandas DataFrame, as the original format was kinda complex. Slice the data by date and later use pd.melt or pd.groupby to shape them into the format you like. Or alternatively try multi-index if stick with Pandas I/O.

Categories

Resources