Sifting through CSV files for time stamps - python

I am fairly new to python but have a problem I'd like to solve but need a little bit of help.
I need to ask a user which directory path they want which I've figured out that part but....
From there I need to figure out a way to ask a user for a specific date/time range day-month-year, hours:minutes:seconds then filter out which csv files are in that range.
From there, I need my program to go into the filtered CSV files and look at time stamps recorded in the csv files.
From those time stamps I need to calculate if there are any gaps from the end of one csv file to the start of next.
If there are gaps I need to return a statement that indicates how long the gap is.
I've seen a few things, but am having trouble putting it all together!
Any guidance would be appreciated!

Consider using Dask data frames (https://docs.dask.org/en/latest/dataframe.html), which works on top of Pandas data frames.
Without going much deeper into Dask you need to know that it works in lazy mode, which means that will not do any processing until it is explicitly triggered with compute method. That makes the coding slightly different from Pandas.
Following example solves the part of reading multiple files and find gaps. The datafiles (that you can find here: https://github.com/mchiuminatto/stackoverflow/tree/master/data)
are OHLC data with frequency of D (one day) so the gap condition is that the difference between any two consecutive dates is more then 1 day.
import dask.dataframe as dd
# read all the csv files in the directory
# how much is loaded into memory is managed by Dask.
df = dd.read_csv('./data/*.csv')
df['date_time'] = dd.to_datetime(df['Time (UTC)'])
df['Time (UTC)'] = dd.to_datetime(df['Time (UTC)'])
df = df.set_index('Time (UTC)')
df['dif'] = df['date_time'] - df['date_time'].shift(1) # calculates gaps
# no data transformation is performed until you execute compute.
df.compute().head(5)
To check one record:
df.loc['2020-01-06 22:00:00'].compute()
Filter the periods with more than one day of difference
_mask = df['dif'] > '1 days' # time unit can be adjusted
df_gap = df[_mask].compute() # now we persist transformations in a Pandas df: df_gap
df_gap.head(5)
df_gap.tail(5)

Related

python time series plot problem (discontinuous datetime, plot weird for some files)

Hi I am trying to plot some timeseries data but there are two problems.
Before describing the problems, there are many stations and data files that I use are for each station.
I mean, the files are station1.csv, station2.csv, ... .
And each csv file has date, station name, sensor name, elevation, groundwater level etc.
Discontinuous timeseries
The original file has discontinuous timeseries as attached below.
2014-10-24,JDsd1,S11,1.49,26.47,36.84,18.19,7682,1021.57
2014-10-25,JDsd1,S11,1.49,26.47,36.84,18.19,7995,1021.79
2014-10-26,JDsd1,S11,1.52,26.44,36.87,18.2,7985,1019.75
2014-10-27,JDsd1,S11,1.53,26.43,36.88,18.2,7979,1020.13
2014-10-28,JDsd1,S11,,,,,,
2014-11-13,JDsd1,S11,1.33,26.63,36.67,18.08,13160,1026.25
2014-11-14,JDsd1,S11,1.24,26.72,36.58,18.11,13013,1027.09
2014-11-15,JDsd1,S11,1.23,26.73,36.57,18.12,12912,1030.27
2014-11-16,JDsd1,S11,1.22,26.74,36.56,18.13,12853,1026.32
I need to make the date range continuously, but hard to do it.
When I use pd.date_range(start_date (or min), end_date( or max), freq='d', the result shows ValueError: Length of values (775) does not match length of index (769).
The length of values (775) is that I need to make and the length of index (769) is current length of dates.
About plot shape
This is a atmospheric data plot in a station data file.
However, there are some stations which show weird plots of atmospheric data as below.
I used same code and data have same data structure.
I cannot see any difference in data.(I want to upload the data but the length would be too long..)
If you know some solutions or hints, please let me know.
I solved the first problem.
n=4
f_n = glob.glob('%s%s.csv' % (path_dir, gs['station'][n])) #get file
pp=pd.read_csv(f_n[0]) #read file
pp=pp.set_index(pd.to_datetime(pp['Date'])) #change rangeindex to datetime
pp=pp.resample('D').first() #Make continuous timeseries
Then I need the solution for the second problem..
If you know the solution or hint, please let me know
I think you might have comma as a decimal.
I just fixed my problem, which is your second problem by doing
df = pd.read_csv(file, thousands='.', decimal=',')
Maybe that helps you

What is the most pythonic way to relationate 2 pandas dataframe? Based on a key value

So, I work on a place and here I use A LOT of Python (Pandas) and the data keeps getting bigger and bigger, last month I was working with a few hundred thousand rows, weeks after that I was working with a few million rows and now I am working with 42 million rows. Most of my work is just take a dataframe and for each row, I need to consult in another dataframe its "equivalent" and process the data, sometimes just merge but more often i need to do a function with the equivalent data. Back in the days with a few hundred thousand rows, it was ok to just use apply and a simple filter but now it is EXTREMELY SLOW. Recently I've switched to vaex which is way faster than pandas on every aspect but apply, and after some time searching I found that apply is the last resource and should be used only if u haven't another option. So, is there another option? I really don't know
Some code to explain how I was doing this entire time:
def get_secondary(row: pd.DataFrame):
cnae = row["cnae_fiscal"]
cnpj = row["cnpj"]
# cnaes is another dataframe
secondary = cnaes[cnaes.cnpj == cnpj]
return [cnae] + list(secondary["cnae"].values)
empresas["cnae_secundarios"] = empresas.apply(get_secondary, axis=1)
This isn't the only use case, as I said.

Fastest approach to read and process 10k Excell cells in Python/Pandas?

I want to read and process realtime DDE data from a trading platform, using Excel as 'bridge' between trading platform (which sends out datas) and Python which process it, and print it back to Excel as front-end 'gui'. SPEED IS CRUCIAL. I need to:
read 6/10 thousands cells in Excel as fast as possible
sum ticks passed at same time (same h:m:sec)
check if DataFrame contains any value in a static array (eg. large quantities)
write output on the same excel file (different sheet), used as front-end output 'gui'.
I imported 'xlwings' library and use it to read data from one sheet, calculate needed values in python and then print out results in another sheet of the same file. I want to have Excel open and visible so to function as 'output dashboard'. This function is run in an infinite loop reading realtime stock prices.
import xlwings as xw
import numpy as np
import pandas as pd
...
...
tickdf = pd.DataFrame(xw.Book('datafile.xlsx').sheets['raw_data'].range((1,5)(1500, 8)).value)
tickdf.columns = ['time', 'price', 'all-tick','symb']
tickdf = tickdf[['time','symb', 'price', 'all-tick']]
#read data and fill a pandas.df with values, then re-order columns
try:
global ttt #this is used as temporary global pandas.df
global tttout #this is used as output global pandas.df copy
#they are global as they can be zeroed with another function
ttt= ttt.append(tickdf, ignore_index=False)
#at each loop, newly read ticks are added as rows to the end of ttt global.df.
ttt.drop_duplicates(inplace=True)
tttout = ttt.copy()
#to prevent outputting incomplete data,for extra-safety, I use a copy of the ttt as DF to be printed out on excel file. I find this as an extra-safety step
tttout = tttout.groupby(['time','symb'], as_index=False).agg({'all-tick':'sum', 'price':'first'})
tttout = tttout.set_index('time')
#sort it by time/name and set time as index
tttout = tttout.loc[tttout['all-tick'].isin(target_ticker)]
#find matching values comparing an array of a dozen values
tttout = tttout.sort_values(by = ['time', 'symb'], ascending = [False, True])
xw.Book(file_path).sheets['OUTPUT'].range('B2').value = tttout
I run this on a i5#4.2ghz, and this function, together with some other small other code, runs in 500-600ms per loop, which is fairly good (but not fantastic!) - I would like to know if there is a better approach and which step(s) might be bottlenecks.
Code reads 1500 rows, one per listed stock in alphabetical order, each of it is the 'last tick' passed on the market for that specific stock and it looks like this:
'10:00:04 | ABC | 10.33 | 50000'
'09:45:20 | XYZ | 5.260 | 200 '
'....
being time, stock symbol, price, quantity.
I want to investigate if there are some specific quantities that are traded on the market, such as 1.000.000 (as it represent a huge order) , or maybe just '1' as often is used as market 'heartbeat', a sort of fake order.
My approach is to use Pandas/Xlwings/ and 'isin' method. Is there a more efficient approach that might improve my script performance?
It would be faster to use a UDF written with PyXLL as that would avoid going via COM and an external process. You would have a formula in Excel with the input set to your range of data, and that would be called each time the input data updated. This would avoid the need to keep polling the data in an infinite loop, and should be much faster than running Python outside of Excel.
See https://www.pyxll.com/docs/introduction.html if you're not already familiar with PyXLL.
PyXLL could convert the input range to a pandas DataFrame for you (see https://www.pyxll.com/docs/userguide/pandas.html), but that might not be the fastest way to do it.
The quickest way to transfer data from Excel to Python is via a floating point numpy array using the "numpy_array" type in PyXLL (see https://www.pyxll.com/docs/userguide/udfs/argtypes.html#numpy-array-types).
As speed is a concern, maybe you could split the data up and have some functions that take mostly static data (eg rows and column headers), and other functions that take variable data as numpy_arrays where possible or other types where not, and then a final function to combine them all.
PyXLL can return Python objects to Excel as object handles. If you need to return intermediate results then it is generally faster to do that instead of expanding the whole dataset to an Excel range.
#Tony Roberts, thank you
I have one doubt and one observation.
DOUBT: Data get updated very fast, every 50-100ms. Would it be feasible to use a UDF fuction to be called so often ? would it be lean ? I have little experience in this.
OBSERVATION: PyXLL is for sure extremely powerful, well done, well maintained but IMHO, costing $25/month it goes beyond the pure nature of free Python language. I although do understand quality has a price.

Read Directory of Timeseries CSV data efficiently with Dask DataFrame and Pandas

I have a directory of timeseries data stored as CSV files, one file per day. How do I load and process it efficiently with Dask DataFrame?
Disclaimer: I maintain Dask. This question occurs often enough in other channels that I decided to add a question here on StackOverflow to which I can point people in the future.
Simple Solution
If you just want to get something quickly then simple use of dask.dataframe.read_csv using a globstring for the path should suffice:
import dask.dataframe as dd
df = dd.read_csv('2000-*.csv')
Keyword arguments
The dask.dataframe.read_csv function supports most of the pandas.read_csv keyword arguments, so you might want to tweak things a bit.
df = dd.read_csv('2000-*.csv', parse_dates=['timestamp'])
Set the index
Many operations like groupbys, joins, index lookup, etc. can be more efficient if the target column is the index. For example if the timestamp column is made to be the index then you can quickly look up the values for a particular range easily, or you can join efficiently with another dataframe along time. The savings here can easily be 10x.
The naive way to do this is to use the set_index method
df2 = df.set_index('timestamp')
However if you know that your new index column is sorted then you can make this much faster by passing the sorted=True keyword argument
df2 = df.set_index('timestamp', sorted=True)
Divisions
In the above case we still pass through the data once to find good breakpoints. However if your data is already nicely segmented (such as one file per day) then you can give these division values to set_index to avoid this initial pass (which can be costly for a large amount of CSV data.
import pandas as pd
divisions = tuple(pd.date_range(start='2000', end='2001', freq='1D'))
df2 = df.set_index('timestamp', sorted=True, divisions=divisions)
This solution correctly and cheaply sets the timestamp column as the index (allowing for efficient computations in the future).
Convert to another format
CSV is a pervasive and convenient format. However it is also very slow. Other formats like Parquet may be of interest to you. They can easily be 10x to 100x faster.

Aggregate time between alternate rows

I have a dataset that's roughly 200KB in size. I've cleaned up the data and loaded it into an RDD in Spark (using pyspark) so that the header format is the following:
Employee ID | Timestamp (MM/DD/YYYY HH:MM) | Location
This dataset stores employee stamp-in and stamp-out times, and I need to add up the amount of time that they've spent at work. Assuming the format of the rows is clean and strictly alternate (so stamp in, stamp out, stamp in, stamp out, etc), is there a way to aggregate the time spent in Spark?
I've tried using filters on all the "stamp in" values and aggregating the time with the value in the row directly after (so r+1) but this is proving to be very difficult not to mention expensive. I think this would be straightforward to do in a language like java or python, but before switching over am I missing a solution that can be implemented in Spark?
You can try using the window function lead:
from pyspark.sql import Window
from pyspark.sql.functions import *
window = Window.partitionBy("id").orderBy("timestamp")
newDf = df.withColumn("stampOut", lead("timestamp", 1).over(window)).where(col("stampOut").isNotNull())
finalDf = newDf.select(col("id"), col("stampOut") - col("timestamp"))

Categories

Resources