Time Data does not match format "'%H:%M.%S%f'" - python

I am trying to forecast time series data.
The time series data in my csv file is in the form 0:00.000
Hence, I indexed the time series data column as follows:
df.columns=['Elapsed','I']
df['Elapsed']=pd.to_datetime(df['Elapsed'], format='%H:%M.%S%f')
df['Elapsed']=df['Elapsed'].dt.time
df.set_index('Elapsed', inplace=True)
Then later I split my data into the test section and the train section
train = df.loc['0:00.000':'0:28.778']
test = df.loc['0:28.779':]
My stack trace is
An extract of my data is:
Can anyone explain how to prevent this error from occuring?

Since the question has now changed, I'll write a new answer.
Your dataframe is indexed by instances of datetime.time, but you're trying to slice it with strings - pandas doesn't want to compare strings with times.
To get your slicing to work, try this:
split_from = datetime.datetime.strptime('0:00.000', '%H:%M.%S%f').time()
split_to = datetime.datetime.strptime('0:28.778', '%H:%M.%S%f').time()
train = df[split_from:split_to]
It would also be useful to hold the format in a variable since you're now using it in several places.
Or if you have fixed split times, you could instead do
split_from = datetime.time(0, 0, 0)
split_to = datetime.time(0, 28, 77.8)
train = df[split_from:split_to]

Without seeing your data, I'm just guessing, but here goes:
I'm guessing your original data in the 'Elapsed' column looks like
'12:34.5678'
'12:35.1234'
In particular, it has quotes each side of the numbers. Otherwise your line
df['Elapsed']=pd.to_datetime(df['Elapsed'], format="'%H:%M.%S%f'")
would fail.
So the error message is telling you that your slicing times have the wrong format: they are missing quotes on each side. Change it to
train = df.loc["'0:00.000'":"'0:28.778'"]
(likewise for the next line) and hopefully that will sort it out.
If you can extract your source data in a way that avoids having quote characters in the timestamps, you'll probably find things a little simpler.

Related

Get data from a Object datatype in python dataframe

I have a DataFrame with Coordinates saved in the following format {"type":"Point","coordinates":[25.2484759,55.3525189]}.
It is saved as an object dtype. Please help me retrieving the coordinates from this column without iteration.
I am a beginner in coding ,but I do think that running a loop over this and splitting the data would be a unnecessary task.Hope you all could help me
This is what I thought
float(trip_data["pickuplocation"][0][31:-13]),float(trip_data["pickuplocation"][0][-12:-3])
I want coordinates to be saved as an array.
Sorry If I sound less technical.Please feel free to ask more details.
If you want to use dataframes with spatial data you should might at the GeoPandas package.
To address the question, assuming the column is a string, you were quite close, you can get the coordinates without a loop using:
coord1 = trip_data["pickuplocation"].str[31:-13].astype(float)
coord2 = trip_data["pickuplocation"].str[-12:-3].astype(float)
You need to tell pandas to treat this object as a string to use string indexing, and then you tell it the series are floats with the astype.
Edit: A more reliable, if less safe (do not use this in production code because ast.literal_eval is not safe) approach might be to use ast though:
import ast
coords = trip_data["pickuplocation"].apply(lambda x: ast.literal_eval(x)["coordinates"])
You should then be able to index coords as a list.

I have a CSV generated from CICFLOWMETER and I'm unable generate a correlation matrix It either generates a empty data frame

This is the code I'm using and I have also tried converting my datatype of my columns which is object to float but I got this error
df = pd.read_csv('DDOSping.csv')
pearsoncorr = df.corr(method='pearson')
ValueError: could not convert string to float:
'172.27.224.251-172.27.224.250-56003-502-6'
Somewhere in your CSV this string value exists '172.27.224.251-172.27.224.250-56003-502-6'. Do you know why it's there? What does it represent? It looks to me like it shouldn't be in the data you include in your correlation matrix calculation.
The df.corr method is trying to convert the string value to a float, but it's obviously not possible to do because it's a big complicated string with various characters, not a regular number.
You should clean your CSV of unnecessary data (or make a copy and clean that so you don't lose anything important). Remove anything, like metadata, that isn't the exact data that df.corr needs, including the string in the error message.
If it's just a few values you need to clean then just open in excel or a text editor to do the cleaning. If it's a lot and all the irrelevant data to be removed is in specific rows and/or columns, you could just remove them from your DataFrame before calling 'df.corr' instead of cleaning the file itself.

Showing integer columns as categorical and throwing error in sweetviz compare

If I analyze these two datasets individually, I don't get any error and the I also get the viz of all the integer columns.
But when I try to compare these dataframe, I get the below error.
Cannot convert series 'Web Visit' in COMPARED from its TYPE_CATEGORICAL
to the desired type TYPE_BOOL.
I also tried the FeatureConfig to skip it, but no avail.
pid_compare = sweetviz.compare([pdf,"234_7551009"],[pdf_2,"215_220941058"])
Maintainer of the lib here; this question was asked in the git also, but it will be useful to detail the answer here.
After looking at your data provided in the link above, it looks like the first dataframe (pdf) only contains 0 & 1, so it is classified as boolean so it cannot be compared against the second one which is categorical (that one has 0,1,2,3 as you probably know!).
The system will be able to handle it if you use FeatureConfig to force the first dataframe to be considered CATEGORICAL.
I just tried the following and it seems to work, let me know if it helps!
feature_config = sweetviz.FeatureConfig(force_cat = ["Web Desktop Interaction"])
report = sweetviz.compare(pdf, pdf_2, None, feature_config)

How to extract a data frame from a messy String?

I have got a dataset that has been revised before its release. However, its accessory codes had not been revised and now faces this error.
The code expects a data frame explaining the features of 255 homes, though the item is just a messy string that has no exact delimiter to convert it!
I showed the error, the types of the items in the new dataset and the content of the string in this [picture][1].
I'm sure there's a better way, but I use this trick to get dataframes to work with from poorly formatted SO questions.
Print the string (to let print take care of things like return characters, '\n'), then select-all and copy it. Then use:
df = pd.read_clipboard("\s\s+")
Sometimes I have to manually adjust the spacing a little bit between a few column names for it to work correctly, but it is unreasonably effective.

Quickest Pandas Value Updating Method?

I'm going through over 1 million patent applications and have to fix the dates, in addition to other things that I will work on later. I'm reading the file into a Pandas data frame, then running the following function:
def date_change():
new_dates = {'m/y': []}
for i, row in apps.iterrows():
try:
d = row['date'].rsplit('/')
new_dates['m/y'].append('{}/19{}'.format(d[0], d[2]))
except Exception as e:
print('{} {}\n{}\n{}'.format(i, e, row, d))
new_dates['m/y'].append(np.nan)
apps.join(pd.DataFrame(new_dates))
apps.drop('date')
Is there a quicker way of executing this? Is Pandas even the correct library to be using with a dataset this large? I've been told PySpark is good for big data, but how much will it improve the speed?
So it seems like you are using a string to represent data instead of a date time object.
I'd suggest to do something like
df['date'] = pd.to_datetime(df['date'])
So you don't need to iterate at all, as that function operate on the whole column.
And then, you might want to check the following answer which uses dt.strftime to format your column appropriately.
If you could show input and expected output, I could add the full solution here.
Besides, 1 million rows should typically be manageable for pandas (depending on the number of columns of course)

Categories

Resources