Python: numpy loadtxt with integer and datetime - python

I don't understand how works the loadtxt methode of numpy. I have read some questions/answers on the website but it's not clear for me.
I have a file 'data.txt' which is :
WEIGHT,DAY
75.1,16/10/2018
75.2,17/10/2018
...
My code is :
def parsetime(v):
print(type(v))
print(v)
return np.datetime64(
datetime.strptime(v, '%d/%m/%Y')
)
data = np.loadtxt('masse.txt',delimiter=',',usecols=(0, 1),converters = {1:parsetime},skiprows=1)
But it doesn't work correctly cause it's giving to the function parsetime a byte and not a string...
<class 'bytes'>
b'16/10/2018'
I just want an np.array which has in first column an integer and the second column a date.
I'am a bit lost.
Thanks a lot by advance,

Related

Error in loading json with os and pandas package [duplicate]

I have some difficulty in importing a JSON file with pandas.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
This is the error that I get:
ValueError: If using all scalar values, you must pass an index
The file structure is simplified like this:
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
It is from the machine learning course of University of Washington on Coursera. You can find the file here.
Try
ser = pd.read_json('people_wiki_map_index_to_word.json', typ='series')
That file only contains key value pairs where values are scalars. You can convert it to a dataframe with ser.to_frame('count').
You can also do something like this:
import json
with open('people_wiki_map_index_to_word.json', 'r') as f:
data = json.load(f)
Now data is a dictionary. You can pass it to a dataframe constructor like this:
df = pd.DataFrame({'count': data})
You can do as #ayhan mention which will give you a column base format
Or you can enclose the object in [ ] (source) as shown below to give you a row format that will be convenient if you are loading multiple values and planing on using matrix for your machine learning models.
df = pd.DataFrame([data])
I think what is happening is that the data in
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
is being read as a string instead of a json
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
is actually
'{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}'
Since a string is a scalar, it wants you to load it as a json, you have to convert it to a dict which is exactly what the other response is doing
The best way is to do a json loads on the string to convert it to a dict and load it into pandas
myfile=f.read()
jsonData=json.loads(myfile)
df=pd.DataFrame(data)
{
"biennials": 522004,
"lb915": 116290
}
df = pd.read_json('values.json')
As pd.read_json expects a list
{
"biennials": [522004],
"lb915": [116290]
}
for a particular key, it returns an error saying
If using all scalar values, you must pass an index.
So you can resolve this by specifying 'typ' arg in pd.read_json
map_index_to_word = pd.read_json('Datasets/people_wiki_map_index_to_word.json', typ='dictionary')
For newer pandas, 0.19.0 and later, use the lines parameter, set it to True.
The file is read as a json object per line.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json', lines=True)
If fixed the following errors I encountered especially when some of the json files have only one value:
ValueError: If using all scalar values, you must pass an index
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
ValueError: Trailing data
For example
cat values.json
{
name: "Snow",
age: "31"
}
df = pd.read_json('values.json')
Chances are you might end up with this
Error: if using all scalar values, you must pass an index
Pandas looks up for a list or dictionary in the value. Something like
cat values.json
{
name: ["Snow"],
age: ["31"]
}
So try doing this. Later on to convert to html tohtml()
df = pd.DataFrame([pd.read_json(report_file, typ='series')])
result = df.to_html()
I solved this by converting it into an array like so
[{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}]

Datetime storing in hd5 database

I have a list of np.datetime64 data that looks as follows:
times =[2015-03-26T16:02:42.000000Z,
2015-03-26T16:02:45.000000Z,...]
type(times) returns list
type(times[1]) returns obspy.core.utcdatetime.UTCDateTime
Now, I understand that h5py does not support date time data.
I have tried the following:
time_str = [n.encode("ascii", "ignore") for n in time_str]
time_str = [str(s) for s in time_str]
type(time_str[1]) returns bytes
I am okay with creating the dataset and storing these date time values as a string
However, when attempting to create the dataset, I get the following error:
with h5py.File('data_ML.hdf5', 'w') as f:
f.create_dataset("time", data=time_str,maxshape=(None),chunks=True, dtype='str')
TypeError: No conversion path for dtype: dtype('<U')
Where am I messing up/ is there an alternative way to store these values as is so I can extract them later?
Ok, here we go. I couldn't get some of you code to work together (maybe you left some steps out, or changed variable names?). And, I could not get the obspy.core.utcdatetime.UTCDateTime object your have.
So I created an example that does the following:
Starts with a list of np.datetime64() objects,
Converts to a list of np.datetime_as_string() in UTC format
objects **see note at Item 4
Converts to a np.array with dtype='S30'
Note: I included Step 2 to replicate your data. See following section
for simpler version
Code below:
times =[np.datetime64('2015-03-26T16:02:42.000000'),
np.datetime64('2015-03-26T16:02:45.000000'),
np.datetime64('2015-03-26T16:02:48.000000'),
np.datetime64('2015-03-26T16:02:55.000000') ]
utc_times = [ np.datetime_as_string(n,timezone='UTC') for n in times ]
utc_str_arr = np.array(utc_times,dtype='S30')
with h5py.File('data_ML.hdf5', 'w') as f:
f.create_dataset("time", data=utc_str_arr,maxshape=(None),chunks=True)
You can simplify the process if you are starting with np.datetime64() objects, and don't have (and don't need or want) the intermediate list of string objects (variable utc_times in my code). The method below skips Step 2 above, and shows 2 ways to create a np.array() of properly encoded strings.
Code below:
times =[np.datetime64('2015-03-26T16:02:42.000000'),
np.datetime64('2015-03-26T16:02:45.000000'),
np.datetime64('2015-03-26T16:02:48.000000'),
np.datetime64('2015-03-26T16:02:55.000000') ]
# Create empty array with defined size and 'S#' dtype, then populate with for loop:
utc_str_arr1 = np.empty((len(times),),dtype='S30')
for i, n in enumerate(times):
utc_str_arr1[i] = np.datetime_as_string(n,timezone='UTC')
# -OR- Create array and populate using loop comprehension:
utc_str_arr2 = np.array( [np.datetime_as_string(n,timezone='UTC').encode('utf-8') for n in times] )
with h5py.File('data_ML.hdf5', 'w') as f:
f.create_dataset("time1", data=utc_str_arr1,maxshape=(None),chunks=True)
f.create_dataset("time2", data=utc_str_arr2,maxshape=(None),chunks=True)
Final result looks similar with either method (second method creates 2 identical datsets).
Image from HDFView:
To Read the Data:
Per request in Aug-02-2021 comment, here is the code to extract data from HDF5 and create Pandas timestamp objects (then saved to a dataframe). First the byte strings in the dataset are read and converted to NumPy Unicode strings with .astype(). Then the strings are converted to Pandas timestamp objects with pd.to_datetime() using the format= parameter.
import h5py
import numpy as np
import pandas as pd
with h5py.File('data_ML.hdf5', 'r') as h5f:
## returns a h5py dataset object:
dts_ds = h5f["time"]
longest_word=len(max(dts_ds, key=len))
## returns an array of byte strings representing np.datetime64:
## .astype() used to convert byte strings to unicode
dts_arr = dts_ds[:].astype('U'+str(longest_word))
## create a new array to hold Pandas datetime objects
## then loop over first array to convert and populate new array
pd_dts_arr = np.empty((dts_arr.shape[0],),dtype=object)
for i, dts in enumerate(dts_arr):
pd_dts_arr[i] = pd.to_datetime(dts, format='%Y-%m-%dT%H:%M:%S.%fZ')
dts_df = pd.DataFrame(pd_dts_arr)
There are a lot of ways to represent dates and time using native Python, NumPy and Pandas objects. More details about working with them can be found at this answer:
Converting between datetime, Timestamp and datetime64

How do I convert from int64 back to timestamp or datetime'?

I am working on a project to look at how much a pitcher's different pitches break each game. I looked here for an earlier error which fixed my error but it gives me some weird numbers. What I mean is like when I print what I hope to be August 3rd,2020 I get 1.5964128e+18. Here's how I got there.
hughes2020=pd.read_csv(r"C:/Users/Stratus/Downloads/Hughes2020Test.csv",parse_dates=['game_date'])
game=hughes2020['game_date'].astype(np.int64)
#Skipping to next part to an example
elif name[i]=="Curveball":
if (c<curve)
xcurve[c]=totalx[i]
ycurve[c]=totaly[i]
cudate[c]=game[i]
c+=1
and when I print the cudate it gives me the large number and I am wondering how I can change it back.
And if I run it as
game=hughes2020['game_date'] #.astype(np.int64)
#Skipping to next part to an example
elif name[i]=="Curveball":
if (c<curve)
xcurve[c]=totalx[i]
ycurve[c]=totaly[i]
cudate[c]=game[i]
c+=1
It gives me an
TypeError: float() argument must be a string or a number, not 'Timestamp'
To convert int to datetime use pd.to_datetime():
df = pd.DataFrame(data=[1.5964128e+18], columns = ['t'])
df['t2'] = pd.to_datetime(df['t'])
t t2
0 1.596413e+18 2020-08-03
However a better solution would be to convert the dates at the time of csv reading (As #sinanspd correctly pointed out). Use parse_dates and other related options in pd.read_csv(). Function manual is here

How to deal with "could not convert string to float" error

This question has been asked several times here and I checked most of them, but couldn't figure out how to deal with it.
I read a CSV file and I try to convert its values to float as following:
testdataframe = pd.read_csv(r'H:\myCSVfile.csv')
testdataset = testdataframe.values
testdataset = testdataset.astype('float32')
I get this error: ValueError: could not convert string to float: '2020-08-05 22:45:00'
here is testdataframe:
array([['2020-08-05 22:45:00', 5.670524],
['2020-08-05 23:00:00', 5.6840434],
['2020-08-05 23:15:00', 5.6911097],
['2020-08-05 23:30:00', 5.6869917],
['2020-08-05 23:45:00', 5.6786237],
['2020-08-06 00:00:00', 5.6710806]], dtype=object
Thanks in advance for your help.
As #John Gordon correctly mentioned that it is a date/time string
You should apply astype(float) to numeric columns. However, if you still want to proceed with applying the same, here goes the logic to ignore 'errors'
df=pd.DataFrame({"A":[1.2,'1.2','a'],"B":['2020-10-2 10:00:00','2020-10-2 11:00:00','2020-10-2 12:00:00']})
df.astype(float, errors='ignore')

ValueError: could not convert string to float:?

I have a dataset with object type, which was imported as a txt file into Jupyter Notebook. But now I am trying to plot some auto-correlation for an individual column and it is not working.
My first attempt was to convert the object columns to float but I get the error message:
could not convert string to float: ?
How do I fix this?
Okay this is my script:
book = pd.read_csv('Book1.csv', parse_dates=True)
t= str(book.Global_active_power)
t
'0 4.216\n1 5.36\n2 5.374\n3 5.388\n4 3.666\n5 3.52\n6 3.702\n7 3.7\n8 3.668\n9 3.662\n10 4.448\n11 5.412\n12 5.224\n13 5.268\n14 4.054\n15 3.384\n16 3.27\n17 3.43\n18 3.266\n19 3.728\n20 5.894\n21 7.706\n22 7.026\n23 5.174\n24 4.474\n25 3.248\n26 3.236\n27 3.228\n28 3.258\n29 3.178\n ... \n1048545 0.324\n1048546 0.324\n1048547 0.324\n1048548 0.322\n1048549 0.322\n1048550 0.322\n1048551 0.324\n1048552 0.324\n1048553 0.326\n1048554 0.326\n1048555 0.324\n1048556 0.324\n1048557 0.322\n1048558 0.322\n1048559 0.324\n1048560 0.322\n1048561 0.322\n1048562 0.324\n1048563 0.388\n1048564 0.424\n1048565 0.42\n1048566 0.418\n1048567 0.418\n1048568 0.42\n1048569 0.422\n1048570 0.426\n1048571 0.424\n1048572 0.422\n1048573 0.422\n1048574 0.422\nName: Global_active_power, Length: 1048575, dtype: object'
I believe the reason is that i have to format my column first for equal number of decimal places and then i can convert to float, but trying to format using this is not working for me
print("{:0<4s}".format(book.Global_active_power))
The column contains a ? entry. Clean this up (along with any other extraneous entries) and you should not see this error.

Categories

Resources