I have a large csv file contains some bus network information.
The stop code are made of a large number with a certain letter in the end. However, some of them are only numbers. When I read them into pandas, the large numbers become in scientific notion. like
code_o lat_o lon_o code_d
490016444HN 51.56878 0.1811568 490013271R
490013271R 51.57493 0.1781319 490009721A
490009721A 51.57708 0.1769355 490010407C
490010407C 51.57947 0.1775409 490011659G
490011659G 51.5806 0.1831088 490009810M
490009810M 51.57947 0.1848733 490014448S
490014448S 51.57751 0.185111 490001243Y
490001243Y 51.57379 0.1839945 490013654S
490013654S 51.57143 0.184776 490013482E
490013482E 51.57107 0.187039 490015118E
490015118E 51.5724 0.1923417 490011214E
490011214E 51.57362 0.1959939 490006980E
490006980E 51.57433 0.1999537 4.90E+09
4.90E+09 51.57071 0.2087701 490003049E
490003049E 51.5631 0.2146196 490004001A
490004001A 51.56314 0.2165552 490015350F
The type of them are object, however I need them to be a normal number in order to cross join other tables.
Since the column is not an 'int' or 'float', I cannot modify them by a whole column.
Any suggestion?
I attached the file from dropbox
https://www.dropbox.com/s/jhbxsncd97rq1z4/gtfs_OD_links_L.csv?dl=0
IIUC, try forcing object type for the code_d column on import:
import numpy as np
import pandas as pd
df = pd.read_csv('your_original_file.csv', dtype={'code_d': 'object'})
You can then parse that column, discarding the letter at the end and casting the result to integer type:
df['code_d'] = df['code_d'].str[:-1].astype(np.int)
Keep it simple: df=pd.read_csv('myfile.csv',dtype=str) and it will read everything in as strings. Or as was posted earlier by #Alberto to specify that column only just: df=pd.read_csv('myfile.csv',dtype={'code_o':str})
Related
I have some difficulty in importing a JSON file with pandas.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
This is the error that I get:
ValueError: If using all scalar values, you must pass an index
The file structure is simplified like this:
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
It is from the machine learning course of University of Washington on Coursera. You can find the file here.
Try
ser = pd.read_json('people_wiki_map_index_to_word.json', typ='series')
That file only contains key value pairs where values are scalars. You can convert it to a dataframe with ser.to_frame('count').
You can also do something like this:
import json
with open('people_wiki_map_index_to_word.json', 'r') as f:
data = json.load(f)
Now data is a dictionary. You can pass it to a dataframe constructor like this:
df = pd.DataFrame({'count': data})
You can do as #ayhan mention which will give you a column base format
Or you can enclose the object in [ ] (source) as shown below to give you a row format that will be convenient if you are loading multiple values and planing on using matrix for your machine learning models.
df = pd.DataFrame([data])
I think what is happening is that the data in
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
is being read as a string instead of a json
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
is actually
'{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}'
Since a string is a scalar, it wants you to load it as a json, you have to convert it to a dict which is exactly what the other response is doing
The best way is to do a json loads on the string to convert it to a dict and load it into pandas
myfile=f.read()
jsonData=json.loads(myfile)
df=pd.DataFrame(data)
{
"biennials": 522004,
"lb915": 116290
}
df = pd.read_json('values.json')
As pd.read_json expects a list
{
"biennials": [522004],
"lb915": [116290]
}
for a particular key, it returns an error saying
If using all scalar values, you must pass an index.
So you can resolve this by specifying 'typ' arg in pd.read_json
map_index_to_word = pd.read_json('Datasets/people_wiki_map_index_to_word.json', typ='dictionary')
For newer pandas, 0.19.0 and later, use the lines parameter, set it to True.
The file is read as a json object per line.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json', lines=True)
If fixed the following errors I encountered especially when some of the json files have only one value:
ValueError: If using all scalar values, you must pass an index
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
ValueError: Trailing data
For example
cat values.json
{
name: "Snow",
age: "31"
}
df = pd.read_json('values.json')
Chances are you might end up with this
Error: if using all scalar values, you must pass an index
Pandas looks up for a list or dictionary in the value. Something like
cat values.json
{
name: ["Snow"],
age: ["31"]
}
So try doing this. Later on to convert to html tohtml()
df = pd.DataFrame([pd.read_json(report_file, typ='series')])
result = df.to_html()
I solved this by converting it into an array like so
[{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}]
I have a list of np.datetime64 data that looks as follows:
times =[2015-03-26T16:02:42.000000Z,
2015-03-26T16:02:45.000000Z,...]
type(times) returns list
type(times[1]) returns obspy.core.utcdatetime.UTCDateTime
Now, I understand that h5py does not support date time data.
I have tried the following:
time_str = [n.encode("ascii", "ignore") for n in time_str]
time_str = [str(s) for s in time_str]
type(time_str[1]) returns bytes
I am okay with creating the dataset and storing these date time values as a string
However, when attempting to create the dataset, I get the following error:
with h5py.File('data_ML.hdf5', 'w') as f:
f.create_dataset("time", data=time_str,maxshape=(None),chunks=True, dtype='str')
TypeError: No conversion path for dtype: dtype('<U')
Where am I messing up/ is there an alternative way to store these values as is so I can extract them later?
Ok, here we go. I couldn't get some of you code to work together (maybe you left some steps out, or changed variable names?). And, I could not get the obspy.core.utcdatetime.UTCDateTime object your have.
So I created an example that does the following:
Starts with a list of np.datetime64() objects,
Converts to a list of np.datetime_as_string() in UTC format
objects **see note at Item 4
Converts to a np.array with dtype='S30'
Note: I included Step 2 to replicate your data. See following section
for simpler version
Code below:
times =[np.datetime64('2015-03-26T16:02:42.000000'),
np.datetime64('2015-03-26T16:02:45.000000'),
np.datetime64('2015-03-26T16:02:48.000000'),
np.datetime64('2015-03-26T16:02:55.000000') ]
utc_times = [ np.datetime_as_string(n,timezone='UTC') for n in times ]
utc_str_arr = np.array(utc_times,dtype='S30')
with h5py.File('data_ML.hdf5', 'w') as f:
f.create_dataset("time", data=utc_str_arr,maxshape=(None),chunks=True)
You can simplify the process if you are starting with np.datetime64() objects, and don't have (and don't need or want) the intermediate list of string objects (variable utc_times in my code). The method below skips Step 2 above, and shows 2 ways to create a np.array() of properly encoded strings.
Code below:
times =[np.datetime64('2015-03-26T16:02:42.000000'),
np.datetime64('2015-03-26T16:02:45.000000'),
np.datetime64('2015-03-26T16:02:48.000000'),
np.datetime64('2015-03-26T16:02:55.000000') ]
# Create empty array with defined size and 'S#' dtype, then populate with for loop:
utc_str_arr1 = np.empty((len(times),),dtype='S30')
for i, n in enumerate(times):
utc_str_arr1[i] = np.datetime_as_string(n,timezone='UTC')
# -OR- Create array and populate using loop comprehension:
utc_str_arr2 = np.array( [np.datetime_as_string(n,timezone='UTC').encode('utf-8') for n in times] )
with h5py.File('data_ML.hdf5', 'w') as f:
f.create_dataset("time1", data=utc_str_arr1,maxshape=(None),chunks=True)
f.create_dataset("time2", data=utc_str_arr2,maxshape=(None),chunks=True)
Final result looks similar with either method (second method creates 2 identical datsets).
Image from HDFView:
To Read the Data:
Per request in Aug-02-2021 comment, here is the code to extract data from HDF5 and create Pandas timestamp objects (then saved to a dataframe). First the byte strings in the dataset are read and converted to NumPy Unicode strings with .astype(). Then the strings are converted to Pandas timestamp objects with pd.to_datetime() using the format= parameter.
import h5py
import numpy as np
import pandas as pd
with h5py.File('data_ML.hdf5', 'r') as h5f:
## returns a h5py dataset object:
dts_ds = h5f["time"]
longest_word=len(max(dts_ds, key=len))
## returns an array of byte strings representing np.datetime64:
## .astype() used to convert byte strings to unicode
dts_arr = dts_ds[:].astype('U'+str(longest_word))
## create a new array to hold Pandas datetime objects
## then loop over first array to convert and populate new array
pd_dts_arr = np.empty((dts_arr.shape[0],),dtype=object)
for i, dts in enumerate(dts_arr):
pd_dts_arr[i] = pd.to_datetime(dts, format='%Y-%m-%dT%H:%M:%S.%fZ')
dts_df = pd.DataFrame(pd_dts_arr)
There are a lot of ways to represent dates and time using native Python, NumPy and Pandas objects. More details about working with them can be found at this answer:
Converting between datetime, Timestamp and datetime64
I have some very noisy (astronomy) data in csv format. Its shape is (815900,2) with 815k points giving information of what the mass of a disk is at a certain time. The fluctuations are pretty noticeable when you look at it close up. For example, here is an snippet of the data where the first column is time in seconds and the second is mass in kg:
40023700,2.40896E+028
40145700,2.44487E+028
40267700,2.44487E+028
40389700,2.44478E+028
40511600,1.535E+028
40633500,2.19067E+028
40755400,2.44496E+028
40877200,2.44489E+028
40999000,2.44489E+028
41120800,2.34767E+028
41242600,2.40936E+028
So it looks like there is a 1.53E+028 data point of noise, and also probably the 2.19E+028 and 2.35E+028 points.
To fix this, I am trying to set a Python script that will read in the csv data, then put some restriction on it so that if the mass is e.g. < 2.35E+028, it will remove the whole row and then create a new csv file with only the "good" data points:
40023700,2.40896E+028
40145700,2.44487E+028
40267700,2.44487E+028
40389700,2.44478E+028
40755400,2.44496E+028
40877200,2.44489E+028
40999000,2.44489E+028
41242600,2.40936E+028
Following this old question top answer by n8henrie, I so far have:
import pandas as pd
import csv
# Here are the locations of my csv file of my original data and an EMPTY csv file that will contain my good, noiseless set of data
originaldata = '/Users/myname/anaconda2/originaldata.csv'
gooddata = '/Users/myname/anaconda2/gooddata.csv'
# I use pandas to read in the original data because then I can separate the columns of time as 'T' and mass as 'M'
originaldata = pd.read_csv('originaldata.csv',delimiter=',',header=None,names=['t','m'])
# Numerical values of the mass values
M = originaldata['m'].values
# Now to put a restriction in
for row in M:
new_row = []
for column in row:
if column > 2.35E+028:
new_row.append(column)
csv.writer(open(newfile,'a')).writerow(new_row)
print('\n\n')
print('After:')
print(open(newfile).read())
However, when I run this, I get this error:
TypeError: 'numpy.float64' object is not iterable
I know the first column (time) is dtype int64 and the second column (mass) is dtype float64... but as a beginner, I'm still not quite sure what this error means or where I'm going wrong. Any help at all would be appreciated. Thank you very much in advance.
You can select rows by a boolean operation. Example:
import pandas as pd
from io import StringIO
data = StringIO('''\
40023700,2.40896E+028
40145700,2.44487E+028
40267700,2.44487E+028
40389700,2.44478E+028
40511600,1.535E+028
40633500,2.19067E+028
40755400,2.44496E+028
40877200,2.44489E+028
40999000,2.44489E+028
41120800,2.34767E+028
41242600,2.40936E+028
''')
df = pd.read_csv(data,names=['t','m'])
good = df[df.m > 2.35e+28]
out = StringIO()
good.to_csv(out,index=False,header=False)
print(out.getvalue())
Output:
40023700,2.40896e+28
40145700,2.44487e+28
40267700,2.44487e+28
40389700,2.44478e+28
40755400,2.44496e+28
40877200,2.44489e+28
40999000,2.44489e+28
41242600,2.40936e+28
This returns a column: M = originaldata['m'].values
So when you do for row in M:, you get only one value in row, so you can't iterate on it again.
I have a dataset with object type, which was imported as a txt file into Jupyter Notebook. But now I am trying to plot some auto-correlation for an individual column and it is not working.
My first attempt was to convert the object columns to float but I get the error message:
could not convert string to float: ?
How do I fix this?
Okay this is my script:
book = pd.read_csv('Book1.csv', parse_dates=True)
t= str(book.Global_active_power)
t
'0 4.216\n1 5.36\n2 5.374\n3 5.388\n4 3.666\n5 3.52\n6 3.702\n7 3.7\n8 3.668\n9 3.662\n10 4.448\n11 5.412\n12 5.224\n13 5.268\n14 4.054\n15 3.384\n16 3.27\n17 3.43\n18 3.266\n19 3.728\n20 5.894\n21 7.706\n22 7.026\n23 5.174\n24 4.474\n25 3.248\n26 3.236\n27 3.228\n28 3.258\n29 3.178\n ... \n1048545 0.324\n1048546 0.324\n1048547 0.324\n1048548 0.322\n1048549 0.322\n1048550 0.322\n1048551 0.324\n1048552 0.324\n1048553 0.326\n1048554 0.326\n1048555 0.324\n1048556 0.324\n1048557 0.322\n1048558 0.322\n1048559 0.324\n1048560 0.322\n1048561 0.322\n1048562 0.324\n1048563 0.388\n1048564 0.424\n1048565 0.42\n1048566 0.418\n1048567 0.418\n1048568 0.42\n1048569 0.422\n1048570 0.426\n1048571 0.424\n1048572 0.422\n1048573 0.422\n1048574 0.422\nName: Global_active_power, Length: 1048575, dtype: object'
I believe the reason is that i have to format my column first for equal number of decimal places and then i can convert to float, but trying to format using this is not working for me
print("{:0<4s}".format(book.Global_active_power))
The column contains a ? entry. Clean this up (along with any other extraneous entries) and you should not see this error.
When running the following code:
for row,hit in hits.iterrows():
forwardRows = data[data.index.values > row];
I get this error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
If I look into what is being compared here I have these variables:
type(row)
pandas.tslib.Timestamp
row
Timestamp('2015-09-01 09:30:00')
is being compared with:
type(data.index.values[0])
numpy.datetime64
data.index.values[0]
numpy.datetime64('2015-09-01T10:30:00.000000000+0100')
I would like to understand whether this is something that can be easily fixed or should I upload a subset of my data? thanks
Although this isn't a direct answer to your question, I have a feeling that this is what you're looking for: pandas.DataFrame.truncate
You could use it as follows:
for row, hit in hits.iterrows():
forwardRows = data.truncate(before=row)
Here's a little toy example of how you might use it in general:
import pandas as pd
# let's create some data to play with
df = pd.DataFrame(
index=pd.date_range(start='2016-01-01', end='2016-06-01', freq='M'),
columns=['x'],
data=np.random.random(5)
)
# example: truncate rows before Mar 1
df.truncate(before='2016-03-01')
# example: truncate rows after Mar 1
df.truncate(after='2016-03-01')
When using values you put it into numpy world. Instead, try
for row,hit in hits.iterrows():
forwardRows = data[data.index > row];