I need the index to start at 1 rather than 0 when writing a Pandas DataFrame to CSV.
Here's an example:
In [1]: import pandas as pd
In [2]: result = pd.DataFrame({'Count': [83, 19, 20]})
In [3]: result.to_csv('result.csv', index_label='Event_id')
Which produces the following output:
In [4]: !cat result.csv
Event_id,Count
0,83
1,19
2,20
But my desired output is this:
In [5]: !cat result2.csv
Event_id,Count
1,83
2,19
3,20
I realize that this could be done by adding a sequence of integers shifted by 1 as a column to my data frame, but I'm new to Pandas and I'm wondering if a cleaner way exists.
Index is an object, and default index starts from 0:
>>> result.index
Int64Index([0, 1, 2], dtype=int64)
You can shift this index by 1 with
>>> result.index += 1
>>> result.index
Int64Index([1, 2, 3], dtype=int64)
Just set the index before writing to CSV.
df.index = np.arange(1, len(df) + 1)
And then write it normally.
source: In Python pandas, start row index from 1 instead of zero without creating additional column
Working example:
import pandas as pdas
dframe = pdas.read_csv(open(input_file))
dframe.index = dframe.index + 1
Another way in one line:
df.shift()[1:]
In my opinion best practice is to set the index with a RangeIndex
import pandas as pd
result = pd.DataFrame(
{'Count': [83, 19, 20]},
index=pd.RangeIndex(start=1, stop=4, name='index')
)
>>> result
Count
index
1 83
2 19
3 20
I prefer this, because you can define the range and a possible step and a name for the index in one line.
This worked for me
df.index = np.arange(1, len(df)+1)
You can use this one:
import pandas as pd
result = pd.DataFrame({'Count': [83, 19, 20]})
result.index += 1
print(result)
or this one, by getting the help of numpy library like this:
import pandas as pd
import numpy as np
result = pd.DataFrame({'Count': [83, 19, 20]})
result.index = np.arange(1, len(result)+1)
print(result)
np.arange will create a numpy array and return values within a given interval which is (1, len(result)+1) and finally you will assign that array to result.index.
use this
df.index = np.arange(1, len(df)+1)
Add ".shift()[1:]" while creating a data frame
data = pd.read_csv(r"C:\Users\user\path\data.csv").shift()[1:]
Fork from the original answer, giving some cents:
if I'm not mistaken, starting from version 0.23, index object is RangeIndex type
From the official doc:
RangeIndex is a memory-saving special case of Int64Index limited to representing monotonic ranges. Using RangeIndex may in some instances improve computing speed.
In case of a huge index range, that makes sense, using the representation of the index, instead of defining the whole index at once (saving memory).
Therefore, an example (using Series, but it applies to DataFrame also):
>>> import pandas as pd
>>>
>>> countries = ['China', 'India', 'USA']
>>> ds = pd.Series(countries)
>>>
>>>
>>> type(ds.index)
<class 'pandas.core.indexes.range.RangeIndex'>
>>> ds.index
RangeIndex(start=0, stop=3, step=1)
>>>
>>> ds.index += 1
>>>
>>> ds.index
RangeIndex(start=1, stop=4, step=1)
>>>
>>> ds
1 China
2 India
3 USA
dtype: object
>>>
As you can see, the increment of the index object, changes the start and stop parameters.
This adds a column that accomplishes what you want
df.insert(0,"Column Name", np.arange(1,len(df)+1))
Following on from TomAugspurger's answer, we could use list comprehension rather than np.arrange(), which removes the requirement for importing the module: numpy. You can use the following instead:
df.index = [i+1 for i in range(len(df))]
Related
I need the index to start at 1 rather than 0 when writing a Pandas DataFrame to CSV.
Here's an example:
In [1]: import pandas as pd
In [2]: result = pd.DataFrame({'Count': [83, 19, 20]})
In [3]: result.to_csv('result.csv', index_label='Event_id')
Which produces the following output:
In [4]: !cat result.csv
Event_id,Count
0,83
1,19
2,20
But my desired output is this:
In [5]: !cat result2.csv
Event_id,Count
1,83
2,19
3,20
I realize that this could be done by adding a sequence of integers shifted by 1 as a column to my data frame, but I'm new to Pandas and I'm wondering if a cleaner way exists.
Index is an object, and default index starts from 0:
>>> result.index
Int64Index([0, 1, 2], dtype=int64)
You can shift this index by 1 with
>>> result.index += 1
>>> result.index
Int64Index([1, 2, 3], dtype=int64)
Just set the index before writing to CSV.
df.index = np.arange(1, len(df) + 1)
And then write it normally.
source: In Python pandas, start row index from 1 instead of zero without creating additional column
Working example:
import pandas as pdas
dframe = pdas.read_csv(open(input_file))
dframe.index = dframe.index + 1
Another way in one line:
df.shift()[1:]
In my opinion best practice is to set the index with a RangeIndex
import pandas as pd
result = pd.DataFrame(
{'Count': [83, 19, 20]},
index=pd.RangeIndex(start=1, stop=4, name='index')
)
>>> result
Count
index
1 83
2 19
3 20
I prefer this, because you can define the range and a possible step and a name for the index in one line.
This worked for me
df.index = np.arange(1, len(df)+1)
You can use this one:
import pandas as pd
result = pd.DataFrame({'Count': [83, 19, 20]})
result.index += 1
print(result)
or this one, by getting the help of numpy library like this:
import pandas as pd
import numpy as np
result = pd.DataFrame({'Count': [83, 19, 20]})
result.index = np.arange(1, len(result)+1)
print(result)
np.arange will create a numpy array and return values within a given interval which is (1, len(result)+1) and finally you will assign that array to result.index.
use this
df.index = np.arange(1, len(df)+1)
Add ".shift()[1:]" while creating a data frame
data = pd.read_csv(r"C:\Users\user\path\data.csv").shift()[1:]
Fork from the original answer, giving some cents:
if I'm not mistaken, starting from version 0.23, index object is RangeIndex type
From the official doc:
RangeIndex is a memory-saving special case of Int64Index limited to representing monotonic ranges. Using RangeIndex may in some instances improve computing speed.
In case of a huge index range, that makes sense, using the representation of the index, instead of defining the whole index at once (saving memory).
Therefore, an example (using Series, but it applies to DataFrame also):
>>> import pandas as pd
>>>
>>> countries = ['China', 'India', 'USA']
>>> ds = pd.Series(countries)
>>>
>>>
>>> type(ds.index)
<class 'pandas.core.indexes.range.RangeIndex'>
>>> ds.index
RangeIndex(start=0, stop=3, step=1)
>>>
>>> ds.index += 1
>>>
>>> ds.index
RangeIndex(start=1, stop=4, step=1)
>>>
>>> ds
1 China
2 India
3 USA
dtype: object
>>>
As you can see, the increment of the index object, changes the start and stop parameters.
This adds a column that accomplishes what you want
df.insert(0,"Column Name", np.arange(1,len(df)+1))
Following on from TomAugspurger's answer, we could use list comprehension rather than np.arrange(), which removes the requirement for importing the module: numpy. You can use the following instead:
df.index = [i+1 for i in range(len(df))]
I'm trying to construct simple DataFrames. Both have a date whereas the first has one additional column:
import pandas as pd
import datetime as dt
import numpy as np
a = pd.DataFrame(np.array([
[dt.datetime(2018, 1, 10), 5.0]]), columns=['date', 'amount'])
print(a)
# date_dt amount
# 2018-01-10 00:00:00 5
b = pd.DataFrame(np.array([
[dt.datetime(2018, 1, 10)]]), columns=['date'])
print(b)
# date_dt
# 2018-01-10
Why are the dates interpreted differently (with and without time)? It gives me problems when I later try to apply merges.
Ok, so here is what happens. I will use the following code:
import pandas as pd
import datetime as dt
import numpy as np
a_val = np.array([[dt.datetime(2018, 1, 10), 5.0]])
a = pd.DataFrame(a_val, columns=['date', 'amount'])
b_val = np.array([[dt.datetime(2018, 1, 10)]])
b = pd.DataFrame(b_val, columns=['date'])
I just split the contents of the pd dataframes and call to the dataframe themselves. First let's print thr a_val and b_val variables:
print(a_val, b_val)
# output: [[datetime.datetime(2018, 1, 10, 0, 0) 5.0]] [[datetime.datetime(2018, 1, 10, 0, 0)]]
So still good, the object are datetime.datetime.
Now let's access the values of the dataframe with .values:
print(a.values, b.values)
# output: [[datetime.datetime(2018, 1, 10, 0, 0) 5.0]] [['2018-01-10T00:00:00.000000000']]
Things are messed up here. Let's print the type of the date:
print(type(a.values[0][0]), type(b.values[0][0]))
# output: <class 'datetime.datetime'> <class 'numpy.datetime64'>
Ok, that's the thing: since in the second dataframe you just have a date object, and you call np.array(), the date is cast to a numpy.datetime64 object, which has a different formatting. Instead, in the first dataframe you have a datetime object plus an int, and the code left them as is.
Short version: if you have a collection of different objects like dates, strings, int etc. use a list, not a numpy array
Both columns in a are objects because of the numpy array that's an intermediate (and is of type object). I'd think that not implicitly interpreting mixed objects is probably good behavior.
a = pd.DataFrame([[dt.datetime(2018, 1, 10), 5.0]], columns=['date', 'amount'])
This seems to be more along the lines of what you want.
I have a Numpy array consisting of a list of lists, representing a two-dimensional array with row labels and column names as shown below:
data = array([['','Col1','Col2'],['Row1',1,2],['Row2',3,4]])
I'd like the resulting DataFrame to have Row1 and Row2 as index values, and Col1, Col2 as header values
I can specify the index as follows:
df = pd.DataFrame(data,index=data[:,0]),
however I am unsure how to best assign column headers.
You need to specify data, index and columns to DataFrame constructor, as in:
>>> pd.DataFrame(data=data[1:,1:], # values
... index=data[1:,0], # 1st column as index
... columns=data[0,1:]) # 1st row as the column names
edit: as in the #joris comment, you may need to change above to np.int_(data[1:,1:]) to have correct data type.
Here is an easy to understand solution
import numpy as np
import pandas as pd
# Creating a 2 dimensional numpy array
>>> data = np.array([[5.8, 2.8], [6.0, 2.2]])
>>> print(data)
>>> data
array([[5.8, 2.8],
[6. , 2.2]])
# Creating pandas dataframe from numpy array
>>> dataset = pd.DataFrame({'Column1': data[:, 0], 'Column2': data[:, 1]})
>>> print(dataset)
Column1 Column2
0 5.8 2.8
1 6.0 2.2
I agree with Joris; it seems like you should be doing this differently, like with numpy record arrays. Modifying "option 2" from this great answer, you could do it like this:
import pandas
import numpy
dtype = [('Col1','int32'), ('Col2','float32'), ('Col3','float32')]
values = numpy.zeros(20, dtype=dtype)
index = ['Row'+str(i) for i in range(1, len(values)+1)]
df = pandas.DataFrame(values, index=index)
This can be done simply by using from_records of pandas DataFrame
import numpy as np
import pandas as pd
# Creating a numpy array
x = np.arange(1,10,1).reshape(-1,1)
dataframe = pd.DataFrame.from_records(x)
>>import pandas as pd
>>import numpy as np
>>data.shape
(480,193)
>>type(data)
numpy.ndarray
>>df=pd.DataFrame(data=data[0:,0:],
... index=[i for i in range(data.shape[0])],
... columns=['f'+str(i) for i in range(data.shape[1])])
>>df.head()
[![array to dataframe][1]][1]
Here simple example to create pandas dataframe by using numpy array.
import numpy as np
import pandas as pd
# create an array
var1 = np.arange(start=1, stop=21, step=1).reshape(-1)
var2 = np.random.rand(20,1).reshape(-1)
print(var1.shape)
print(var2.shape)
dataset = pd.DataFrame()
dataset['col1'] = var1
dataset['col2'] = var2
dataset.head()
Adding to #behzad.nouri 's answer - we can create a helper routine to handle this common scenario:
def csvDf(dat,**kwargs):
from numpy import array
data = array(dat)
if data is None or len(data)==0 or len(data[0])==0:
return None
else:
return pd.DataFrame(data[1:,1:],index=data[1:,0],columns=data[0,1:],**kwargs)
Let's try it out:
data = [['','a','b','c'],['row1','row1cola','row1colb','row1colc'],
['row2','row2cola','row2colb','row2colc'],['row3','row3cola','row3colb','row3colc']]
csvDf(data)
In [61]: csvDf(data)
Out[61]:
a b c
row1 row1cola row1colb row1colc
row2 row2cola row2colb row2colc
row3 row3cola row3colb row3colc
I think this is a simple and intuitive method:
data = np.array([[0, 0], [0, 1] , [1, 0] , [1, 1]])
reward = np.array([1,0,1,0])
dataset = pd.DataFrame()
dataset['StateAttributes'] = data.tolist()
dataset['reward'] = reward.tolist()
dataset
returns:
But there are performance implications detailed here:
How to set the value of a pandas column as list
It's not so short, but maybe can help you.
Creating Array
import numpy as np
import pandas as pd
data = np.array([['col1', 'col2'], [4.8, 2.8], [7.0, 1.2]])
>>> data
array([['col1', 'col2'],
['4.8', '2.8'],
['7.0', '1.2']], dtype='<U4')
Creating data frame
df = pd.DataFrame(i for i in data).transpose()
df.drop(0, axis=1, inplace=True)
df.columns = data[0]
df
>>> df
col1 col2
0 4.8 7.0
1 2.8 1.2
In python, I am trying to find the quickest to hash each value in a pandas data frame.
I know any string can be hashed using:
hash('a string')
But how do I apply this function on each element of a pandas data frame?
This may be a very simple thing to do, but I have just started using python.
Pass the hash function to apply on the str column:
In [37]:
df = pd.DataFrame({'a':['asds','asdds','asdsadsdas']})
df
Out[37]:
a
0 asds
1 asdds
2 asdsadsdas
In [39]:
df['hash'] = df['a'].apply(hash)
df
Out[39]:
a hash
0 asds 4065519673257264805
1 asdds -2144933431774646974
2 asdsadsdas -3091042543719078458
If you want to do this to every element then call applymap:
In [42]:
df = pd.DataFrame({'a':['asds','asdds','asdsadsdas'],'b':['asewer','werwer','tyutyuty']})
df
Out[42]:
a b
0 asds asewer
1 asdds werwer
2 asdsadsdas tyutyuty
In [43]:
df.applymap(hash)
Out[43]:
a b
0 4065519673257264805 7631381377676870653
1 -2144933431774646974 -6124472830212927118
2 -3091042543719078458 -1784823178011532358
Pandas also has a function to apply a hash function on an array or column:
import pandas as pd
df = pd.DataFrame({'a':['asds','asdds','asdsadsdas']})
df["hash"] = pd.util.hash_array(df["a"].to_numpy())
In addition to #EdChum a heads-up: hash() does not return the same values for a string for each run on every machine. Depending on your use-case, you better use
import hashlib
def md5hash(s: str):
return hashlib.md5(s.encode('utf-8')).hexdigest() # or SHA, ...
df['a'].apply(md5hash)
# or
df.applymap(md5hash)
I can't seem to get a simple dtype check working with Pandas' improved Categoricals in v0.15+. Basically I just want something like is_categorical(column) -> True/False.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({
'x': np.linspace(0, 50, 6),
'y': np.linspace(0, 20, 6),
'cat_column': random.sample('abcdef', 6)
})
df['cat_column'] = pd.Categorical(df2['cat_column'])
We can see that the dtype for the categorical column is 'category':
df.cat_column.dtype
Out[20]: category
And normally we can do a dtype check by just comparing to the name
of the dtype:
df.x.dtype == 'float64'
Out[21]: True
But this doesn't seem to work when trying to check if the x column
is categorical:
df.x.dtype == 'category'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-22-94d2608815c4> in <module>()
----> 1 df.x.dtype == 'category'
TypeError: data type "category" not understood
Is there any way to do these types of checks in pandas v0.15+?
Use the name property to do the comparison instead, it should always work because it's just a string:
>>> import numpy as np
>>> arr = np.array([1, 2, 3, 4])
>>> arr.dtype.name
'int64'
>>> import pandas as pd
>>> cat = pd.Categorical(['a', 'b', 'c'])
>>> cat.dtype.name
'category'
So, to sum up, you can end up with a simple, straightforward function:
def is_categorical(array_like):
return array_like.dtype.name == 'category'
First, the string representation of the dtype is 'category' and not 'categorical', so this works:
In [41]: df.cat_column.dtype == 'category'
Out[41]: True
But indeed, as you noticed, this comparison gives a TypeError for other dtypes, so you would have to wrap it with a try .. except .. block.
Other ways to check using pandas internals:
In [42]: isinstance(df.cat_column.dtype, pd.api.types.CategoricalDtype)
Out[42]: True
In [43]: pd.api.types.is_categorical_dtype(df.cat_column)
Out[43]: True
For non-categorical columns, those statements will return False instead of raising an error. For example:
In [44]: pd.api.types.is_categorical_dtype(df.x)
Out[44]: False
For much older version of pandas, replace pd.api.types in the above snippet with pd.core.common.
Just putting this here because pandas.DataFrame.select_dtypes() is what I was actually looking for:
df['column'].name in df.select_dtypes(include='category').columns
Thanks to #Jeff.
In my pandas version (v1.0.3), a shorter version of joris' answer is available.
df = pd.DataFrame({'noncat': [1, 2, 3], 'categ': pd.Categorical(['A', 'B', 'C'])})
print(isinstance(df.noncat.dtype, pd.CategoricalDtype)) # False
print(isinstance(df.categ.dtype, pd.CategoricalDtype)) # True
print(pd.CategoricalDtype.is_dtype(df.noncat)) # False
print(pd.CategoricalDtype.is_dtype(df.categ)) # True
I ran into this thread looking for the exact same functionality, and also found out another option, right from the pandas documentation here.
It looks like the canonical way to check if a pandas dataframe column is a categorical Series should be the following:
hasattr(column_to_check, 'cat')
So, as per the example given in the initial question, this would be:
hasattr(df.x, 'cat') #True
Nowadays you can use:
pandas.api.types.is_categorical_dtype(series)
Docs here: https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_categorical_dtype.html
Available since at least pandas 1.0
Taking a look at #Jeff Tratner answer, since the condition df.cat_column.dtype == 'category' not needs to be True to be considered a column as cataegorical,
I propose this considering categorical the dtypes within 'categorical_dtypes' list:
def is_cat(column):
categorical_dtypes = ['object', 'category', 'bool']
if column.dtype.name in categorical_dtypes:
return True
else:
return False
´´´