From the reindex docs:
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False.
Therefore, I thought that I would get a reordered Dataframe by setting copy=False in place (!). It appears, however, that I do get a copy and need to assign it to the original object again. I don't want to assign it back, if I can avoid it (the reason comes from this other question).
This is what I am doing:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5, 5))
df.columns = [ 'a', 'b', 'c', 'd', 'e' ]
df.head()
Outs:
a b c d e
0 0.234296 0.011235 0.664617 0.983243 0.177639
1 0.378308 0.659315 0.949093 0.872945 0.383024
2 0.976728 0.419274 0.993282 0.668539 0.970228
3 0.322936 0.555642 0.862659 0.134570 0.675897
4 0.167638 0.578831 0.141339 0.232592 0.976057
Reindex gives me the correct output, but I'd need to assign it back to the original object, which is what I wanted to avoid by using copy=False:
df.reindex( columns=['e', 'd', 'c', 'b', 'a'], copy=False )
The desired output after that line is:
e d c b a
0 0.177639 0.983243 0.664617 0.011235 0.234296
1 0.383024 0.872945 0.949093 0.659315 0.378308
2 0.970228 0.668539 0.993282 0.419274 0.976728
3 0.675897 0.134570 0.862659 0.555642 0.322936
4 0.976057 0.232592 0.141339 0.578831 0.167638
Why is copy=False not working in place?
Is it possible to do that at all?
Working with python 3.5.3, pandas 0.23.3
reindex is a structural change, not a cosmetic or transformative one. As such, a copy is always returned because the operation cannot be done in-place (it would require allocating new memory for underlying arrays, etc). This means you have to assign the result back, there's no other choice.
df = df.reindex(['e', 'd', 'c', 'b', 'a'], axis=1)
Also see the discussion on GH21598.
The one corner case where copy=False is actually of any use is when the indices used to reindex df are identical to the ones it already has. You can check by comparing the ids:
id(df)
# 4839372504
id(df.reindex(df.index, copy=False)) # same object returned
# 4839372504
id(df.reindex(df.index, copy=True)) # new object created - ids are different
# 4839371608
A bit off topic, but I believe this would rearrange the columns in place
for i, colname in enumerate(list_of_columns_in_desired_order):
col = dataset.pop(colname)
dataset.insert(i, colname, col)
Related
I have a dataframe:
import pandas as pd
df = pd.DataFrame({'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']})
that I would like to slice into two new dataframes such that the first contains every nth value, while the second contains the remaining values not in the first.
For example, in the case of n=3, the second dataframe would keep two values from the original dataframe, skip one, keep two, skip one, etc. This slice is illustrated in the following image where the original dataframe values are blue, and these are split into a green set and a red set:
I have achieved this successfully using a combination of iloc and isin:
df1 = df.iloc[::3]
df2 = df[~df.val.isin(df1.val)]
but what I would like to know is:
Is this the most Pythonic way to achieve this? It seems inefficient and not particularly elegant to take what I want out of a dataframe then get the rest of what I want by checking what is not in the new dataframe that is in the original. Instead, is there an iloc expression, like that which was used to generate df1, which could do the second part of the slicing procedure and replace the isin line? Even better, is there a single expression that could execute the the entire two-step slice in one step?
Use modulo 3 with compare for not equal first values (same like sliced rows):
#for default RangeIndex
df2 = df[df.index % 3 != 0]
#for any Index
df2 = df[np.arange(len(df)) % 3 != 0]
print (df2)
val
1 b
2 c
4 e
5 f
7 h
This question already has answers here:
Convert pandas dataframe to NumPy array
(15 answers)
Closed 2 years ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
How can I get the index or column of a DataFrame as a NumPy array or Python list?
To get a NumPy array, you should use the values attribute:
In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c']); df
A B
a 1 4
b 2 5
c 3 6
In [2]: df.index.values
Out[2]: array(['a', 'b', 'c'], dtype=object)
This accesses how the data is already stored, so there isn't any need for a conversion.
Note: This attribute is also available for many other pandas objects.
In [3]: df['A'].values
Out[3]: Out[16]: array([1, 2, 3])
To get the index as a list, call tolist:
In [4]: df.index.tolist()
Out[4]: ['a', 'b', 'c']
And similarly, for columns.
You can use df.index to access the index object and then get the values in a list using df.index.tolist(). Similarly, you can use df['col'].tolist() for Series.
pandas >= 0.24
Deprecate your usage of .values in favour of these methods!
From v0.24.0 onwards, we will have two brand spanking new, preferred methods for obtaining NumPy arrays from Index, Series, and DataFrame objects: they are to_numpy(), and .array. Regarding usage, the docs mention:
We haven’t removed or deprecated Series.values or
DataFrame.values, but we highly recommend and using .array or
.to_numpy() instead.
See this section of the v0.24.0 release notes for more information.
to_numpy() Method
df.index.to_numpy()
# array(['a', 'b'], dtype=object)
df['A'].to_numpy()
# array([1, 4])
By default, a view is returned. Any modifications made will affect the original.
v = df.index.to_numpy()
v[0] = -1
df
A B
-1 1 2
b 4 5
If you need a copy instead, use to_numpy(copy=True);
v = df.index.to_numpy(copy=True)
v[-1] = -123
df
A B
a 1 2
b 4 5
Note that this function also works for DataFrames (while .array does not).
array Attribute
This attribute returns an ExtensionArray object that backs the Index/Series.
pd.__version__
# '0.24.0rc1'
# Setup.
df = pd.DataFrame([[1, 2], [4, 5]], columns=['A', 'B'], index=['a', 'b'])
df
A B
a 1 2
b 4 5
<!- ->
df.index.array
# <PandasArray>
# ['a', 'b']
# Length: 2, dtype: object
df['A'].array
# <PandasArray>
# [1, 4]
# Length: 2, dtype: int64
From here, it is possible to get a list using list:
list(df.index.array)
# ['a', 'b']
list(df['A'].array)
# [1, 4]
or, just directly call .tolist():
df.index.tolist()
# ['a', 'b']
df['A'].tolist()
# [1, 4]
Regarding what is returned, the docs mention,
For Series and Indexes backed by normal NumPy arrays, Series.array
will return a new arrays.PandasArray, which is a thin (no-copy)
wrapper around a numpy.ndarray. arrays.PandasArray isn’t especially
useful on its own, but it does provide the same interface as any
extension array defined in pandas or by a third-party library.
So, to summarise, .array will return either
The existing ExtensionArray backing the Index/Series, or
If there is a NumPy array backing the series, a new ExtensionArray object is created as a thin wrapper over the underlying array.
Rationale for adding TWO new methods
These functions were added as a result of discussions under two GitHub issues GH19954 and GH23623.
Specifically, the docs mention the rationale:
[...] with .values it was unclear whether the returned value would be the
actual array, some transformation of it, or one of pandas custom
arrays (like Categorical). For example, with PeriodIndex, .values
generates a new ndarray of period objects each time. [...]
These two functions aim to improve the consistency of the API, which is a major step in the right direction.
Lastly, .values will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.
If you are dealing with a multi-index dataframe, you may be interested in extracting only the column of one name of the multi-index. You can do this as
df.index.get_level_values('name_sub_index')
and of course name_sub_index must be an element of the FrozenList df.index.names
Since pandas v0.13 you can also use get_values:
df.index.get_values()
A more recent way to do this is to use the .to_numpy() function.
If I have a dataframe with a column 'price', I can convert it as follows:
priceArray = df['price'].to_numpy()
You can also pass the data type, such as float or object, as an argument of the function
I converted the pandas dataframe to list and then used the basic list.index(). Something like this:
dd = list(zone[0]) #Where zone[0] is some specific column of the table
idx = dd.index(filename[i])
You have you index value as idx.
Below is a simple way to convert a dataframe column into a NumPy array.
df = pd.DataFrame(somedict)
ytrain = df['label']
ytrain_numpy = np.array([x for x in ytrain['label']])
ytrain_numpy is a NumPy array.
I tried with to.numpy(), but it gave me the below error:
TypeError: no supported conversion for types: (dtype('O'),)* while doing Binary Relevance classfication using Linear SVC.
to.numpy() was converting the dataFrame into a NumPy array, but the inner element's data type was a list because of which the above error was observed.
I'm trying to perform a number of operations on a list of dataframes. I've opted to use a dictionary to help me with this process, but I was wonder if it's possible to reference the originally created dataframe with the changes.
So using the below code as an example, is it possible to call the dfA object with the columns ['a', 'b', 'c'] that were added when it was nested within the dictionary object?
dfA = pd.DataFrame(data=[1], columns=['x'])
dfB = pd.DataFrame(data=[1], columns=['y'])
dfC = pd.DataFrame(data=[1], columns=['z'])
dfdict = {'A':dfA,
'B':dfB,
'C':dfC}
df_dummy = pd.DataFrame(data=[[1,2,3]], columns=['a', 'b', 'c'])
for key in dfdict:
dfdict[str(key)] = pd.concat([dfdict[str(key)], df_dummy], axis=1)
The initial dfA that you created and the dfA DataFrame from your dictionary are two different objects. (You can confirm this by running dfA is dfdict['A'] or id(dfA) == id(dfdict['A']), both of which should return False).
To access the second (newly created) object you need to call it from the dictionary.
dfdict['A']
Or:
dfdict.get('A')
The returned DataFrame will have the new columns you added.
I'm starting from the pandas DataFrame documentation here: Introduction to data structures
I'd like to iteratively fill the DataFrame with values in a time series kind of calculation. I'd like to initialize the DataFrame with columns A, B, and timestamp rows, all 0 or all NaN.
I'd then add initial values and go over this data calculating the new row from the row before, say row[A][t] = row[A][t-1]+1 or so.
I'm currently using the code as below, but I feel it's kind of ugly and there must be a way to do this with a DataFrame directly or just a better way in general.
Note: I'm using Python 2.7.
import datetime as dt
import pandas as pd
import scipy as s
if __name__ == '__main__':
base = dt.datetime.today().date()
dates = [ base - dt.timedelta(days=x) for x in range(0,10) ]
dates.sort()
valdict = {}
symbols = ['A','B', 'C']
for symb in symbols:
valdict[symb] = pd.Series( s.zeros( len(dates)), dates )
for thedate in dates:
if thedate > dates[0]:
for symb in valdict:
valdict[symb][thedate] = 1+valdict[symb][thedate - dt.timedelta(days=1)]
print valdict
NEVER grow a DataFrame row-wise!
TLDR; (just read the bold text)
Most answers here will tell you how to create an empty DataFrame and fill it out, but no one will tell you that it is a bad thing to do.
Here is my advice: Accumulate data in a list, not a DataFrame.
Use a list to collect your data, then initialise a DataFrame when you are ready. Either a list-of-lists or list-of-dicts format will work, pd.DataFrame accepts both.
data = []
for row in some_function_that_yields_data():
data.append(row)
df = pd.DataFrame(data)
pd.DataFrame converts the list of rows (where each row is a scalar value) into a DataFrame. If your function yields DataFrames instead, call pd.concat.
Pros of this approach:
It is always cheaper to append to a list and create a DataFrame in one go than it is to create an empty DataFrame (or one of NaNs) and append to it over and over again.
Lists also take up less memory and are a much lighter data structure to work with, append, and remove (if needed).
dtypes are automatically inferred (rather than assigning object to all of them).
A RangeIndex is automatically created for your data, instead of you having to take care to assign the correct index to the row you are appending at each iteration.
If you aren't convinced yet, this is also mentioned in the documentation:
Iteratively appending rows to a DataFrame can be more computationally
intensive than a single concatenate. A better solution is to append
those rows to a list and then concatenate the list with the original
DataFrame all at once.
*** Update for pandas >= 1.4: append is now DEPRECATED! ***
As of pandas 1.4, append has now been deprecated! Use pd.concat instead. See the release notes
These options are horrible
append or concat inside a loop
Here is the biggest mistake I've seen from beginners:
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True) # yuck
# or similarly,
# df = pd.concat([df, pd.Series({'A': i, 'B': b, 'C': c})], ignore_index=True)
Memory is re-allocated for every append or concat operation you have. Couple this with a loop and you have a quadratic complexity operation.
The other mistake associated with df.append is that users tend to forget append is not an in-place function, so the result must be assigned back. You also have to worry about the dtypes:
df = pd.DataFrame(columns=['A', 'B', 'C'])
df = df.append({'A': 1, 'B': 12.3, 'C': 'xyz'}, ignore_index=True)
df.dtypes
A object # yuck!
B float64
C object
dtype: object
Dealing with object columns is never a good thing, because pandas cannot vectorize operations on those columns. You will need to do this to fix it:
df.infer_objects().dtypes
A int64
B float64
C object
dtype: object
loc inside a loop
I have also seen loc used to append to a DataFrame that was created empty:
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
df.loc[len(df)] = [a, b, c]
As before, you have not pre-allocated the amount of memory you need each time, so the memory is re-grown each time you create a new row. It's just as bad as append, and even more ugly.
Empty DataFrame of NaNs
And then, there's creating a DataFrame of NaNs, and all the caveats associated therewith.
df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
df
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
It creates a DataFrame of object columns, like the others.
df.dtypes
A object # you DON'T want this
B object
C object
dtype: object
Appending still has all the issues as the methods above.
for i, (a, b, c) in enumerate(some_function_that_yields_data()):
df.iloc[i] = [a, b, c]
The Proof is in the Pudding
Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.
Benchmarking code for reference.
Here's a couple of suggestions:
Use date_range for the index:
import datetime
import pandas as pd
import numpy as np
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')
columns = ['A','B', 'C']
Note: we could create an empty DataFrame (with NaNs) simply by writing:
df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # With 0s rather than NaNs
To do these type of calculations for the data, use a NumPy array:
data = np.array([np.arange(10)]*3).T
Hence we can create the DataFrame:
In [10]: df = pd.DataFrame(data, index=index, columns=columns)
In [11]: df
Out[11]:
A B C
2012-11-29 0 0 0
2012-11-30 1 1 1
2012-12-01 2 2 2
2012-12-02 3 3 3
2012-12-03 4 4 4
2012-12-04 5 5 5
2012-12-05 6 6 6
2012-12-06 7 7 7
2012-12-07 8 8 8
2012-12-08 9 9 9
If you simply want to create an empty data frame and fill it with some incoming data frames later, try this:
newDF = pd.DataFrame() #creates a new dataframe that's empty
newDF = newDF.append(oldDF, ignore_index = True) # ignoring index is optional
# try printing some data from newDF
print newDF.head() #again optional
In this example I am using this pandas doc to create a new data frame and then using append to write to the newDF with data from oldDF.
If I have to keep appending new data into this newDF from more than
one oldDFs, I just use a for loop to iterate over
pandas.DataFrame.append()
Note: append() is deprecated since version 1.4.0. Use concat()
Initialize empty frame with column names
import pandas as pd
col_names = ['A', 'B', 'C']
my_df = pd.DataFrame(columns = col_names)
my_df
Add a new record to a frame
my_df.loc[len(my_df)] = [2, 4, 5]
You also might want to pass a dictionary:
my_dic = {'A':2, 'B':4, 'C':5}
my_df.loc[len(my_df)] = my_dic
Append another frame to your existing frame
col_names = ['A', 'B', 'C']
my_df2 = pd.DataFrame(columns = col_names)
my_df = my_df.append(my_df2)
Performance considerations
If you are adding rows inside a loop consider performance issues. For around the first 1000 records "my_df.loc" performance is better, but it gradually becomes slower by increasing the number of records in the loop.
If you plan to do thins inside a big loop (say 10M records or so), you are better off using a mixture of these two;
fill a dataframe with iloc until the size gets around 1000, then append it to the original dataframe, and empty the temp dataframe.
This would boost your performance by around 10 times.
Simply:
import numpy as np
import pandas as pd
df=pd.DataFrame(np.zeros([rows,columns])
Then fill it.
Assume a dataframe with 19 rows
index=range(0,19)
index
columns=['A']
test = pd.DataFrame(index=index, columns=columns)
Keeping Column A as a constant
test['A']=10
Keeping column b as a variable given by a loop
for x in range(0,19):
test.loc[[x], 'b'] = pd.Series([x], index = [x])
You can replace the first x in pd.Series([x], index = [x]) with any value
This is my way to make a dynamic dataframe from several lists with a loop
x = [1,2,3,4,5,6,7,8]
y = [22,12,34,22,65,24,12,11]
z = ['as','ss','wa', 'ss','er','fd','ga','mf']
names = ['Bob', 'Liz', 'chop']
a loop
def dataF(x,y,z,names):
res = []
for t in zip(x,y,z):
res.append(t)
return pd.DataFrame(res,columns=names)
Result
dataF(x,y,z,names)
# import pandas library
import pandas as pd
# create a dataframe
my_df = pd.DataFrame({"A": ["shirt"], "B": [1200]})
# show the dataframe
print(my_df)
This question is about filtering a NumPy ndarray according to some column values.
I have a fairly large NumPy ndarray (300000, 50) and I am filtering it according to values in some specific columns. I have ndtypes so I can access each column by name.
The first column is named category_code and I need to filter the matrix to return only rows where category_code is in ("A", "B", "C").
The result would need to be another NumPy ndarray whose columns are still accessible by the dtype names.
Here is what I do now:
index = numpy.asarray([row['category_code'] in ('A', 'B', 'C') for row in data])
filtered_data = data[index]
List comprehension like:
list = [row for row in data if row['category_code'] in ('A', 'B', 'C')]
filtered_data = numpy.asarray(list)
wouldn't work because the dtypes I originally had are no longer accessible.
Are there any better / more Pythonic way of achieving the same result?
Something that could look like:
filtered_data = data.where({'category_code': ('A', 'B','C'})
Thanks!
You can use the NumPy-based library, Pandas, which has a more generally useful implementation of ndarrays:
>>> # import the library
>>> import pandas as PD
Create some sample data as python dictionary, whose keys are the column names and whose values are the column values as a python list; one key/value pair per column
>>> data = {'category_code': ['D', 'A', 'B', 'C', 'D', 'A', 'C', 'A'],
'value':[4, 2, 6, 3, 8, 4, 3, 9]}
>>> # convert to a Pandas 'DataFrame'
>>> D = PD.DataFrame(data)
To return just the rows in which the category_code is either B or C, two steps conceptually, but can easily be done in a single line:
>>> # step 1: create the index
>>> idx = (D.category_code== 'B') | (D.category_code == 'C')
>>> # then filter the data against that index:
>>> D.ix[idx]
category_code value
2 B 6
3 C 3
6 C 3
Note the difference between indexing in Pandas versus NumPy, the library upon which Pandas is built. In NumPy, you would just place the index inside the brackets, indicating which dimension you are indexing with a ",", and using ":" to indicate that you want all of the values (columns) in the other dimension:
>>> D[idx,:]
In Pandas, you call the the data frame's ix method, and place only the index inside the brackets:
>>> D.loc[idx]
If you can choose, I strongly recommend pandas: it has "column indexing" built-in plus a lot of other features. It is built on numpy.