Append a column to a dataframe in Pandas - python

I am trying to append a numpy.darray to a dataframe with little success.
The dataframe is called user2 and the numpy.darray is called CallTime.
I tried:
user2["CallTime"] = CallTime.values
but I get an error message:
Traceback (most recent call last):
File "<ipython-input-53-fa327550a3e0>", line 1, in <module>
user2["CallTime"] = CallTime.values
AttributeError: 'numpy.ndarray' object has no attribute 'values'
Then I tried:
user2["CallTime"] = user2.assign(CallTime = CallTime.values)
but I get again the same error message as above.
I also tried to use the merge command but for some reason it was not recognized by Python although I have imported pandas. In the example below CallTime is a dataframe:
user3 = merge(user2, CallTime)
Error message:
Traceback (most recent call last):
File "<ipython-input-56-0ebf65759df3>", line 1, in <module>
user3 = merge(user2, CallTime)
NameError: name 'merge' is not defined
Any ideas?
Thank you!

pandas DataFrame is a 2-dimensional data structure, and each column of a DataFrame is a 1-dimensional Series. So if you want to add one column to a DataFrame, you must first convert it into Series. np.ndarray is a multi-dimensional data structure. From your code, I believe the shape of np.ndarray CallTime should be nx1 (n rows and 1 colmun), and it's easy to convert it to a Series. Here is an example:
df = DataFrame(np.random.rand(5,2), columns=['A', 'B'])
This creates a dataframe df with two columns 'A' and 'B', and 5 rows.
CallTime = np.random.rand(5,1)
Assume this is your np.ndarray data CallTime
df['C'] = pd.Series(CallTime[:, 0])
This will add a new column to df. Here CallTime[:,0] is used to select first column of CallTime, so if you want to use different column from np.ndarray, change the index.
Please make sure that the number of rows for df and CallTime are equal.
Hope this would be helpful.

I think instead to provide only documentation, I will try to provide a sample:
import numpy as np
import pandas as pd
data = {'A': [2010, 2011, 2012],
'B': ['Bears', 'Bears', 'Bears'],
'C': [11, 8, 10],
'D': [5, 8, 6]}
user2 = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'])
#creating the array what will append to pandas dataframe user2
CallTime = np.array([1, 2, 3])
#convert to list the ndarray array CallTime, if you your CallTime is a matrix than after converting to list you can iterate or you can convert into dataframe and just append column required or just join the dataframe.
user2.loc[:,'CallTime'] = CallTime.tolist()
print(user2)
I think this one will help, also check numpy.ndarray.tolist documentation if need to find out why we need the list and how to do, also here is example how to create dataframe from numpy in case of need https://stackoverflow.com/a/35245297/2027457

Here is a simple solution.
user2["CallTime"] = CallTime
The problem here for you is that CallTime is an array, you couldn't use .values. Since .values is used to convert a dataframe to array. For example,
df = DataFrame(np.random.rand(10,2), columns=['A', 'B'])
# The followings are correct
df.values
df['A'].values
df['B'].values

Related

Can't access DataFrame elements after reading from CSV

I'm creating a matrix and converting it into DataFrame after creation. Since I'm working with lots of data and it takes a while for creation I wanted to store the matrix into a CSV so I can just read it once is created. Here what I'm doing:
transitions = create_matrix(alpha, N)
# convert the matrix to a DataFrame
df = pd.DataFrame(transitions, columns=list(tags), index=list(tags))
df.to_csv(r'D:\U\k\Desktop\prg\F_transition_' + language + '.csv')
df_r = pd.read_csv('transition_en.csv')
The fact is that after reading from CSV I got the error:
in get_loc raise KeyError(key). KeyError: 'O'
It seems this is thrown by those lines of code:
if i == 0:
tran_pr = df_r.loc['O', tag]
else:
tran_pr = df_r.loc[st[-1], tag]
I imagine that once the data is stored in a CSV, the reading of the file is not equivalent to the DataFrame I had before. How can I convert these lines of code to login like I did before?
I tried to set index=False when creating the csv and also skip_blank_lines=True when reading. Nothing changes
df_r is like:
can you try:
import pandas as pd
df = pd.DataFrame([[1, 2], [2, 3]], columns = ['A', 'B'], index = ['C', 'D'])
print(df['A']['C'])
while using loc you need provide index first and then give column
df_r.loc[tag, 'O']
will work.
Don't use index = false, while importing, which will not include index in the dataframe

Get Non Empty Columnin Pandas Dataframe

In the following pandas dataframe there are missing values in different columns for each row.
import pandas as pd
import numpy as np
d = {'col1': [1, 2, None], 'col2': [None, 4, 5], 'col3': [3, None, None]}
df = pd.DataFrame(data=d)
df
I know I can use this to locate which columns are not empty in the ith row
df.iloc[0].notnull()
And then something like the following to find which specific columns are not empty.
np.where(df.iloc[0].notnull())
However, how can I then use those values as indices to return the non missing columns in the ith row?
For example, in the 0th row I'd like to return back columns
df.iloc[0, [0,2]]
This isn't quite right, but I'm guessing is somewhere along these lines?
df.iloc[0, np.where(df.iloc[0].notnull())]
** Edit
I realize I can do this
df.iloc[0, np.where(df.iloc[0].notnull())[0].tolist()]
And this returns the expected result. However, is this the most efficient approach?
Here's a way using np.isnan
# set row number
row_number = 0
# get dataframe
df.loc[row_number, ~np.isnan(df.values)[row_number]]

Pandas fillna with method=None (default value) raises an error

I am writing a function to aid DataFrame merges between two tables. The function creates a mapping key in the first DataFrame using variables in the second DataFrame.
My issue arises when I try to include the .fillna(method=) at the end of the function.
# Import libraries
import pandas as pd
# Create data
data_1 = {"col_1": [1, 2, 3, 4, 5], "col_2": [1, , 3, , 5]}
data_2 = {"col_1": [1, 2, 3, 4, 5], "col_3": [1, , 3, , 5]}
df = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
def merge_on_key(df, df2, join_how="left", fill_na=None):
# Import libraries
import pandas as pd
# Code to create mapping key not required for question
# Merge the two dataframes
print(fill_na)
print(type(fill_na))
df3 = pd.merge(df, df1, how=join_how, on="col_1").fillna(method=fill_na)
return df3
df3 = merge_on_key(df, df2)
output:
>>> None
>>> <class 'NoneType'>
error message:
ValueError: Must specify a fill 'value' or 'method'
My question is why does the fill_na, which is equal to None, not allow the fillna(method=None, the default value for fillna(method))?
You have to either use a 'value' or a 'method'. In your call to fillna you are setting both of them to None. In short, you're telling Python to fill empty (None) values in the dataframe with None, which does nothing and thus it raises an exception.
Based on the docs (link), you could either assign a non-empty value:
df3 = pd.merge(df, df1, how=join_how, on="col_1").fillna(value=0, method=fill_na)
or change the method from None (which means "directly substitute the None values in the dataframe by the given value) to one of {'backfill', 'bfill', 'pad', 'ffill'} (each documented in the docs):
df3 = pd.merge(df, df1, how=join_how, on="col_1").fillna( method='backfill')

pandas- get index as a string [duplicate]

This question already has answers here:
Convert pandas dataframe to NumPy array
(15 answers)
Closed 2 years ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
How can I get the index or column of a DataFrame as a NumPy array or Python list?
To get a NumPy array, you should use the values attribute:
In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c']); df
A B
a 1 4
b 2 5
c 3 6
In [2]: df.index.values
Out[2]: array(['a', 'b', 'c'], dtype=object)
This accesses how the data is already stored, so there isn't any need for a conversion.
Note: This attribute is also available for many other pandas objects.
In [3]: df['A'].values
Out[3]: Out[16]: array([1, 2, 3])
To get the index as a list, call tolist:
In [4]: df.index.tolist()
Out[4]: ['a', 'b', 'c']
And similarly, for columns.
You can use df.index to access the index object and then get the values in a list using df.index.tolist(). Similarly, you can use df['col'].tolist() for Series.
pandas >= 0.24
Deprecate your usage of .values in favour of these methods!
From v0.24.0 onwards, we will have two brand spanking new, preferred methods for obtaining NumPy arrays from Index, Series, and DataFrame objects: they are to_numpy(), and .array. Regarding usage, the docs mention:
We haven’t removed or deprecated Series.values or
DataFrame.values, but we highly recommend and using .array or
.to_numpy() instead.
See this section of the v0.24.0 release notes for more information.
to_numpy() Method
df.index.to_numpy()
# array(['a', 'b'], dtype=object)
df['A'].to_numpy()
# array([1, 4])
By default, a view is returned. Any modifications made will affect the original.
v = df.index.to_numpy()
v[0] = -1
df
A B
-1 1 2
b 4 5
If you need a copy instead, use to_numpy(copy=True);
v = df.index.to_numpy(copy=True)
v[-1] = -123
df
A B
a 1 2
b 4 5
Note that this function also works for DataFrames (while .array does not).
array Attribute
This attribute returns an ExtensionArray object that backs the Index/Series.
pd.__version__
# '0.24.0rc1'
# Setup.
df = pd.DataFrame([[1, 2], [4, 5]], columns=['A', 'B'], index=['a', 'b'])
df
A B
a 1 2
b 4 5
<!- ->
df.index.array
# <PandasArray>
# ['a', 'b']
# Length: 2, dtype: object
df['A'].array
# <PandasArray>
# [1, 4]
# Length: 2, dtype: int64
From here, it is possible to get a list using list:
list(df.index.array)
# ['a', 'b']
list(df['A'].array)
# [1, 4]
or, just directly call .tolist():
df.index.tolist()
# ['a', 'b']
df['A'].tolist()
# [1, 4]
Regarding what is returned, the docs mention,
For Series and Indexes backed by normal NumPy arrays, Series.array
will return a new arrays.PandasArray, which is a thin (no-copy)
wrapper around a numpy.ndarray. arrays.PandasArray isn’t especially
useful on its own, but it does provide the same interface as any
extension array defined in pandas or by a third-party library.
So, to summarise, .array will return either
The existing ExtensionArray backing the Index/Series, or
If there is a NumPy array backing the series, a new ExtensionArray object is created as a thin wrapper over the underlying array.
Rationale for adding TWO new methods
These functions were added as a result of discussions under two GitHub issues GH19954 and GH23623.
Specifically, the docs mention the rationale:
[...] with .values it was unclear whether the returned value would be the
actual array, some transformation of it, or one of pandas custom
arrays (like Categorical). For example, with PeriodIndex, .values
generates a new ndarray of period objects each time. [...]
These two functions aim to improve the consistency of the API, which is a major step in the right direction.
Lastly, .values will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.
If you are dealing with a multi-index dataframe, you may be interested in extracting only the column of one name of the multi-index. You can do this as
df.index.get_level_values('name_sub_index')
and of course name_sub_index must be an element of the FrozenList df.index.names
Since pandas v0.13 you can also use get_values:
df.index.get_values()
A more recent way to do this is to use the .to_numpy() function.
If I have a dataframe with a column 'price', I can convert it as follows:
priceArray = df['price'].to_numpy()
You can also pass the data type, such as float or object, as an argument of the function
I converted the pandas dataframe to list and then used the basic list.index(). Something like this:
dd = list(zone[0]) #Where zone[0] is some specific column of the table
idx = dd.index(filename[i])
You have you index value as idx.
Below is a simple way to convert a dataframe column into a NumPy array.
df = pd.DataFrame(somedict)
ytrain = df['label']
ytrain_numpy = np.array([x for x in ytrain['label']])
ytrain_numpy is a NumPy array.
I tried with to.numpy(), but it gave me the below error:
TypeError: no supported conversion for types: (dtype('O'),)* while doing Binary Relevance classfication using Linear SVC.
to.numpy() was converting the dataFrame into a NumPy array, but the inner element's data type was a list because of which the above error was observed.

What is the correct way of passing parameters to stats.friedmanchisquare based on a DataFrame?

I am trying to pass values to stats.friedmanchisquare from a dataframe df, that has shape (11,17).
This is what works for me (only for three rows in this example):
df = df.as_matrix()
print stats.friedmanchisquare(df[1, :], df[2, :], df[3, :])
which yields
(16.714285714285694, 0.00023471398805908193)
However, the line of code is too long when I want to use all 11 rows of df.
First, I tried to pass the values in the following manner:
df = df.as_matrix()
print stats.friedmanchisquare([df[x, :] for x in np.arange(df.shape[0])])
but I get:
ValueError:
Less than 3 levels. Friedman test not appropriate.
Second, I also tried not converting it to a matrix-form leaving it as a DataFrame (which would be ideal for me), but I guess this is not supported yet, or I am doing it wrong:
print stats.friedmanchisquare([row for index, row in df.iterrows()])
which also gives me the error:
ValueError:
Less than 3 levels. Friedman test not appropriate.
So, my question is: what is the correct way of passing parameters to stats.friedmanchisquare based on df? (or even using its df.as_matrix() representation)
You can download my dataframe in csv format here and read it using:
df = pd.read_csv('df.csv', header=0, index_col=0)
Thank you for your help :)
Solution:
Based on #Ami Tavory and #vicg's answers (please vote on them), the solution to my problem, based on the matrix representation of the data, is to add the *-operator defined here, but better explained here, as follows:
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
And the same is true if you want to work with the original dataframe, which is what I ideally wanted:
print stats.friedmanchisquare(*[row for index, row in df.iterrows()])
in this manner you iterate over the dataframe in its native format.
Note that I went ahead and ran some timeit tests to see which way is faster and as it turns out, converting it first to a numpy array beforehand is twice as fast than using df in its original dataframe format.
This was my experimental setup:
import timeit
setup = '''
import pandas as pd
import scipy.stats as stats
import numpy as np
df = pd.read_csv('df.csv', header=0, index_col=0)
'''
theCommand = '''
df = np.array(df)
stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
theCommand = '''
stats.friedmanchisquare(*[row for index, row in df.iterrows()])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
which yields the following results:
4.97029900551
8.7627799511
The problem I see with your first attempt is that you end up passing one list with multiple dataframes inside of it.
The stats.friedmanchisquare needs multiple array_like arguments, not one list
Try using the * (star/unpack) operator to unpack the list
Like this
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
You could pass it using the "star operator", similarly to this:
a = np.array([[1, 2, 3], [2, 3, 4] ,[4, 5, 6]])
friedmanchisquare(*(a[i, :] for i in range(a.shape[0])))

Categories

Resources