How to assign values of series to column names of dataframe - python

I have Series with values:
0 1_AA
1 2_BB
2 3_CC
3 4_DD
and I want to convert this series to names of dataframe columns. It should look like this:
1_AA 2_BB 3_CC 4_DD
0
Is it possible?

One could just use the columns-argument for DataFrame:
>>> import pandas as pd
>>> s = pd.Series(['a', 'b', 'c'])
>>> pd.DataFrame(columns=s)
Empty DataFrame
Columns: [a, b, c]
Index: []
or pass it in directly as list:
>>> pd.DataFrame(columns=['1_AA', '2_BB', '3_CC', '4_DD'])
Empty DataFrame
Columns: [1_AA, 2_BB, 3_CC, 4_DD]
Index: []

You could use dict.fromkeys:
>>> import pandas as pd
>>> s = pd.Series(['1_AA', '2_BB', '3_CC', '4_DD'])
>>> pd.DataFrame(dict.fromkeys(s, [0])) # each column containing one zero - [0]
1_AA 2_BB 3_CC 4_DD
0 0 0 0 0
Or collections.OrderedDict, which garantuees that the order of your values is always kept:
>>> from collections import OrderedDict
>>> pd.DataFrame(OrderedDict.fromkeys(s, [0]))
1_AA 2_BB 3_CC 4_DD
0 0 0 0 0
You could also use empty lists as second argument for fromkeys:
>>> pd.DataFrame(dict.fromkeys(s, []))
Empty DataFrame
Columns: [1_AA, 2_BB, 3_CC, 4_DD]
Index: []
But that creates an empty dataframe - with the correct columns.

Related

Error when creating a data frame in Python: ValueError [duplicate]

This may be a simple question, but I can not figure out how to do this. Lets say that I have two variables as follows.
a = 2
b = 3
I want to construct a DataFrame from this:
df2 = pd.DataFrame({'A':a,'B':b})
This generates an error:
ValueError: If using all scalar values, you must pass an index
I tried this also:
df2 = (pd.DataFrame({'a':a,'b':b})).reset_index()
This gives the same error message.
The error message says that if you're passing scalar values, you have to pass an index. So you can either not use scalar values for the columns -- e.g. use a list:
>>> df = pd.DataFrame({'A': [a], 'B': [b]})
>>> df
A B
0 2 3
or use scalar values and pass an index:
>>> df = pd.DataFrame({'A': a, 'B': b}, index=[0])
>>> df
A B
0 2 3
You may try wrapping your dictionary into a list:
my_dict = {'A':1,'B':2}
pd.DataFrame([my_dict])
A B
0 1 2
You can also use pd.DataFrame.from_records which is more convenient when you already have the dictionary in hand:
df = pd.DataFrame.from_records([{ 'A':a,'B':b }])
You can also set index, if you want, by:
df = pd.DataFrame.from_records([{ 'A':a,'B':b }], index='A')
You need to create a pandas series first. The second step is to convert the pandas series to pandas dataframe.
import pandas as pd
data = {'a': 1, 'b': 2}
pd.Series(data).to_frame()
You can even provide a column name.
pd.Series(data).to_frame('ColumnName')
Maybe Series would provide all the functions you need:
pd.Series({'A':a,'B':b})
DataFrame can be thought of as a collection of Series hence you can :
Concatenate multiple Series into one data frame (as described here )
Add a Series variable into existing data frame ( example here )
Pandas magic at work. All logic is out.
The error message "ValueError: If using all scalar values, you must pass an index" Says you must pass an index.
This does not necessarily mean passing an index makes pandas do what you want it to do
When you pass an index, pandas will treat your dictionary keys as column names and the values as what the column should contain for each of the values in the index.
a = 2
b = 3
df2 = pd.DataFrame({'A':a,'B':b}, index=[1])
A B
1 2 3
Passing a larger index:
df2 = pd.DataFrame({'A':a,'B':b}, index=[1, 2, 3, 4])
A B
1 2 3
2 2 3
3 2 3
4 2 3
An index is usually automatically generated by a dataframe when none is given. However, pandas does not know how many rows of 2 and 3 you want. You can however be more explicit about it
df2 = pd.DataFrame({'A':[a]*4,'B':[b]*4})
df2
A B
0 2 3
1 2 3
2 2 3
3 2 3
The default index is 0 based though.
I would recommend always passing a dictionary of lists to the dataframe constructor when creating dataframes. It's easier to read for other developers. Pandas has a lot of caveats, don't make other developers have to experts in all of them in order to read your code.
You could try:
df2 = pd.DataFrame.from_dict({'a':a,'b':b}, orient = 'index')
From the documentation on the 'orient' argument: If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.
I usually use the following to to quickly create a small table from dicts.
Let's say you have a dict where the keys are filenames and the values their corresponding filesizes, you could use the following code to put it into a DataFrame (notice the .items() call on the dict):
files = {'A.txt':12, 'B.txt':34, 'C.txt':56, 'D.txt':78}
filesFrame = pd.DataFrame(files.items(), columns=['filename','size'])
print(filesFrame)
filename size
0 A.txt 12
1 B.txt 34
2 C.txt 56
3 D.txt 78
You need to provide iterables as the values for the Pandas DataFrame columns:
df2 = pd.DataFrame({'A':[a],'B':[b]})
I had the same problem with numpy arrays and the solution is to flatten them:
data = {
'b': array1.flatten(),
'a': array2.flatten(),
}
df = pd.DataFrame(data)
import pandas as pd
a=2
b=3
dict = {'A': a, 'B': b}
pd.DataFrame(pd.Series(dict)).T
# *T :transforms the dataframe*
Result:
A B
0 2 3
To figure out the "ValueError" understand DataFrame and "scalar values" is needed.
To create a Dataframe from dict, at least one Array is needed.
IMO, array itself is indexed.
Therefore, if there is an array-like value there is no need to specify index.
e.g. The index of each element in ['a', 's', 'd', 'f'] are 0,1,2,3 separately.
df_array_like = pd.DataFrame({
'col' : 10086,
'col_2' : True,
'col_3' : "'at least one array'",
'col_4' : ['one array is arbitrary length', 'multi arrays should be the same length']})
print("df_array_like: \n", df_array_like)
Output:
df_array_like:
col col_2 col_3 col_4
0 10086 True 'at least one array' one array is arbitrary length
1 10086 True 'at least one array' multi arrays should be the same length
As shows in the output, the index of the DataFrame is 0 and 1.
Coincidently same with the index of the array ['one array is arbitrary length', 'multi arrays should be the same length']
If comment out the 'col_4', it will raise
ValueError("If using all scalar values, you must pass an index")
Cause scalar value (integer, bool, and string) does not have index
Note that Index(...) must be called with a collection of some kind
Since index used to locate all the rows of DataFrame
index should be an array. e.g.
df_scalar_value = pd.DataFrame({
'col' : 10086,
'col_2' : True,
'col_3' : "'at least one array'"
}, index = ['fst_row','snd_row','third_row'])
print("df_scalar_value: \n", df_scalar_value)
Output:
df_scalar_value:
col col_2 col_3
fst_row 10086 True 'at least one array'
snd_row 10086 True 'at least one array'
third_row 10086 True 'at least one array'
I'm a beginner, I'm learning python and English. 👀
I tried transpose() and it worked.
Downside: You create a new object.
testdict1 = {'key1':'val1','key2':'val2','key3':'val3','key4':'val4'}
df = pd.DataFrame.from_dict(data=testdict1,orient='index')
print(df)
print(f'ID for DataFrame before Transpose: {id(df)}\n')
df = df.transpose()
print(df)
print(f'ID for DataFrame after Transpose: {id(df)}')
Output
0
key1 val1
key2 val2
key3 val3
key4 val4
ID for DataFrame before Transpose: 1932797100424
key1 key2 key3 key4
0 val1 val2 val3 val4
ID for DataFrame after Transpose: 1932797125448
​```
the input does not have to be a list of records - it can be a single dictionary as well:
pd.DataFrame.from_records({'a':1,'b':2}, index=[0])
a b
0 1 2
Which seems to be equivalent to:
pd.DataFrame({'a':1,'b':2}, index=[0])
a b
0 1 2
This is because a DataFrame has two intuitive dimensions - the columns and the rows.
You are only specifying the columns using the dictionary keys.
If you only want to specify one dimensional data, use a Series!
If you intend to convert a dictionary of scalars, you have to include an index:
import pandas as pd
alphabets = {'A': 'a', 'B': 'b'}
index = [0]
alphabets_df = pd.DataFrame(alphabets, index=index)
print(alphabets_df)
Although index is not required for a dictionary of lists, the same idea can be expanded to a dictionary of lists:
planets = {'planet': ['earth', 'mars', 'jupiter'], 'length_of_day': ['1', '1.03', '0.414']}
index = [0, 1, 2]
planets_df = pd.DataFrame(planets, index=index)
print(planets_df)
Of course, for the dictionary of lists, you can build the dataframe without an index:
planets_df = pd.DataFrame(planets)
print(planets_df)
Change your 'a' and 'b' values to a list, as follows:
a = [2]
b = [3]
then execute the same code as follows:
df2 = pd.DataFrame({'A':a,'B':b})
df2
and you'll get:
A B
0 2 3
simplest options ls :
dict = {'A':a,'B':b}
df = pd.DataFrame(dict, index = np.arange(1) )
Another option is to convert the scalars into list on the fly using Dictionary Comprehension:
df = pd.DataFrame(data={k: [v] for k, v in mydict.items()})
The expression {...} creates a new dict whose values is a list of 1 element. such as :
In [20]: mydict
Out[20]: {'a': 1, 'b': 2}
In [21]: mydict2 = { k: [v] for k, v in mydict.items()}
In [22]: mydict2
Out[22]: {'a': [1], 'b': [2]}
Convert Dictionary to Data Frame
col_dict_df = pd.Series(col_dict).to_frame('new_col').reset_index()
Give new name to Column
col_dict_df.columns = ['col1', 'col2']
You could try this:
df2 = pd.DataFrame.from_dict({'a':a,'b':b}, orient = 'index')
If you have a dictionary you can turn it into a pandas data frame with the following line of code:
pd.DataFrame({"key": d.keys(), "value": d.values()})
Just pass the dict on a list:
a = 2
b = 3
df2 = pd.DataFrame([{'A':a,'B':b}])

Panda Numpy converting data to a column

I have a data result that when I print it looks like
>>>print(result)
[[0]
[1]
[0]
[0]
[1]
[0]]
I guess that's about the same as [ [0][1][0][0][1][0] ] which seems a bit weird [0,1,0,0,1,0] seems a more logical representation but somehow it's not like that.
Though I would like these values to be added as a single column to a Panda dataframe df
I tried several ways to join it to my dataframe:
df = pd.concat(df,result)
df = pd.concat(df,{'result' =result})
df['result'] =pd.aply(result, axis=1)
with no luck. How can I do it?
There is multiple ways for flatten your data:
df = pd.DataFrame(data=np.random.rand(6,2))
result = np.array([0,1,0,0,1,0])[:, None]
print (result)
[[0]
[1]
[0]
[0]
[1]
[0]]
df['result'] = result[:,0]
df['result1'] = result.ravel()
#df['result1'] = np.concatenate(result)
print (df)
0 1 result result1
0 0.098767 0.933861 0 0
1 0.532177 0.610121 1 1
2 0.288742 0.718452 0 0
3 0.520980 0.367746 0 0
4 0.253658 0.011994 1 1
5 0.662878 0.846113 0 0
If you are looking to put that array in flat format pandas dataframe column, following is simplest way:
df["result"] = sum(result, [])
As long as the number of data points in this list is the same as the number of rows of the dataframe this should work:
import pandas as pd
your_data = [[0],[1],[0],[0],[1],[0]]
df = pd.DataFrame() # skip and use your own dataframe with len(df) == len(your_data)
df['result'] = [i[0] for i in your_data]

Changing multiple column names

Let's say I have a data frame with such column names:
['a','b','c','d','e','f','g']
And I would like to change names from 'c' to 'f' (actually add string to the name of column), so the whole data frame column names would look like this:
['a','b','var_c_equal','var_d_equal','var_e_equal','var_f_equal','g']
Well, firstly I made a function that changes column names with the string i want:
df.rename(columns=lambda x: 'or_'+x+'_no', inplace=True)
But now I really want to understand how to implement something like this:
df.loc[:,'c':'f'].rename(columns=lambda x: 'var_'+x+'_equal', inplace=True)
You can a use a list comprehension for that like:
Code:
new_columns = ['var_{}_equal'.format(c) if c in 'cdef' else c for c in columns]
Test Code:
import pandas as pd
df = pd.DataFrame({'a':(1,2), 'b':(1,2), 'c':(1,2), 'd':(1,2)})
print(df)
df.columns = ['var_{}_equal'.format(c) if c in 'cdef' else c
for c in df.columns]
print(df)
Results:
a b c d
0 1 1 1 1
1 2 2 2 2
a b var_c_equal var_d_equal
0 1 1 1 1
1 2 2 2 2
One way is to use a dictionary instead of an anonymous function. Both the below variations assume the columns you need to rename are contiguous.
Contiguous columns by position
d = {k: 'var_'+k+'_equal' for k in df.columns[2:6]}
df = df.rename(columns=d)
Contiguous columns by name
If you need to calculate the numerical indices:
cols = df.columns.get_loc
d = {k: 'var_'+k+'_equal' for k in df.columns[cols('c'):cols('f')+1]}
df = df.rename(columns=d)
Specifically identified columns
If you want to provide the columns explicitly:
d = {k: 'var_'+k+'_equal' for k in 'cdef'}
df = df.rename(columns=d)

Python: return max column from a set of columns

Let's say i have a dataframe with columns A, B, C, D
import pandas as pd
import numpy as np
## create dataframe 100 by 4
df = pd.DataFrame(np.random.randn(100,4), columns=list('ABCD'))
df.head(10)
I would like to create a new column, "max_bcd", and this column will say 'b','c','d', indicating that for that particular row, one of those three columns contains the largest value.
Does anyone know how to accomplish that?
Try this idmax with axis=1 will help you to find the max value among columnns:
>>> df.idxmax(axis=1)
0 B
1 C
2 D
dtype: object
import pandas as pd
import numpy as np
cols = ['B', 'C', 'D']
## create dataframe 100 by 4
df = pd.DataFrame(np.random.randn(100,4), columns=list('ABCD'))
df.head(10)
df.insert(4, 'max_BCD_name', None)
df.insert(5, 'max_BCD_value', None)
df['max_BCD_name'] = df.apply(lambda x: df[cols].idxmax(axis=1)) # column name
df['max_BCD_value'] = df.apply(lambda x: df[cols].max(axis=1)) # value
print(df)
Edit: Just saw your requirement of only B, C and D. Added code for that.
Output:
A B C D max_BCD_name max_BCD_value
0 -0.653010 -1.479903 3.415286 -1.246829 C 3.415286
1 0.343084 1.243901 0.502271 -0.467752 B 1.243901
2 0.099207 1.257792 -0.997121 -1.559208 B 1.257792
3 -0.646787 1.053846 -2.663767 1.022687 B 1.053846

Getting the integer index of a Pandas DataFrame row fulfilling a condition?

I have the following DataFrame:
a b c
b
2 1 2 3
5 4 5 6
As you can see, column b is used as an index. I want to get the ordinal number of the row fulfilling ('b' == 5), which in this case would be 1.
The column being tested can be either an index column (as with b in this case) or a regular column, e.g. I may want to find the index of the row fulfilling ('c' == 6).
Use Index.get_loc instead.
Reusing #unutbu's set up code, you'll achieve the same results.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(1,7).reshape(2,3),
columns = list('abc'),
index=pd.Series([2,5], name='b'))
>>> df
a b c
b
2 1 2 3
5 4 5 6
>>> df.index.get_loc(5)
1
You could use np.where like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(1,7).reshape(2,3),
columns = list('abc'),
index=pd.Series([2,5], name='b'))
print(df)
# a b c
# b
# 2 1 2 3
# 5 4 5 6
print(np.where(df.index==5)[0])
# [1]
print(np.where(df['c']==6)[0])
# [1]
The value returned is an array since there could be more than one row with a particular index or value in a column.
With Index.get_loc and general condition:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(1,7).reshape(2,3),
columns = list('abc'),
index=pd.Series([2,5], name='b'))
>>> df
a b c
b
2 1 2 3
5 4 5 6
>>> df.index.get_loc(df.index[df['b'] == 5][0])
1
The other answers based on Index.get_loc() do not provide a consistent result, because this function will return in integer if the index consists of all unique values, but it will return a boolean mask array if the index does not consist of unique values. A more consistent approach to return a list of integer values every time would be the following, with this example shown for an index with non-unique values:
df = pd.DataFrame([
{"A":1, "B":2}, {"A":2, "B":2},
{"A":3, "B":4}, {"A":1, "B":3}
], index=[1,2,3,1])
If searching based on index value:
[i for i,v in enumerate(df.index == 1) if v]
[0, 3]
If searching based on a column value:
[i for i,v in enumerate(df["B"] == 2) if v]
[0, 1]

Categories

Resources