Error when creating a data frame in Python: ValueError [duplicate] - python

This may be a simple question, but I can not figure out how to do this. Lets say that I have two variables as follows.
a = 2
b = 3
I want to construct a DataFrame from this:
df2 = pd.DataFrame({'A':a,'B':b})
This generates an error:
ValueError: If using all scalar values, you must pass an index
I tried this also:
df2 = (pd.DataFrame({'a':a,'b':b})).reset_index()
This gives the same error message.

The error message says that if you're passing scalar values, you have to pass an index. So you can either not use scalar values for the columns -- e.g. use a list:
>>> df = pd.DataFrame({'A': [a], 'B': [b]})
>>> df
A B
0 2 3
or use scalar values and pass an index:
>>> df = pd.DataFrame({'A': a, 'B': b}, index=[0])
>>> df
A B
0 2 3

You may try wrapping your dictionary into a list:
my_dict = {'A':1,'B':2}
pd.DataFrame([my_dict])
A B
0 1 2

You can also use pd.DataFrame.from_records which is more convenient when you already have the dictionary in hand:
df = pd.DataFrame.from_records([{ 'A':a,'B':b }])
You can also set index, if you want, by:
df = pd.DataFrame.from_records([{ 'A':a,'B':b }], index='A')

You need to create a pandas series first. The second step is to convert the pandas series to pandas dataframe.
import pandas as pd
data = {'a': 1, 'b': 2}
pd.Series(data).to_frame()
You can even provide a column name.
pd.Series(data).to_frame('ColumnName')

Maybe Series would provide all the functions you need:
pd.Series({'A':a,'B':b})
DataFrame can be thought of as a collection of Series hence you can :
Concatenate multiple Series into one data frame (as described here )
Add a Series variable into existing data frame ( example here )

Pandas magic at work. All logic is out.
The error message "ValueError: If using all scalar values, you must pass an index" Says you must pass an index.
This does not necessarily mean passing an index makes pandas do what you want it to do
When you pass an index, pandas will treat your dictionary keys as column names and the values as what the column should contain for each of the values in the index.
a = 2
b = 3
df2 = pd.DataFrame({'A':a,'B':b}, index=[1])
A B
1 2 3
Passing a larger index:
df2 = pd.DataFrame({'A':a,'B':b}, index=[1, 2, 3, 4])
A B
1 2 3
2 2 3
3 2 3
4 2 3
An index is usually automatically generated by a dataframe when none is given. However, pandas does not know how many rows of 2 and 3 you want. You can however be more explicit about it
df2 = pd.DataFrame({'A':[a]*4,'B':[b]*4})
df2
A B
0 2 3
1 2 3
2 2 3
3 2 3
The default index is 0 based though.
I would recommend always passing a dictionary of lists to the dataframe constructor when creating dataframes. It's easier to read for other developers. Pandas has a lot of caveats, don't make other developers have to experts in all of them in order to read your code.

You could try:
df2 = pd.DataFrame.from_dict({'a':a,'b':b}, orient = 'index')
From the documentation on the 'orient' argument: If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.

I usually use the following to to quickly create a small table from dicts.
Let's say you have a dict where the keys are filenames and the values their corresponding filesizes, you could use the following code to put it into a DataFrame (notice the .items() call on the dict):
files = {'A.txt':12, 'B.txt':34, 'C.txt':56, 'D.txt':78}
filesFrame = pd.DataFrame(files.items(), columns=['filename','size'])
print(filesFrame)
filename size
0 A.txt 12
1 B.txt 34
2 C.txt 56
3 D.txt 78

You need to provide iterables as the values for the Pandas DataFrame columns:
df2 = pd.DataFrame({'A':[a],'B':[b]})

I had the same problem with numpy arrays and the solution is to flatten them:
data = {
'b': array1.flatten(),
'a': array2.flatten(),
}
df = pd.DataFrame(data)

import pandas as pd
a=2
b=3
dict = {'A': a, 'B': b}
pd.DataFrame(pd.Series(dict)).T
# *T :transforms the dataframe*
Result:
A B
0 2 3

To figure out the "ValueError" understand DataFrame and "scalar values" is needed.
To create a Dataframe from dict, at least one Array is needed.
IMO, array itself is indexed.
Therefore, if there is an array-like value there is no need to specify index.
e.g. The index of each element in ['a', 's', 'd', 'f'] are 0,1,2,3 separately.
df_array_like = pd.DataFrame({
'col' : 10086,
'col_2' : True,
'col_3' : "'at least one array'",
'col_4' : ['one array is arbitrary length', 'multi arrays should be the same length']})
print("df_array_like: \n", df_array_like)
Output:
df_array_like:
col col_2 col_3 col_4
0 10086 True 'at least one array' one array is arbitrary length
1 10086 True 'at least one array' multi arrays should be the same length
As shows in the output, the index of the DataFrame is 0 and 1.
Coincidently same with the index of the array ['one array is arbitrary length', 'multi arrays should be the same length']
If comment out the 'col_4', it will raise
ValueError("If using all scalar values, you must pass an index")
Cause scalar value (integer, bool, and string) does not have index
Note that Index(...) must be called with a collection of some kind
Since index used to locate all the rows of DataFrame
index should be an array. e.g.
df_scalar_value = pd.DataFrame({
'col' : 10086,
'col_2' : True,
'col_3' : "'at least one array'"
}, index = ['fst_row','snd_row','third_row'])
print("df_scalar_value: \n", df_scalar_value)
Output:
df_scalar_value:
col col_2 col_3
fst_row 10086 True 'at least one array'
snd_row 10086 True 'at least one array'
third_row 10086 True 'at least one array'
I'm a beginner, I'm learning python and English. 👀

I tried transpose() and it worked.
Downside: You create a new object.
testdict1 = {'key1':'val1','key2':'val2','key3':'val3','key4':'val4'}
df = pd.DataFrame.from_dict(data=testdict1,orient='index')
print(df)
print(f'ID for DataFrame before Transpose: {id(df)}\n')
df = df.transpose()
print(df)
print(f'ID for DataFrame after Transpose: {id(df)}')
Output
0
key1 val1
key2 val2
key3 val3
key4 val4
ID for DataFrame before Transpose: 1932797100424
key1 key2 key3 key4
0 val1 val2 val3 val4
ID for DataFrame after Transpose: 1932797125448
​```

the input does not have to be a list of records - it can be a single dictionary as well:
pd.DataFrame.from_records({'a':1,'b':2}, index=[0])
a b
0 1 2
Which seems to be equivalent to:
pd.DataFrame({'a':1,'b':2}, index=[0])
a b
0 1 2

This is because a DataFrame has two intuitive dimensions - the columns and the rows.
You are only specifying the columns using the dictionary keys.
If you only want to specify one dimensional data, use a Series!

If you intend to convert a dictionary of scalars, you have to include an index:
import pandas as pd
alphabets = {'A': 'a', 'B': 'b'}
index = [0]
alphabets_df = pd.DataFrame(alphabets, index=index)
print(alphabets_df)
Although index is not required for a dictionary of lists, the same idea can be expanded to a dictionary of lists:
planets = {'planet': ['earth', 'mars', 'jupiter'], 'length_of_day': ['1', '1.03', '0.414']}
index = [0, 1, 2]
planets_df = pd.DataFrame(planets, index=index)
print(planets_df)
Of course, for the dictionary of lists, you can build the dataframe without an index:
planets_df = pd.DataFrame(planets)
print(planets_df)

Change your 'a' and 'b' values to a list, as follows:
a = [2]
b = [3]
then execute the same code as follows:
df2 = pd.DataFrame({'A':a,'B':b})
df2
and you'll get:
A B
0 2 3

simplest options ls :
dict = {'A':a,'B':b}
df = pd.DataFrame(dict, index = np.arange(1) )

Another option is to convert the scalars into list on the fly using Dictionary Comprehension:
df = pd.DataFrame(data={k: [v] for k, v in mydict.items()})
The expression {...} creates a new dict whose values is a list of 1 element. such as :
In [20]: mydict
Out[20]: {'a': 1, 'b': 2}
In [21]: mydict2 = { k: [v] for k, v in mydict.items()}
In [22]: mydict2
Out[22]: {'a': [1], 'b': [2]}

Convert Dictionary to Data Frame
col_dict_df = pd.Series(col_dict).to_frame('new_col').reset_index()
Give new name to Column
col_dict_df.columns = ['col1', 'col2']

You could try this:
df2 = pd.DataFrame.from_dict({'a':a,'b':b}, orient = 'index')

If you have a dictionary you can turn it into a pandas data frame with the following line of code:
pd.DataFrame({"key": d.keys(), "value": d.values()})

Just pass the dict on a list:
a = 2
b = 3
df2 = pd.DataFrame([{'A':a,'B':b}])

Related

pandas: how to merge columns irrespective of index

I have two dataframes with meaningless index's, but carefully curated order and I want to merge them while preserving that order. So, for example:
>>> df1
First
a 1
b 3
and
>>> df2
c 2
d 4
After merging, what I want to obtain is this:
>>> Desired_output
First Second
AnythingAtAll 1 2 # <--- Row Names are meaningless.
SeriouslyIDontCare 3 4 # <--- But the ORDER of the rows is critical and must be preserved.
The fact that I've got row-indices "a/b", and "c/d" is irrelevent, but what is crucial is the order in which the rows appear. Every version of "join" I've seen requires me to manually reset indices, which seems really awkward, and I don't trust that it won't screw up the ordering. I thought concat would work, but I get this:
>>> pd.concat( [df1, df2] , axis = 1, ignore_index= True )
0 1
a 1.0 NaN
b 3.0 NaN
c NaN 2.0
d NaN 4.0
# ^ obviously not what I want.
Even when I explicitly declare ignore_index.
How do I "overrule" the indexing and force the columns to be merged with the rows kept in the exact order that I supply them?
Edit:
Note that if I assign another column, the results are all "NaN".
>>> df1["second"]=df2["Second"]
>>> df1
First second
a 1 NaN
b 3 NaN
This was screwing me up but thanks to the suggestion from jsmart and topsail, you can dereference the indices by directly accessing the values in the column:
df1["second"]=df2["Second"].values
>>> df1
First second
a 1 2
b 3 4
^ Solution
This should also work I think:
df1["second"] = df2["second"].values
It would keep the index from the first dataframe, but since you have values in there such as "AnyThingAtAll" and "SeriouslyIdontCare" I guess any index values whatsoever are acceptable.
Basically, we are just adding a the values from your series as a new column to the first dataframe.
Here's a test example similar to your described problem:
# -----------
# sample data
# -----------
df1 = pd.DataFrame(
{
'x': ['a','b'],
'First': [1,3],
})
df1.set_index("x", drop=True, inplace=True)
df2 = pd.DataFrame(
{
'x': ['c','d'],
'Second': [2, 4],
})
df2.set_index("x", drop=True, inplace=True)
# ---------------------------------------------
# Add series as a new column to first dataframe
# ---------------------------------------------
df1["Second"] = df2["Second"].values
Result is:
First
Second
a
1
2
b
3
4
The goal is to combine data based on position (not by Index). Here is one way to do it:
import pandas as pd
# create data frames df1 and df2
df1 = pd.DataFrame(data = {'First': [1, 3]}, index=['a', 'b'])
df2 = pd.DataFrame(data = {'Second': [2, 4]}, index = ['c', 'd'])
# add a column to df1 -- add by position, not by Index
df1['Second'] = df2['Second'].values
print(df1)
First Second
a 1 2
b 3 4
And you could create a completely new data frame like this:
data = {'1st': df1['First'].values, '2nd': df1['Second'].values}
print(pd.DataFrame(data))
1st 2nd
0 1 2
1 3 4
ignore_index means whether to keep the output dataframe index from original along axis. If it is True, it means don't use original index but start from 0 to n just like what the column header 0, 1 shown in your result.
You can try
out = pd.concat( [df1.reset_index(drop=True), df2.reset_index(drop=True)] , axis = 1)
print(out)
First Second
0 1 2
1 3 4

modify a dataframe column with a dictionary

I have a column in my dataframe that has a value that is an identifier. And in turn I have a dictionary that contains for each identifier its meaning.
I would like to replace the identifier of my column with its meaning.
I thought it was as simple as doing the following (the column with the identifier is in position 3 of the dataframe),
df.iloc[:,3] = my_dict[df.iloc[:,3]]]
But I get the following error,
TypeError: unhashable type: 'Series'.
The column in question contains integers and the dictionary looks like the following:
my_dict = {1: "One", 2: "Two", 3: "Three"}
How could I make this change to my dataframe?
Thank you very much.
I would recommand using a lambda function applied to your column
def explain_column(x,my_dict):
if x in my_dict.keys():
return my_dict[x]
else:
return x #Assuming that you won't change the value if not in the dict
df['my_column']=df['my_column'].apply(lambda x: explain_column(x,my_dict))
You could use .map()
Example:
import pandas as pd
data = [[1,2,3],
[2,3,4]]
columns = ['a','b', 'c']
df = pd.DataFrame(data, columns=columns)
my_dict = {1: "One", 2: "Two", 3: "Three"}
df['a'] = df['a'].map(my_dict).fillna(df['a'])
Output:
print(df)
a b c
0 1 2 3
1 2 3 4
Is now (note I only applied to columns 'a' and 'c', and if you don't want nan, then need to use the .fillna():
print(df)
a b c
0 One 2 Three
1 Two 3 4

Get a Dataframe index if it is of a specific dtype

I have been trying to build a preprocessing pipeline, but I am struggling a little to generate a list of the indexes for each column that is an object dtype. I have been able to get the names of each into an array using the following code:
categorical_features = [col for col in input.columns if input[col].dtype == 'object']
Is there an easy way to get the index of these columns, from the original input dataframe into a list, like this one that I built manually?
c = [1,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23,24,25,28,29,
30,31,38,39,40,41,42,43,44,45,50,51,55,56]
Use df.select_dtypes + df.columns.get_indexer:
categorical_features = df.columns.get_indexer(df.select_dtypes('object').columns)
df.select_dtypes returns a copy of df with only the columns that are of the specified dtype(s) (you can specify multiple, e.g. df.select_dtypes(['object', 'int'])).
df.columns.get_indexer returns the indexes of the specified columns.
I think you need select.dtypes and enumerate
df = pd.DataFrame({'A' : ['A', 'B', 'C'], 'B' : [1,2,3], 'C' : [1, '2', '3']})
print(df)
A B C
0 A 1 1
1 B 2 2
2 C 3 3
idx_cols = [idx for idx, col in enumerate(df.select_dtypes('object').columns) ]
[0, 1]
enumerate can help with that:
categorical_features_indexes = [i for i, col in enumerate(input.columns) if input[col].dtype == 'object']

Proper way to pass list to value_vars using pandas melt

I'm unsure how to do this with python but am stuck. The ticker values in the column are in a list format. When trying to pass that list to melt's value_vars I get an error. When I try converting to a tuple it still contains the list brackets. The documentation says "value_vars -- tuple, list, or ndarray, optional" -- not having success w/ list or tuple. Thanks in advance.
My data:
sector ticker
0 Communication Services [ATVI.OQ, GOOGL.OQ, GOOG.OQ, T.N, CTL.N, CHTR....
1 Consumer Discretionary [AAP.N, AMZN.OQ, APTV.N, AZO.N, BBY.N, BKNG.OQ...
rowData = groups.loc[groups['sector'] == 'Communication Services']
print(tuple(rowData['ticker']))
new_df = pd.melt(new_df, id_vars=['date'], value_vars=rowData['ticker'])
The tuple doesn't look right with this output:
(['ATVI.OQ', 'GOOGL.OQ', 'GOOG.OQ', 'T.N', 'CTL.N', 'CHTR'],)
And here is the value_vars error:
TypeError: unhashable type: 'list'
EDIT
Solved using
tup = tuple(rowData['ticker'].explode())
new_df = pd.melt(new_df, id_vars=['date'], value_vars=tup)
You data are not actually in wide format. I think what you want is just to explode the column:
df = pd.DataFrame({'A': [[1, 2, 3], [4, 5, 6]], 'B': ['A', 'B']})
df.explode('A')
Out[21]:
A B
0 1 A
0 2 A
0 3 A
1 4 B
1 5 B
1 6 B
But I'm not 100 per cent sure about what the end goal is. See http://xyproblem.info/.

Creating Pandas DataFrame from list or dict always returns empty DF

I'm trying to create a pandas dataframe out of a dictionary. The dictionary keys are strings and the values are 1 or more lists. I'm having a strange issue in which pd.DataFrame() command consistently returns an empty dataframe even when I pass it a non-empty object like a list or dict.
My code is similar to the following:
myDictionary = {"ID1":[1,2,3], "ID2":[10,11,12],[2,34,11],"ID3":[8,3,12]}
df = pd.DataFrame(myDictionary, columns = ["A","B","C"])
So I want to create a DF that looks like this:
A B C
ID1 1 2 3
ID2 10 11 12
ID2 2 34 11
ID3 8 3 12
When I check the contents of df, I get "Empty DataFrame" and if I iterate over its contents, I get just the column names and none of the data in myDictionary! I have checked the documentation and this should be a strightforward command:
pd.DataFrame(dict, columns)
This doesn't get me the result I'm looking for and I'm baffled why. Anyone have any ideas? Thank you!
What I would recommend doing in this situation is interpreting your list of lists as strings. Later if you need to edit or analyze any of these you can use a parser to interpret the columns.
See below working code that allows you to keep your list of lists in the dataframe.
myDictionary = {"ID1":'[1,2,3]', "ID2":'[10,11,12],[2,34,11]',"ID3":'[8,3,12]'}
df = pd.DataFrame(myDictionary, columns = ["ID1","ID2","ID3"], index = [0])
df.rename(columns ={'ID1' : 'A', 'ID2': 'B', 'ID3': 'C'}, inplace = True)
df.head(3)
By always converting the lists to strings you will be able to combine them much easier, regardless of how many lists there are that need to be combined.
try the example below to figure out why df is empty:
myDictionary = {"ID1":[1,2,3], "ID2":[10,11,12],"ID3":[8,3,12], 'A':[0, 0, 0]}
df = pd.DataFrame(myDictionary, columns = ["A","B","C"])
and the what you want is:
myDictionary = {"ID1":[1,2,3], "ID2":[10,11,12],"ID3":[8,3,12]}
df = pd.DataFrame(myDictionary).rename(columns={'ID1':'A', 'ID2':'B', 'ID3':'C'})
You are passing in the names "ID1", "ID2", and "ID3" into pd.DataFrame as the column names and then telling pandas to use columns A, B, C. Since there are no columns A, B, C pandas returns an empty DataFrame. Use the code below to make the DataFrame:
import pandas as pd
myDictionary = {"ID1": [1, 2, 3], "ID2": [10, 11, 12], "ID3": [8, 3, 12]}
df = pd.DataFrame(myDictionary, columns=["ID1", "ID2", "ID3"])
print(df)
Output:
ID1 ID2 ID3
0 1 10 8
1 2 11 3
2 3 12 12
And moreover this:
"ID2":[10,11,12],[2,34,11]
Is incorrect since you are either trying to pass 2 keys for one value in a dictionary, or forgot to make a key for the values [2,34,11]. Thus your dictionary should be returning errors when you try and compile unless you remove that list.
Firstly the [2,34,11] list is missing a column name. GIVE IT A NAME!
The reason for your error is that when you use the following command:
df = pd.DataFrame(myDictionary, columns = ["A","B","C"])
It creates a dataframe based on your dictionary. But then you are saying that you only want columns from your dictionary that are labelled 'A', 'B', 'C', which your dictionary doesn't have.
Try instead:
df = pd.DataFrame(myDictionary, columns = ["ID1","ID2","ID3"])
df.rename(columns ={'ID1' : 'A', 'ID2': 'B', 'ID3': 'C'}, inplace = True)
you can not create a data frame where two row level will be same like yours example
ID2 10 11 12
ID2 2 34 11
and at the same time, it is also true for the dictionary as well, in the dictionary every key has to be unique but in yours dataframe metioned like below dictionary which is impossible
{"ID2":[10,11,12],"ID2":[2,34,11]}
so my suggestion chagne you dictionary design and follow so many answers about to convert dictinary to df
Here is one possible approach
Dictionary
myDictionary = {"ID1":[1,2,3], "ID2":[[10,11,12],[2,34,11]],"ID3":[8,3,12]}
Get a dictionary d that contains key-values for values that are nested lists whose (a) keys are unique - use a suffix to ensure the keys of this dictionary d are unique and (b) whose values are flattened sub-lists from the nested list
to do this, iterate through the loop and
check if the value contains a sublist
if so, append that key:value pair to a separate dictionary d
use a suffix to separate identical keys, since the key ID2 can't be repeated in a dictionary
each suffix will hold one of the sub-lists from the nested list
generate a list of keys from the original dictionary (in a variable named nested_keys myDictionary), whose values are nested lists
d = {}
nested_keys = []
for k,v in myDictionary.items():
if any(isinstance(i, list) for i in v):
for m,s in enumerate(v):
d[k+'_'+str(m+1)] = s
nested_keys.append(k)
print(d)
{'ID2_1': [10, 11, 12], 'ID2_2': [2, 34, 11]}
(Using the list of keys whose values are nested lists - nested_keys) Get a second dictionary that contains values that are not nested lists - see this SO post for how to do this
myDictionary = {key: myDictionary[key] for key in myDictionary if key not in nested_keys}
print(myDictionary)
{'ID1': [1, 2, 3], 'ID3': [8, 3, 12]}
Combine the 2 dictionaries above into a single dictionary
myDictionary = {**d, **myDictionary}
print(myDictionary)
{'ID2_1': [10, 11, 12], 'ID2_2': [2, 34, 11], 'ID1': [1, 2, 3], 'ID3': [8, 3, 12]}
Convert the combined dictionary into a DataFrame and drop the suffix that was added earlier
df = pd.DataFrame(list(myDictionary.values()), index=myDictionary.keys(),
columns=list('ABC'))
df.reset_index(inplace=True)
df = df.replace(r"_[0-9]", "", regex=True)
df.sort_values(by='index', inplace=True)
print(df)
index A B C
2 ID1 1 2 3
0 ID2 10 11 12
1 ID2 2 34 11
3 ID3 8 3 12

Categories

Resources