Pandas DataFrame list strange behavior - python

When assigning a new column to one dataframe in the list, it copies it to all other dataframes. Example:
In [219]: a = [pd.DataFrame()]*2
In [220]: a[0]['a'] = [1,2,3]
In [221]: a[1]
Out[221]:
a
0 1
1 2
2 3
Is this a bug? And what can I do to prevent it?
Thanks!

The answer is because when you define a list with that syntax
x = [something]*n
You end up with a list, where each item is THE SAME something. It doesn't create copies, it references the SAME object:
>>> import pandas as pd
>>> a=pd.DataFrame()
>>> g=[a]*2
>>> g
1: [Empty DataFrame
Columns: []
Index: [], Empty DataFrame
Columns: []
Index: []]
>>> id(g[0])
4: 129264216L
>>> id(g[1])
5: 129264216L
The comments are pointing to some useful examples which you should read through and grok.
To avoid it in your situation, just use another way of instantiating the list:
>>> map(lambda x: pd.DataFrame(),range(2))
6: [Empty DataFrame
Columns: []
Index: [], Empty DataFrame
Columns: []
Index: []]
>>> [pd.DataFrame() for i in range(2)]
7: [Empty DataFrame
Columns: []
Index: [], Empty DataFrame
Columns: []
Index: []]
>>>

EDIT: I now see that there is an explanation for this in the replies^
I don't understand what this is caused by just yet, but you can get around it by defining your dataframes separately prior to putting them in a list.
In [2]: df1 = pd.DataFrame()
In [3]: df2 = pd.DataFrame()
In [4]: a = [df1, df2]
In [5]: a[0]['a'] = [1,2,3]
In [6]: a[0]
Out[6]:
a
0 1
1 2
2 3
In [7]: a[1]
Out[7]:
Empty DataFrame
Columns: []
Index: []

Related

How do I find which columns in my pandas dataframe contain a list?

I am running a drop duplicates on my dataframe and I cannot run it on columns that contain lists.
I am trying to make a function that finds the columns containing lists before I drop duplicates.
df1 = pd.DataFrame([[1,[1,2,3],1],[2,[2,3],2],[3,[3],3]], columns = ['a','b','c'])
a b c
0 1 [1, 2, 3] 1
1 2 [2, 3] 2
2 3 [3] [3,4]
cols = find_list_columns(df1)
cols
['b','c']
Use applymap and any
In [556]: df1.columns[df1.applymap(type).eq(list).any()]
Out[556]: Index(['b', 'c'], dtype='object')
As #Ch3steR recommended
df1.applymap(lambda x: isinstance(x,list).any()
You can define a custom function using isinstance with any then use df.apply to create a boolean mask and use boolean indexing over df.columns
def has_list(x):
return any(isinstance(i, list) for i in x)
mask = df1.apply(has_list)
mask
# a False
# b True
# c True
# dtype: bool
cols = df1.columns[mask].tolist()
# ['b', 'c']
df1.drop(columns = cols)
# a
# 0 1
# 1 2
# 2 3
If your dataframe has only either numbers or list, then you can do that in one line like so:
print([key for key in df1.columns if df1[key].dtype == "object"])

How to concat item that is in list format in columns in dataframe

i want to concatenate item that is in list format in dataframe
i have a data frame below, when i print the DataFrame.head(), it shows below
A B
1 [1,2,3,4]
2 [5,6,7,8]
Expect Result (convert it from list to string separate by comma)
A B
1 1,2,3,4
2 5,6,7,8
You could do:
import pandas as pd
data = [[1, [1,2,3,4]],
[2, [5,6,7,8]]]
df = pd.DataFrame(data=data, columns=['A', 'B'])
df['B'] = [','.join(map(str, lst)) for lst in df.B]
print(df.head(2))
Output
A B
0 1 1,2,3,4
1 2 5,6,7,8
You can use the map or apply methods for this:
import pandas as pd
data = [[1, [1,2,3,4]],
[2, [5,6,7,8]]]
df = pd.DataFrame(data=data, columns=['A', 'B'])
df['B'] = df['B'].map(lambda x: ",".join(map(str,x)))
# or
# df['B'] = df['B'].apply(lambda x: ",".join(map(str,x)))
print(df.head(2))
df = pd.DataFrame([['1',[1,2,3,4]],['2',[5,6,7,8]]], columns=list('AB'))
generic way to convert lists to strings. in your example, your list is of type int but it could be any type that can be represented as a string to join the elements in the list by using ','.join(map(str, a_list)) Then just iterate through the rows in the specific column that cotains the lists you want to join
for i, row in df.iterrows():
df.loc[i,'B'] = ','.join(map(str, row['B']))

Panda Numpy converting data to a column

I have a data result that when I print it looks like
>>>print(result)
[[0]
[1]
[0]
[0]
[1]
[0]]
I guess that's about the same as [ [0][1][0][0][1][0] ] which seems a bit weird [0,1,0,0,1,0] seems a more logical representation but somehow it's not like that.
Though I would like these values to be added as a single column to a Panda dataframe df
I tried several ways to join it to my dataframe:
df = pd.concat(df,result)
df = pd.concat(df,{'result' =result})
df['result'] =pd.aply(result, axis=1)
with no luck. How can I do it?
There is multiple ways for flatten your data:
df = pd.DataFrame(data=np.random.rand(6,2))
result = np.array([0,1,0,0,1,0])[:, None]
print (result)
[[0]
[1]
[0]
[0]
[1]
[0]]
df['result'] = result[:,0]
df['result1'] = result.ravel()
#df['result1'] = np.concatenate(result)
print (df)
0 1 result result1
0 0.098767 0.933861 0 0
1 0.532177 0.610121 1 1
2 0.288742 0.718452 0 0
3 0.520980 0.367746 0 0
4 0.253658 0.011994 1 1
5 0.662878 0.846113 0 0
If you are looking to put that array in flat format pandas dataframe column, following is simplest way:
df["result"] = sum(result, [])
As long as the number of data points in this list is the same as the number of rows of the dataframe this should work:
import pandas as pd
your_data = [[0],[1],[0],[0],[1],[0]]
df = pd.DataFrame() # skip and use your own dataframe with len(df) == len(your_data)
df['result'] = [i[0] for i in your_data]

How to assign values of series to column names of dataframe

I have Series with values:
0 1_AA
1 2_BB
2 3_CC
3 4_DD
and I want to convert this series to names of dataframe columns. It should look like this:
1_AA 2_BB 3_CC 4_DD
0
Is it possible?
One could just use the columns-argument for DataFrame:
>>> import pandas as pd
>>> s = pd.Series(['a', 'b', 'c'])
>>> pd.DataFrame(columns=s)
Empty DataFrame
Columns: [a, b, c]
Index: []
or pass it in directly as list:
>>> pd.DataFrame(columns=['1_AA', '2_BB', '3_CC', '4_DD'])
Empty DataFrame
Columns: [1_AA, 2_BB, 3_CC, 4_DD]
Index: []
You could use dict.fromkeys:
>>> import pandas as pd
>>> s = pd.Series(['1_AA', '2_BB', '3_CC', '4_DD'])
>>> pd.DataFrame(dict.fromkeys(s, [0])) # each column containing one zero - [0]
1_AA 2_BB 3_CC 4_DD
0 0 0 0 0
Or collections.OrderedDict, which garantuees that the order of your values is always kept:
>>> from collections import OrderedDict
>>> pd.DataFrame(OrderedDict.fromkeys(s, [0]))
1_AA 2_BB 3_CC 4_DD
0 0 0 0 0
You could also use empty lists as second argument for fromkeys:
>>> pd.DataFrame(dict.fromkeys(s, []))
Empty DataFrame
Columns: [1_AA, 2_BB, 3_CC, 4_DD]
Index: []
But that creates an empty dataframe - with the correct columns.

Appending to an empty DataFrame in Pandas?

Is it possible to append to an empty data frame that doesn't contain any indices or columns?
I have tried to do this, but keep getting an empty dataframe at the end.
e.g.
import pandas as pd
df = pd.DataFrame()
data = ['some kind of data here' --> I have checked the type already, and it is a dataframe]
df.append(data)
The result looks like this:
Empty DataFrame
Columns: []
Index: []
This should work:
>>> df = pd.DataFrame()
>>> data = pd.DataFrame({"A": range(3)})
>>> df = df.append(data)
>>> df
A
0 0
1 1
2 2
Since the append doesn't happen in-place, so you'll have to store the output if you want it:
>>> df = pd.DataFrame()
>>> data = pd.DataFrame({"A": range(3)})
>>> df.append(data) # without storing
>>> df
Empty DataFrame
Columns: []
Index: []
>>> df = df.append(data)
>>> df
A
0 0
1 1
2 2
And if you want to add a row, you can use a dictionary:
df = pd.DataFrame()
df = df.append({'name': 'Zed', 'age': 9, 'height': 2}, ignore_index=True)
which gives you:
age height name
0 9 2 Zed
You can concat the data in this way:
InfoDF = pd.DataFrame()
tempDF = pd.DataFrame(rows,columns=['id','min_date'])
InfoDF = pd.concat([InfoDF,tempDF])
The answers are very useful, but since pandas.DataFrame.append was deprecated (as already mentioned by various users), and the answers using pandas.concat are not "Runnable Code Snippets" I would like to add the following snippet:
import pandas as pd
df = pd.DataFrame(columns =['name','age'])
row_to_append = pd.DataFrame([{'name':"Alice", 'age':"25"},{'name':"Bob", 'age':"32"}])
df = pd.concat([df,row_to_append])
So df is now:
name age
0 Alice 25
1 Bob 32
pandas.DataFrame.append Deprecated since version 1.4.0: Use concat() instead.
Therefore:
df = pd.DataFrame() # empty dataframe
df2 = pd..DataFrame(...) # some dataframe with data
df = pd.concat([df, df2])

Categories

Resources