Storing 3-dimensional data in pandas DataFrame - python

I am new to Python and I'm trying to understand how to manipulate data with pandas DataFrames. I searched for similar questions but I don't see any satisfying my exact need. Please point me to the correct post if this is a duplicate.
So I have multiple DataFrames with the exact same shape, columns and index. How do I combine them with labels so I can easily access the data with any column/index/label?
E.g. after the setup below, how do I put df1 and df2 into one DataFrame and label them with the names 'df1' and 'df2', so I can access data in a way like df['A']['df1']['b'], and get number of rows of df?
>>> import numpy as np
>>> import pandas as pd
>>> df1 = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'], index=['a', 'b'])
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=['A', 'B'], index=['a', 'b'])
>>> df1
A B
a 1 2
b 3 4
>>> df2
A B
a 5 6
b 7 8

I think MultiIndex DataFrame is answer created by concat:
df = pd.concat([df1, df2], keys=('df1','df2'))
print (df)
A B
df1 a 1 2
b 3 4
df2 a 5 6
b 7 8
Then for basic select is possible use xs:
print (df.xs('df1'))
A B
a 1 2
b 3 4
And for select index and columns together use slicers:
idx = pd.IndexSlice
print (df.loc[idx['df1', 'b'], 'A'])
3
Another possible solution is use panels.
But in newer versions of pandas is deprecated.

Using xarray is recommended, as other answers to similar questions have suggested. Since pandas Panels were deprecated in favour of xarray.

Related

How can I use the pandas.insert() without knowing the size of the column I want to insert?

I want to insert another column in a data frame that is simply full of ones, so [1,1,1,1...] by using the insert function, however, I am not sure how to do it. I do not know the number of ones, is there any alternative ways to do it?
You can just pass a scalar to insert:
Example:
df = pd.DataFrame([[1,2,3], [4,5,6]], columns=['A', 'B', 'D'])
df.insert(2, 'C', 1)
output:
A B C D
0 1 2 1 3
1 4 5 1 6
Can you elaborate and share an example?.Are you looking for something like this?
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
series= pd.Series([1 for i in range(1,10)])
df.insert(2,"newcol",series)
print(df)

How to render a column name as a single cell in midst of multilevel columns in pandas?

I'm working on multilevel indexes in columns. I've to send these tables. For sending tables, I'm using df.to_html(). The picture below is where i am now. foo is the index which i've converted to column.
While converting to column, I want it to occupy both cells so it can look nice.This is what i want to achieve.
The code looks like this.
df = pd.DataFrame([[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]],index=['M1','M2','M3'])
df.columns = pd.MultiIndex.from_product([['x', 'y'], ['a', 'b']])
ind = df.index
df.reset_index(drop=True,inplace=True)
df.insert(0,'foo',ind)
With the code you provide, foois not set as the index of the dataframe.
Anyway, you could add this after your current code in order to correct the header of your dataframe before converting it to html:
df = df.rename(axis=1, level=0, mapper={"foo": ""}).rename(
axis=1, level=1, mapper={"": "foo"}
)
df.to_html(index=False)
This way, the html version of your dataframe renders the desired way:
x y
foo a b a b
M1 1 2 3 4
M2 1 2 3 4
M3 1 2 3 4

Best way to add multiple list to existing dataframe [duplicate]

I'm trying to figure out how to add multiple columns to pandas simultaneously with Pandas. I would like to do this in one step rather than multiple repeated steps.
import pandas as pd
df = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)
df[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs',3] # I thought this would work here...
I would have expected your syntax to work too. The problem arises because when you create new columns with the column-list syntax (df[[new1, new2]] = ...), pandas requires that the right hand side be a DataFrame (note that it doesn't actually matter if the columns of the DataFrame have the same names as the columns you are creating).
Your syntax works fine for assigning scalar values to existing columns, and pandas is also happy to assign scalar values to a new column using the single-column syntax (df[new1] = ...). So the solution is either to convert this into several single-column assignments, or create a suitable DataFrame for the right-hand side.
Here are several approaches that will work:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]
})
Then one of the following:
1) Three assignments in one, using list unpacking:
df['column_new_1'], df['column_new_2'], df['column_new_3'] = [np.nan, 'dogs', 3]
2) DataFrame conveniently expands a single row to match the index, so you can do this:
df[['column_new_1', 'column_new_2', 'column_new_3']] = pd.DataFrame([[np.nan, 'dogs', 3]], index=df.index)
3) Make a temporary data frame with new columns, then combine with the original data frame later:
df = pd.concat(
[
df,
pd.DataFrame(
[[np.nan, 'dogs', 3]],
index=df.index,
columns=['column_new_1', 'column_new_2', 'column_new_3']
)
], axis=1
)
4) Similar to the previous, but using join instead of concat (may be less efficient):
df = df.join(pd.DataFrame(
[[np.nan, 'dogs', 3]],
index=df.index,
columns=['column_new_1', 'column_new_2', 'column_new_3']
))
5) Using a dict is a more "natural" way to create the new data frame than the previous two, but the new columns will be sorted alphabetically (at least before Python 3.6 or 3.7):
df = df.join(pd.DataFrame(
{
'column_new_1': np.nan,
'column_new_2': 'dogs',
'column_new_3': 3
}, index=df.index
))
6) Use .assign() with multiple column arguments.
I like this variant on #zero's answer a lot, but like the previous one, the new columns will always be sorted alphabetically, at least with early versions of Python:
df = df.assign(column_new_1=np.nan, column_new_2='dogs', column_new_3=3)
7) This is interesting (based on https://stackoverflow.com/a/44951376/3830997), but I don't know when it would be worth the trouble:
new_cols = ['column_new_1', 'column_new_2', 'column_new_3']
new_vals = [np.nan, 'dogs', 3]
df = df.reindex(columns=df.columns.tolist() + new_cols) # add empty cols
df[new_cols] = new_vals # multi-column assignment works for existing cols
8) In the end it's hard to beat three separate assignments:
df['column_new_1'] = np.nan
df['column_new_2'] = 'dogs'
df['column_new_3'] = 3
Note: many of these options have already been covered in other answers: Add multiple columns to DataFrame and set them equal to an existing column, Is it possible to add several columns at once to a pandas DataFrame?, Add multiple empty columns to pandas DataFrame
You could use assign with a dict of column names and values.
In [1069]: df.assign(**{'col_new_1': np.nan, 'col2_new_2': 'dogs', 'col3_new_3': 3})
Out[1069]:
col_1 col_2 col2_new_2 col3_new_3 col_new_1
0 0 4 dogs 3 NaN
1 1 5 dogs 3 NaN
2 2 6 dogs 3 NaN
3 3 7 dogs 3 NaN
My goal when writing Pandas is to write efficient readable code that I can chain. I won't go into why I like chaining so much here, I expound on that in my book, Effective Pandas.
I often want to add new columns in a succinct manner that also allows me to chain. My general rule is that I update or create columns using the .assign method.
To answer your question, I would use the following code:
(df
.assign(column_new_1=np.nan,
column_new_2='dogs',
column_new_3=3
)
)
To go a little further. I often have a dataframe that has new columns that I want to add to my dataframe. Let's assume it looks like say... a dataframe with the three columns you want:
df2 = pd.DataFrame({'column_new_1': np.nan,
'column_new_2': 'dogs',
'column_new_3': 3},
index=df.index
)
In this case I would write the following code:
(df
.assign(**df2)
)
With the use of concat:
In [128]: df
Out[128]:
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
In [129]: pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
Out[129]:
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0.0 4.0 NaN NaN NaN
1 1.0 5.0 NaN NaN NaN
2 2.0 6.0 NaN NaN NaN
3 3.0 7.0 NaN NaN NaN
Not very sure of what you wanted to do with [np.nan, 'dogs',3]. Maybe now set them as default values?
In [142]: df1 = pd.concat([df, pd.DataFrame(columns = [ 'column_new_1', 'column_new_2','column_new_3'])])
In [143]: df1[[ 'column_new_1', 'column_new_2','column_new_3']] = [np.nan, 'dogs', 3]
In [144]: df1
Out[144]:
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0.0 4.0 NaN dogs 3
1 1.0 5.0 NaN dogs 3
2 2.0 6.0 NaN dogs 3
3 3.0 7.0 NaN dogs 3
Dictionary mapping with .assign():
This is the most readable and dynamic way to assign new column(s) with value(s) when working with many of them.
import pandas as pd
import numpy as np
new_cols = ["column_new_1", "column_new_2", "column_new_3"]
new_vals = [np.nan, "dogs", 3]
# Map new columns as keys and new values as values
col_val_mapping = dict(zip(new_cols, new_vals))
# Unpack new column/new value pairs and assign them to the data frame
df = df.assign(**col_val_mapping)
If you're just trying to initialize the new column values to be empty as you either don't know what the values are going to be or you have many new columns.
import pandas as pd
import numpy as np
new_cols = ["column_new_1", "column_new_2", "column_new_3"]
new_vals = [None for item in new_cols]
# Map new columns as keys and new values as values
col_val_mapping = dict(zip(new_cols, new_vals))
# Unpack new column/new value pairs and assign them to the data frame
df = df.assign(**col_val_mapping)
use of list comprehension, pd.DataFrame and pd.concat
pd.concat(
[
df,
pd.DataFrame(
[[np.nan, 'dogs', 3] for _ in range(df.shape[0])],
df.index, ['column_new_1', 'column_new_2','column_new_3']
)
], axis=1)
if adding a lot of missing columns (a, b, c ,....) with the same value, here 0, i did this:
new_cols = ["a", "b", "c" ]
df[new_cols] = pd.DataFrame([[0] * len(new_cols)], index=df.index)
It's based on the second variant of the accepted answer.
Just want to point out that option2 in #Matthias Fripp's answer
(2) I wouldn't necessarily expect DataFrame to work this way, but it does
df[['column_new_1', 'column_new_2', 'column_new_3']] = pd.DataFrame([[np.nan, 'dogs', 3]], index=df.index)
is already documented in pandas' own documentation
http://pandas.pydata.org/pandas-docs/stable/indexing.html#basics
You can pass a list of columns to [] to select columns in that order.
If a column is not contained in the DataFrame, an exception will be raised.
Multiple columns can also be set in this manner.
You may find this useful for applying a transform (in-place) to a subset of the columns.
You can use tuple unpacking:
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df['col3'], df['col4'] = 'a', 10
Result:
col1 col2 col3 col4
0 1 3 a 10
1 2 4 a 10
If you just want to add empty new columns, reindex will do the job
df
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0 4 NaN NaN NaN
1 1 5 NaN NaN NaN
2 2 6 NaN NaN NaN
3 3 7 NaN NaN NaN
full code example
import numpy as np
import pandas as pd
df = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(df)
print('df',df, sep='\n')
print()
df=df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)
print('''df.reindex(list(df)+['column_new_1', 'column_new_2','column_new_3'], axis=1)''',df, sep='\n')
otherwise go for zeros answer with assign
I am not comfortable using "Index" and so on...could come up as below
df.columns
Index(['A123', 'B123'], dtype='object')
df=pd.concat([df,pd.DataFrame(columns=list('CDE'))])
df.rename(columns={
'C':'C123',
'D':'D123',
'E':'E123'
},inplace=True)
df.columns
Index(['A123', 'B123', 'C123', 'D123', 'E123'], dtype='object')
You could instantiate the values from a dictionary if you wanted different values for each column & you don't mind making a dictionary on the line before.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]
})
>>> df
col_1 col_2
0 0 4
1 1 5
2 2 6
3 3 7
>>> cols = {
'column_new_1':np.nan,
'column_new_2':'dogs',
'column_new_3': 3
}
>>> df[list(cols)] = pd.DataFrame(data={k:[v]*len(df) for k,v in cols.items()})
>>> df
col_1 col_2 column_new_1 column_new_2 column_new_3
0 0 4 NaN dogs 3
1 1 5 NaN dogs 3
2 2 6 NaN dogs 3
3 3 7 NaN dogs 3
Not necessarily better than the accepted answer, but it's another approach not yet listed.
import pandas as pd
df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]
})
df['col_3'], df['col_4'] = [df.col_1]*2
>> df
col_1 col_2 col_3 col_4
0 4 0 0
1 5 1 1
2 6 2 2
3 7 3 3

Combine columns in different rows in dataframe python

I am trying to join columns in different rows in a dataframe.
import pandas as pd
tdf = {'ph1': [1, 2], 'ph2': [3, 4], 'ph3': [5,6], 'ph4': [nan,nan]}
df = pd.DataFrame(data=tdf)
df
Output:
ph1 ph2 ph3 ph4
0 1 3 5 nan
1 2 4 6 nan
I combined ph1, ph2, ph3, ph4 with below code:
for idx, row in df.iterrows():
df = df[[ph1, ph2, ph3, ph4]]
df["ConcatedPhoneNumbers"] = df.loc[0:].apply(lambda x: ', '.join(x), axis=1)
I got
df["ConcatPhoneNumbers"]
ConcatPhoneNumbers
1,3,5,,
2,4,6,,
Now I need to combine these columns using pandas with appropriate function.
My result should be 1,3,5,2,4,6
Also need to remove these extra commas.
I am new Python learner.I did some research and reached till here. Please help me to get the exact approach.
It seems you need stack for remove NaNs, then convert to int, str and list and last join:
tdf = {'ph1': [1, 2], 'ph2': [3, 4], 'ph3': [5,6], 'ph4': [np.nan,np.nan]}
df = pd.DataFrame(data=tdf)
cols = ['ph1', 'ph2', 'ph3', 'ph4']
s = ','.join(df[cols].stack().astype(int).astype(str).values.tolist())
print (s)
1,3,5,2,4,6

Python pandas merge with OR logic

I'm searching and haven't found an answer to this question, can you perform a merge of pandas dataframes using OR logic? Basically, the equivalent of a SQL merge using "where t1.A = t2.A OR t1.A = t2.B".
I have a situation where I am pulling information from one database into a dataframe (df1) and I need to merge it with information from another database, which I pulled into another dataframe (df2), merging based on a single column (col1). If these always used the same value when they matched, it would be very straightforward. The situation I have is that sometimes they match and sometimes they use a synonym. There is a third database that has a table that provides a lookup between synonyms for this data entity (col1 and col1_alias), which could be pulled into a third dataframe (df3). What I am looking to do is merge the columns I need from df1 and the columns I need from df2.
As stated above, in cases where df1.col1 and df2.col1 match, this would work...
df = df1.merge(df2, on='col1', how='left')
However, they don't always have the same value and sometimes have the synonyms. I thought about creating df3 based on when df3.col1 was in df1.col1 OR df3.col1_alias was in df1.col1. Then, creating a single list of values from df3.col1 and df3.col1_alias (list1) and selecting df2 based on df2.col1 in list1. This would give me the rows from df2 I need but, that still wouldn't put me in position to merge df1 and df2 matching the appropriate rows. I think if there an OR merge option, I can step through this and make it work, but all of the following threw a syntax error:
df = df1.merge((df3, left_on='col1', right_on='col1', how='left')|(df3, left_on='col1', right_on='col1_alias', how='left'))
and
df = df1.merge(df3, (left_on='col1', right_on='col1')|(left_on='col1', right_on='col1_alias'), how='left')
and
df = df1.merge(df3, left_on='col1', right_on='col1'|right_on='col1_alias', how='left')
and several other variations. Any guidance on how to perform an OR merge or suggestions on a completely different approach to merging df1 and df2 using the synonyms in two columns in df3?
I think I would do this as two merges:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=["A", "B"])
In [12]: df2 = pd.DataFrame([[1, 7], [2, 8], [4, 9]], columns=["C", "D"])
In [13]: res = df.merge(df2, left_on="B", right_on="C", how="left")
In [14]: res.update(df.merge(df2, left_on="A", right_on="C", how="left"))
In [15]: res
Out[15]:
A B C D
0 1 2 1.0 7.0
1 3 4 4.0 9.0
2 5 6 NaN NaN
As you can see this picks A = 1 -> D = 7 rather than B = 2 -> D = 8.
Note: For more extensibility (matching different columns) it might make sense to pull out a single column, although they're both the same in this example:
In [21]: res = df.merge(df2, left_on="B", right_on="C", how="left")["C"]
In [22]: res.update(df.merge(df2, left_on="A", right_on="C", how="left")["C"])
In [23]: res
Out[23]:
0 1.0
1 4.0
2 NaN
Name: C, dtype: float64
#will this work?
df = pd.concat([df1.merge(df3, left_on='col1', right_on='col1', how='left'), df1.merge(df3, left_on='col1', right_on='col1_alias', how='left')]

Categories

Resources