create a dataframe from 3 other dataframes in python - python

I am trying to create a new df which summarises my key information, by taking that information from 3 (say) other dataframes.
dfdate = {'x1': [2, 4, 7, 5, 6],
'x2': [2, 2, 2, 6, 7],
'y1': [3, 1, 4, 5, 9]}
dfdate = pd.DataFrame(df, index=range(0:4))
dfqty = {'x1': [1, 2, 6, 6, 8],
'x2': [3, 1, 1, 7, 5],
'y1': [2, 4, 3, 2, 8]}
dfqty = pd.DataFrame(df2, range(0:4))
dfprices = {'x1': [0, 2, 2, 4, 4],
'x2': [2, 0, 0, 3, 4],
'y1': [1, 3, 2, 1, 3]}
dfprices = pd.DataFrame(df3, range(0:4))
Let us say the above 3 dataframes are my data. Say, some dates, qty, and prices of goods. My new df is to be constructed from the above data:
rng = len(dfprices.columns)*len(dfprices.index) # This is the len of new df
dfnew = pd.DataFrame(np.nan,index=range(0,rng),columns=['Letter', 'Number', 'date', 'qty', 'price])
Now, this is where I struggle to piece my stuff together. I am trying to take all the data in dfdate and put it into a column in the new df. same with dfqty and dfprice. (so 3x5 matricies essentially goto a 1x15 vector and are placed into the new df).
As well as that, I need a couple of columns in dfnew as identifiers, from the names of the columns of the old df.
Ive tried for loops but to no avail, and don't know how to convert a df to series. But my desired output is:
dfnew:
'Lettercol','Numbercol', 'date', 'qty', 'price'
0 X 1 2 1 0
1 X 1 4 2 2
2 X 1 7 6 2
3 X 1 5 6 4
4 X 1 6 8 4
5 X 2 2 3 2
6 X 2 2 1 0
7 X 2 2 1 0
8 X 2 6 7 3
9 X 2 7 5 4
10 Y 1 3 2 1
11 Y 1 1 4 3
12 Y 1 4 3 2
13 Y 1 5 2 1
14 Y 1 9 8 3
where the numbers 0-14 are the index.
letter = letter from col header in DFs
number = number from col header in DFs
next 3 columns are data from the orig df's
(don't ask why the original data is in that funny format :)
thanks so much. my last Q wasn't well received so have tried to make this one better, thanks

Use:
#list of DataFrames
dfs = [dfdate, dfqty, dfprices]
#list comprehension with reshape
comb = [x.unstack() for x in dfs]
#join together
df = pd.concat(comb, axis=1, keys=['date', 'qty', 'price'])
#remove second level of MultiIndex and index to column
df = df.reset_index(level=1, drop=True).reset_index().rename(columns={'index':'col'})
#extract all values without first by indexing [1:] and first letter by [0]
df['Number'] = df['col'].str[1:]
df['Letter'] = df['col'].str[0]
cols = ['Letter', 'Number', 'date', 'qty', 'price']
#change order of columns
df = df.reindex(columns=cols)
print (df)
Letter Number date qty price
0 x 1 2 1 0
1 x 1 4 2 2
2 x 1 7 6 2
3 x 1 5 6 4
4 x 1 6 8 4
5 x 2 2 3 2
6 x 2 2 1 0
7 x 2 2 1 0
8 x 2 6 7 3
9 x 2 7 5 4
10 y 1 3 2 1
11 y 1 1 4 3
12 y 1 4 3 2
13 y 1 5 2 1
14 y 1 9 8 3

Related

How to convert two rows of data into a single row

I want convert below data into one Using pandas
Orginal data
ID Name m1 m2 m3
1 X 2 6 6
1 Y 1 2 3
2 A 2 4 7
2 y 5 6 7
I want To covert into below format using pandas libray
ID Name1 m1 m2 m3 Name2 m1 m2 m3
1 X 2 6 6 Y 1 2 3
2 A 2 4 7 y 6 6 7
Let's assume this is your data:
data = {'ID':[1, 1, 2, 2],
'Name':['X', 'Y', 'A', 'y'],
'm1':[2, 1, 2, 5], 'm2':[6,2,4,6],
'm3':[6, 3, 7, 7] }
df = pd.DataFrame(data)
Step 1: Sort the data by ID:
df = df.sort_values(by=['ID'])
Step 2: drop duplicates and keep the first records
df1 = df.drop_duplicates(subset=['ID'], keep='first')
Step 3: again drop duplicates but keep the last records
df2 = df.drop_duplicates(subset=['ID'], keep='last')
Step 4: finally, merge the two dataframe on the same ID
df = df1.merge(df2, on='ID')
Expected output would be look like:
ID Name_x m1_x m2_x m3_x Name_y m1_y m2_y m3_y
0 1 X 2 6 6 Y 1 2 3
1 2 A 2 4 7 y 5 6 7

Calculate mean of selected columns with multilevel header

I have a dataframe with multilevel headers for the columns like this:
name 1 2 3 4
x y x y x y x y
A 1 4 3 7 2 1 5 2
B 2 2 6 1 4 5 1 7
How can I calculate the mean for 1x, 2x and 3x, but not 4x?
I tried:
df['mean']= df[('1','x'),('2','x'),('3','x')].mean()
This did not work, it syas key error. I would like to get:
name 1 2 3 4 mean
x y x y x y x y
A 1 4 3 7 2 1 5 2 2
B 2 2 6 1 4 5 1 7 4
Is there a way to calculate the mean while keeping the first column header as an integer?
This is only one solution:
import pandas as pd
iterables = [[1, 2, 3, 4], ["x", "y"]]
array = [
[1, 4, 3, 7, 2, 1, 5, 2],
[2, 2, 6, 1, 4, 5, 1, 7]
]
index = pd.MultiIndex.from_product(iterables)
df = pd.DataFrame(array, index=["A", "B"], columns=index)
df["mean"] = df.xs("x", level=1, axis=1).loc[:,1:3].mean(axis=1)
print(df)
1 2 3 4 mean
x y x y x y x y
A 1 4 3 7 2 1 5 2 2.0
B 2 2 6 1 4 5 1 7 4.0
Steps:
Select all the "x"-columns with df.xs("x", level=1, axis=1)
Select only columns 1 to 3 with .loc[:,1:3]
Calculate the mean value with .mean(axis=1)

Select rows of pandas dataframe in order of a given list with repetitions and keep the original index

After looking here and here and in the documentation, I still cannot find a way to select rows from a DataFrame according to all these criteria:
Return rows in an order given from a list of values from a given column
Return repeated rows (associated with repeated values in the list)
Preserve the original indices
Ignore values of the list not present in the DataFrame
As an example, let
df = pd.DataFrame({'A': [5, 6, 3, 4], 'B': [1, 2, 3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
and let
list_of_values = [3, 4, 6, 4, 3, 8]
Then I would like to get the following DataFrame:
A B
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
How can I accomplish that? Zero's answer looks promising as it is the only one I found which preserves the original index, but it does not work with repetitions. Any ideas about how to modify/generalize it?
We have to preserve the index by assigning it as a column first so we can set_index after the mering:
list_of_values = [3, 4, 6, 4, 3, 8]
df2 = pd.DataFrame({'A': list_of_values, 'order': range(len(list_of_values))})
dfn = (
df.assign(idx=df.index)
.merge(df2, on='A')
.sort_values('order')
.set_index('idx')
.drop('order', axis=1)
)
A B
idx
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
If you want to get rid of the index name (idx), use rename_axis:
dfn = dfn.rename_axis(None)
A B
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
Here's a way to do that using merge:
list_df = pd.DataFrame({"A": list_of_values, "order": range(len(list_of_values))})
pd.merge(list_df, df, on="A").sort_values("order").drop("order", axis=1)
The output is:
A B
0 3 3
2 4 5
4 6 2
3 4 5
1 3 3

Mapping Tuple Dictionary to Multiple columns of a DataFrame

Ive got a PDB DataFrame with residue insertion codes. Simplified example.
d = {'ATOM' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
'residue_number' : [2, 2, 2, 3, 3, 3, 3, 3, 3, 5, 5, 5],
'insertion' : ['', '', '', '', '', '', 'A', 'A', 'A', '', '', '']}
df = pd.DataFrame(data = d)
Dataframe:
ATOM residue_number insertion
0 1 2
1 2 2
2 3 2
3 4 3
4 5 3
5 6 3
6 7 3 A
7 8 3 A
8 9 3 A
9 10 5
10 11 5
11 12 5
I need to renumber the residues according to a different numbering and insertion scheme. Output from the renumbering script can be formatted into a dictionary of tuples, e.g.
my_dict = {(2,): 1, (3,): 2, (3, 'A') : 3, (5, ) : (4, 'A') }
Is it possible to map this dictionary of tuples onto the two columns ['ATOM']['insertion']? The desired output would be:
ATOM residue_number insertion
0 1 1
1 2 1
2 3 1
3 4 2
4 5 2
5 6 2
6 7 3
7 8 3
8 9 3
9 10 4 A
10 11 4 A
11 12 4 A
I've been searching and banging my head on this for a few days, I've tried mapping and multindex but cant seem to find a way to map a dictionary of tuples across multiple columns. I feel like I'm thinking about it wrong somehow. Thanks for any advice!
in this case I think that you need to define a function that gets as input your old residue_number and insertion and outputs the new ones. For that, I will work directly from the df, so, to avoid extra coding, I will redefine your my_dict from (2,) to this (2,'')
here is the code:
import pandas as pd
d = {'ATOM' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
'residue_number' : [2, 2, 2, 3, 3, 3, 3, 3, 3, 5, 5, 5],
'insertion' : ['', '', '', '', '', '', 'A', 'A', 'A', '', '', '']}
df = pd.DataFrame(data = d)
# Our new dict with keys and values as tuples
my_new_dict = {(2,''): (1,''), (3,''): (2,''), (3,'A'): (3,''), (5,''): (4,'A') }
# We need a function that maps a tuple (residue_number, insertion) into your new_residue_number and new_insertion values
def new_residue_number(residue_number, insertion, my_new_dict):
# keys are tuples
key = (residue_number, insertion)
# Return new residue_number and insertion values
return my_new_dict[key]
# Example to see how this works
print(new_residue_number(2, '', my_new_dict)) # Output (1,'')
print(new_residue_number(5, '', my_new_dict)) # Output (4, 'A')
print(new_residue_number(3, 'A', my_new_dict)) # Output (3,'')
# Now we apply this to our df and save it in the same df in two new columns
df[['new_residue_number','new_insertion']] = df.apply(lambda row: pd.Series(new_residue_number(row['residue_number'], row['insertion'], my_new_dict)), axis=1)
I hope this can solve your problem!
I think we can create a DataFrame with your dictionary after modifying it to set all values ​​as tuples. So we can use DataFrame.join or . I think this is easier(
and recommended) if we convert the blank values ​​of the insertion column to NaN.
import numpy as np
new_df = ( df.assign(insertion = df['insertion'].replace(r'^\s*$',
np.nan,
regex=True)
.mask(df['insertion'].isnull()))
.join(pd.DataFrame({x:(y if isinstance(y,tuple) else (y,np.nan))
for x,y in my_dict.items()},
index = ['new_residue_number','new_insertion']).T,
on = ['residue_number','insertion'])
.fillna('')
.drop(['residue_number','insertion'],axis=1)
.rename(columns = {'new_residue_number':'residue_number',
'new_insertion':'insertion'}))
print(new_df)
ATOM residue_number insertion
0 1 1.0
1 2 1.0
2 3 1.0
3 4 2.0
4 5 2.0
5 6 2.0
6 7 3.0
7 8 3.0
8 9 3.0
9 10 4.0 A
10 11 4.0 A
11 12 4.0 A
Detail
print(pd.DataFrame({x:(y if isinstance(y,tuple) else (y,np.nan))
for x,y in my_dict.items()},
index = ['new_residue_number','new_insertion']).T)
new_residue_number new_insertion
2 NaN 1 NaN
3 NaN 2 NaN
A 3 NaN
5 NaN 4 A
The logic here is a simple merge. But we need to do a lot of work to turn that dictionary into a suitable DataFrame for mapping. I'd reconsider whether you can store the renumbering output as my final s DataFrame from the start.
#Turn the dict into a mapping
s = pd.DataFrame(my_dict.values())[0].explode().to_frame()
s['idx'] = s.groupby(level=0).cumcount()
s = (s.pivot(columns='idx', values=0)
.rename_axis(None, axis=1)
.rename(columns={0: 'new_res', 1: 'new_ins'}))
s.index = pd.MultiIndex.from_tuples([*my_dict.keys()], names=['residue_number', 'insertion'])
s = s.reset_index().fillna('') # Because you have '' not NaN
# residue_number insertion new_res new_ins
#0 2 1
#1 3 2
#2 3 A 3
#3 5 4 A
The mapping is now a merge. I'll leave all columns in for clarity of the logic, but you can use the commented out code to drop the original columns and rename the new columns.
df = df.merge(s, how='left')
# Your real output with
#df = (df.merge(s, how='left')
# .drop(columns=['residue_number', 'insertion'])
# .rename(columns={'new_res': 'residue_number',
# 'new_ins': 'insertion'}))
ATOM residue_number insertion new_res new_ins
0 1 2 1
1 2 2 1
2 3 2 1
3 4 3 2
4 5 3 2
5 6 3 2
6 7 3 A 3
7 8 3 A 3
8 9 3 A 3
9 10 5 4 A
10 11 5 4 A
11 12 5 4 A

using isin across multiple columns

I'm trying to use .isin with the ~ so I can get a list of unique rows back based on multiple columns in 2 data-sets.
So, I have 2 data-sets with 9 rows:
df1 is the bottom and df2 is the top (sorry but I couldn't get it to show both below, it showed 1 then a row of numbers)
Index Serial Count Churn
1 9 5 0
2 8 6 0
3 10 2 1
4 7 4 2
5 7 9 2
6 10 2 2
7 2 9 1
8 9 8 3
9 4 3 5
Index Serial Count Churn
1 10 2 1
2 10 2 1
3 9 3 0
4 8 6 0
5 9 8 0
6 1 9 1
7 10 3 1
8 6 7 1
9 4 8 0
I would like to get a list of rows from df1 which aren't in df2 based on more than 1 column.
For example if I base my search on the columns Serial and Count I wouldn't get Index 1 and 2 back from df1 as it appears in df2 at Index position 6, the same with Index position 4 in df1 as it appears at Index position 2 in df2. The same would apply to Index position 5 in df1 as it is at Index position 8 in df2.
The churn column doesn't really matter.
I can get it to work but based only on 1 column but not on more than 1 column.
df2[~df2.Serial.isin(df1.Serial.values)] kinda does what I want, but only on 1 column. I want it to be based on 2 or more.
Index Serial Count Churn
3 9 3 0
6 1 9 1
7 10 3 1
8 6 7 1
9 4 8 0
One solution is to merge with indicators:
df1 = pd.DataFrame([[10, 2, 0], [9, 4, 1], [9, 8, 1], [8, 6, 1], [9, 8, 1], [1, 9, 1], [10, 3, 1], [6, 7, 1], [4, 8, 1]], columns=['Serial', 'Count', 'Churn'])
df2 = pd.DataFrame([[9, 5, 1], [8, 6, 1], [10, 2, 1], [7, 4, 1], [7, 9, 1], [10, 2, 1], [2, 9, 1], [9, 8, 1], [4, 3, 1]], columns=['Serial', 'Count', 'Churn'])
# merge with indicator on
df_temp = df1.merge(df2[['Serial', 'Count']].drop_duplicates(), on=['Serial', 'Count'], how='left', indicator=True)
res = df_temp.loc[df_temp['_merge'] == 'left_only'].drop('_merge', axis=1)
Output
Serial Count Churn
1 9 4 1
5 1 9 1
6 10 3 1
7 6 7 1
8 4 8 1
I've had similar issue to solve, I've found the easiest way to deal with it by creating a temporary column, which consists of merged identifier columns and utilising isin on this newly created temporary column values.
A simple function achieving this could be the following
from functools import reduce
get_temp_col = lambda df, cols: reduce(lambda x, y: x + df[y].astype('str'), cols, "")
def subset_on_x_columns(df1, df2, cols):
"""
Subsets the input dataframe `df1` based on the missing unique values of input columns
`cols` of dataframe `df2`.
:param df1: Pandas dataframe to be subsetted
:param df2: Pandas dataframe which missing values are going to be
used to subset `df1` by
:param cols: List of column names
"""
df1_temp_col = get_temp_col(df1, cols)
df2_temp_col = get_temp_col(df2, cols)
return df1[~df1_temp_col.isin(df2_temp_col.unique())]
Thus for your case all that is needed, is to execute:
result_df = subset_on_x_columns(df1, df2, ['Serial', 'Count'])
which has the wanted rows:
Index Serial Count Churn
3 9 3 0
6 1 9 1
7 10 3 1
8 6 7 1
9 4 8 0
The nice thing about this solution is that it is naturally scalable in the number of columns to use, i.e. all that is needed is to specify in the input parameter list cols which columns to use as identifiers.

Categories

Resources