Why does merging dataframes writes off some of my data?

Why does merging dataframes writes off some of my data? - python

I tried to merge two dataframes:
df1 contains columns a b c, row 1,2,3
df2 contains columns a b c, row 4,5,6
when using pd.merge(df1,df2), some of the row data gets erased and disappears from the merged df. Why?

You can try this:
pd.concat([df1,df2])
It works.
Example:
df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[11, 22, 33], [44, 55, 66], [77, 88, 99]]),
columns=['a', 'b', 'c'])
pd.concat([df1,df2],ignore_index=True)
You will get the table with all elements from two dataframes. ignore_index=True helps to avoid confused numeration of index.
Also you can use:
df1.merge(df2, how='outer')
You should check https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

Related

how slice by hybrid stile

having a random df
df = pd.DataFrame([[1,2,3,4],[4,5,6,7],[7,8,9,10],[10,11,12,13],[14,15,16,17]], columns=['A', 'B','C','D'])
cols_in = list(df)[0:2]+list(df)[4:]
now:
x = []
for i in range(df.shape[0]):
x.append(df.iloc[i,cols_in])
obviously in the cycle, x return an error due to col_in assignment in iloc.
How could be possible apply a mixed style slicing of df like in append function ?

It seems like you want to exclude one column? There is no column 4, so depending on which columns you are after, something like this might be what you are after:
df = pd.DataFrame([[1,2,3,4],[4,5,6,7],[7,8,9,10],[10,11,12,13],[14,15,16,17]], columns=['A', 'B','C','D'])
If you want to get the column indeces from column names you can do:
cols = ['A', 'B', 'D']
cols_in = np.nonzero(df.columns.isin(cols))[0]
x = []
for i in range(df.shape[0]):
x.append(df.iloc[i, cols_in].to_list())
x
Output:
[[1, 2, 4], [4, 5, 7], [7, 8, 10], [10, 11, 13], [14, 15, 17]]

Specify columns to output with Pandas Merge function

import pandas as pd
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['dog', 'bar', 'baz', 'foo'],
'value': [5, 6, 7, 8],
'valuea': [9, 10, 11, 12],
'valueb': [13, 14, 15, 16]})
I would like to merge these 2 dataframes based on 'value'. However I don't want the result to give me all of the columns in df2. I would like to keep the one with the 'valuea' column header but not the one with 'valueb' column header as per the squared output in the image.
The code I have tried is
df1.merge(df2, on ='value')
Is there a way exclude column with header = valueb using parameters in the merge function?

You cannot exclude columns with a parameter in the merge function.
Try these approaches instead:
pd.merge(df1, df2).drop(columns=['valueb'])
pd.merge(df1, df2.drop(columns=['valueb']))

Filling column of dataframe based on 'groups' of values of another column

I am trying to fill values of a column based on the value of another column. Suppose I have the following dataframe:
import pandas as pd
data = {'A': [4, 4, 5, 6],
'B': ['a', np.nan, np.nan, 'd']}
df = pd.DataFrame(data)
And I would like to fill column B but only if the value of column A equals 4. Hence, all rows that have the same value as another in column A should have the same value in column B (by filling this).
Thus, the desired output should be:
data = {'A': [4, 4, 5, 6],
'B': ['a', a, np.nan, 'd']}
df = pd.DataFrame(data)
I am aware of the fillna method, but this gives the wrong output as the third row also gets the value 'A' assigned:
df['B'] = fillna(method="ffill", inplace=True)
data = {'A': [4, 4, 5, 6],
'B': ['a', 'a', 'a', 'd']}
df = pd.DataFrame(data)
How can I get the desired output?

Try this:
df['B'] = df.groupby('A')['B'].ffill()
Output:
>>> df
A B
0 4 a
1 4 a
2 5 NaN
3 6 d

Add total row to dataframe with multi level index

Consider the follow dataframe with a multi level index:
arrays = [np.array(['bar', 'bar', 'baz']),
np.array(['one', 'two', 'one'])]
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'],
index=arrays)
All I'm trying to do is add a 'Totals' row to the bottom (12, 15, 18 would be the expected values here). It seems like I need to calculate the totals and then append them to the dataframe, but I just can't get it work while preserving the multi level index (which I want to do). Thanks in advance!

This does not preserve your multi-level index, but it does append a new row called "total" that contains column sums:
import pandas as pd
import numpy as np
arrays = [np.array(['bar', 'bar', 'baz']),
np.array(['one', 'two', 'one'])]
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'],
index=arrays)
df.append(df.sum().rename('total')).assign(total=lambda d: d.sum(1))

I figured it out. Thanks for the responses. Those plus a little more education about indices in Python got me to something that worked.
# Create df of totals
df2 = pd.DataFrame(df.sum())
# Transpose df
df2 = df2.T
# Reset index
df2 = df2.reset_index()
# Add additional column so the columns of df2 match the columns of df
df2['Index'] = "zTotal"
# Set indices to match df indices
df2 = df2.set_index(['index', 'Index'])
# Concat df and df2
df3 = pd.concat([df, df2])
# Sort in desired order
df3 = df3.sort_index(ascending=[False,True])

plotting a given column name across different data frames in python

All, I have multiple dataframes like
df1 = pd.DataFrame(np.array([
['a', 1, 2],
['b', 3, 4],
['c', 5, 6]]),
columns=['name', 'attr1', 'attr2'])
df2 = pd.DataFrame(np.array([
['a', 2, 3],
['b', 4, 5],
['c', 6, 7]]),
columns=['name', 'attr1', 'attr2'])
df3 = pd.DataFrame(np.array([
['a', 3, 4],
['b', 5, 6],
['c', 7, 8]]),
columns=['name', 'attr1', 'attr2'])
each of these dataframes are generated at specific time steps says T=[t1, t2, t3]
I would like to plot, attr1 or attr2 of the diff data frames as function of time T. I would like to do this for 'a', 'b' and 'c' on all the same graph.
Plot Attr1 VS time for 'a', 'b' and 'c'

If I understand correctly, first assign a column T to each of your dataframes, then concatenate the three. Then, you can groupby the name column, iterate through each, and plot T against attr1 or attr2:
dfs = pd.concat([df1.assign(T=1), df2.assign(T=2), df3.assign(T=3)])
for name, data in dfs.groupby('name'):
plt.plot(data['T'], data['attr2'], label=name)
plt.xlabel('Time')
plt.ylabel('attr2')
plt.legend()
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does merging dataframes writes off some of my data? - python

I tried to merge two dataframes: df1 contains columns a b c, row 1,2,3 df2 contains columns a b c, row 4,5,6 when using pd.merge(df1,df2), some of the row data gets erased and disappears from the merged df. Why?

Related

how slice by hybrid stile

Specify columns to output with Pandas Merge function

Filling column of dataframe based on 'groups' of values of another column

Add total row to dataframe with multi level index

plotting a given column name across different data frames in python

Categories

Resources