Pandas: Merge dataframes with repeated indexes - python

I would like to merge two datasets that share a common index. In my real data, this index is a serial number and it is repeated. The serial number corresponds to a vehicle and it is repeated for every trip taken with that vehicle. So there are different feature values depending on the trip circumstances.
Here's an example:
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=["a", "a", "b", "b"],
)
df1
>>
A B C D
a A0 B0 C0 D0
a A1 B1 C1 D1
b A2 B2 C2 D2
b A3 B3 C3 D3
df2 = pd.DataFrame(
{
"A2": ["A4", "A5", "A6", "A7"],
"B2": ["B4", "B5", "B6", "B7"],
"C2": ["C4", "C5", "C6", "C7"],
"D2": ["D4", "D5", "D6", "D7"],
},
index=["a", "b", "b", "b"],
)
​
df2
>>
A2 B2 C2 D2
a A4 B4 C4 D4
b A5 B5 C5 D5
b A6 B6 C6 D6
b A7 B7 C7 D7
I am struggling to see the best way of emerging these two datasets. Apart from the index, they don't share more common information. So I'd like to use as much as I can from the two but also avoid unnecessary repetition.
I attempted:
df1.join(df2)
>>
A B C D A2 B2 C2 D2
a A0 B0 C0 D0 A4 B4 C4 D4
a A1 B1 C1 D1 A4 B4 C4 D4
b A2 B2 C2 D2 A5 B5 C5 D5
b A2 B2 C2 D2 A6 B6 C6 D6
b A2 B2 C2 D2 A7 B7 C7 D7
b A3 B3 C3 D3 A5 B5 C5 D5
b A3 B3 C3 D3 A6 B6 C6 D6
b A3 B3 C3 D3 A7 B7 C7 D7
but as you can see for every df1 I'm adding all rows of df2 to each one of the rows of df1... This is not wrong I think... but considering the size of my datasets (3GB)it would end up creating more observations than necessary... So I would like to avoid this if possible.
I also attempted:
pd.concat([df1, df2], axis=1, join="inner")
but as I have repeated indexes of serial numbers it returns an error:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
What's the best way of merging these two datasets of repeated indexes? In other words, what the best output should be in order to preserve information from both datasets and minimise repetition (affects data size significantly)?

Related

How to convert each row of a dataframe to new column use concat in python

If I have dataframes,
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=[0, 1, 2, 3],)
df2 = pd.DataFrame(
{
"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"],
"C": ["C4", "C5", "C6", "C7"],
"D": ["D4", "D5", "D6", "D7"],
},
index=[4,5,6,7],)
I want to use pd.concat to combine these two dataframes as
dfnew = pd.concat([df1.loc[0],
df1.loc[1],
df1.loc[2],
df1.loc[3],
df2.loc[4],
df2.loc[5],
df2.loc[6],
df2.loc[7]],
axis=0,sort=False)
dfnew = dfnew.to_frame().transpose()
dfnew is a 1row x 32 columns dataframe.
But how about I have many rows in df1 and df2, or I want to combine different number of rows of df1 and df2 in a loop? What can I do for the concat .loc[] part? Or is there another way to do this?
Thank you ahead.
IIUC, you could stack the individual dataframes, concat and reshape:
dfnew = pd.concat([df1.stack(), df2.stack()]).droplevel(0).to_frame().T
output:
A B C D A B C D A B C D A B C D A B C D A B C D A B C D A B C D
0 A0 B0 C0 D0 A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 A4 B4 C4 D4 A5 B5 C5 D5 A6 B6 C6 D6 A7 B7 C7 D7

Python Update two dataframes with identical columns and a few differing rows

I am joining two data frames that have the same columns. I wanted to update the first dataframe. However, the my code creates additional columns but it is not updating.
My code:
left = pd.DataFrame({"key": ["K0", "K1", "K2", "K3"],
"A": ["NaN", "NaN", "NaN", "NaN"],
"B": ["B0", "B1", "B2", "B3"],})
right = pd.DataFrame({"key": ["K1", "K2", "K3"],
"A": ["C1", "C2", "C3"],
"B": [ "B1", "B2", "B3"]})
result = pd.merge(left, right, on="key",how='left')
Present output:
result =
key A_x B_x A_y B_y
0 K0 NaN B0 NaN NaN
1 K1 NaN B1 C1 B1
2 K2 NaN B2 C2 B2
3 K3 NaN B3 C3 B3
Expected output:
result =
key B A
0 K0 B0 NaN
1 K1 B1 C1
2 K2 B2 C2
3 K3 B3 C3
Use combine_first:
result = left.set_index("key").combine_first(right.set_index("key")).reset_index()
print(result)
Output
key A B
0 K0 NaN B0
1 K1 C1 B1
2 K2 C2 B2
3 K3 C3 B3

Add rows from one dataframe to another based on missing values in a given column pandas

I have been searching a long time for an answer but could not find it. I have two dataframes, one is target, the other backup which both have the same columns. What I want to do is to look at a given column and add all the rows from backup to target which are not in target. The most straightforward solution for this is:
import pandas as pd
import numpy as np
target = pd.DataFrame({
"key1": ["K1", "K2", "K3", "K5"],
"A": ["A1", "A2", "A3", np.nan],
"B": ["B1", "B2", "B3", "B5"],
})
backup = pd.DataFrame({
"key1": ["K1", "K2", "K3", "K4", "K5"],
"A": ["A1", "A", "A3", "A4", "A5"],
"B": ["B1", "B2", "B3", "B4", "B5"],
})
merged = target.copy()
for item in backup.key1.unique():
if item not in target.key1.unique():
merged = pd.concat([merged, backup.loc[backup.key1 == item]])
merged.reset_index(drop=True, inplace=True)
giving
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
4 K4 A4 B4
Now I have tried several things using just pandas where none of them works.
pandas concat
# Does not work because it creates duplicate lines and if dropped, the updated rows which are different will not be dropped -- compare the line with A or NaN
pd.concat([target, backup]).drop_duplicates()
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
1 K2 A B2
3 K4 A4 B4
4 K5 A5 B5
pandas merge
# Does not work because the backup would overwrite data in the target -- NaN
pd.merge(target, backup, how="right")
key1 A B
0 K1 A1 B1
1 K2 A B2
2 K3 A3 B3
3 K4 A4 B4
4 K5 A5 B5
Importantly, it is not a duplicate of this post since I do not want to have a new column and more importantly, the values are not NaN in target, they are simply not there. Furthermore, if then I would use what is proposed for merging the columns, the NaN in the target would be replaced by the value in backup which is unwanted.
It is not a duplicate of this post which uses the combine_first pandas because in that case the NaN is filled by the value from the backup which is wrong:
target.combine_first(backup)
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 A4 B5
4 K5 A5 B5
Lastly,
target.join(backup, on=["key1"])
gives me an annoying
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
which I really do not get since both are pure strings and the proposed solution does not work.
So I would like to ask, what am I missing? How can I do it using some pandas methods? Thanks a lot.
Use concat with filtered backup rows with not exist in target.key1 filtered by Series.isin in boolean indexing:
merged = pd.concat([target, backup[~backup.key1.isin(target.key1)]])
print (merged)
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
3 K4 A4 B4
Maybe you can try this with a 'subset' parameter in df.drop_duplicates()?
pd.concat([target, backup]).drop_duplicates(subset = "key1")
which gives output:
key1 A B
0 K1 A1 B1
1 K2 A2 B2
2 K3 A3 B3
3 K5 NaN B5
3 K4 A4 B4

How do I print items out of a list horizontally in rows of specific length in Python? [duplicate]

This question already has answers here:
Python: Print in rows
(3 answers)
Closed 2 years ago.
I am very new to Python and have been given the following exercise.
Write a program that prints the whole chess board in the following way:
A1 A2 A3 A4 A5 A6 A7 A8
B1 B2 B3 B4 B5 B6 B7 B8
C1 C2 C3 C4 C5 C6 C7 C8
D1 D2 D3 D4 D5 D6 D7 D8
E1 E2 E3 E4 E5 E6 E7 E8
F1 F2 F3 F4 F5 F6 F7 F8
G1 G2 G3 G4 G5 G6 G7 G8
H1 H2 H3 H4 H5 H6 H7 H8
So far, this is what I've done:
letter_fields=["A", "B", "C", "D","E","F", "G", "H"]
number_fields=["1", "2", "3", "4", "5", "6", "7", "8"]
for letter in letter_fields:
for number in number_fields:
print (letter+number, end="")
So I have now printed everything horizontally, but not sure how to make it into specific rows. I have tried adding empty spaces in the print line, however, can't quite get it to align. This is meant to be done without anything too complicated as this is only the 4th lesson.. Any help appreciated!
Given what you may have already learned, you can use a space as the end argument for print, and print a new line after each inner loop finishes:
for letter in letter_fields:
for number in number_fields:
print (letter+number, end=" ")
print()
#blhsing offers a good fix for your current code.
Another solution would be loop over every letter, build the horizontal row using a list comprehension, them print the joined coordinates on each line using str.join:
for letter in letter_fields:
row = [letter + number for number in number_fields]
print(" ".join(row))
Output:
A1 A2 A3 A4 A5 A6 A7 A8
B1 B2 B3 B4 B5 B6 B7 B8
C1 C2 C3 C4 C5 C6 C7 C8
D1 D2 D3 D4 D5 D6 D7 D8
E1 E2 E3 E4 E5 E6 E7 E8
F1 F2 F3 F4 F5 F6 F7 F8
G1 G2 G3 G4 G5 G6 G7 G8
H1 H2 H3 H4 H5 H6 H7 H8

How to get the difference between two csv by Index using Pandas

Need to get the difference between 2 csv files, kill duplicates and Nan fields.
I am trying this one but it adds them together instead of subtracting.
df1 = pd.concat([df,cite_id]).drop_duplicates(keep=False)[['id','website']]
df is main dataframe
cite_id is dataframe that has to be subtracted.
You can do this efficiently using 'isin'
df.dropna().drop_duplicates()
cite_id.dropna().drop_duplicates()
df[~df.id.isin(cite_id.id.values)]
Or You can merge them and keep only the lines that have a NaN
df[pd.merge(cite_id, df, how='outer').isnull().any(axis=1)]
import pandas as pd
df1 = pd.read_csv("1.csv")
df2 = pd.read_csv("2.csv")
df1 = df1.dropna().drop_duplicates()
df2 = df2.dropna().drop_duplicates()
df = df2.loc[~df2.id.isin(df1.id)]
You can concatenate two dataframes as one, after that you can remove all dupicates
df1
ID B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
cite_id
ID B C D
4 A2 B4 C4 D4
5 A3 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
pd.concat([df1,cite_id]).drop_duplicates(subset=['ID'], keep=False)
Out:
ID B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
6 A6 B6 C6 D6
7 A7 B7 C7 D7

Categories

Resources