How to get the difference between two csv by Index using Pandas - python

Need to get the difference between 2 csv files, kill duplicates and Nan fields.
I am trying this one but it adds them together instead of subtracting.
df1 = pd.concat([df,cite_id]).drop_duplicates(keep=False)[['id','website']]
df is main dataframe
cite_id is dataframe that has to be subtracted.

You can do this efficiently using 'isin'
df.dropna().drop_duplicates()
cite_id.dropna().drop_duplicates()
df[~df.id.isin(cite_id.id.values)]
Or You can merge them and keep only the lines that have a NaN
df[pd.merge(cite_id, df, how='outer').isnull().any(axis=1)]

import pandas as pd
df1 = pd.read_csv("1.csv")
df2 = pd.read_csv("2.csv")
df1 = df1.dropna().drop_duplicates()
df2 = df2.dropna().drop_duplicates()
df = df2.loc[~df2.id.isin(df1.id)]

You can concatenate two dataframes as one, after that you can remove all dupicates
df1
ID B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
cite_id
ID B C D
4 A2 B4 C4 D4
5 A3 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
pd.concat([df1,cite_id]).drop_duplicates(subset=['ID'], keep=False)
Out:
ID B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
6 A6 B6 C6 D6
7 A7 B7 C7 D7

Related

Pandas: Merge dataframes with repeated indexes

I would like to merge two datasets that share a common index. In my real data, this index is a serial number and it is repeated. The serial number corresponds to a vehicle and it is repeated for every trip taken with that vehicle. So there are different feature values depending on the trip circumstances.
Here's an example:
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=["a", "a", "b", "b"],
)
df1
>>
A B C D
a A0 B0 C0 D0
a A1 B1 C1 D1
b A2 B2 C2 D2
b A3 B3 C3 D3
df2 = pd.DataFrame(
{
"A2": ["A4", "A5", "A6", "A7"],
"B2": ["B4", "B5", "B6", "B7"],
"C2": ["C4", "C5", "C6", "C7"],
"D2": ["D4", "D5", "D6", "D7"],
},
index=["a", "b", "b", "b"],
)
​
df2
>>
A2 B2 C2 D2
a A4 B4 C4 D4
b A5 B5 C5 D5
b A6 B6 C6 D6
b A7 B7 C7 D7
I am struggling to see the best way of emerging these two datasets. Apart from the index, they don't share more common information. So I'd like to use as much as I can from the two but also avoid unnecessary repetition.
I attempted:
df1.join(df2)
>>
A B C D A2 B2 C2 D2
a A0 B0 C0 D0 A4 B4 C4 D4
a A1 B1 C1 D1 A4 B4 C4 D4
b A2 B2 C2 D2 A5 B5 C5 D5
b A2 B2 C2 D2 A6 B6 C6 D6
b A2 B2 C2 D2 A7 B7 C7 D7
b A3 B3 C3 D3 A5 B5 C5 D5
b A3 B3 C3 D3 A6 B6 C6 D6
b A3 B3 C3 D3 A7 B7 C7 D7
but as you can see for every df1 I'm adding all rows of df2 to each one of the rows of df1... This is not wrong I think... but considering the size of my datasets (3GB)it would end up creating more observations than necessary... So I would like to avoid this if possible.
I also attempted:
pd.concat([df1, df2], axis=1, join="inner")
but as I have repeated indexes of serial numbers it returns an error:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
What's the best way of merging these two datasets of repeated indexes? In other words, what the best output should be in order to preserve information from both datasets and minimise repetition (affects data size significantly)?

Check if value of one column exists in another column, put a value in another column in pandas

Say I have a data frame like the following:
A B C D E
a1 b1 c1 d1 e1
a2 a1 c2 d2 e2
a3 a1 a2 d3 e3
a4 a1 a2 a3 e4
I want to create a new column with predefined values if a value found in other columns.
Something like this:
A B C D E F
a1 b1 c1 d1 e1 NA
a2 a1 c2 d2 e2 in_B
a3 a1 a2 d3 e3 in_B, in_C
a4 a1 a2 a3 e4 in_B, in_C, in_D
The in_B, in_C could be other string of choice. If values present in multiple columns, then value of F would be multiple. Example, row 3 and 4 of column F (in row 3 there are two values and in row 4 there are three values). So far, I have tried a below:
DF.F=np.where(DF.A.isin(DF.B), DF.A,'in_B')
But it does not give expected result. Any help
STEPS:
Stack the dataframe.
check for the duplicate values.
unstack to get the same structure back.
use dot to get the required result.
df['new_col'] = df.stack().duplicated().unstack().dot(
'In ' + k.columns + ',').str.strip(',')
OUTPUT:
A B C D E new_col
0 a1 b1 c1 d1 e1
1 a2 a1 c2 d2 e2 In B
2 a3 a1 a2 d3 e3 In B,In C
3 a4 a1 a2 a3 e4 In B,In C,In D

Dataframe slicing with string values

I have a string dataframe that I would like to modify. I need to cut off each row of the dataframe at a value say A4 and replace other values after A4 with -- or remove them. I would like to create a new dataframe that has values only upto the string "A4". How would i do this?
import pandas as pd
columns = ['c1','c2','c3','c4','c5','c6']
values = [['A1', 'A2','A3','A4','A5','A6'],['A1','A3','A2','A5','A4','A6'],['A1','A2','A4','A3','A6','A5'],['A2','A1','A3','A4','A5','A6'], ['A2','A1','A3','A4','A6','A5'],['A1','A2','A4','A3','A5','A6']]
input = pd.DataFrame(values, columns)
columns = ['c1','c2','c3','c4','c5','c6']
values = [['A1', 'A2','A3','A4','--','--'],['A1','A3,'A2','A5','A4','--'],['A1','A2','A4','--','--','--'],['A2','A1','A3','A4','--','--'], ['A2','A1','A3','A4','--','--'],['A1','A2','A4','--','--','--']]
output = pd.DataFrame(values, columns)
You can make a small function, that will take an array, and modify the values after your desired value:
def myfunc(x, val):
for i in range(len(x)):
if x[i] == val:
break
x[(i+1):] = '--'
return x
Then you need to apply the function to the dataframe in a rowwise (axis = 1) manner:
input.apply(lambda x: myfunc(x, 'A4'), axis = 1)
0 1 2 3 4 5
c1 A1 A2 A3 A4 -- --
c2 A1 A3 A2 A5 A4 --
c3 A1 A2 A4 -- -- --
c4 A2 A1 A3 A5 A4 --
c5 A2 A1 A4 -- -- --
c6 A1 A2 A4 -- -- --
I assume you will have values more than A4
df.replace('A([5-9])', '--', regex=True)
0 1 2 3 4 5
c1 A1 A2 A3 A4 -- --
c2 A1 A3 A2 -- A4 --
c3 A1 A2 A4 A3 -- --
c4 A2 A1 A3 -- A4 --
c5 A2 A1 A4 A3 -- --
c6 A1 A2 A4 A3 -- --

Extracted cell forms into a new row with same column name (Reading multiple files)

I want to find the value of c2-c5 based on the row b2-b5 and add into rows with the dataframe.
This is a sample data that I am using.
.. 2 3 4 5 6 7 8
0 a b c d e f g
1 a1 b1 c1 d1 e1 f1 g1
2 a2 b2 c2 d2 e2 f2 g2
3 a3 b3 c3 d3 e3 f3 g3
4 a4 b4 c4 d4 e4 f4 g4
5 a5 b5 c5 d5 e5 f5 g5
Code I tried: I have to put the df.loc outside as the values are getting replaced.
data=[]
for file in files:
df=pd.read_excel(file, header=None)
df['Year'] = file.split('_')[0]
df['Final'] = df.iat(1,1)
df['Comments'] = df.iat(2,1)
data.append(df)
df1 = df.loc[df[3].isin(['b2','b3','b4','b5']),[3,4]].assign(year=file.split('.')[0]).assign(df['Year]....)
I want the result to be like this:
1 2 3 4 5 year
.
.
. . . abc def
. . . b2 c2 2019
. . . b3 c3 2019
b4 c4 2019
b5 c5 2019
b2 c2 2019
b3 c3 2019
b4 c4 2019
b5 c5 2019
What if I have different years and want to add more different columns?
data=[]
for file in files:
df=pd.read_excel(file, header=None)
df['Year']= filename.split('_')
df = df.loc[df[3].isin(['b2','b3','b4','b5']),[3, 4]]
data.append(df)
df = pd.concat(data, ignore_index=True)
Idea is filter all values by Series.isin and add new column year by DataFrame.assign with append each filtered DataFrames to list data and last use concat:
data=[]
for file in files:
df=pd.read_excel(file, header=None)
df = df.loc[df[3].isin(['b2','b3','b4','b5']),[3, 4]].assign(year=file.split('.')[0])
data.append(df)
df = pd.concat(data, ignore_index=True)
Test with sample data:
df = df.loc[df[3].isin(['b2','b3','b4','b5']),[3, 4]].assign(year=2019)
print (df)
3 4 year
2 b2 c2 2019
3 b3 c3 2019
4 b4 c4 2019
5 b5 c5 2019

Groupby and Sample pandas

I am trying to sample the resulting data after doing a groupby on multiple columns. If the respective groupby has more than 2 elements, I want to take sample 2 records, else take all the records
df:
col1 col2 col3 col4
A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
C1 C2 C3 C4
target df:
col1 col2 col3 col4
A1 A2 A3 A4 or A5 or A6
A1 A2 A3 A4 or A5 or A6
B1 B2 B3 B4
B1 B2 B3 B5
C1 C2 C3 C4
I have mentioned A4 or A5 or A6 because, when we take sample, either of the three might return
This is what i have tried so far:
trial = pd.DataFrame(df.groupby(['col1', 'col2','col3'])['col4'].apply(lambda x: x if (len(x) <=2) else x.sample(2)))
However, in this I do not get col1, col2 and col3
I think need double reset_index - first for remove 3.rd level of MultiIndex and second for convert MultiIndex to columns:
trial= (df.groupby(['col1', 'col2','col3'])['col4']
.apply(lambda x: x if (len(x) <=2) else x.sample(2))
.reset_index(level=3, drop=True)
.reset_index())
Or reset_index with drop for remove column level_3:
trial= (df.groupby(['col1', 'col2','col3'])['col4']
.apply(lambda x: x if (len(x) <=2) else x.sample(2))
.reset_index()
.drop('level_3', 1))
print (trial)
col1 col2 col3 col4
0 A1 A2 A3 A4
1 A1 A2 A3 A6
2 B1 B2 B3 B4
3 B1 B2 B3 B5
4 C1 C2 C3 C4
There is no need to convert this to a pandas dataframe its one by default
trial=df.groupby(['col1', 'col2','col3'])['col4'].apply(lambda x: x if (len(x) <=2) else x.sample(2))
And this should add the col1,2,3
trial.reset_index(inplace=True,drop=False)

Categories

Resources