delete a part of pd.DataFrame with Python - python

I'm iterating over rows in my DataFrame with DataFrame.iterrows() and if a row meets certain criteria I store it in the other DataFrame. Is there a way to delete rows that appear in both of them like set.difference(another_set)?
I was asked to provide a code, so, since I dont know the answer to my question, I worked around my problem and created another DataFrame, to which I save good data instead of having two DataFrames and taking a difference of them both.
def test_right_chain(self, temp):
temp__=pd.DataFrame()
temp_=pd.DataFrame()
key=temp["nr right"].iloc[0]
temp_=temp_.append(temp.iloc[0])
temp=temp[1:]
for index, row in temp.iterrows():
print row
key_=row['nr right']
if abs(key_-key)==1:
pass
elif len(temp_)>2:
print row
temp__.append(temp_)
temp_=pd.DataFrame()
else:
temp_=pd.DataFrame()
temp_=temp_.append(row)
key=key_
return temp__

You can do an intersection of both DataFrames with df.merge(df1, df2, right_index=True, how='inner') function, leaving indexes that appear by the rows in left DataFrame (I don't know why, but this happens when I use right_index=True) and then retrieve indexes of those rows. (I used answer from this question: Compare Python Pandas DataFrames for matching rows)
df1 = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD'))
df2 = df1.ix[4:8]
df2.reset_index(drop=True,inplace=True)
df2.loc[-1] = [2, 3, 4, 5]
df2.loc[-2] = [14, 15, 16, 17]
df2.reset_index(drop=True,inplace=True)
df3=pd.merge(df1, df2, on=['A', 'B', 'C', 'D'], right_index=True, how='inner')
Now you need indexes of rows that appear in both DataFrames:
indexes= df3.index.values
And then you just need to drop those rows from your DataFrame:
df1=df1.drop(df1.index[indexes])

Related

Take rows of ith duplicated index and put in i number of dataframes

This is a bit tricky to put into words, but I'll give it a try. I have a dataframe with duplicated indices as provided below.
a = [0.00000, 0.071928, 1.294, 2.592563, 0.000318, 2.575291, 0.439986, 2.232147, 6.091523, 2.075441, 0.96152]
b = [0.00000, 0.399791, 1.302446, 1.388957, 1.276451, 1.527568, 1.614107, 2.686325, 4.167600, 6.135689, 5.945807]
df = pd.DataFrame({'a' : a, 'b' : b})
df.index = [1,1,1,1,1,2,2,3,3,3,4]
I want the row of the first duplicated index for every number to be appended to df1, and the row of the second duplicated index to be appended to df2, etc; the first time indices 1, 2, 3, 4... n have a duplicate, those rows get appended to dataframe 1. The second time indices 1, 2, 3, 4...n have a duplicate, those rows get appended to dataframe 2, and so on. Ideally, it would look something like this if concatenated for the first three duplicates under the 'index' column:
Any idea how to go about this? I've tried to run df[df.duplicated(subset = ['index'])] in a for loop to widdle down the df to the very first duplicates, but it doesn't seem to work the way I think it will.
Slicing out the duplicate indices via cumcount and using concat to stitch together the resulting sub-dataframes will do the job.
cols = df.columns
df['id'] = df.index
pd.concat([df[df.groupby('id').cumcount()==i][cols] for i in range(0, max(df.groupby('id').cumcount().values))], axis=1)

how to render distinct columns/rows by comparing two dataframes in pandas?

I have two dataframes but they have more common columns and few distinct columns that only appeared in one of dataframe. I want to print out those distinct columns and common columns so can have better idea what columns are changed in another dataframe. I got some interesting post on SO but don't know why I got an error. I have two dataframes which has following shape:
df19.shape
(39831, 1952)
df20.shape
(39821, 1962)
here is dummy data:
df1 = pd.DataFrame([[1, 2], [1, 3], [4, 6],[11,13],[10,19],[21,23]], columns=['A', 'B'])
df2 = pd.DataFrame([[3, 4,0,7], [1, 3,9,2], [4, 6,3,8],[8,5,1,6]], columns=['A', 'B','C','D'])
current attempt
I came across SO and tried following:
res=pd.concat([df19, df20]).loc[df19.index.symmetric_difference(df20.index)]
res.shape
(10, 1984)
this gave me distinct rows but not distinct columns.I also tried this one but gave me error:
df19.compare(df20, keep_equal=True, keep_shape=True)
how should I render distinct rows and columns by comparing two dataframes in pandas? Does anyone knows of doing this easily in python? Any quick thoughts? Thanks
objective
I simply want to render distinct rows or columns to compare two dataframe by column name or what distinct rows that it has. for instance, compared to df1, what columns are newly added to df2; similarly what rows are added to df2 and so on. Any idea?
I would recommend getting the columns by filtering by the name of the columns.
common = [i for i in list(df1) if i in list(df2)]
temp = df2[common]
distinct = [i for i in list(df2) if i not in list(df1)]
temp = df2[distinct]
Thanks to #Shaido, this one worked for me:
import pandas as pd
df1=pd.read_csv(data1)
df2=pd.read_csv(data2)
df1_cols = df1.columns
df2_cols = df2.columns
common_cols = df1_cols.intersection(df2_cols)
df2_not_df1 = df2_cols.difference(df1_cols)

Joining two dataframes on subvalue of the key column

I am currently trying to join / merge two df on the column Key, where in df1 the key is a standalone value such as 5, but in df2, the key can consist of multiple values such as [5,6,13].
For example like this:
df1 = pd.DataFrame({'key': [["5","6","13"],["10","7"],["6","8"]]})
df2 = pd.DataFrame({'sub_key': ["5","10","6"]})
However, my df are a lot bigger and consist of many columns, so an efficient solution would be great.
As a result I would like to have a table like this:
Key1
Key2
5
5,6,13
10
10,7
and so on ....
I already tried to apply this approach to my code, but it didn't work:
df1['join'] = 1
df2['join'] = 1
merged= df1.merge(df2, on='join').drop('join', axis=1)
df2.drop('join', axis=1, inplace=True)
merged['match'] = merged.apply(lambda x: x.key(x.sub_key), axis=1).ge(0)
I also tried to split and explode the column and to join on single values, however there the problem was, that not all column values were split correctly and I would need to combine everything back into one cell once joined.
Help would be much appreciated!
If you only want to match the first key:
df1['sub_key'] = df1.key.str[0]
df1.merge(df2)
If you want to match ANY key:
df3 = df1.explode('key').rename(columns={'key':'sub_key'})
df3 = df3.join(df1)
df3.merge(df2)
Edit: First version had a small bug, fixed it.

Merging multiple pandas dataframes by common STRINGS in a column

I have 6 csv files in which one column is a sentence and the second column is an integer.
The sentences are the same across all csv files, but they are out of key order from file to file.
I want to merge all data frames by sentence, so that I have one column of the sentences and then each integer column associated with that sentence from each csv file.
I've tried various merging and reducing techniques by the common 'sentence' column, but I end up with orders of magnitude more rows than I should have.
For example:
data_frames = [df1, df2, df3, df4, df5, df6]
reduce(lambda x,y: pd.merge(x,y, on='sentence', how='inner'), data_frames)
results in a dataframe with 12,502,455 rows!! I only have 4,825 rows in each csv file.
even using:
pd.merge(df1,df2, on='sentence', how='inner')
results in a dataframe with 5295 rows.
I know all the sentences are identical across csv files because I uploaded the same csv file of sentences to mTurk to be labeled.
it looks like your code is running correctly. I'm guessing the problem is that your sentences are not distinct. if you have duplicate sentences, running an inner join will multiply them. google "cartesian product"
can you post how many duplicate sentences are in each file?
You might have strings with different values.
Ensure to preprocess them before by doing lower and strip.
Example:
new_dfs = []
for df in dfs:
df['sentence'] = df['sentence'].apply(lambda x: x.lower().strip())
new_dfs.append(df)
Then, you can do simply merge as you mentioned. Make sure to have similar named columns.
Here's a simple working example:
import pandas as pd
vals1 = [[1, 'doc'], [2, 'bac'], [3, 'mec']]
vals2 = [[22, 'doc'], [12, 'mec'], [67, 'bac']]
vals3 = [[15, 'mec'], [35, 'bac'], [122, 'doc']]
df1 = pd.DataFrame(data=vals1, columns=["x","y"])
df2 = pd.DataFrame(data=vals2, columns=["x","y"])
df3 = pd.DataFrame(data=vals3, columns=["x","y"])
df4 = pd.merge(df1, df2, on='y', how='inner', suffixes=("1","2"))
df4 = pd.merge(df4, df3, on='y', how='inner')
df4.head()
Result:

How to read the result of pandas merge?

Using pandas merge, the resulting columns are confusing:
df1 = pd.DataFrame(np.random.randint(0, 100, size=(5, 5)))
df2 = pd.DataFrame(np.random.randint(0, 100, size=(5, 5)))
df2[0] = df1[0] # matching key on the first column.
# Now the weird part.
pd.merge(df1, df2, left_on=0, right_on=0).shape
Out[96]: (5, 9)
pd.merge(df1, df2, left_index=True, right_index=True).shape
Out[102]: (5, 10)
pd.merge(df1, df2, left_on=0, right_on=1).shape
Out[107]: (0, 11)
The number of columns are not fixed, the column labels are also unstable, worse yet these are not documented clearly.
I want to read some columns of the resulting data frame, which have many columns (hundreds). Currently I am using .iloc[] because labeling is too much work. But I am worried that this is error prone due to the weird merged result.
What is the correct way to read some columns in the merged data frame?
Python: 2.7.13, Pandas: 0.19.2
Merge key
1.1 Merge on key when the join-key is a column (This is the right solution for you as you say "df2[0] = df1[0] # matching key on the first column.
")
1.2 Merge on index when the merge-key is the index
==> reason why you get 1 column more in the second merge (pd.merge(df1, df2, left_index=True, right_index=True).shape) is because the initial join keys appears now twice '0_x' & '0_y'
Regarding column names
Column names do not change during a merge, UNLESS there are columns with the same name in both dataframes. The columns change like following, you get :
'initial_column_name'+'_x' (the suffix '_x' is added to the column of the left dataframe (df1))
'initial_column_name'+'_y' (the suffix '_y' is added to the column of the right dataframe (df2) )
To deal with 3 different cases for the number of columns in merged result, I ended up checking the number of columns, then convert the column number index to use in .iloc[]. Here is the code, for future searchers.
Still the best way I know to deal with huge number of columns now. I will mark the better answer if there is one.
Utility method to convert column number index:
def get_merged_column_index(num_col_df, num_col_df1, num_col_df2, col_df1=[], col_df2=[], joinkey_df1=[], joinkey_df2=[]):
"""Transform the column indexes in old source dataframes to column indexes in merged dataframe. Check for different pandas merged result formats.
:param num_col_df: number of columns in merged dataframe df.
:param num_col_df1: number of columns in df1.
:param num_col_df2: number of columns in df2.
:param col_df1: (list of int) column position in df1 to keep (0-based).
:param col_df2: (list of int) column position in df2 to keep (0-based).
:param joinkey_df1: (list of int) column position (0-based). Not implemented now.
:param joinkey_df2: (list of int) column position (0-based). Not implemented now.
:return: (list of int) transformed column indexes, 0-based, in merged dataframe.
"""
col_df1 = np.array(col_df1)
col_df2 = np.array(col_df2)
if num_col_df == num_col_df1 + num_col_df2: # merging keeps same old columns
col_df2 += num_col_df1
elif num_col_df == num_col_df1 + num_col_df2 + 1: # merging add column 'key_0' to the head
col_df1 += 1
col_df2 += num_col_df1 + 1
elif num_col_df <= num_col_df1 + num_col_df2 - 1: # merging deletes (possibly many) duplicated "join-key" columns in df2, keep and do not change order columns in df1.
raise ValueError('Format of merged result is too complicated.')
else:
raise ValueError('Undefined format of merged result.')
return np.concatenate((col_df1, col_df2)).astype(int).tolist()
Then:
cols_toextract_df1 = []
cols_toextract_df2 = []
converted_cols = get_merged_column_index(num_col_df=df.shape[1], num_col_df1=df1.shape[1], num_col_df2=df2.shape[1], col_df1=cols_toextract_df1, col_df2=cols_toextract_df1)
extracted_df = df.iloc[:, converted_cols]

Categories

Resources