I have two data frames.
My final goal is to compare a column in both data frames and return those values which donot match with each other
example:
df_1["column_1"]= ["A45", "kl24", "mhg", "tz22" ]
df_2["column_2"]= ["KL24", "tz22", "mhg", "A 45"]
I need a code that comparing two array values in the respective dataframe["column"] and returns those values from df_1 which did not match in df_2(Ex: from our example "A45" and "kl24" will return because there is a space and upper and lower case error)
Can anyone kindly please help me with this !
Use pd.merge(..., how='left', indicator=True) and select the rows with left_only:
>>> df = df_1.merge(df_2, how='outer', left_on='column_1', right_on='column_2', indicator=True)
>>> df
column_1 column_2 _merge
0 A45 NaN left_only
1 kl24 NaN left_only
2 mhg mhg both
3 tz22 tz22 both
4 NaN KL24 right_only
5 NaN A 45 right_only
>>> df[df._merge == 'left_only']['column_1']
0 A45
1 kl24
Name: column_1, dtype: object
You can use df.isin and df.loc to do the trick
df_1["column_1"].loc[~df_1["column_1"].isin(df_2["column_2"])]
column_1
0 A45
1 kl24
I have two datasets with identical column headers and I would like to remove ALL data that is 100% identical, and just have what they do not have exactly in common remaining. How could I go about doing that?
Thank you for your time!
To get everything BUT the intersection of two pandas datasets, try this:
# Everything from the first except what is on second
r1 = df1[~df1.isin(df2)]
# Everything from the second except what is on first
r2 = df2[~df2.isin(df1)]
# concatenate and drop NANs
result = pd.concat(
[r1, r2]
).dropna().reset_index(drop=True)
There is one caveat though, when filtering with boolean masks, your int values might turn into floats. By default, pandas replaces unwanted (False) values with the float version of NAN and converts the entire column to float. You can see this happening in the example below.
To circumvent this, explicitly declare the datatype when creating the dataframe.
Example
import pandas as pd
df1 = pd.read_csv("./csv1.csv") #, dtype='Int64')
print(f"csv1\n{df1}\n")
df2 = pd.read_csv("./csv2.csv") #, dtype='Int64')
print(f"csv2\n{df2}\n")
# Everything from first except what is on second
r1 = df1[~df1.isin(df2)]
# Everything from second except what is on first
r2 = df2[~df2.isin(df1)]
# concatenate and drop NANs
result = pd.concat(
[r1, r2]
).dropna().reset_index(drop=True)
print(f"result\n{result}\n")
Input
csv1
A B C
0 1 2 3
1 4 5 6
2 7 8 9
csv2
A B C
0 1 2 3
1 4 5 6
2 10 11 12
Output
result
A B C
0 7.0 8.0 9.0
1 10.0 11.0 12.0
If you came here looking for information on how to
merge a DataFrame and Series on the index, please look at this
answer.
The OP's original intention was to ask how to assign series elements
as columns to another DataFrame. If you are interested in knowing the
answer to this, look at the accepted answer by EdChum.
Best I can come up with is
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]}) # see EDIT below
s = pd.Series({'s1':5, 's2':6})
for name in s.index:
df[name] = s[name]
a b s1 s2
0 1 3 5 6
1 2 4 5 6
Can anybody suggest better syntax / faster method?
My attempts:
df.merge(s)
AttributeError: 'Series' object has no attribute 'columns'
and
df.join(s)
ValueError: Other Series must have a name
EDIT The first two answers posted highlighted a problem with my question, so please use the following to construct df:
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
with the final result
a b s1 s2
3 NaN 4 5 6
5 2 5 5 6
6 3 6 5 6
Update
From v0.24.0 onwards, you can merge on DataFrame and Series as long as the Series is named.
df.merge(s.rename('new'), left_index=True, right_index=True)
# If series is already named,
# df.merge(s, left_index=True, right_index=True)
Nowadays, you can simply convert the Series to a DataFrame with to_frame(). So (if joining on index):
df.merge(s.to_frame(), left_index=True, right_index=True)
You could construct a dataframe from the series and then merge with the dataframe.
So you specify the data as the values but multiply them by the length, set the columns to the index and set params for left_index and right_index to True:
In [27]:
df.merge(pd.DataFrame(data = [s.values] * len(s), columns = s.index), left_index=True, right_index=True)
Out[27]:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
EDIT for the situation where you want the index of your constructed df from the series to use the index of the df then you can do the following:
df.merge(pd.DataFrame(data = [s.values] * len(df), columns = s.index, index=df.index), left_index=True, right_index=True)
This assumes that the indices match the length.
Here's one way:
df.join(pd.DataFrame(s).T).fillna(method='ffill')
To break down what happens here...
pd.DataFrame(s).T creates a one-row DataFrame from s which looks like this:
s1 s2
0 5 6
Next, join concatenates this new frame with df:
a b s1 s2
0 1 3 5 6
1 2 4 NaN NaN
Lastly, the NaN values at index 1 are filled with the previous values in the column using fillna with the forward-fill (ffill) argument:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
To avoid using fillna, it's possible to use pd.concat to repeat the rows of the DataFrame constructed from s. In this case, the general solution is:
df.join(pd.concat([pd.DataFrame(s).T] * len(df), ignore_index=True))
Here's another solution to address the indexing challenge posed in the edited question:
df.join(pd.DataFrame(s.repeat(len(df)).values.reshape((len(df), -1), order='F'),
columns=s.index,
index=df.index))
s is transformed into a DataFrame by repeating the values and reshaping (specifying 'Fortran' order), and also passing in the appropriate column names and index. This new DataFrame is then joined to df.
Nowadays, much simpler and concise solution can achieve the same task. Leveraging the capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame, we can use:
df.join(df.apply(lambda x: s, axis=1))
Result:
a b s1 s2
3 NaN 4 5 6
5 2.0 5 5 6
6 3.0 6 5 6
Here, we used DataFrame.apply() with a simple lambda function as the applied function on axis=1. The applied lambda function simply just returns the Series s:
df.apply(lambda x: s, axis=1)
Result:
s1 s2
3 5 6
5 5 6
6 5 6
The result has already inherited the row index of the original DataFrame df. Consequently, we can simply join df with this interim result by DataFrame.join() to get the desired final result (since they have the same row index).
This capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame is well documented in the official document as follows:
By default (result_type=None), the final return type is inferred from
the return type of the applied function.
The default behaviour (result_type=None) depends on the return value of the
applied function: list-like results will be returned as a Series of
those. However if the apply function returns a Series these are
expanded to columns.
The official document also includes example of such usage:
Returning a Series inside the function is similar to passing
result_type='expand'. The resulting column names will be the Series
index.
df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
foo bar
0 1 2
1 1 2
2 1 2
If I could suggest setting up your dataframes like this (auto-indexing):
df = pd.DataFrame({'a':[np.nan, 1, 2], 'b':[4, 5, 6]})
then you can set up your s1 and s2 values thus (using shape() to return the number of rows from df):
s = pd.DataFrame({'s1':[5]*df.shape[0], 's2':[6]*df.shape[0]})
then the result you want is easy:
display (df.merge(s, left_index=True, right_index=True))
Alternatively, just add the new values to your dataframe df:
df = pd.DataFrame({'a':[nan, 1, 2], 'b':[4, 5, 6]})
df['s1']=5
df['s2']=6
display(df)
Both return:
a b s1 s2
0 NaN 4 5 6
1 1.0 5 5 6
2 2.0 6 5 6
If you have another list of data (instead of just a single value to apply), and you know it is in the same sequence as df, eg:
s1=['a','b','c']
then you can attach this in the same way:
df['s1']=s1
returns:
a b s1
0 NaN 4 a
1 1.0 5 b
2 2.0 6 c
You can easily set a pandas.DataFrame column to a constant. This constant can be an int such as in your example. If the column you specify isn't in the df, then pandas will create a new column with the name you specify. So after your dataframe is constructed, (from your question):
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
You can just run:
df['s1'], df['s2'] = 5, 6
You could write a loop or comprehension to make it do this for all the elements in a list of tuples, or keys and values in a dictionary depending on how you have your real data stored.
If df is a pandas.DataFrame then df['new_col']= Series list_object of length len(df) will add the or Series list_object as a column named 'new_col'. df['new_col']= scalar (such as 5 or 6 in your case) also works and is equivalent to df['new_col']= [scalar]*len(df)
So a two-line code serves the purpose:
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]})
s = pd.Series({'s1':5, 's2':6})
for x in s.index:
df[x] = s[x]
Output:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
(no idea how to introduce a matrix here for readability)
I have two dataframes obtained with Panda and Python.
df1 = pd.DataFrame({'Index': ['0','1','2'], 'number':[3,'dd',1], 'people':[3,'s',3]})
df1 = df1.set_index('Index')
df2 = pd.DataFrame({'Index': ['0','1','2'], 'quantity':[3,2,'hi'], 'persons':[1,5,np.nan]})
I would like to sum the quantities of columns based on Index. Columns do not have the same name and may contain strings. (I have in fact 50 columns on each df). I want to consider nan as 0. The result should look:
df3
Index column 1 column 2
0 6 4
1 nan nan
2 nan nan
I was wondering how could this be done.
Note:
For sure a double while or for would do the trick, just not very elegant...
indices=0
columna=0
while indices<len(df.index)-1:
while columna<numbercolumns-1:
df3.iloc[indices,columna]=df1.iloc[indices,columna] +df2.iloc[indices,columna]
indices += 1
columna += 1
Thank you.
You can try of concatenating both dataframes, then add based on the index group
df1.columns = df.columns
df1.people = pd.to_numeric(df1.people,errors='coerce')
pd.concat([df,df1]).groupby('Index').sum()
Out:
number people
Index
A 8 5.0
B 2 2.0
C 2 5.0
F 3 3.0
I am merging two data frames using pandas.merge. Even after specifying how = left option, I found the number of rows of merged data frame is larger than the original. Why does this happen?
panel = pd.read_csv(file1, encoding ='cp932')
before_len = len(panel)
prof_2000 = pd.read_csv(file2, encoding ='cp932').drop_duplicates()
temp_2000 = pd.merge(panel, prof_2000, left_on='Candidate_u', right_on="name2", how="left")
after_len = len(temp_2000)
print(before_len, after_len)
> 12661 13915
This sounds like having more than one rows in right under 'name2' that match the key you have set for the left. Using option 'how='left' with pandas.DataFrame.merge() only means that:
left: use only keys from left frame
However, the actual number of rows in the result object is not necessarily going to be the same as the number of rows in the left object.
Example:
In [359]: df_1
Out[359]:
A B
0 a AAA
1 b BBA
2 c CCF
and then another DF that looks like this (notice that there are more than one entry for your desired key on the left):
In [360]: df_3
Out[360]:
key value
0 a 1
1 a 2
2 b 3
3 a 4
If I merge these two on left.A, here's what happens:
In [361]: df_1.merge(df_3, how='left', left_on='A', right_on='key')
Out[361]:
A B key value
0 a AAA a 1.0
1 a AAA a 2.0
2 a AAA a 4.0
3 b BBA b 3.0
4 c CCF NaN NaN
This happened even though I merged with how='left' as you can see above, there were simply more than one rows to merge and as shown here the result pd.DataFrame has in fact more rows than the pd.DataFrame on the left.
I hope this helps!
The problem of doubling of rows after each merge() (of any type, 'both' or 'left') is usually caused by duplicates in any of the keys, so we need to drop them first:
left_df.drop_duplicates(subset=left_key, inplace=True)
right_df.drop_duplicates(subset=right_key, inplace=True)
If you do not have any duplication, as indicated in the above answer. You should double-check the names of removed entries. In my case, I discovered that the names of removed entries are inconsistent between the df1 and df2 and I solved the problem by:
df1["col1"] = df2["col2"]