I am trying to perform a vlookup/merge betwen two dataframes. I get the error
KeyError: "['Player'] not in index"
i've tried to reindex the columns but doesnt seem to work.
df1= df1.reindex(columns = ['Player','Category'])
My current code is like so missingnames = pd.merge(df1,df2[['Player','Player Name']],on='Player',how = 'left')
My dataframes are like below:
df1:
df2:
expected output
can anyone help with this?
Thanks.
You can do it like this:
df1['Exists'] = df1['Player'].str.lower().isin(df2['Player Name'].str.lower())
Look closer at your merge argument and the columns you have in each dataframe
df1 includes "Player" and "Category"
df2 includes "Player Name", "Height" and "Weight"
Your merge argument says that the column "Player" is in df2 however it is not
missingnames = pd.merge(df1,df2[['Player','Player Name']],on='Player',how = 'left')
===============================^
missingnames needs to be changed to:
missingnames = df1.merge(df2,left_on='Player',right_on='Player Name',how = 'left')
And then from there ascertain if there are any missing values
Use numpy.where:
df1['Exists']=np.where(df1['Player'].str.upper().isin(df2['Player Name'].str.upper()),'Exists','')
Related
I'm trying to merge two data frames on a column with a int data type
df3 = df2.merge('df1', how = 'inner', on = 'ID')
But I receive this error
TypeError: Can only merge Series or DataFrame objects, a (class 'str') was passed
I do not understand what is causing this, so any help would be appreciated!
The way you have written is calling to merge df2 with 'df1' as a computer this looks like trying to merge a dataframe with the literal phrase 'df1', try removing the quotations and do just df1 as an object.
You need to pass the variable 'df1' reference directly, not as a string:
df3 = df2.merge(df1, how = 'inner', on = 'ID')
Alternatively you can pass both dataframes as a parameter:
df3 = pd.merge(df1, df2, how = 'inner', on = 'ID')
Using outer join to merge two tables. Let's say
df1 = ['productID', 'Name']
df2 = ['userID', 'productID', 'usage']
I tried to use outer join with merge function in pandas.
pd.merge(df1, df2[['userID','productID', 'usage']], on='productID', how = 'outer')
However, the error message I got is
'productID' is both an index level and a column label, which is ambiguous.
I googled this error message and saw a open [issue]: https://github.com/facebook/prophet/issues/891
Any solution to my problem?
Error means there is same index name like column productID:
#check it
print (df2.index.name)
Solution is remove/rename index name, e.g. by DataFrame.rename_axis:
pd.merge(df1, df2.rename_axis(None)[['userID','productID', 'usage']],
on='productID', how = 'outer')
I imported a dataset into my python script and took the correlation. This is the code for correlation:
data = pd.read_excel('RQ_ID_Grouping.xlsx' , 'Sheet1')
corr = data.corr()
After the correlation the data looks like this:
I want to convert the data into below format:
I am using this code to achieve the above data , but it doesn't seem to be working:
corr1 = (corr.melt(var_name = 'X' , value_name = 'Y').groupby('X')['Y'].reset_index(name = 'Corr_Value'))
I know there should be something after the 'groupby' part but I don't know what . If you could help me , I would greatly appreciate it.
Use DataFrame.stack for reshape and drop missing values, convert MultiIndex to columns by DataFrame.reset_index and last set columns names:
df = corr.stack().reset_index()
df.columns = ['X','Y','Corr_Value']
Another solution with DataFrame.rename_axis:
df = corr.stack().rename_axis(('X','Y')).reset_index(name='Corr_Value')
And your solution with melt is also possible:
df = (corr.rename_axis('X')
.reset_index()
.melt('X', var_name='Y', value_name='Corr_Value')
.dropna()
.sort_values(['X','Y'])
.reset_index(drop=True))
i have two dataframe like this.
df1
MainId,Time,info1,info2
100,2018-07-12 08:05:00,a,b
100,2018-07-12 08:07:00,x,y
101,2018-07-14 16:00,c,d
100,2018-07-14 19:30:00,d,e
104,2018-07-14 03:30:00,g,h
and
df2
Id,MainId,startTime,endTime,value
1,100,2018-07-12 08:00:00,2018-07-12 08:10:00,1001
2,150,2018-07-14 10:05:00,2018-07-14 17:05:00,1002
3,101,2018-07-12 0:05:00,2018-07-12 19:05:00,1003
4,100,2018-07-12 08:05:00,2018-07-12 08:15:00,1004
df2 is main dataframe and df1 is subdataframe. I would like to check starttime and endtime of df2 with the time in df1 with respective to MainId. If df1.Time isin df2(start and endtime) with respective to MainId, then i want to include info1 and info2 column of df1 to df2. If there are no values, then I would like to enter just nan.
I want my output like this
Id,MainId,info1,info2,value
1,100,a,b,1001
1,100,x,y,1001
2,150,nan,nan,1002
3,101,nan,nan,1003
4,100,a,b,1004
4,100,x,y,1004
Here I have two same Id(In Id1) and MainId in output because they have different info1 and info2 and I want to include that one too.
This is what I am doing in pandas
df2['info1'] = np.where((df2['MainId'] == df1['MainId'])& (df1['Time'].isin([df2['startTime'], df2['endTime']])),df1['info1'], np.nan)
but it is throwing an error
ValueError: Can only compare identically-labeled Series objects
How Can i Fix this error ? Is there a better way ?
df1 and df2 have diferente Index (you can check this by inspecting df1.index and df2.index. Hence, when you do df2['MainId'] == df1['MainId'], you have 2 series objects that are not comparable.
Try using a left join, something like:
df3 = df2.join(df1.set_index('MainId'), on='MainId'))
should give you the dataframe you want. You can then use it to execute your comparisons.
Suppose I have the following dataframes in pySpark:
df1 = sqlContext.createDataFrame([Row(name='john', age=50), Row(name='james', age=25)])
df2 = sqlContext.createDataFrame([Row(name='john', weight=150), Row(name='mike', weight=115)])
df3 = sqlContext.createDataFrame([Row(name='john', age=50, weight=150), Row(name='james', age=25, weight=None), Row(name='mike', age=None, weight=115)])
Now suppose I want to create df3 from joining/merging df1 and df2.
I tried doing
df1.join(df2, df1.name == df2.name, 'outer')
This doesn't quite work exactly because it produces two name columns. I need to then somehow combine the two name columns so that missing names from one name column are filled in by the missing name from the other name column.
How would I do that? Or is there a better way to create df3 from df1 and df2?
You can use coallesce function which returns the first not-null argument.
from pyspark.sql.functions import coalesce
df1 = df1.alias("df1")
df2 = df2.alias("df2")
(df1.join(df2, df1.name == df2.name, 'outer')
.withColumn("name_", coalesce("df1.name", "df2.name"))
.drop("name")
.withColumnRenamed("name_", "name"))
This is a little late, but there is a simpler solution if someone needs it. Just a simple change from original poster's solution:
df1.join(df2, 'name', 'outer')
df3 = df1.join(df2, ['name'], 'outer')
Joining in this way will prevent the duplication of the name column. https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html