I have two dataframes of different shape
The 'ANTENNA1' and 'ANTENNA2' in the bigger dataframe correspond to the ID columns in the smaller dataframe. I want to create merge the smaller dataframe to the bigger one so that the bigger dataframe will have '(POSITION, col1)', '(POSITION, col2)', '(POSITION, col3)' according to ANTENNA1 == ID
Edit: I tried with pd.merge but it is changing the original dataframe column values
Original:
df = pd.merge(df_main, df_sub, left_on='ANTENNA1', right_on ='id', how = 'left')
Result:
I want to keep the original dataframe columns as it is.
Assuming your first dataframe (with positions) is called df1, and the second is called df2, with your loaded data, you could just use pandas.DataFrame.merge: ( -> pd.merge(...) )
df = pd.merge(df1,df2,left_on='id', right_on='ANTENNA1')
Than you might select the df on your needed columns(col1,col2,..) to get the desired result df[["col1","col2",..]].
simple example:
# import pandas as pd
import pandas as pd
# creating dataframes as df1 and df2
df1 = pd.DataFrame({'ID': [1, 2, 3, 5, 7, 8],
'Name': ['Sam', 'John', 'Bridge',
'Edge', 'Joe', 'Hope']})
df2 = pd.DataFrame({'id': [1, 2, 4, 5, 6, 8, 9],
'Marks': [67, 92, 75, 83, 69, 56, 81]})
# merging df1 and df2 by ID
# i.e. the rows with common ID's get
# merged i.e. {1,2,5,8}
df = pd.merge(df1, df2, left_on="ID", right_on="id")
print(df)
Related
I have two dataframes: one for sales other for customers.
I need to create a third dataframe, but using merge doesn't work:
df1 = pd.read_csv('sales.csv', sep=';')
df2 = pd.read_csv('cust.csv', sep=';')
df1.shape = (423413, 21)
df2.shape = (231286, 12)
of course, some customers made more than one purchase. But when I use merge, the dataframe is bigger than the sum of the two previous dataframes and it doesn't matter which method: inner, outer, right, left. Always give a value greater than the sum of the two
data_sum = df1.merge(df2, on='Id_Customer', how='left')
data_sum.shape = (745711, 32)
I've been trying to merge, join, concat and nothing works
Some customers made more than one purchase and customers who never purchased anything.
How could I create a new dataframe that looks up the sales table ID, finds the reference in the customers table, and presents the sales and customer data in this?
The number might be the result of processing of the DataFrames, sales without a customer_id and vice versa, etc. As suggested in the comments it is difficult to say without data, but below an example demonstrating that the merge function works as expected.
import pandas as pd
cust = pd.DataFrame({
'adress': [1, 2, 3, 4, 5],
'id': [0, 1, 2, 3, 4]
})
sales = pd.DataFrame({
'nr': [10, 12, 13, 14, 15, 20, 1],
'id': [0, 1, 2, 3, 4, 0, 10]
})
merge = pd.merge(cust, sales, how='outer')
print(merge)
merge2 = pd.merge(cust, sales, how='inner')
print(merge2)
merge3 = pd.merge(cust, sales, how='left')
print(merge3)
merge4 = pd.merge(cust, sales, how='right')
print(merge4)
I have created a dataframe,i need to do two operations:
Converting to a list
converting the same list back to the dataframe with original column names.
Issue: i am loosing the column names when i first convert to a list and when i convert back to dataframe i am not getting those column names
Please help!
import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
#convert df to list
a=df.values.tolist()
#convert back to original dataframe
df1 = pd.DataFrame(a)
print(df1)
Current output
i am unable to get column names
You need pass columns names by df.columns, also if not default index is necessary pass it too:
df1 = pd.DataFrame(a, columns=df.columns, index=df.index)
If default RangeIndex in original DataFrame:
df1 = pd.DataFrame(a, columns=df.columns)
EDIT:
If need some similar structure use DataFrame.to_dict with orient='split' there are converted DataFrame to dictionary of columnsnames, index and data like:
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
d = df.to_dict(orient='split')
print (d)
{'index': [0, 1, 2],
'columns': ['Name', 'Age'],
'data': [['tom', 10], ['nick', 15], ['juli', 14]]}
And for original DataFrame use:
df2 = pd.DataFrame(d['data'], index=d['index'], columns=d['columns'])
print (df2)
Name Age
0 tom 10
1 nick 15
2 juli 14
I am a newbie in Python and I need some help.
I have 2 data frame containing a list of users with a list of recommended friends from two tables.
I would like to achieve the following:
Sort the list of recommended friends by ascending order from 2 data frame for each user.
Match the list of matching recommended friends from dataframe2 to dataframe1 for each user. Return only the matched values.
I have tried my code but it didn't achieve the desired results.
import pandas as pd
import numpy as np
///load data from csv
df1 = pd.read_csv('CommonFriend.csv')
df2 = pd.read_csv('InfluenceFriend.csv')
print(df1)
print(df2)
///convert values to list to sort by recommended friends ID
df1.values.tolist()
df1.sort_values(by=['User','RecommendedFriends'])
df2.values.tolist()
df2.sort_values(by=['User','RecommendedFriends'])
///obtain only matched values from list of recommended friends from df1 and df2.
df3 = df1.merge(df2, how='inner', on='User')
/// return dataframe with user, matched recommendedfriends ID
print(df3)
Problem encountered:
The elements in each list are not sorted in ascending order.
While matching each elements through pandas merge with "inner-join". It seems that it is not able to read certain elements.
Updates: Below are the data frame header which cause some error in the code.
This should be the solution to your problem. You might have to change a few variables but you get the idea: you merge the two dataframes on the users, so you get a dataframe with both lists for each user. You then take the intersection of both lists and store that in a new column.
df1 = pd.DataFrame(np.array([[1, [5, 7, 10, 11]], [2, [3, 8, 5, 12]]]),
columns=['User', 'Recommended friends'])
df2 = pd.DataFrame(np.array([[1, [5, 7, 9]], [2, [4, 7, 10]], [3, [15, 7, 9]]]),
columns=['User', 'Recommended friends'])
df3 = pd.merge(df1, df2, on='User')
df3['intersection'] = [list(set(a).intersection(set(b))) for a, b in zip(df3['Recommended friends_x'], df3['Recommended friends_y'])]
Output df3:
User Recommended friends_x Recommended friends_y intersection
0 1 [5, 7, 10, 11] [5, 7, 9] [5, 7]
1 2 [3, 8, 5, 12] [4, 7, 10] []
I do not quite understand what exactly your problem is, but in general you will have to assign the dataframe to itself again.
import pandas as pd
import numpy as np
df1 = pd.read_csv('CommonFriend.csv')
df2 = pd.read_csv('InfluenceFriend.csv')
print(df1)
print(df2)
df1 = df1.values.tolist()
df1 = df1.sort_values(by=['User','RecommendedFriends'])
df2 = df2.values.tolist()
df2 = df2.sort_values(by=['User','RecommendedFriends'])
df3 = df1.merge(df2, how='inner', on='User')
print(df3)
When there is a DataFrame like the following:
import pandas as pd
df = pd.DataFrame(1, index=[100, 29, 234, 1, 150], columns=['A'])
How can I sort this dataframe by index with each combination of index and column value intact?
Dataframes have a sort_index method which returns a copy by default. Pass inplace=True to operate in place.
import pandas as pd
df = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150], columns=['A'])
df.sort_index(inplace=True)
print(df.to_string())
Gives me:
A
1 4
29 2
100 1
150 5
234 3
Slightly more compact:
df = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150], columns=['A'])
df = df.sort_index()
print(df)
Note:
sort has been deprecated, replaced by sort_index for this scenario
preferable not to use inplace as it is usually harder to read and prevents chaining. See explanation in answer here:
Pandas: peculiar performance drop for inplace rename after dropna
If the DataFrame index has name, then you can use sort_values() to sort by the name as well. For example, if the index is named lvl_0, you can sort by this name. This particular case is common if the dataframe is obtained from a groupby or a pivot_table operation.
df = df.sort_values('lvl_0')
If the index has name(s), you can even sort by both index and a column value. For example, the following sorts by both the index and the column A values:
df = df.sort_values(['lvl_0', 'A'])
If you have a MultiIndex dataframe, then, you can sort by the index level by using the level= parameter. For example, if you want to sort by the second level in descending order and the first level in ascending order, you can do so by the following code.
df = df.sort_index(level=[1, 0], ascending=[False, True])
If the indices have names, again, you can call sort_values(). For example, the following sorts by indexes 'lvl_1' and 'lvl_2'.
df = df.sort_values(['lvl_1', 'lvl_2'])
When there is a DataFrame like the following:
import pandas as pd
df = pd.DataFrame(1, index=[100, 29, 234, 1, 150], columns=['A'])
How can I sort this dataframe by index with each combination of index and column value intact?
Dataframes have a sort_index method which returns a copy by default. Pass inplace=True to operate in place.
import pandas as pd
df = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150], columns=['A'])
df.sort_index(inplace=True)
print(df.to_string())
Gives me:
A
1 4
29 2
100 1
150 5
234 3
Slightly more compact:
df = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150], columns=['A'])
df = df.sort_index()
print(df)
Note:
sort has been deprecated, replaced by sort_index for this scenario
preferable not to use inplace as it is usually harder to read and prevents chaining. See explanation in answer here:
Pandas: peculiar performance drop for inplace rename after dropna
If the DataFrame index has name, then you can use sort_values() to sort by the name as well. For example, if the index is named lvl_0, you can sort by this name. This particular case is common if the dataframe is obtained from a groupby or a pivot_table operation.
df = df.sort_values('lvl_0')
If the index has name(s), you can even sort by both index and a column value. For example, the following sorts by both the index and the column A values:
df = df.sort_values(['lvl_0', 'A'])
If you have a MultiIndex dataframe, then, you can sort by the index level by using the level= parameter. For example, if you want to sort by the second level in descending order and the first level in ascending order, you can do so by the following code.
df = df.sort_index(level=[1, 0], ascending=[False, True])
If the indices have names, again, you can call sort_values(). For example, the following sorts by indexes 'lvl_1' and 'lvl_2'.
df = df.sort_values(['lvl_1', 'lvl_2'])