how to map two dataframes with pandas [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two excel files
+ File one contains specific data about different customer (like: Sex, Age, Name...) and
+ File two contains different transactions for each customer
I want to create a new Column in File2 containing specific data to each Costumer from File1

file1.csv
customer_id,sex,age,name
af4wf3,m,12,mike
z20ask,f,15,sam
file2.csv
transaction_id,customer_id,amount
12h2j4hk,af4wf3,123.20
12h2j4h1,af4wf3,5.22
12h2j4h2,z20ask,13.20
12h2j4h3,af4wf3,1.20
12h2j4h4,z20ask,2341.12
12h2j4h5,z20ask,235.96
12h2j4h6,af4wf3,999.30
Load and join the dataframes
import pandas as pd
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
df1.set_index('customer_id', inplace=True)
df2.set_index('transaction_id', inplace=True)
output = df2.join(df1, on='customer_id')
output.to_csv('file2_updated.csv')
file2_updated.csv
transaction_id,customer_id,amount,sex,age,name
12h2j4hk,af4wf3,123.2,m,12,mike
12h2j4h1,af4wf3,5.22,m,12,mike
12h2j4h2,z20ask,13.2,f,15,sam
12h2j4h3,af4wf3,1.2,m,12,mike
12h2j4h4,z20ask,2341.12,f,15,sam
12h2j4h5,z20ask,235.96,f,15,sam
12h2j4h6,af4wf3,999.3,m,12,mike

the same as #jc416 but using pd.merge:
file2.merge(file1, on='customer_id')
transaction_id customer_id amount sex age name
0 12h2j4hk af4wf3 123.2 m 12 mike
1 12h2j4h1 af4wf3 5.22 m 12 mike
2 12h2j4h3 af4wf3 1.2 m 12 mike
3 12h2j4h6 af4wf3 999.3 m 12 mike
4 12h2j4h2 z20ask 13.2 f 15 sam
5 12h2j4h4 z20ask 2341.12 f 15 sam
6 12h2j4h5 z20ask 235.96 f 15 sam
You definitely should read Pandas merging 101

Related

Add new columns to dataframe based on index from another dataframe in Python [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have two dataframe
df look like this
Title
Description
Book ID
Title A
randomDesc
14563
Title B
randomDesc
22631
Title C
randomDesc
09452
Title D
randomDesc
87243
my 2nd dataframe have diffrnenet information
Book ID
Date
Rate
user ID
14563
29/4/2021
8
90
22631
30/9/1990
6.5
87
09452
4/6/2000
4
90
87243
9/11/2017
9.5
30
22631
30/9/1990
9
30
I want to add title and Description data to the second data frame based on the book ID
DataFrame.merge() might be a solution:
df2.merge(df1, how="left", on="Book ID")

Calculate a count of groupby rows that occur within a rolling window of days in Pandas

I have the following dataframe:
import pandas as pd
#Create DF
d = {'Name': ['Jim','Jim','Jim', 'Jim','Jack','Jack'],
'Date': ['08/01/2021','27/01/2021','05/02/2021','10/02/2021','26/01/2021','20/02/2021']}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df.Date,format='%d/%m/%Y')
df
I would like to add a column (to this same dataframe) calculating the how many have occurred in the last 28 days grouped by Name. Does anyone know the most efficient way to do this over 200,000 rows of code? with about 1000 different Name's?
The new column values should be 1,2,3,3,1,2. Any help would be much appreciated! Thanks!
Set the index of dataframe to Date, then group the frame by Name and apply rolling count with a closed window having offset of 28 days
df['count'] = df.set_index('Date')\
.groupby('Name', sort=False)['Name']\
.rolling('28d', closed='both').count().tolist()
Name Date count
0 Jim 2021-01-08 1.0
1 Jim 2021-01-27 2.0
2 Jim 2021-02-05 3.0
3 Jim 2021-02-10 3.0
4 Jack 2021-01-26 1.0
5 Jack 2021-02-20 2.0

Multiple columns into one [duplicate]

This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 2 years ago.
I have a large table with multiple columns as input table in below format:
Col-A Col-B Col-C Col-D Col-E Col-F
001 10 01/01/2020 123456 123123 123321
001 20 01/02/2020 123456 123111
002 10 01/03/2020 111000 111123
And I'd like to write a code such that it will show lines per each Col-A and so that instead of multiple columns Col-D,E,F I will only have Col-D:
Col-A Col-B Col-C Col-D
001 10 01/01/2020 123456
001 10 01/01/2020 123123
001 10 01/01/2020 123321
001 20 01/02/2020 123456
001 20 01/02/2020 123111
002 10 01/03/2020 111000
002 10 01/03/2020 111123
Any ideas will be appreciated,
Thanks,
Nurbek
You can use pd.melt
import pandas as pd
newdf = pd.melt(
df,
id_vars=['Col-A', 'Col-B', 'Col-C'],
value_vars=['Col-D', 'Col-E', 'Col-F']
).dropna()
This will drop 'Col-D', 'Col-E' and 'Col-F', but create two new columns variable and value. Variable column will denote the column from which your value came from. To achieve what you want ultimately, you can drop the variable column and rename the value column to Col-D.
newdf = newdf.drop(['variable'], axis=1)
newdf = newdf.rename(columns={"value":"Col-D"})
What about something like this:
df2 = df[["Col-A","Col-B","Col-C","Col-D"]]
columns = ["Col-E","Col-F",...,"Col-Z"]
for col in columns:
df2.append(df[["Col-A","Col-B","Col-C",col]]).reset_index(drop=True)
You just append the columns you want to your original dataframe

When importing csv to PANDAS, how can you only import columns that contain a specified string within them?

I have thousands of CSV files that each contain hundreds of columns and hundreds of thousands of rows. For speeds I want to only import the data to PANDAS dataframes that I need. I can filter our the CSV files that I do not need using a separate metadata file, but I am having trouble figuring out how to drop the columns that I do not need (during the import -- I know how to filter columns of a dataframe after its been imported, but like I said, I am trying to avoid importing unnecessary data).
So let's say I have the following csv file:
Date/Time Apple Tart Cherry Pie Blueberry Pie Banana Pudding Tomato Soup
1:00 2 4 7 6 5
2:00 3 5 4 5 8
3:00 1 4 7 4 4
I want to import only columns that include the text "Pie", as well as the "Date/Time" column. Also note that the column names and number of columns are different for all of my csv files, so the "usecol" specification has not worked for me as-is since I do not know the specific column names to enter.
The usecols parameter in pandas read_csv accepts a function to filter for the columns you are interested in :
import pandas as pd
from io import StringIO
data = """Date/Time Apple Tart Cherry Pie Blueberry Pie Banana Pudding Tomato Soup
1:00 2 4 7 6 5
2:00 3 5 4 5 8
3:00 1 4 7 4 4"""
df = pd.read_csv(StringIO(data),
sep='\s{2,}',
engine='python',
#this is the key part of the code for your usecase
#looks for columns that contain Pie or Date/Time
#and returns only those columns
#quite extensible as well, since it accepts a function
usecols = lambda x: ("Pie" in x) or ("Date/Time" in x) )
df
Date/Time Cherry Pie Blueberry Pie
0 1:00 4 7
1 2:00 5 4
2 3:00 4 7
You can specify the column names when you use read_csv() as a list, for example:
df=pd.read_csv('fila.csv',names=['columnName#1','columnName3'])
Look that i did not use 'columnName2'.

Best strategy for merging a lot of data frames using pandas

I'm trying to merge many (a few thousand one column tsv files) data frames into a single csv file using pandas. I'm new to pandas (and python for that matter) and could use some input or direction.
My data frames are observational data on a list scraped from the web and do not contain headers. For example:
data frame 1:
bluebird 34
chickadee 168
eagle 10
hawk 67
sparrow 2
data frame 2:
albatross 56
bluebird 78
hawk 3
pelican 19
sparrow 178
I'm looking to do is simply create a master file with all of the individual observations:
albatross 0 56
bluebird 34 78
chickadee 168 0
eagle 10 0
hawk 67 3
pelican 0 19
sparrow 2 178
I've tried to merge the data frames one at a time using pandas:
import pandas as pd
df1 = pd.read_table("~/home/birds1.tsv", sep='\t')
df2 = pd.read_table("~/home/birds2.tsv", sep='\t')
merged = df1.merge(df1, df2, how="left").fillna("0")
merged.to_csv("merged.csv", index=False)
but I am only getting one column. I don't have a master list of "birds", but I can concatenate all the data and sort on unique names for a dictionary list if this is needed.
What should my strategy be for merging a few thousand files?
Look at the docs for merge, when called from a frame, the first parameter is the 'other' frame, and the second is which variables you want to merge on (not actually sure what happens when you pass a DataFrame).
But, assuming your bird column is called 'bird', what you probably want is:
In [412]: df1.merge(df2, on='bird', how='outer').fillna(0)
Out[412]:
bird value_x value_y
0 bluebird 34 78
1 chickadee 168 0
2 eagle 10 0
3 hawk 67 3
4 sparrow 2 178
5 albatross 0 56
6 pelican 0 19
I would think the fastest way is to set the column you want to merge on to the index, create a list of the dataframes and then pd.concat them. Something like this:
import os
import pandas as pd
directory = os.path.expanduser('~/home')
files = os.path.listdir(directory)
dfs = []
for filename in files:
if '.tsv' in file:
df = pd.read_table(os.path.join(directory,filename),sep='\t').set_index('bird')
dfs.append(df)
master_df = pd.concat(dfs,axis=1)

Categories

Resources