I have a script that converts files.
import pandas as pd
df = pd.read_csv("sample1.csv")
final_df = df.reindex(['id','name','email'],axis=1)
final_df.to_csv("output.csv", index = False)
sample1.csv
|name|email|id|
|--| -- | -- |
output.csv
|id|name|email|
|--| -- | -- |
Now, if the other sample files are in the format like below, how to arrange them in the format same as output.csv
sample2.csv
|id|first name |email address|
|--| -- | -- |
|1 | Sta |sta#example.com|
|2 |Danny|dany#example.com|
|3 |Elle |elle#example.com|
sample3.csv
|id|initial name |email id|
|--| -- | -- |
|1 | Ricky|ricky#example.com|
|2 |Sham|sham#example.com|
|3 |Mark|#example.com|
sample4.csv
| id |alias|contact|
|-- | -- | -- |
| 1 | Ricky|ricky#example.com|
|2 |Sham|sham#example.com|
|3 |Mark|#example.com|
I want to convert these files and place them in the columns of output file. For example, first name, initial name, alias refers to name(all means same), and email address, email id, and contact should refer to email. The order of columns can be random in the sample files.
The basic illustration for this case is :
switch(headerfields[i])
{
case "firstname":
case "initial name":
case "alias":
name = i;
}
Any ideas to do this in Pandas?
Select the target columns, then append to the target DataFrame.
dfn = pd.DataFrame(columns=['id', 'name', 'email'])
for df in [df1, df2, df3]:
# select columns
cond_list = [
df.columns =='id',
df.columns.str.contains('name|alias', na=False),
df.columns.str.contains('email|contact', na=False)
]
cols = [df.columns[cond][0] for cond in cond_list]
print(cols)
dfn = dfn.append(pd.DataFrame(df[cols].values, columns=dfn.columns))
output:
['id', 'first name', 'email address']
['id', 'initial name', 'email id']
['id', 'alias', 'contact']
dfn:
id name email
0 1 Sta sta#example.com
1 2 Danny dany#example.com
2 3 Elle elle#example.com
0 1 Ricky ricky#example.com
1 2 Sham sham#example.com
2 3 Mark #example.com
0 1 Ricky ricky#example.com
1 2 Sham sham#example.com
2 3 Mark #example.com
Testing data:
df_str = '''
id "first name" "email address"
1 Sta sta#example.com
2 Danny dany#example.com
3 Elle elle#example.com
'''
df1 = pd.read_csv(io.StringIO(df_str.strip()), sep='\s+', index_col=False)
df_str = '''
id "initial name" "email id"
1 Ricky ricky#example.com
2 Sham sham#example.com
3 Mark #example.com
'''
df2 = pd.read_csv(io.StringIO(df_str.strip()), sep='\s+', index_col=False)
df_str = '''
id alias contact
1 Ricky ricky#example.com
2 Sham sham#example.com
3 Mark #example.com
'''
df3 = pd.read_csv(io.StringIO(df_str.strip()), sep='\s+', index_col=False)
df1['1'] = 1
df2['2'] = 2
df3['3'] = 3
df1.sort_index(axis=1, inplace=True)
df2.sort_index(axis=1, inplace=True)
df3.sort_index(axis=1, inplace=True)
Not the cleanest solution, but you could test the first 5 rows of each dataframe for certain strings/numbers and assume that's your target column.
import numpy as np
import pandas as pd
def rename_and_merge_dfs(dfs : list) -> pd.DataFrame:
new_dfs = []
for frame in dfs:
id_col = frame.head(5).select_dtypes(np.number).columns[0]
email = frame.columns[frame.head(5).replace('[^#]','',regex=True).eq('#').all()][0]
name = list(set(frame.columns) - set([id_col, email]))[0]
frame = frame.rename(columns={id_col : 'id', email : 'email', name : 'name'})
new_dfs.append(frame)
return pd.concat(new_dfs)
final = rename_and_merge_dfs([df3,df2,df1])
print(final)
id name email
1 1 Ricky ricky#example.com
2 2 Sham sham#example.com
3 3 Mark #example.com
0 1 Ricky ricky#example.com
1 2 Sham sham#example.com
2 3 Mark #example.com
0 1 Sta sta#example.com
1 2 Danny dany#example.com
2 3 Elle elle#example.com
This solved my problem.
import pandas as pd
sample1 = pd.read_csv('sample1.csv')
def mapping(df):
for column_name, column in df.transpose().iterrows():
df.rename(columns ={'first name' : 'FNAME', 'email address': 'EMAIL'}, inplace = True)
df.rename(columns ={'alias' : 'FNAME', 'contact': 'EMAIL'}, inplace = True)
df.rename(columns ={'initial name' : 'FNAME', 'emailid': 'EMAIL'}, inplace = True)
mapping(sample1)
Related
In my one sheet Excel file that I created through my SQL, I have 3 columns that represent letter ratings. The rating values may differ between ratings 1, 2, and 3, but they can still be ranked with the same value.
I am trying to create a new column in my Excel file that can take these 3 letter ratings and pull the middle rating.
ranking | Rating_1 | Rating_2 | Rating_3 | NEW_COLUMN |
(1 lowest) | -------- | -------- | -------- | -------- |
3 | A+ | AA | Aa | middle(rating)|
2 | B+ | BB | Bb | middle(rating)|
1 | Fa | Fb | Fc | middle(rating)|
| -------- | -------- | -------- | --------- |
There are three scenarios I need to account for:
if all three ratings differ, pick the rating between rating_1, rating_2, and rating_3 that isn't the highest rating or the lowest rating
if all three ratings are the same, pick rating on rating_1
if 2 of the ratings are the same, but one is different, pick the minimum rating
I created a dataframe :
df = pd.DataFrame(
{"Rating_1": ["A+", "AA", "Aa"],
"Rating_2": ["B+", "BB", "Bb"],
"Rating_3": ["Fa", "Fb", "Fc"]}
)
df["NEW COLUMN"] = {insert logic here}
Or is it easier to create a new DF that filters down the the original DF?
With the fowllowing toy dataframe:
import pandas as pd
df = pd.DataFrame(
{
"Rating_1": ["A+", "Cc", "Aa"],
"Rating_2": ["AA", "Cc", "Aa"],
"Rating_3": ["BB", "Cc", "Bb"],
}
)
print(df)
# Output
Rating_1 Rating_2 Rating_3
0 A+ AA BB
1 Cc Cc Cc
2 Aa Aa Bb
Here is one way to do it using Python sets to check conditions:
# First condition
df["Middle_rating"] = df.apply(
lambda x: sorted([x["Rating_1"], x["Rating_2"], x["Rating_3"]])[1]
if len(set([x["Rating_1"], x["Rating_2"], x["Rating_3"]])) == 3
else "",
axis=1,
)
# Second condition
df["Middle_rating"] = df.apply(
lambda x: x["Rating_1"]
if len(set([x["Rating_1"], x["Rating_2"], x["Rating_3"]])) == 1
else x["Middle_rating"],
axis=1,
)
# Third condition
ratings = {
rating: i
for i, rating in enumerate(["A+", "AA", "Aa", "B+", "BB", "Bb", "C+", "CC", "Cc"])
} # ratings ordered from best (A+: 0) to worst (CC: 8)
df["Middle_rating"] = df.apply(
lambda x: max(x["Rating_1"], x["Rating_2"], x["Rating_3"])
if len(
set([ratings[x["Rating_1"]], ratings[x["Rating_2"]], ratings[x["Rating_3"]]])
)
== 2
else x["Middle_rating"],
axis=1,
)
Then:
print(df)
# Output
Rating_1 Rating_2 Rating_3 Middle_rating
0 A+ AA BB AA
1 Cc Cc Cc Cc
2 Aa Aa Bb Bb
I have three columns in my data frame:
CaseID
FirstName
LastName
1
rohit
pandey
2
rai
3
In the output, I am trying to add the fourth column and have values as LastName,FirstName
I have this Python code
df_ids['ContactName'] = df_ids[['LastName', 'FirstName']].agg(lambda x: ','.join(x.values), axis=1)
But it appends the blank values also which something like below that I am able to get like below:
CaseID
FirstName
LastName
ContactName
1
rohit
pandey
pandey, rohit
2
rai
, rai
3
,
The expected output:
CaseID
FirstName
LastName
ContactName
1
rohit
pandey
pandey, rohit
2
rai
rai
3
Someone has added PySpark tag. This is PySpark version:
from pyspark.sql import functions as F
df_ids = df_ids.replace('', None) # Replaces empty strings with nulls
df_ids = df_ids.withColumn('ContactName', F.concat_ws(', ', 'LastName', 'FirstName'))
df_ids = df_ids.fillna('') # Replaces nulls back to empty strings
df_ids.show()
# +------+---------+--------+-------------+
# |CaseID|FirstName|LastName| ContactName|
# +------+---------+--------+-------------+
# | 1| rohit| pandey|pandey, rohit|
# | 2| | rai| rai|
# | 3| | | |
# +------+---------+--------+-------------+
This is the easy way, using apply. apply takes each row one at a time and passes it to the given function.
import pandas as pd
data = [
[ 1, 'rohit', 'pandey' ],
[ 2, '', 'rai' ],
[ 3, '', '' ]
]
df = pd.DataFrame(data, columns=['CaseID', 'FirstName', 'LastName'] )
def fixup( row ):
if not row['LastName']:
return ''
if not row['FirstName']:
return row['LastName']
return row['LastName'] + ', ' + row['FirstName']
print(df)
df['Contact1'] = df.apply(fixup, axis=1)
print(df)
Output:
CaseID FirstName LastName
0 1 rohit pandey
1 2 rai
2 3
CaseID FirstName LastName Contact1
0 1 rohit pandey pandey, rohit
1 2 rai rai
2 3
Two (actually 1 and a half) other options, which are very close to your attempt:
df_ids['ContactName'] = (
df_ids[['LastName', 'FirstName']]
.agg(lambda row: ', '.join(name for name in row if name), axis=1)
)
or
df_ids['ContactName'] = (
df_ids[['LastName', 'FirstName']]
.agg(lambda row: ', '.join(filter(None, row)), axis=1)
)
In both version the ''s are filtered out:
Via a generator expression: The if name makes sure that '' isn't allowed, because its truth value is False - try print(bool('')).
By the built-in function filter() with the first argument set to None.
i have a code and the prints look pretty weird. i want to fix it
*The Prints
Matching Score
0 john carry 73.684211
Matching Score
0 alex midlane 80.0
Matching Score
0 alex midlane 50.0
Matching Score
0 robert patt 53.333333
Matching Score
0 robert patt 70.588235
Matching Score
0 david baker 100.0
*I need this format
| Matching | Score |
| ------------ | -----------|
| john carry | 73.684211 |
| alex midlane | 80.0 |
| alex midlane | 50.0 |
| robert patt | 53.333333 |
| robert patt | 70.588235 |
| david baker | 100.0 |
*My Code
import numpy as np
import pandas as pd
from rapidfuzz import process, fuzz
df = pd.DataFrame({
"NameTest": ["john carry", "alex midlane", "robert patt", "david baker", np.nan, np.nan, np.nan],
"Name": ["john carrt", "john crat", "alex mid", "alex", "patt", "robert", "david baker"]
})
NameTests = [name for name in df["NameTest"] if isinstance(name, str)]
for Name in df["Name"]:
if isinstance(Name, str):
match = process.extractOne(
Name, NameTests,
scorer=fuzz.ratio,
processor=None,
score_cutoff=10)
data = {'Matching': [match[0]],
'Score': [match[1]]}
df1 = pd.DataFrame(data)
print(df1)
I have tried many ways. but got the same prints
thank you for suggestion.
You need an array or a list in order to keep all the data (I use an array) because you creating a dataframe in each loop
data = []
for Name in df["Name"]:
if isinstance(Name, str):
match = process.extractOne(
Name, NameTests,
scorer=fuzz.ratio,
processor=None,
score_cutoff=10)
print(match[0])
data.append({'Matching': match[0],
'Score': match[1]})
df1 = pd.DataFrame(data)
print(df1)
Here is the output
enter image description here
You create a new dataframe in each loop. You can store the result in a global dict and create dataframe from that dict after the loop.
data = {'Matching': [], 'Score': []}
for Name in df["Name"]:
if isinstance(Name, str):
match = process.extractOne(
Name, NameTests,
scorer=fuzz.ratio,
processor=None,
score_cutoff=10)
data['Matching'].append(match[0])
data['Score'].append(match[1])
df1 = pd.DataFrame(data)
I have a two different datasets:
users:
+-------+---------+--------+
|user_id| movie_id|timestep|
+-------+---------+--------+
| 100 | 1000 |20200728|
| 101 | 1001 |20200727|
| 101 | 1002 |20200726|
+-------+---------+--------+
movies:
+--------+---------+--------------------------+
|movie_id| title | genre |
+--------+---------+--------------------------+
| 1000 |Toy Story|Adventure|Animation|Chil..|
| 1001 | Jumanji |Adventure|Children|Fantasy|
| 1002 | Iron Man|Action|Adventure|Sci-Fi |
+--------+---------+--------------------------+
How to get dataset in the following format? So I can get user's taste profile, so I can compare different users by their similarity score?
+-------+---------+--------+---------+---------+-----+
|user_id| Action |Adventure|Animation|Children|Drama|
+-------+---------+--------+---------+---------+-----+
| 100 | 0 | 1 | 1 | 1 | 0 |
| 101 | 1 | 1 | 0 | 1 | 0 |
+-------+---------+---------+---------+--------+-----+
Where df is the movies dataframe and dfu is the users dataframe
The 'genre' column needs to be split into a list with pandas.Series.str.split, and then using pandas.DataFrame.explode, transform each element of the list into a row, replicating index values.
pandas.merge the two dataframes on 'movie_id'
Use pandas.DataFrame.groupby on 'user_id' and 'genre' and aggregate by count.
Shape final
.unstack converts the groupby dataframe from long to wide format
.fillna replace NaN with 0
.astype changes the numeric values from float to int
Tested in python 3.10, pandas 1.4.3
import pandas as pd
# data
movies = {'movie_id': [1000, 1001, 1002],
'title': ['Toy Story', 'Jumanji', 'Iron Man'],
'genre': ['Adventure|Animation|Children', 'Adventure|Children|Fantasy', 'Action|Adventure|Sci-Fi']}
users = {'user_id': [100, 101, 101],
'movie_id': [1000, 1001, 1002],
'timestep': [20200728, 20200727, 20200726]}
# set up dataframes
df = pd.DataFrame(movies)
dfu = pd.DataFrame(users)
# split the genre column strings at '|' to make lists
df.genre = df.genre.str.split('|')
# explode the lists in genre
df = df.explode('genre', ignore_index=True)
# merge df with dfu
dfm = pd.merge(dfu, df, on='movie_id')
# groupby, count and unstack
final = dfm.groupby(['user_id', 'genre'])['genre'].count().unstack(level=1).fillna(0).astype(int)
# display(final)
genre Action Adventure Animation Children Fantasy Sci-Fi
user_id
100 0 1 1 1 0 0
101 1 2 0 1 1 1
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two dataframes such as:
dfa:
Name | ID | Amount
Bob V434 50.00
Jill B333 22.11
Hank B442 11.11
dfb:
Name | ID_First | ID_Second | ID_Third
Bob V434 E333 B442
Karen V434 E333 B442
Jill V434 E333 B442
Hank V434 E333 B442
I want to join dfa to dfb, but the ID in dfa only corresponds to one of the IDS in dfb.
Is there a way I can join dfa to dfb for ID in dfa so basically if it matches any of the ids in dfb then I can match amount from dfa?
Required output would just be:
Name | ID_First | ID_Second | ID_Third | Amount
Bob V434 E333 B442 50.00
Jill V434 E333 B442 22.11
Hank V434 E333 B442 11.11
Basically join on Name that exists in both tables, but the ID that exists in dfa exists in dfb under only one of the ID_First, second or third columns so the amount that matches for the same name and same ID value but that ID value is only in one of the IDS for dfb.
Thanks
You could attempt a merge on all three, though not sure how efficient that would be. This wouldn't account work for when you have multiple matches across IDs, if such a thing is possible. The following might work;
new_df = pd.DataFrame()
for col in ['ID_First', 'ID_Second', 'ID_Third']:
df = pd.merge(dfa, dfb, left_on='ID', right_on=col, how='left')
new_df = df if new_df.empty else new_df.append(df)
I don't think you can have an 'OR' condition in pd.merge.
This is another possibility;
Python Pandas: How to merge based on an "OR" condition?
You can do 3 inner joins with each of your id columns and concatenate them
df1 = pd.DataFrame([['Bob','V434',50.00],['Jill','E333',22.11],['Hank','B442',11.11]],
columns=['Name','ID','Amount'])
df2 = pd.DataFrame([['Bob','V434','E333','B442'],
['Karen','V434','E333','B442'],
['Jill','V434','E333','B442'],
['Hank','V434','E333','B442']],
columns=['Name','ID_First','ID_Second','ID_Third'])
print(pd.concat([df1.merge(df2, left_on=['ID','Name'], right_on=['ID_First','Name']),
df1.merge(df2, left_on=['ID', 'Name'], right_on=['ID_Second', 'Name']),
df1.merge(df2, left_on=['ID', 'Name'], right_on=['ID_Third', 'Name'])])[['Name','ID','Amount']])
Output:
Name ID Amount
0 Bob V434 50.00
0 Jill E333 22.11
0 Hank B442 11.11
Improvising on #Ian's answer to get the desired output:
new_df = pd.DataFrame()
for col in ['ID_First', 'ID_Second', 'ID_Third']:
df = pd.merge(df1, df2, left_on=['ID','Name'], right_on=[col,'Name'], how='inner')
new_df = df if new_df.empty else new_df.append(df)
Solution
You can do this with a simple merge statement as follows.
pd.merge(dfa[['Name', 'Amount']], dfb, how='inner', on='Name')
Note: While merging dfa and dfb, the columns, dfa.ID and dfb.ID do not act like primary keys, neither are their values unique. The only thing that matters here is to inner join dfa and dfb using the "Name" column.
Output:
For Reproducibility
You may load the data and test the solution given above, using the following code-block
import numpy as np
import pandas as pd
from io import StringIO
# Example Data
dfa = """
Name | ID | Amount
Bob | V434 | 50.00
Jill | B333 | 22.11
Hank | B442 | 11.11
"""
dfb = """
Name | ID_First | ID_Second | ID_Third
Bob | V434 | E333 | B442
Karen | V434 | E333 | B442
Jill | V434 | E333 | B442
Hank | V434 | E333 | B442
"""
# Load Data and Clean up empty spaces
# in headers and columns
dfa = pd.read_csv(StringIO(dfa), sep='|')
dfb = pd.read_csv(StringIO(dfb), sep='|')
dfa.columns = dfa.columns.str.strip()
dfb.columns = dfb.columns.str.strip()
for col in dfa.columns:
if col=='Amount':
dfa[col] = dfa[col].astype(str).str.strip().astype(float)
else:
dfa[col] = dfa[col].str.strip()
for col in dfb.columns:
dfb[col] = dfb[col].str.strip()
# merge dfa and dfb: Note that dfa.ID and dfb.ID do not act
# like primary keys, neither are their values unique.
# The only thing that matters here is to inner join dfa
# and dfb using the "Name" column.
pd.merge(dfa[['Name', 'Amount']], dfb, how='inner', on='Name')