I keep running into this use and I haven't found a good solution. I am asking for a solution in python, but a solution in R would also be helpful.
I've been getting data that looks something like this:
import pandas as pd
data = {'Col1': ['Bob', '101', 'First Street', '', 'Sue', '102', 'Second Street', '', 'Alex' , '200', 'Third Street', '']}
df = pd.DataFrame(data)
Col1
0 Bob
1 101
3
4 Sue
5 102
6 Second Street
7
8 Alex
9 200
10 Third Street
11
The pattern in my real data does repeat like this. Sometimes there is a blank row (or more than 1), and sometimes there are not any blank rows. The important part here is that I need to convert this column into a row.
I want the data to look like this.
Name Address Street
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
I have tried playing around with this, but nothing has worked. My thought was to iterate through a few rows at a time, assign the values to the appropriate column, and just build a data frame row by row.
x = len(df['Col1'])
holder = pd.DataFrame()
new_df = pd.DataFrame()
while x < 4:
temp = df.iloc[:5]
holder['Name'] = temp['Col1'].iloc[0]
holder['Address'] = temp['Col1'].iloc[1]
holder['Street'] = temp['Col1'].iloc[2]
new_df = pd.concat([new_df, holder])
df = temp[5:]
df.reset_index()
holder = pd.DataFrame()
x = len(df['Col1'])
new_df.head(10)
In R,
data <- data.frame(
Col1 = c('Bob', '101', 'First Street', '', 'Sue', '102', 'Second Street', '', 'Alex' , '200', 'Third Street', '')
)
k<-which(grepl("Street", data$Col1) == TRUE)
j <- k-1
i <- k-2
data.frame(
Name = data[i,],
Adress = data[j,],
Street = data[k,]
)
Name Adress Street
1 Bob 101 First Street
2 Sue 102 Second Street
3 Alex 200 Third Street
Or, if Street not ends with Street but Adress are always a number, you can also try
j <- which(apply(data, 1, function(x) !is.na(as.numeric(x)) ))
i <- j-1
k <- j+1
Python3
In Python 3, you can convert your DataFrame into an array and then reshape it.
n = df.shape[0]
df2 = pd.DataFrame(
data=df.to_numpy().reshape((n//4, 4), order='C'),
columns=['Name', 'Address', 'Street', 'Empty'])
This produces for your sample data this:
Name Address Street Empty
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
If you like you can remove the last column:
df2 = df2.drop(['Empty'], axis=1)
Name Address Street
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
One-liner code
df2 = pd.DataFrame(data=df.to_numpy().reshape((df.shape[0]//4, 4), order='C' ), columns=['Name', 'Address', 'Street', 'Empty']).drop(['Empty'], axis=1)
Name Address Street
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
In python i believe this may help u.
1 import pandas as pd
2
3 data = {'Col1': ['Bob', '101', 'First Street', '', 'Sue', '102', 'Second Street', '', 'Alex' , '200', 'Third Street', '']}
4
5 var = list(data.values())[0]
6 var2 = []
7 for aux in range(int(len(var)/4)):
8 var2.append(var[aux*4: aux*4+3])
9 data = pd.DataFrame(var2, columns=['Name', 'Address','Street',])
10 print(data)
Another R solution. This solution is based on the tidyverse package. The example data frame data is from Park's post (https://stackoverflow.com/a/69833814/7669809).
library(tidyverse)
data2 <- data %>%
mutate(ID = cumsum(Col1 %in% "")) %>%
filter(!Col1 %in% "") %>%
group_by(ID) %>%
mutate(Type = case_when(
row_number() == 1L ~"Name",
row_number() == 2L ~"Address",
row_number() == 3L ~"Street",
TRUE ~NA_character_
)) %>%
pivot_wider(names_from = "Type", values_from = "Col1") %>%
ungroup()
data2
# # A tibble: 3 x 4
# ID Name Address Street
# <int> <chr> <chr> <chr>
# 1 0 Bob 101 First Street
# 2 1 Sue 102 Second Street
# 3 2 Alex 200 Third Street
The values of the DataFrame are reshaped by unknown rows and 4 columns, then the first 3 columns of the entire array are taken out by slicing and converted into a DataFrame, and finally the columns of DataFrame are reset by set_axis
result = pd.DataFrame(df.values.reshape(-1, 4)[:, :-1])\
.set_axis(['Name', 'Address', 'Street'], axis=1)
result
>>>
Name Address Street
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
Related
I have a data frame containing one row:
df_1D = pd.DataFrame({'Day1':[5],
'Day2':[6],
'Day3':[7],
'ID':['AB12'],
'Country':['US'],
'Destination_A':['Miami'],
'Destination_B':['New York'],
'Destination_C':['Chicago'],
'First_Agent':['Jim'],
'Second_Agent':['Ron'],
'Third_Agent':['Cynthia']},
)
Day1 Day2 Day3 ID ... Destination_C First_Agent Second_Agent Third_Agent
0 5 6 7 AB12 ... Chicago Jim Ron Cynthia
I'm wondering if there's an easy way, to transform it into a dataframe with three rows as shown here:
Day ID Country Destination Agent
0 5 AB12 US Miami Jim
1 6 AB12 US New York Ron
2 7 AB12 US Chicago Cynthia
Have you tried to pivot it with .pivot function? https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html
One option using reshaping, which only requires to know the final columns:
# define final columns
cols = ['Day', 'ID', 'Destination', 'Country', 'Agent']
# the part below is automatic
# ------
# extract the keywords
pattern = f"({'|'.join(cols)})"
new = df_1D.columns.str.extract(pattern)[0]
# and reshape
out = (df_1D
.set_axis(pd.MultiIndex.from_arrays([new, new.groupby(new).cumcount()]), axis=1)
.loc[0].unstack(0).ffill()[cols]
)
Output:
Day ID Destination Country Agent
0 5 AB12 Miami US Jim
1 6 AB12 New York US Ron
2 7 AB12 Chicago US Cynthia
alternative defining idx/cols separately
idx = ['ID', 'Country']
cols = ['Day', 'Destination', 'Agent']
df2 = df_1D.set_index(idx)
pattern = f"({'|'.join(cols)})"
new = df2.columns.str.extract(pattern)[0]
out = (df2
.set_axis(pd.MultiIndex.from_arrays([new, new.groupby(new).cumcount().astype(str)],
names=[None, None]),
axis=1)
.stack().reset_index(idx)
)
clomuns_day=[col for col in df_1D if col.startswith('Day')]
clomuns_dest=[col for col in df_1D if col.startswith('Destination')]
clomuns_agent=[col for col in df_1D if 'Agent'in col]
new_df=pd.DataFrame()
new_df['Day']=df_1D[clomuns_day].values.tolist()[0]
new_df['ID']= list(df_1D['ID'])*len(new_df)
new_df['Country']= list(df_1D['Country'])*len(new_df)
new_df['Destination']=df_1D[clomuns_dest].values.tolist()[0]
new_df['Agent']=df_1D[clomuns_agent].values.tolist()[0]
Out:
Day ID Country Destination Agent
0 5 AB12 US Miami Jim
1 6 AB12 US New York Ron
2 7 AB12 US Chicago Cynthia
you can use it whatever destination is repeat
One option is with pivot_longer from pyjanitor, where for this case, you pass a list of regexes to names_pattern, and the new column names to names_to:
# pip install pyjanitor
import janitor
import pandas as pd
(df_1D
.pivot_longer(
index=['ID','Country'],
names_to = ['Day','Destination','Agent'],
names_pattern=['Day','Destination','Agent'])
)
ID Country Day Destination Agent
0 AB12 US 5 Miami Jim
1 AB12 US 6 New York Ron
2 AB12 US 7 Chicago Cynthia
I don't think there is a way to treat this fully automated. It requires manual manipulation. This is the shortest code that comes to my mind. Feel free to comment:
d1 = {}
for k in ['Day', 'Destination', 'Agent']:
d1[k] = [d[i][0] for i in d.keys() if k in i]
for k in ['ID', 'Country']:
d1[k] = d[k] * len(d1['Day'])
d1 = pd.DataFrame(d1)
Output:
Hope this help.
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes in Pandas. What I want achieve is, grab every 'Name' from DF1 and get the corresponding 'City' and 'State' present in DF2.
For example, 'Dwight' from DF1 should return corresponding values 'Miami' and 'Florida' from DF2.
DF1
Name Age Student
0 Dwight 20 Yes
1 Michael 30 No
2 Pam 55 No
. . . .
70000 Jim 27 Yes
DF1 has approx 70,000 rows with 3 columns
Second Dataframe, DF2 has approx 320,000 rows.
Name City State
0 Dwight Miami Florida
1 Michael Scranton Pennsylvania
2 Pam Austin Texas
. . . . .
325082 Jim Scranton Pennsylvania
Currently I have two functions, which return the values of 'City' and 'State' using a filter.
def read_city(id):
filt = (df2['Name'] == id)
if filt.any():
field = (df2[filt]['City'].values[0])
else:
field = ""
return field
def read_state(id):
filt = (df2['Name'] == id)
if filt.any():
field = (df2[filt]['State'].values[0])
else:
field = ""
return field
I am using the apply function to process all the values.
df['city_list'] = df['Name'].apply(read_city)
df['State_list'] = df['Name'].apply(read_state)
The result takes a long time to compute in the above way. It roughly takes me around 18 minutes to get back the df['city_list'] and df['State_list'].
Is there a faster to compute this ? Since I am completely new to pandas, I would like to know if there is a efficient way to compute this ?
I believe you can do a map:
s = df2.groupby('name')[['City','State']].agg(list)
df['city_list'] = df['Name'].map(s['City'])
df['State_list'] = df['Name'].map(s['State'])
Or a left merge after you got s:
df = df.merge(s.add_suffix('_list'), left_on='Name', right_index=True, how='left')
I think you can do something like this:
# Dataframe DF1 (dummy data)
DF1 = pd.DataFrame(columns=['Name', 'Age', 'Student'], data=[['Dwight', 20, 'Yes'], ['Michael', 30, 'No'], ['Pam', 55, 'No'], ['Jim', 27, 'Yes']])
print("DataFrame DF1")
print(DF1)
# Dataframe DF2 (dummy data)
DF2 = pd.DataFrame(columns=['Name', 'City', 'State'], data=[['Dwight', 'Miami', 'Florida'], ['Michael', 'Scranton', 'Pennsylvania'], ['Pam', 'Austin', 'Texas'], ['Jim', 'Scranton', 'Pennsylvania']])
print("DataFrame DF2")
print(DF2)
# You do a merge on 'Name' column and then, you change the name of columns 'City' and 'State'
df = pd.merge(DF1, DF2, on=['Name']).rename(columns={'City': 'city_list', 'State': 'State_list'})
print("DataFrame final")
print(df)
Output:
DataFrame DF1
Name Age Student
0 Dwight 20 Yes
1 Michael 30 No
2 Pam 55 No
3 Jim 27 Yes
DataFrame DF2
Name City State
0 Dwight Miami Florida
1 Michael Scranton Pennsylvania
2 Pam Austin Texas
3 Jim Scranton Pennsylvania
DataFrame final
Name Age Student city_list State_list
0 Dwight 20 Yes Miami Florida
1 Michael 30 No Scranton Pennsylvania
2 Pam 55 No Austin Texas
3 Jim 27 Yes Scranton Pennsylvania
I have a dataset with several columns.
Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.
_id fName lName age
0 ABCD Andrew Schulz
1 ABCD Andreww 23
2 DEFG John boy
3 DEFG Johnn boy 14
4 CDGH Bob TANNA 13
5 ABCD. Peter Parker 45
6 DEFGH Clark Kent 25
So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values. Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
5 ABCD Peter Parker 45
6 DEFG Clark Kent 25
I intend to use pyjarowinkler.
If I had two independent columns (without all the group by stuff) to check, this is how I use it.
df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])]
df = df[df['score'] > 0.87]
Can someone suggest a pythonic and fast way of doing this
UPDATE
So, I have tried using record linkage library for this. And I have ended up at a dataframe containing pair of indexes that are similar called 'matches'. Now I just want to basically combine the data.
# Indexation step
indexer = recordlinkage.Index()
indexer.block(left_on='_id')
candidate_links = indexer.index(df)
# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.string('fName', 'fName', method='jarowinkler', threshold=threshold, label='full_name')
features = compare_cl.compute(candidate_links, df)
# Classification step
matches = features[features.sum(axis=1) >= 1]
print(len(matches))
This is how matches looks:
index1 index2 fName
0 1 1.0
2 3 1.0
I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows
just wanted to clear some doubts regarding your ques. Couldn't clear them in comments due to low reputation.
Like here if i run it for col "fName". I should be able to reduce this
dataframe to based on a score threshold:
So basically your function would return the DataFrame containing the first row in each group (by ID)? This will result in the above listed resultant DataFrame.
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
I hope this code answer your question
r0 =['ABCD','Andrew','Schulz', '' ]
r1 =['ABCD','Andrew', '' , '23' ]
r2 =['DEFG','John' ,'boy' , '' ]
r3 =['DEFG','John' ,'boy' , '14' ]
r4 =['CDGH','Bob' ,'TANNA' , '13' ]
Rx =[r0,r1,r2,r3,r4]
print(Rx)
print()
Dict= dict()
for i in Rx:
if (Dict.__contains__(i[0]) == True):
if (i[2] != ''):
Dict[i[0]][2] = i[2]
if (i[3] != ''):
Dict[i[0]][3] = i[3]
else:
Dict[i[0]]=i
Rx[:] = Dict.values()
print(Rx)
I am lost with the 'score' part of your question, but if what you need is to fill the gaps in data with values from other rows and then drop the duplicates by id, maybe this can help:
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
First make sure that empty values are replaced with nulls. Then use fillna to 'back fill' the data. Then drop duplicates keeping the first occurrence of Id. fillna will fill the values from the next value found in the column, which may correspond to other Id, but since you will discard the duplicated rows, I believe drop_duplicates keeping the first occurrence will do the job. (This assumes that at least one value is provided in every column for every Id)
I've tested with this dataset and code:
data = [
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', 'Schulz'],
['AABBCC', 'Andrew', '', 23],
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', '',],
['DDEEFF', 'Karl', 'boy'],
['DDEEFF', 'Karl', ''],
['DDEEFF', 'Karl', '', 14],
['GGHHHH', 'John', 'TANNA', 13],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', 'Blob'],
['HLHLHL', 'Bob', 'Blob', 15],
['HLHLHL', 'Bob','', 15],
['JLJLJL', 'Nick', 'Best', 20],
['JLJLJL', 'Nick', '']
]
df = pd.DataFrame(data, columns=['Id', 'fName', 'lName', 'Age'])
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
Output:
Id fName lName Age
0 AABBCC Andrew Schulz 23.0
5 DDEEFF Karl boy 14.0
8 GGHHHH John TANNA 13.0
9 HLHLHL Bob Blob 15.0
14 JLJLJL Nick Best 20.0
Hope this helps and apologies if I misunderstood the question.
I have a data frame, df, like this:
data = {'A': ['Jason (121439)', 'Molly (194439)', 'Tina (114439)', 'Jake (127859)', 'Amy (122579)'],
'B': ['Bob (127439)', 'Mark (136489)', 'Tyler (121443)', 'John (126259)', 'Anna(174439)'],
'C': ['Jay (121596)', 'Ben (12589)', 'Toom (123586)', 'Josh (174859)', 'Al(121659)'],
'D': ['Paul (123839)', 'Aaron (124159)', 'Steve (161899)', 'Vince (179839)', 'Ron (128379)']}
df = pd.DataFrame(data)
And I want to create a new data frame with one column with the name and the other column with the number between parenthesis, which would look like this:
data2 = {'Name': ['Jason ', 'Molly ', 'Tina ', 'Jake ', 'Amy '],
'ID#': ['121439', '194439', '114439', '127859', '122579']}
result = pd.DataFrame(data2)
I try different things, but it all did not work:
1)
List_name=pd.DataFrame()
List_id=pd.DataFrame()
List_both=pd.DataFrame(columns=["Name","ID"])
for i in df.columns:
left=df[i].str.split("(",1).str[0]
right=df[i].str.split("(",1).str[1]
List_name=List_name.append(left)
List_id=List_id.append(right)
List_both=pd.concat([List_name,List_id], axis=1)
List_both
2) applying a function on all cell
Names = lambda x: x.str.split("(",1).str[0]
IDS = Names = lambda x: x.str.split("(",1).str[1]
But I was wondering how to do it in order to store it in a data frame that will look like result...
You can use stack followed by str.extract.
(df.stack()
.str.strip()
.str.extract(r'(?P<Name>.*?)\s*\((?P<ID>.*?)\)$')
.reset_index(drop=True))
Name ID
0 Jason 121439
1 Bob 127439
2 Jay 121596
3 Paul 123839
4 Molly 194439
5 Mark 136489
6 Ben 12589
7 Aaron 124159
8 Tina 114439
9 Tyler 121443
10 Toom 123586
11 Steve 161899
12 Jake 127859
13 John 126259
14 Josh 174859
15 Vince 179839
16 Amy 122579
17 Anna 174439
18 Al 121659
19 Ron 128379
I have a dataset that contains the following fields:
building guid (abcd-efgh-5678-1234, ..., etc)
street address (1256 Grant St, 500 wall st, etc)
price ($5000, $10000, etc)
Based on this, I want to add two new columns to my DataFrame object in Pandas.:
street name (wall st)
street number (500)
Until now, I've been able to fetch specific instances of the word wall st as follows:
str_street = 'Wall St'
wall_st = dataset.loc[dataset['street_address'].str.lower().str.endswith(str_street.lower()), :]
wall_st['street_name'] = ???
wall_st['street_address_number'] = ???
How do I go about doing this?
I think you need extract:
df = pd.DataFrame({'street address': ['500 wall street', '123 blafoo']})
print (df)
street address
0 500 wall street
1 123 blafoo
df1 = df['street address'].str.extract('(?P<number>\d+)(?P<name>.*)', expand=True)
print (df1)
number name
0 500 wall street
1 123 blafoo
Solution with split:
df[['number','name']] = df['street address'].str.split(n=1, expand=True)
print (df)
street address number name
0 500 wall street 500 wall street
1 123 blafoo 123 blafoo
df = pd.DataFrame({'street address': ['500 wall street', '123 blafoo']})
df['street address'].apply(lambda x: pd.Series(x.split(None, 1)))
will result in:
0 1
0 500 wall street
1 123 blafoo
You can then just rename the columns and pd.concat this to you original data frame.