How to replace first two letters of a column value using index - python

I have a data frame
d = {'name': ['john', 'tom', 'bob', 'rock', None], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
df = pd.DataFrame(data = d)
df['month'] = pd.DatetimeIndex(df['DoB']).month
df['year'] = pd.DatetimeIndex(df['DoB']).year
What I want to do: replace first two letters with 'XX' in the name column if year = 2014 .
My code:
df.loc[ (df.year == 2014) , df.name.str[0:2] ] = 'XX'
First of all I get this error :
ValueError: cannot index with vector containing NA / NaN values
But even if there was a value instead of None - say 'jimy' - I get following error: KeyError: "['jo' 'to' 'bo' 'ro' 'ji'] not in index"
I also thought of replace method but it only works if you want to replace a given string.
Any suggestions ?

You are close. Note that pd.DataFrame.loc uses a column label as the second indexer.
mask = df['year'] == 2014
df.loc[mask, 'name'] = 'XX' + df.loc[mask, 'name'].str[2:]
print(df)
Address DoB name month year
0 NY 01/02/2010 john 1 2010
1 NJ 01/02/2012 tom 1 2012
2 PA 11/22/2014 XXb 11 2014
3 NY 11/22/2014 XXck 11 2014
4 CA 09/25/2016 None 9 2016

Related

Convert one column to rows and columns

I keep running into this use and I haven't found a good solution. I am asking for a solution in python, but a solution in R would also be helpful.
I've been getting data that looks something like this:
import pandas as pd
data = {'Col1': ['Bob', '101', 'First Street', '', 'Sue', '102', 'Second Street', '', 'Alex' , '200', 'Third Street', '']}
df = pd.DataFrame(data)
Col1
0 Bob
1 101
3
4 Sue
5 102
6 Second Street
7
8 Alex
9 200
10 Third Street
11
The pattern in my real data does repeat like this. Sometimes there is a blank row (or more than 1), and sometimes there are not any blank rows. The important part here is that I need to convert this column into a row.
I want the data to look like this.
Name Address Street
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
I have tried playing around with this, but nothing has worked. My thought was to iterate through a few rows at a time, assign the values to the appropriate column, and just build a data frame row by row.
x = len(df['Col1'])
holder = pd.DataFrame()
new_df = pd.DataFrame()
while x < 4:
temp = df.iloc[:5]
holder['Name'] = temp['Col1'].iloc[0]
holder['Address'] = temp['Col1'].iloc[1]
holder['Street'] = temp['Col1'].iloc[2]
new_df = pd.concat([new_df, holder])
df = temp[5:]
df.reset_index()
holder = pd.DataFrame()
x = len(df['Col1'])
new_df.head(10)
In R,
data <- data.frame(
Col1 = c('Bob', '101', 'First Street', '', 'Sue', '102', 'Second Street', '', 'Alex' , '200', 'Third Street', '')
)
k<-which(grepl("Street", data$Col1) == TRUE)
j <- k-1
i <- k-2
data.frame(
Name = data[i,],
Adress = data[j,],
Street = data[k,]
)
Name Adress Street
1 Bob 101 First Street
2 Sue 102 Second Street
3 Alex 200 Third Street
Or, if Street not ends with Street but Adress are always a number, you can also try
j <- which(apply(data, 1, function(x) !is.na(as.numeric(x)) ))
i <- j-1
k <- j+1
Python3
In Python 3, you can convert your DataFrame into an array and then reshape it.
n = df.shape[0]
df2 = pd.DataFrame(
data=df.to_numpy().reshape((n//4, 4), order='C'),
columns=['Name', 'Address', 'Street', 'Empty'])
This produces for your sample data this:
Name Address Street Empty
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
If you like you can remove the last column:
df2 = df2.drop(['Empty'], axis=1)
Name Address Street
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
One-liner code
df2 = pd.DataFrame(data=df.to_numpy().reshape((df.shape[0]//4, 4), order='C' ), columns=['Name', 'Address', 'Street', 'Empty']).drop(['Empty'], axis=1)
Name Address Street
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
In python i believe this may help u.
1 import pandas as pd
2
3 data = {'Col1': ['Bob', '101', 'First Street', '', 'Sue', '102', 'Second Street', '', 'Alex' , '200', 'Third Street', '']}
4
5 var = list(data.values())[0]
6 var2 = []
7 for aux in range(int(len(var)/4)):
8 var2.append(var[aux*4: aux*4+3])
9 data = pd.DataFrame(var2, columns=['Name', 'Address','Street',])
10 print(data)
Another R solution. This solution is based on the tidyverse package. The example data frame data is from Park's post (https://stackoverflow.com/a/69833814/7669809).
library(tidyverse)
data2 <- data %>%
mutate(ID = cumsum(Col1 %in% "")) %>%
filter(!Col1 %in% "") %>%
group_by(ID) %>%
mutate(Type = case_when(
row_number() == 1L ~"Name",
row_number() == 2L ~"Address",
row_number() == 3L ~"Street",
TRUE ~NA_character_
)) %>%
pivot_wider(names_from = "Type", values_from = "Col1") %>%
ungroup()
data2
# # A tibble: 3 x 4
# ID Name Address Street
# <int> <chr> <chr> <chr>
# 1 0 Bob 101 First Street
# 2 1 Sue 102 Second Street
# 3 2 Alex 200 Third Street
The values of the DataFrame are reshaped by unknown rows and 4 columns, then the first 3 columns of the entire array are taken out by slicing and converted into a DataFrame, and finally the columns of DataFrame are reset by set_axis
result = pd.DataFrame(df.values.reshape(-1, 4)[:, :-1])\
.set_axis(['Name', 'Address', 'Street'], axis=1)
result
>>>
Name Address Street
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street

How to Convert String Dictionary to Dictionary and the split into separate columns

I have a single column pandas dataframe that looks like where each dictionary is currently a string. I want to convert each item in the pandas to a dictionary and then split it out into separate columns.
Current Look
0 {'Owl': '8109284', 'county': '27'}
1 {'Kid': '298049', 'county': '28'}
2 {'Tree': '190849', 'county': '29'}
3 {'Garden': '9801294', 'county': '30'}
4 {'House': '108094', 'county': '31'}
End Look
Item Count County County Number
0 'Owl' '8109284' 'county' '27'
1 'Kid' '298049' 'county' '28'
2 'Tree' '190849' 'county' '29'
3 'Garden' '9801294 'county' '30'
4 'House' '108094' 'county' '31'
Can anyone help resolve this. I've tried a bunch of different things.
You can use literal_eval() to get the dict object out of string. After that get the items using items() method on the dictionary. Lastly, construct the dataframe.
import ast
import pandas as pd
d = [[0, "{'Owl': '8109284', 'county': '27'}"],
[1, "{'Kid': '298049', 'county': '28'}"],
[2, "{'Tree': '190849', 'county': '29'}"],
[3, "{'Garden': '9801294', 'county': '30'}"],
[4, "{'House': '108094', 'county': '31'}"]]
df = pd.DataFrame(d, columns=['a', 'b'])
pd.DataFrame(df['b'].apply(lambda x: [j for i in ast.literal_eval(x).items() for j in i]).to_list(),
columns=['Item', 'Count', 'County', 'County Number'])
Item Count County County Number
0 Owl 8109284 county 27
1 Kid 298049 county 28
2 Tree 190849 county 29
3 Garden 9801294 county 30
4 House 108094 county 31
If the format of your dicts is the same (2 keys, 2 values), you can use apply function. 'data' is your dict column:
df['Item']=df.data.apply(lambda x: list(eval(x).keys())[0])
df['Count']=df.data.apply(lambda x: list(eval(x).values())[0])
df['County']=df.data.apply(lambda x: list(eval(x).keys())[1])
df['County Number']=df.data.apply(lambda x: list(eval(x).values())[0])
del df['data']
Output
Item Count County County Number
0 Owl 8109284 county 8109284
1 Kid 298049 county 298049
2 Tree 190849 county 190849
3 Garden 9801294 county 9801294
4 House 108094 county 108094

pandas string replace TypeError - replace words using pandas df column

I am using Python 3.7 and have a large data frame that includes a column of addresses, like so:
c = ['420 rodeo drive', '22 alaska avenue', '919 franklin boulevard', '39 emblem alley', '67 fair way']
df1 = pd.DataFrame(c, columns = ["address"])
There is also a dataframe of address abbrevations and their usps version. An example:
x = ['avenue', 'drive', 'boulevard', 'alley', 'way']
y = ['aly', 'dr', 'blvd', 'ave', 'way']
df2 = pd.DataFrame(list(zip(x,y)), columns = ['common_abbrev', 'usps_abbrev'], dtype = "str")
What I would like to do is as follows:
Search df1['address'] for occurrences of words in df2['common_abbrev'] and replace with df2['usps_abbrev'].
Return the transformed df1.
To this end I have tried in a few ways what appears to be the canonical strategy of df.str.replace() like so:
df1["address"] = df1["address"].str.replace(df2["common_abbrev"], df2["usps_abbrev"])
However, I get the following error:
`TypeError: repl must be a string or callable`.
My question is:
Since I've already given the dtype for df2, why am I receiving the error?
How can I produce my desired result of:
address
0 420 rodeo dr
1 22 alaska ave
2 919 franklin blvd
3 39 emblem aly
4 67 fair way
Thanks for your help.
First create a dict using zip. Then use Series.replace:
In [703]: x = ['avenue', 'drive', 'boulevard', 'alley', 'way']
...: y = ['aly', 'dr', 'blvd', 'ave', 'way']
In [686]: d = dict(zip(x, y))
In [691]: df1.address = df1.address.replace(d, regex=True)
In [692]: df1
Out[692]:
address
0 420 rodeo dr
1 22 alaska aly
2 919 franklin blvd
3 39 emblem ave
4 67 fair way

Apply a function on elements in a Pandas column, grouped on another column

I have a dataset with several columns.
Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.
_id fName lName age
0 ABCD Andrew Schulz
1 ABCD Andreww 23
2 DEFG John boy
3 DEFG Johnn boy 14
4 CDGH Bob TANNA 13
5 ABCD. Peter Parker 45
6 DEFGH Clark Kent 25
So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values. Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
5 ABCD Peter Parker 45
6 DEFG Clark Kent 25
I intend to use pyjarowinkler.
If I had two independent columns (without all the group by stuff) to check, this is how I use it.
df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])]
df = df[df['score'] > 0.87]
Can someone suggest a pythonic and fast way of doing this
UPDATE
So, I have tried using record linkage library for this. And I have ended up at a dataframe containing pair of indexes that are similar called 'matches'. Now I just want to basically combine the data.
# Indexation step
indexer = recordlinkage.Index()
indexer.block(left_on='_id')
candidate_links = indexer.index(df)
# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.string('fName', 'fName', method='jarowinkler', threshold=threshold, label='full_name')
features = compare_cl.compute(candidate_links, df)
# Classification step
matches = features[features.sum(axis=1) >= 1]
print(len(matches))
This is how matches looks:
index1 index2 fName
0 1 1.0
2 3 1.0
I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows
just wanted to clear some doubts regarding your ques. Couldn't clear them in comments due to low reputation.
Like here if i run it for col "fName". I should be able to reduce this
dataframe to based on a score threshold:
So basically your function would return the DataFrame containing the first row in each group (by ID)? This will result in the above listed resultant DataFrame.
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
I hope this code answer your question
r0 =['ABCD','Andrew','Schulz', '' ]
r1 =['ABCD','Andrew', '' , '23' ]
r2 =['DEFG','John' ,'boy' , '' ]
r3 =['DEFG','John' ,'boy' , '14' ]
r4 =['CDGH','Bob' ,'TANNA' , '13' ]
Rx =[r0,r1,r2,r3,r4]
print(Rx)
print()
Dict= dict()
for i in Rx:
if (Dict.__contains__(i[0]) == True):
if (i[2] != ''):
Dict[i[0]][2] = i[2]
if (i[3] != ''):
Dict[i[0]][3] = i[3]
else:
Dict[i[0]]=i
Rx[:] = Dict.values()
print(Rx)
I am lost with the 'score' part of your question, but if what you need is to fill the gaps in data with values from other rows and then drop the duplicates by id, maybe this can help:
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
First make sure that empty values are replaced with nulls. Then use fillna to 'back fill' the data. Then drop duplicates keeping the first occurrence of Id. fillna will fill the values from the next value found in the column, which may correspond to other Id, but since you will discard the duplicated rows, I believe drop_duplicates keeping the first occurrence will do the job. (This assumes that at least one value is provided in every column for every Id)
I've tested with this dataset and code:
data = [
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', 'Schulz'],
['AABBCC', 'Andrew', '', 23],
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', '',],
['DDEEFF', 'Karl', 'boy'],
['DDEEFF', 'Karl', ''],
['DDEEFF', 'Karl', '', 14],
['GGHHHH', 'John', 'TANNA', 13],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', 'Blob'],
['HLHLHL', 'Bob', 'Blob', 15],
['HLHLHL', 'Bob','', 15],
['JLJLJL', 'Nick', 'Best', 20],
['JLJLJL', 'Nick', '']
]
df = pd.DataFrame(data, columns=['Id', 'fName', 'lName', 'Age'])
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
Output:
Id fName lName Age
0 AABBCC Andrew Schulz 23.0
5 DDEEFF Karl boy 14.0
8 GGHHHH John TANNA 13.0
9 HLHLHL Bob Blob 15.0
14 JLJLJL Nick Best 20.0
Hope this helps and apologies if I misunderstood the question.

Error pd.pivot "MultiIndex.name must be a hashable type"

I´m trying to apply to my pandas dataframe something similar to R's tidyr::spread . I saw in some places people using pd.pivot but so far I had no success.
So in this example I have the following dataframe DF:
df = pd.DataFrame({'action_id' : [1,2,1,4,5],
'name': ['jess', 'alex', 'jess', 'cath', 'mary'],
'address': ['house', 'house', 'park', 'park', 'park'],
'date': [ '01/01', '02/01', '03/01', '04/01', '05/01']})
How does it look like:
Ok, so what I want is a multi-index pivot table having 'action_id' and 'name' as index, "spread" the address column and fill it with the 'date' column. So my df would look like this:
What I tryed to do was:
df.pivot(index = ['action_id', 'name'], columns = 'address', values = 'date')
And I got the error TypeError: MultiIndex.name must be a hashable type
Does anyone know what am I doing wrong?
You do not need to mention the index in pd.pivot
This will work
import pandas as pd
df = pd.DataFrame({'action_id' : [1,2,1,4,5],
'name': ['jess', 'alex', 'jess', 'cath', 'mary'],
'address': ['house', 'house', 'park', 'park', 'park'],
'date': [ '01/01', '02/01', '03/01', '04/01', '05/01']})
df = pd.concat([df, pd.pivot(data=df, index=None, columns='address', values='date')], axis=1) \
.reset_index(drop=True).drop(['address','date'], axis=1)
print(df)
action_id name house park
0 1 jess 01/01 NaN
1 2 alex 02/01 NaN
2 1 jess NaN 03/01
3 4 cath NaN 04/01
4 5 mary NaN 05/01
And to arrive at what you want, you need to do a groupby
df = df.groupby(['action_id','name']).agg({'house':'first','park':'first'}).reset_index()
print(df)
action_id name house park
0 1 jess 01/01 03/01
1 2 alex 02/01 NaN
2 4 cath NaN 04/01
3 5 mary NaN 05/01
Dont forget to accept the answer if it helped you
Another option:
df2 = df.set_index(['action_id','name','address']).date.unstack().reset_index()
df2.columns.name = None

Categories

Resources