add specific part of one column values to another column - python

I have the following dataframe
import pandas as pd
data = {'existing_indiv': ['stac.Altered', 'MASO.MHD'], 'queries': ['modify', 'change']}
df = pd.DataFrame(data)
existing_indiv queries
0 stac.Altered modify
1 MASO.MHD change
I want to add the period and the word before the period to the beginning of the values of the queries column
Expected outcome:
existing_indiv queries
0 stac.Altered stac.modify
1 MASO.MHD MASO.change
Any ideas?

You can use .str.extract and regex ^([^.]+\.) to extract everything before the first .:
df.queries = df.existing_indiv.str.extract('^([^.]+\.)', expand=False) + df.queries
df
existing_indiv queries
0 stac.Altered stac.modify
1 MASO.MHD MASO.change
If you prefer .str.split:
df.existing_indiv.str.split('.').str[0] + '.' + df.queries
0 stac.modify
1 MASO.change
dtype: object

Related

Extract values within the quotes signs into two separate columns with python

How can i extract the values within the quotes signs into two separate columns with python. The dataframe is given below:
df = pd.DataFrame(["'FRH02';'29290'", "'FRH01';'29300'", "'FRT02';'29310'", "'FRH03';'29340'",
"'FRH05';'29350'", "'FRG02';'29360'"], columns = ['postcode'])
df
postcode
0 'FRH02';'29290'
1 'FRH01';'29300'
2 'FRT02';'29310'
3 'FRH03';'29340'
4 'FRH05';'29350'
5 'FRG02';'29360'
i would like to get an output like the one below:
postcode1 postcode2
FRH02 29290
FRH01 29300
FRT02 29310
FRH03 29340
FRH05 29350
FRG02 29360
i have tried several str.extract codes but havent been able to figure this out. Thanks in advance.
Finishing Quang Hoang's solution that he left in the comments:
import pandas as pd
df = pd.DataFrame(["'FRH02';'29290'",
"'FRH01';'29300'",
"'FRT02';'29310'",
"'FRH03';'29340'",
"'FRH05';'29350'",
"'FRG02';'29360'"],
columns = ['postcode'])
# Remove the quotes and split the strings, which results in a Series made up of 2-element lists
postcodes = df['postcode'].str.replace("'", "").str.split(';')
# Unpack the transposed postcodes into 2 new columns
df['postcode1'], df['postcode2'] = zip(*postcodes)
# Delete the original column
del df['postcode']
print(df)
Output:
postcode1 postcode2
0 FRH02 29290
1 FRH01 29300
2 FRT02 29310
3 FRH03 29340
4 FRH05 29350
5 FRG02 29360
You can use Series.str.split:
p1 = []
p2 = []
for row in df['postcode'].str.split(';'):
p1.append(row[0])
p2.append(row[1])
df2 = pd.DataFrame()
df2["postcode1"] = p1
df2["postcode2"] = p2

How to split a string without given delimeter in Panda

dfcolumn = [PUEF2CarmenXFc034DpEd, PUEF2BalulanFc034CamH, CARF1BalulanFc013Baca, ...]
My output should be:
dfnewcolumn1 = [PUEF2, PUEF2 , CARF1]
dfnewcolumn2 = [CarmenXFc034DpEd, BalulanFc034CamH, BalulanFc013Baca]
Assuming your split criteria is by fixed number of characters (e.g. 5 here), you can use:
df['dfnewcolumn1'] = df['dfcolumn'].str[:5]
df['dfnewcolumn2'] = df['dfcolumn'].str[5:]
Result:
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
If your split criteria is by the first digit in the string, you can use:
df[['dfnewcolumn1', 'dfnewcolumnX']] = df['dfcolumn'].str.split(r'(?<=\d)\D', n=1, expand=True)
df[['dfnewcolumnX', 'dfnewcolumn2']] = df['dfcolumn'].str.split(r'\D*\d', n=1, expand=True)
df = df.drop(columns='dfnewcolumnX')
Using the following modified original data with more test cases:
dfcolumn
0 PUEF2CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH
2 CARF1BalulanFc013Baca
3 CAF1BalulanFc013Baca
4 PUEFA2BalulanFc034CamH
Run code:
df[['dfnewcolumn1', 'dfnewcolumnX']] = df['dfcolumn'].str.split(r'(?<=\d)\D', n=1, expand=True)
df[['dfnewcolumnX', 'dfnewcolumn2']] = df['dfcolumn'].str.split(r'\D*\d', n=1, expand=True)
df = df.drop(columns='dfnewcolumnX')
Result:
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
3 CAF1BalulanFc013Baca CAF1 BalulanFc013Baca
4 PUEFA2BalulanFc034CamH PUEFA2 BalulanFc034CamH
Assuming your prefix consists of a sequence of alphabets followed by a sequence of digits, which both have variable length. Then a regex split function can be constructed and applied on each cell.
Solution
import pandas as pd
import re
# data
df = pd.DataFrame()
df["dfcolumn"] = ["PUEF2CarmenXFc034DpEd", "PUEF2BalulanFc034CamH", "CARF1BalulanFc013Baca"]
def f_split(s: str):
"""Split two part by regex"""
# alphabet(s) followed by digit(s)
o = re.match(r"^([A-Za-z]+\d+)(.*)$", s)
# may add exception handling here if there is no match
return o.group(1), o.group(2)
df[["dfnewcolumn1", "dfnewcolumn2"]] = df["dfcolumn"].apply(f_split).to_list()
Note the .to_list() to convert tuples into lists, which is required for the new column assignment to work.
Result
print(df)
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
Hoe about this compact solution:
import pandas as pd
df = pd.DataFrame({"original": ["PUEF2CarmenXFc034DpEd", "PUEF2BalulanFc034CamH", "CARF1BalulanFc013Baca"]})
df2 = pd.DataFrame(df.original.str.split(r"(\d)", n=1).to_list(), columns=["part1", "separator", "part2"])
df2.part1 = df2.part1 + df2.separator.astype(str)
df2
part1 separator part2
0 PUEF2 2 CarmenXFc034DpEd
1 PUEF2 2 BalulanFc034CamH
2 CARF1 1 BalulanFc013Baca
I use:
Series.str.split with a regex pattern and a kwarg to specify that it should only split on the first match.
in th regex pattern, I use a group (the round braces in (\d)) to capture the separating character
to_list() to output the split as a list of lists
DataFrame constructor to build a new DataFrame from that list
string concat of two columns

remove unwanted strings from Pandas column

I have a dataframe :
ID Website
1 www.yah.com/?trk
2 www.gle.com
I want to clean unwanted part from the website Url by deleting '?trk' or replacing it by ''
My final Dataframe will be :
ID Website
1 www.yah.com
2 www.gle.com
how can i do it known that i might have other options not only '?trk'
If you want to replace '?trk' only and not the '/' you can:
df['Website'] = df['Website'].replace(['?trk'],'')
Check split
df['Website'] = df['Website'].str.split('/').str[0]
df
Out[169]:
ID Website
0 1 www.yah.com
1 2 www.gle.com

Convert Quarter + Year (Datetime) into String in Pandas

I have some datetime info extracted into columns in Pandas. For example, I got the quarters like this:
df['quarter'] = pd.to_datetime(df['ddate'], format='%Y%m%d', errors='coerce').dt.quarter
I need to take the 'quarter' and 'year' columns and combine them into something like "Q3_2017". I can get this to work fine with a single data point like this:
'Q' + str(df['quarter'].iloc[0]) + '_' + str(df['year'].iloc[0])
But when I try to apply "str()" to a whole column I get bizarre results. For instance:
df['period'] = str(df['quarter'])
Instead of getting the quarter (e.g. "1"), I get something like this:
7222 1\n185579 4\n185580 1\n2129..
What exactly is going on and what's an easy fix?
I found a few previous solutions, but none seem to work specifically with quarters; can only find out how to do this with month or year, for example.
Try:
df['period'] = 'Q' + df['quarter'].astype(str) + '_' + df['year'].astype(str)
With Periods you can access %q for strftime.
import pandas as pd
df = pd.DataFrame({'ddate': pd.date_range('2010-01-01', freq='57D', periods=4)})
df.ddate.dt.to_period('Q').dt.strftime('Q%q_%Y')
0 Q1_2010
1 Q1_2010
2 Q2_2010
3 Q2_2010
Name: ddate, dtype: object
Or just keep the format of to_period (convert to string if you want)
df.ddate.dt.to_period("Q")
0 2010Q1
1 2010Q1
2 2010Q2
3 2010Q2
Name: ddate, dtype: period[Q-DEC]

Parsing a JSON string enclosed with quotation marks from a CSV using Pandas

Similar to this question, but my CSV has a slightly different format. Here is an example:
id,employee,details,createdAt
1,John,"{"Country":"USA","Salary":5000,"Review":null}","2018-09-01"
2,Sarah,"{"Country":"Australia", "Salary":6000,"Review":"Hardworking"}","2018-09-05"
I think the double quotation mark in the beginning of the JSON column might have caused some errors. Using df = pandas.read_csv('file.csv'), this is the dataframe that I got:
id employee details createdAt Unnamed: 1 Unnamed: 2
1 John {Country":"USA" Salary:5000 Review:null}" 2018-09-01
2 Sarah {Country":"Australia" Salary:6000 Review:"Hardworking"}" 2018-09-05
My desired output:
id employee details createdAt
1 John {"Country":"USA","Salary":5000,"Review":null} 2018-09-01
2 Sarah {"Country":"Australia","Salary":6000,"Review":"Hardworking"} 2018-09-05
I've tried adding quotechar='"' as the parameter and it still doesn't give me the result that I want. Is there a way to tell pandas to ignore the first and the last quotation mark surrounding the json value?
As an alternative approach you could read the file in manually, parse each row correctly and use the resulting data to contruct the dataframe. This works by splitting the row both forward and backwards to get the non-problematic columns and then taking the remaining part:
import pandas as pd
data = []
with open("e1.csv") as f_input:
for row in f_input:
row = row.strip()
split = row.split(',', 2)
rsplit = [cell.strip('"') for cell in split[-1].rsplit(',', 1)]
data.append(split[0:2] + rsplit)
df = pd.DataFrame(data[1:], columns=data[0])
print(df)
This would display your data as:
id employee details createdAt
0 1 John {"Country":"USA","Salary":5000,"Review":null} 2018-09-01
1 2 Sarah {"Country":"Australia", "Salary":6000,"Review"... 2018-09-05
I have reproduced your file
With
df = pd.read_csv('e1.csv', index_col=None )
print (df)
Output
id emp details createdat
0 1 john "{"Country":"USA","Salary":5000,"Review":null}" "2018-09-01"
1 2 sarah "{"Country":"Australia", "Salary":6000,"Review... "2018-09-05"
I think there's a better way by passing a regex to sep=r',"|",|(?<=\d),' and possibly some other combination of parameters. I haven't figured it out totally.
Here is a less than optimal option:
df = pd.read_csv('s083838383.csv', sep='##$%^', engine='python')
header = df.columns[0]
print(df)
Why sep='##$%^' ? This is just garbage that allows you to read the file with no sep character. It could be any random character and is just used as a means to import the data into a df object to work with.
df looks like this:
id,employee,details,createdAt
0 1,John,"{"Country":"USA","Salary":5000,"Review...
1 2,Sarah,"{"Country":"Australia", "Salary":6000...
Then you could use str.extract to apply regex and expand the columns:
result = df[header].str.extract(r'(.+),(.+),("\{.+\}"),(.+)',
expand=True).applymap(str.strip)
result.columns = header.strip().split(',')
print(result)
result is:
id employee details createdAt
0 1 John "{"Country":"USA","Salary":5000,"Review":null}" "2018-09-01"
1 2 Sarah "{"Country":"Australia", "Salary":6000,"Review... "2018-09-05"
If you need the starting and ending quotes stripped off of the details string values, you could do:
result['details'] = result['details'].str.strip('"')
If the details object items needs to be a dicts instead of strings, you could do:
from json import loads
result['details'] = result['details'].apply(loads)

Categories

Resources