I have dataframe which contains id column with the following sample values
16620625 5686
16310427-5502
16501010 4957
16110430 8679
16990624/4174
16230404.1177
16820221/3388
I want to standardise to XXXXXXXX-XXXX (i.e. 8 and 4 digits separated by a dash), How can I achieve that using python.
here's my code
df['id']
df.replace(" ", "-")
Can use DataFrame.replace() function using a regular expression like this:
df = df.replace(regex=r'^(\d{8})\D(\d{4})$', value=r'\1-\2')
Here's example code with sample data.
import pandas as pd
df = pd.DataFrame({'id': [
'16620625 5686',
'16310427-5502',
'16501010 4957',
'16110430 8679',
'16990624/4174',
'16230404.1177',
'16820221/3388']})
# normalize matching strings with 8-digits + delimiter + 4-digits
df = df.replace(regex=r'^(\d{8})\D(\d{4})$', value=r'\1-\2')
print(df)
Output:
id
0 16620625-5686
1 16310427-5502
2 16501010-4957
3 16110430-8679
4 16990624-4174
5 16230404-1177
6 16820221-3388
If any value does not match the regexp of the expected format then it's value will not be changed.
inside a for loop:
convert your data frame entry to a string.
traverse this string up to 7th index.
concatenate '-' after 7th index to the string.
concatenate remaining string to the end.
traverse to next data frame entry.
Related
I have a dataframe as follows:
df
Name Sequence
abc ghijklmkhf
bhf uyhbnfkkkkkk
dmf hjjfkkd
I want to append the second column data to the below of the first column data in specific format as follows:
Name Sequence Merged
abc ghijklmkhf >abc
ghijklmkhf
bhf uyhbnfkkkkkk >bhf
uyhbnfkkkkkk
dmf hjjfkkd >dmf
hjjfkkd
I tried df['Name'] = '>' + df['Name'].astype(str) to get the name in the format with > symbol. How do I append the second column data below the value of first column data?
You can use vectorial concatenation:
df['Merged'] = '>'+df['Name']+'\n'+df['Sequence']
output:
Name Sequence Merged
0 abc ghijklmkhf >abc\nghijklmkhf
1 bhf uyhbnfkkkkkk >bhf\nuyhbnfkkkkkk
2 dmf hjjfkkd >dmf\nhjjfkkd
Checking that there are two lines:
print(df.loc[0, 'Merged'])
>abc
ghijklmkhf
As a complement to the mozway's solution, if you want to see the dataframe exactly with the format you mentioned, use the following:
from IPython.display import display, HTML
df["Merged"] = ">"+df["Name"]+"\n"+df["Sequence"]
def pretty_print(df):
return display(HTML(df.to_html().replace("\\n","<br>")))
pretty_print(df)
I have a lot of datasets that I need to iterate through, search for specific value and return some values based on search outcome.
Datasets are stored as dictionary:
key type size Value
df1 DataFrame (89,10) Column names:
df2 DataFrame (89,10) Column names:
df3 DataFrame (89,10) Column names:
Each dataset looks something like this, and I am trying to look if value in column A row 1 has 035 in it and return B column.
| A | B | C
02 la 035 nan nan
Target 7 5
Warning 3 6
If I try to search for specific value in it I get an error
TypeError: first argument must be string or compiled pattern
I have tried:
something = []
for key in df:
text = df[key]
if re.search('035', text):
something.append(text['B'])
Something = pd.concat([something], axis=1)
You can use .str.contains() : https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html
df = pd.DataFrame({
"A":["02 la 035", "Target", "Warning"],
"B":[0,7,3],
"C":[0, 5, 6]
})
df[df["A"].str.contains("035")] # Returns first row only.
Also works for regex.
df[df["A"].str.contains("[TW]ar")] # Returns last two rows.
EDIT to answer additional details.
The dataframe I set up looks like this:
To extract column B for those rows which match the last regex pattern I used, amend the last line of code to:
df[df["A"].str.contains("[TW]ar")]["B"]
This returns a series. Output:
Edit 2: I see you want a dataframe at the end. Just use:
df[df["A"].str.contains("[TW]ar")]["B"].to_frame()
I am new to pandas. I have an issue with knowing where the column is in last or first inside a string. For suppose I have a string str = "000*999bikes" where the bikes is a column name of df, now I want to check the position of the column whether it is in the first or last
My code:
str = "000*999bikes"
df =
bikes cars
0 12 23
1 34 67
3 56 90
Now the column name bikes is in last of a string. How to know if it is in the last using if condition?
If you just want the number beside the column name, you can use the following code to check for each column.
import pandas as pd
df = pd.DataFrame([[1,2],[2,2]],columns=['bikes','cars'])
strings = ["cars000*999","000*999bikes"]
for s in strings:
for col in df.columns:
if s.startswith(col):
print(col, s[len(col) + 1])
if s.endswith(col):
print(col, s[-len(col) - 1])
Output:
cars 0
bikes 9
If your strings are in the dataframe then you could do this with pandas str operations instead of loops.
I am having a dataframe column that contains either 4 or 6 char strings in length, I would like to add "00" string to the end of the strings having length of 4.
I am using this code but its giving me a syntax error.
df['col'] = np.where((df['col'].str.len() = 4, df['col'].astype(str) +'00' , df['col']'])
The clean pandas way to do this is to use ljust instead:
import pandas as pd
df = pd.DataFrame()
df['col'] = pd.Series(['aaaa', 'bbbbbb'], dtype='string')
df['col'] = df['col'].str.ljust(6, '0')
print(df)
Output
col
0 aaaa00
1 bbbbbb
From the documentation above:
Return the string left justified in a string of length width. Padding
is done using the specified fillchar (default is an ASCII space). The
original string is returned if width is less than or equal to len(s).
pandas offers a rich api to work with text, see this for more information.
I have a dataset with column named region. Sample values are
Eg. region_1, region_2, region_3 etc.
I need to replace these values to
Eg. 1,2,3, etc.
Any specific function to deal with this easy transformation?
Thanks
I believe you need split with select second value and if necessary convert to integers:
df.region = df.region.str.split('_').str[1].astype(int)
Or use extract with regex for extract integers:
df.region = df.region.str.extract('(\d+)', expand=False).astype(int)
Sample:
df = pd.DataFrame({'region':['region_1','region_2','region_3']})
df.region = df.region.str.extract('(\d+)', expand=False).astype(int)
print (df)
region
0 1
1 2
2 3