I know how to add leading zeros for all values in pandas column. But my pandas column 'id' involves both numeric character like '83948', '848439' and Alphanumeric character like 'dy348dn', '494rd7f'. What I want is only add zeros for the numeric character until it reaches to 10, how can we do that?
I understand that you want to apply padding only on ids that are completely numeric. In this case, you can use isnumeric() on a string (for example, mystring.isnumeric()) in order to check if the string contains only numbers. If the condition is satisfied, you can apply your padding rule.
You can use a mask with str.isdigit and boolean indexing with str.zfill:
mask = df['col'].str.isdigit()
df.loc[mask, 'col'] = df.loc[mask, 'col'].str.zfill(10)
Output:
col
0 0000083948
1 0000848439
2 dy348dn
3 494rd7f
Used input:
df = pd.DataFrame({'col': ['83948', '848439', 'dy348dn', '494rd7f']})
Related
I have a data frame with some text read in from a txt file the column names are FEATURE and SENTENCES.
Within the FEATURE col there is some text that starts with '[NA]', e.g. '[NA] not a feature'.
How can I remove those rows from my data frame?
So far I have tried:
df[~df.FEATURE.str.contains("[NA]")]
But this did nothing, no errors either.
I also tried:
df.drop(df['FEATURE'].str.startswith('[NA]'))
Again, there were no errors, but this didn't work.
Lets suppose you have DataFrame below:
>>> df
FEATURE
0 this
1 is
2 string
3 [NA]
Then below simply should be sufficed ..
>>> df[~df['FEATURE'].str.startswith('[NA]')]
FEATURE
0 this
1 is
2 string
other way in case data needed to formatted to string before operating on it..
df[~df['FEATURE'].astype(str).str.startswith('[NA]')]
OR using str.contains :
>>> df[df.FEATURE.str.contains('[NA]') == False]
# df[df['FEATURE'].str.contains('[NA]') == False]
FEATURE
0 this
1 is
2 string
OR
df[df.FEATURE.str[0].ne('[')]
IIUC use regex=False for not parsing string like regex:
df[~df.FEATURE.str.contains("[NA]", regex=False)]
Or escape special regex chars []:
df[~df.FEATURE.str.contains("\[NA\]")]
Another problem should be trailing white spaces, then use:
df[~df['FEATURE'].str.strip().str.startswith('[NA]')]
df['data'].str.startswith('[NA]') or df['data'].str.contains('[NA]') will both return a boolean (True/False) list. Drop doesnt work with booleans and in this case it is easiest using 'loc'
Here is one solution with some example data. Note that i add '==False' to get all the rows that DON'T have [NA]:
df = pd.DataFrame(['feature','feature2', 'feature3', '[NA] not feature', '[NA] not feature2'], columns=['data'])
mask = df['data'].str.contains('[NA]')==False
df.loc[mask]
The below simply code should work
df = df[~df['Date'].str.startswith('[NA]')]
I currently have a column in my dataset that looks like the following:
Identifier
09325445
02242456
00MatBrown
0AntonioK
065824245
The column data type is object.
What I'd like to do is remove the leading zeros only from column rows where there is a string. I want to keep the leading zeros where the column rows are integers.
Result I'm looking to achieve:
Identifier
09325445
02242456
MatBrown
AntonioK
065824245
Code I am currently using (that isn't working)
def removeLeadingZeroFromString(row):
if df['Identifier'] == str:
return df['Identifier'].str.strip('0')
else:
return df['Identifier']
df['Identifier' ] = df.apply(lambda row: removeLeadingZeroFromString(row), axis=1)
One approach would be to try to convert Identifier to_numeric. Test where the converted values isna, using this mask to only str.lstrip (strip leading zeros only) where the values could not be converted:
m = pd.to_numeric(df['Identifier'], errors='coerce').isna()
df.loc[m, 'Identifier'] = df.loc[m, 'Identifier'].str.lstrip('0')
df:
Identifier
0 09325445
1 02242456
2 MatBrown
3 AntonioK
4 065824245
Alternatively, a less robust approach, but one that will work with number only strings, would be to test where not str.isnumeric:
m = ~df['Identifier'].str.isnumeric()
df.loc[m, 'Identifier'] = df.loc[m, 'Identifier'].str.lstrip('0')
*NOTE This fails easily to_numeric is the much better approach if looking for all number types.
Sample Frame:
df = pd.DataFrame({
'Identifier': ['0932544.5', '02242456']
})
Sample Results with isnumeric:
Identifier
0 932544.5 # 0 Stripped
1 02242456
DataFrame and imports:
import pandas as pd
df = pd.DataFrame({
'Identifier': ['09325445', '02242456', '00MatBrown', '0AntonioK',
'065824245']
})
Use replace with regex and a positive lookahead:
>>> df['Identifier'].str.replace(r'^0+(?=[a-zA-Z])', '', regex=True)
0 09325445
1 02242456
2 MatBrown
3 AntonioK
4 065824245
Name: Identifier, dtype: object
Regex: replace one or more 0 (0+) at the start of the string (^) if there is a character ([a-zA-Z]) after 0s ((?=...)).
I have the following dataframe that I am trying to remove the spaces between the numbers in the value column and then use pd.to_numeric to change the dtype. THe current dtype of value is an object.
periodFrom value
1 17.11.2020 28 621 240
2 18.11.2020 30 211 234
3 19.11.2020 33 065 243
4 20.11.2020 34 811 330
I have tried multiple variations of this but can't work it out:
df['value'] = df['value'].str.strip()
df['value'] = df['value'].str.replace(',', '').astype(int)
df['value'] = df['value'].astype(str).astype(int)
One option is to apply .str.split() first in order to split by whitespaces(even if the anyone of them has more than one character length), then concatenate (''.join()) while removing those whitespaces along with converting to integers(int()) such as
j=0
for i in df['value'].str.split():
df['value'][j]=int(''.join(i))
j+=1
You can do:
df['value'].replace({' ':''}, regex=True)
Or
df['value'].apply(lambda x: re.sub(' ', '', str(x)))
And add to both .astype(int).
I have a short script to pivot data. The first column is a 9 digit ID number, often beginning with zeros such as 000123456
Here is the script:
df = pd.read_csv('source')
new_df = df.pivot_table(index = 'id', columns = df.groupby('id').cumcount().add(1), values = ['prog_id', 'prog_type'], aggfunc='first').sort_index(axis=1,level=1)
new_df.columns = [f'{x}_{y}' for x,y in new_df.columns]
new_df.to_csv('destination')
print(new_df)
Although the CSV is being read with an id such as 000123456, the output only contains 123456
Even when setting an explicit dtype, Pandas removes the leading zeros. Is there a work around for telling Pandas to leave the leading zeros?
Per comment on original post, set dtype as string:
df = pd.read_csv('source', dtype={'id':np.str})
You could use pandas' zfill() method right after reading your csv file "source". Basically, you would fill the values of your attribute "id", with as many zeros as you would like, in this particular case, making the number 9 digits long (3 zeros + 6 original digits). So, we would have:
df = pd.read_csv('source')
df.index = df.index.str.zfill(9)
# (...)
I have a column of integers (sample row: 123456789) and some of the values are interspersed with junk alphabets. Ex: 1234y5678. I want to delete the alphabets appearing in such cells and retain the numbers. How do I go about it using Pandas?
Assume my dataframe is df and the column name is mobile.
Should I use np.where with conditions such as df[df['mobile'].str.contains('a-z')] and use string replace?
If your junk characters are not limited to letters, you should use this:
yourSeries.str.replace('[^0-9]', '')
Use pd.Series.str.replace:
import pandas as pd
s = pd.Series(['125109a181', '1361q1j1', '85198m4'])
s.str.replace('[a-zA-Z]', '').astype(int)
Output:
0 125109181
1 136111
2 851984
Use the regex character class \D (not a digit):
df['mobile'] = df['mobile'].str.replace('\D', '').astype('int64')