I need to read a worksheet and organize the numbers that are without punctuation.
For example, I need this number: 12345678901234567890, look like this: 1234567-89.0123.4.56.7890
In Js I did something like this in a google sheets spreadsheet and it worked, but I need to embed this in a python script. I'm using pandas to read the spreadsheet. I'm having trouble mounting in python
function reorder() {
const a = ["12345678901234567890","12345678901234567890"];//put all of the strings into an array
const b = a.map(s => {
return `${s.slice(0,7)}-${s.slice(7,9)}.${s.slice(9,13)}.${s.slice(13,14)}.${s.slice(14,16)}.${s.slice(16)}`
})
}
Considering that the dataframe that OP is reading from excel looks like this
df = pd.DataFrame({'string': ['12345678901234567890', '12345678901234567890']})
[Out]:
string
0 12345678901234567890
1 12345678901234567890
There are various ways to achieve OP's goal.
One, for example, is using pandas.Series.apply with a lambda function as follows
df['string'] = df['string'].apply(lambda x: f"{x[0:7]}-{x[7:9]}.{x[9:13]}.{x[13:14]}.{x[14:16]}.{x[16:]}")
[Out]:
string
0 1234567-89.0123.4.56.7890
1 1234567-89.0123.4.56.7890
One can also use pandas.DataFrame.query as
df['string'] = df.query('string.str.contains(r"\d{7}-\d{2}\.\d{4}\.\d\.\d{2}\.\d{2}")', engine='python')
[Out]:
string
0 1234567-89.0123.4.56.7890
1 1234567-89.0123.4.56.7890
Notes:
One might have to adjust the column name (in this case it is string) and/or the dataframe (in this case it is df).
Related
I convert a CSV file to pandas DataFrame, but found all content is str with the pattern like ="content"
Tried using df.replace to substitute '=' and '"'. The code is like
df.replace("=","", inplace = True)
df.replace('"',"", inplace = True)
However, this code does not work without error messages, and nothing is replaced in the Dataframe.
After df.replace
Strangely, it works when use
df[column] = df[column].str.replace('=','')
df[column] = df[column].str.replace('=','')
Is there any possible way to replace/substitute equal and double quote signs using DataFrame methods? And I am curious with the reason why df.replace method isn't workable.
Sorry I can only provide the pic since the original data and code are in a notebook with locked internet and USB function.
Thanks for the help
Because .replace('=', '') requires the cell value to be exactly '=' which is obviously not true in your case.
You may instead use it with regex:
df = pd.DataFrame({'a': ['="abc"', '="bcd"'], 'b': ['="uef"', '="hdd"'], 'c':[1,3]})
df.replace([r'^="', r'"$'], '', regex=True, inplace=True)
print(df)
a b c
0 abc uef 1
1 bcd hdd 3
Two regular expressions are used here, with the first taking care of the head and the second the tail.
I have a dataframe where one column contains numbers but as string values like "1.0", "52.0" etc.
I want to convert the column to instead contain strings like "PRE_1", "PRE_52".
Example
df = pd.DataFrame([['1.0'],['52.0']],columns=['Pre'])
df["pre"] = 'PRE_' + df["pre"].astype(str)
gives me output of PRE_1.0
I tried:
df["pre"] = 'PRE_' + df["pre"].astype(int).astype(str) but got a ValueError.
Do I need to convert it into something else before trying to convert it to an int?
It looks like: df["pre"].astype(float).astype(int).astype(str) might do what I want but I'm open to cleaner ways of doing it.
I'm pretty new to pandas, so help would be greatly appreciated!
To properly be able to help, having sample data would be great. Based on the information you did provide, if the data coming in is a float, you can apply a format to truncate it as below.
df = pd.DataFrame({'pre': [1.0, 52.0]})
df['pre'] = df['pre'].map('PRE_{:.0f}'.format)
print(df)
Apply a function:
import pandas as pd
df = pd.DataFrame([['1.0'],['52.0']],columns=['Pre'])
print(df)
df.Pre = df.Pre.apply(lambda n: f'PRE_{float(n):.0f}')
print(df)
Output:
Pre
0 1.0
1 52.0
Pre
0 PRE_1
1 PRE_52
I have an example df:
df = pd.DataFrame({'A': ['100,100', '200,200'],
'B': ['200,100,100', '100']})
A B
0 100,100 200,100,100
1 200,200 100
and I want to replace the commas ',' with nothing (basically, remove them). You can probably guess a real-world application, as many data is written with thousand separators, feel free to introduce me to a better method.
Now I read the documentation for pd.replace() here and I tried several versions of code - it raises no error, but does not modify my data frame.
df = df.replace(',','')
df = df.replace({',': ''})
df = df.replace([','],'')
df = df.replace([','],[''])
I can get it working when specifying the column names and using the ".str.replace()" method for Series, but imagine having 20 columns. I also can get this working specifying columns in the df.replace() method but there must be a more convenient way for such an easy task. I could write a custom function, but pandas is such an amazing library it must be something I am missing.
This works:
df['A'] = df['A'].str.replace(',','')
Thank you!
df.replace has a parameter regex set it to True for partial matches.
By default regex param is False. When False it replaces only exact or fullmatches.
From Pandas docs:
str: string exactly matching to_replace will be replaced with the value.
df.replace(',', '', regex=True)
A B
0 100100 200100100
1 200200 100
In pd.Series.str.replace by default it's regex param is True.
From docs:
Equivalent to str.replace() or re.sub(), depending on the regex value.
Determines if assumes the passed-in pattern is a regular expression:
If True, assumes the passed-in pattern is a regular expression.
If False, treats the pattern as a literal string
Though your immediate question has probably been answered, I wanted to mention that if you are reading this data in from a csv file, you can pass the thousands argument with a comma "," to indicate that this should be treated as an integer and remove the comma:
import io
import pandas as pd
csv_file = io.StringIO("""
A,B,C
"1,000","2,000","3,000"
1,2,3
"50,000",50,5
""")
df = pd.read_csv(csv_file, thousands=",")
print(df)
A B C
0 1000 2000 3000
1 1 2 3
2 50000 50 5
print(df.dtypes)
A int64
B int64
C int64
dtype: object
I have a csv file with two formatted columns that currently read in as objects:
contains percentage values which read in as strings like '0.01%'. The % is always at the end.
contains currency values which read in as string like '$1234.5'.
I have tried using the split function to remove the % or $ inside the dataframe, then using float on the result of the split. This will print the correct result but will not assign the value. It also gives a type error that float does not have split function, even though I do the split before the float????
Try this:
import pandas as pd
df = pd.read_csv('data.csv')
"""
The example df looks like this:
col1 col2
0 3.04% $100.25
1 0.15% $1250
2 0.22% $322
3 1.30% $956
4 0.49% $621
"""
df['col1'] = df['col1'].str.split('%', expand=True)[[0]]
df['col2'] = df['col2'].str.split('$', 1, expand=True)[[1]]
df[['col1', 'col2']] = df[['col1', 'col2']].apply(pd.to_numeric)
You are probably looking for the apply method.
With
df['first_col'] = df['first_col'].apply(lambda x: float(x.strip('%'))
I'm using Pandas to load an Excel spreadsheet which contains zip code (e.g. 32771). The zip codes are stored as 5 digit strings in spreadsheet. When they are pulled into a DataFrame using the command...
xls = pd.ExcelFile("5-Digit-Zip-Codes.xlsx")
dfz = xls.parse('Zip Codes')
they are converted into numbers. So '00501' becomes 501.
So my questions are, how do I:
a. Load the DataFrame and keep the string type of the zip codes stored in the Excel file?
b. Convert the numbers in the DataFrame into a five digit string e.g. "501" becomes "00501"?
As a workaround, you could convert the ints to 0-padded strings of length 5 using Series.str.zfill:
df['zipcode'] = df['zipcode'].astype(str).str.zfill(5)
Demo:
import pandas as pd
df = pd.DataFrame({'zipcode':['00501']})
df.to_excel('/tmp/out.xlsx')
xl = pd.ExcelFile('/tmp/out.xlsx')
df = xl.parse('Sheet1')
df['zipcode'] = df['zipcode'].astype(str).str.zfill(5)
print(df)
yields
zipcode
0 00501
You can avoid panda's type inference with a custom converter, e.g. if 'zipcode' was the header of the column with zipcodes:
dfz = xls.parse('Zip Codes', converters={'zipcode': lambda x:x})
This is arguably a bug since the column was originally string encoded, made an issue here
str(my_zip).zfill(5)
or
print("{0:>05s}".format(str(my_zip)))
are 2 of many many ways to do this
The previous answers have correctly suggested using zfill(5). However, if your zipcodes are already in float datatype for some reason (I recently encountered data like this), you first need to convert it to int. Then you can use zfill(5).
df = pd.DataFrame({'zipcode':[11.0, 11013.0]})
zipcode
0 11.0
1 11013.0
df['zipcode'] = df['zipcode'].astype(int).astype(str).str.zfill(5)
zipcode
0 00011
1 11013