How to remove unwanted dots from strings in pandas column? - python

I have a data that contains the column as following:
mouse.pad.v.1.2
key.board.1.0.c30
pen.color.4.32.r
I am removing digits by
df["parts"]= df["parts"].str.replace('\d+', '')
Once the digits are removed the data looks like the following:
mouse.pad.v..
key.board...c
pen.color...r
what I want to do is to replace more than one dot from the column with just one dot. Ideal output should be
mouse.pad.v
key.board.c
pen.color.r
I tried using
df["parts"]= df["parts"].str.replace('..', '.')
But I am not sure how many dots will be combined together. Is there a way to automate it?

Try:
df["parts"] = df["parts"].str.replace(r"\.*\d+", "", regex=True)
print(df)
Prints:
parts
0 mouse.pad.v
1 key.board.c
2 pen.color.r
Input dataframe:
parts
0 mouse.pad.v.1.2
1 key.board.1.0.c30
2 pen.color.4.32.r

Related

Pandas Split Columns in Columns different sizes

--Edited--['SOLVED']
I am using tabula to convert pdf invoices to pandas dataframe, but the last column isn't in the good way.
I want to split the last row named 'PVF c/ IVA PVA s/Tx Desc% Tx Inf. IVA% P.Unit. Total Liq.'
I want to split, in each space, and have new columns ['PVFc/IVA', 'PVAs/Tx', 'Desc%' 'TxInf.', 'IVA%', 'P.Unit.', 'Total Liq.'], and the rows should be split for each space. Row2 '7,41', '6,30', '65,0', '0,03', '6', '2,24', '22,40'.
I have searched and found how to split, but... some rows will be split in 7 columns and other only in 6 columns and I get an error.
For more information, every row which 'PVP c/Iva' is NaN or 'Esc.' is 'NETT' don't have 'PVFc/IVA' value, so the (len) of the column is 6. it's possible for my analyses insert 0,00 as prefix in that rows to all have a 7 columns len().
Any solution is welcome, I am starting with Python and pandas... thanks for your time
I apply parts of the code from #Ahmed Sayed, and i have made progess,
to concatenate Nan Colums with other, first i replace Nan with a space
dataframe['placeHolderColumn'] = dataframe['placeHolderColumn'].fillna(value='')
after some trying e errors, i found that sometimes there are more than one space, so I have replaced all spaces for one space, and then replace '*'
dataframe["newColumn"]= dataframe['newColumn'].str.replace(' ','*')
the i have created a new column to confirme the split element
dataframe["count2"]= dataframe['newColumn'].str.count('\*', re.I)
I get this result
So, as last job i apply the split métode,
dataframe[['c1','c2','c3','c4','c5','c6']] = dataframe['newColumn'].str.split('*', expand=True)
but i get this error
--FOUND--
i have to pass another column name, i am just passing 6 new colums and i have 7 values
dataframe[['c1','c2','c3','c4','c5','c6', 'c7']] = dataframe['newColumn'].str.split('*', expand=True)
So the problem here is the cells do not have an equal number of values in that column, we can address this by counting the number of values and wherever we see a missing value, we can add a dummy 00 at the beginning so it is easier for us to split later.
first, let's create a column with the number of spaces. This gives the number of values in that row.
import re
df["count"]= df['PVF c/ IVA PVA s/Tx Desc% Tx Inf. IVA% P.Unit. Total Liq.'].str.count(' ', re.I)
then, if the count is less than what we are expecting, let's append a zero at the beginning of each cell string
# here we compare the number of spaces to 5, 5 is for the short cells that need a dummy 00 at the beginning
df.loc[df["count"] <= 5, 'placeHolderColumn'] = '00 ' #notice there is a space after the zeros
# now let's create a new column and merge the placeHolderColumn column to the old values column
df['newColumn'] = df['placeHolderColumn'] + df['PVF c/ IVA PVA s/Tx Desc% Tx Inf. IVA% P.Unit. Total Liq.'].astype(str)
lastly, we can split the column by
df[['c1','c2','c3','c4','c5','c6']] = df['newColumn'].str.split(' ', expand=True)

If statement: String starts with exactly 4 digitis in Python/pandas

I have a column of a dataframe consisting of strings, which are either a date (e.g. "12-10-2020") or a string starting with 4 digits (e.g. "4030 - random name"). I would like to write an if statement to capture the strings which are starting with 4 digits, which is similar to this code:
string[0].isdigit()
but instead of isdigit, it should be something like:
is string which starts with 4 digits
I hope I clarified my question and let me know if it is not clear. I am btw working in pandas.
Use str.contains:
col"
df[df["col"].str.contains(r'^[0-9]{4}')]
You can use str.match that is anchored by default to the start of the string:
Example:
df = pd.DataFrame({'col': ['4030 - random name', 'other', '07-02-2022']})
df[df['col'].str.match('\d{4}')]
output:
col
0 4030 - random name

how to separate the whole number and decimal numbers in a separate columns using python csv?

I saw many similar questions like this but none have been done in csv file using python. Basically I have a column with a decimal numbers and I want to write a code where it creates 2 new columns one for just whole number and other for decimals. I turned the column into numeric using the code below.
df['hour_num'] = pd.to_numeric(df['total_time'])
I already have the column 'total_time' and 'hour_num'. I want to know how to get the column 'Whole number' and 'Decimal'
here is the pic to help better understand.
pic
You can convert the numbers to strings and split on ., convert to a DataFrame and assign to original DataFrame.
df = pd.DataFrame({'col1':[2.123, 3.557, 0.123456]})
df[['whole number', 'decimal']] = df['col1'].astype(str).str.split('.').apply(pd.Series)
df['decimal'] = ('0.' + df['decimal']).astype(float)
df['whole number'] = df['whole number'].astype(int)
Output:
col1 whole number decimal
0 2.123000 2 0.123000
1 3.557000 3 0.557000
2 0.123456 0 0.123456

Finding and deleting sub-strings in dataframe column Python

I would like to find all the rows in a column that contains a unique ID as a string which starts with digits and symbols. After they have been identified, I would like to delete the first 9 characters for those unique rows, only. So far I have:
if '.20_P' in df['ID']:
df['ID']= df['ID']str.slice[: 9]
where I would like it to take this:
df['ID'] =
2.2.2020_P18dhwys
2.1.2020_P18dh234
2.4.2020_P18dh229
P18dh209
P18dh219
2.5.2020_P18dh289
and trun it into this:
df['ID'] =
P18dhwys
P18dh234
P18dh229
P18dh209
P18dh219
P18dh289
Do a conditional row-wise apply to the same column:
df['ID'] = df.apply(lambda row: row['ID'][:9] if '.20_P' in row['ID'] else row['ID'], axis=1)
You could also use a regular expression to find your substring.
The regular expression here works as follows: Find a substring () consisting of multiple occurrences (+) of digits (\d) or ([]) non whitespace characters (\w). This might (*, ?) be preceded by a combination of digits and dots [\d+\.] with a trailing underscore _. Note that this is also quite fast as it is highly optimized (compared to .apply()). So if you have a lot of data, or do this often this is something you might want to consider.
import pandas as pd
df = pd.DataFrame({'A': [
'2.2.2020_P18dhwys',
'2.1.2020_P18dh234',
'2.4.2020_P18dh229',
'P18dh209',
'P18dh219',
'2.5.2020_P18dh289',
]})
print(df['A'].str.extract(r'[\d+\.]*_?([\d\w]+)'))
Output:
0
0 P18dhwys
1 P18dh234
2 P18dh229
3 P18dh209
4 P18dh219
5 P18dh289
If you know that the string to remove is a prefix added with underscore, you could do
df['ID']= df['ID'].apply(lambda x: x.split('_')[-1])

Python how to remove unwanted commas in a dataframe containing lists as element

I have a dataframe consisting list as a element which I got from the field measurements. I am processing each list for some operation. Surprisingly some random lists have additional comma at the end and this stops the whole process.
df =
index data
0 [1.002,1.001,1,1.005,1.001,1.001,1]
1 [2.002,2.001,2,2.005,2.001,2.001,2,,]
2 [4.002,3.001,2,1.005,2.001,6.001,5]
3 [1.002,1.001,1,1.005,1.001,1.001,9,,]
4 [8.002,1.001,7,1.005,9.001,8.001,12]
My dataframe has 90000 rows. An example row which gives error is given in index 1 and 3. These two lists have additional commas at the end. I want to eliminate those additional commas from the list. How to do it?
My present code:
for index, row in iv_df.iterrows():
row['data'] = np.setdiff1d(row['data'],[,])
Present output:
SyntaxError: invalid syntax
Expected output:
df =
index data
0 [1.002,1.001,1,1.005,1.001,1.001,1]
1 [2.002,2.001,2,2.005,2.001,2.001,2]
2 [4.002,3.001,2,1.005,2.001,6.001,5]
3 [1.002,1.001,1,1.005,1.001,1.001,9]
4 [8.002,1.001,7,1.005,9.001,8.001,12]
Any idea on how to achieve it?
df['data'] = df.data.replace(to_replace=r',,', value='', regex=True)
print(df)
For a more elegant solution, try using str.strip:
df['Data']= df['Data'].str.rstrip(',')

Categories

Resources