Pandas replace function - insert pad 0 to 3 numbers - python

in Pandas (df), a column with following strings. looking to pad 0 when number within string are <100
Freq
XXX100KHz
XYC200KHz
YYY80KHz
YYY50KHz
to:
Freq
XXX100KHz
XYC200KHz
YYY080KHz
YYY050KHz
following function doesn't work, as \1 then 0 won't work as \10 doesn't exist.
df.replace({'Freq':'^([A-Za-z]+)(\d\d[A-Za-z]*)$'},{'Freq':r'\1**0**\2'},regex=True, inplace=True)

Try:
df["Freq"] = df["Freq"].str.replace(
r"(?<=\D)\d{1,2}(?=KHz)",
lambda g: "{:0>3}".format(g.group()),
regex=True,
)
print(df)
Prints:
Freq
0 XXX100KHz
1 XYC200KHz
2 YYY080KHz
3 YYY050KHz

Related

How to split a string without given delimeter in Panda

dfcolumn = [PUEF2CarmenXFc034DpEd, PUEF2BalulanFc034CamH, CARF1BalulanFc013Baca, ...]
My output should be:
dfnewcolumn1 = [PUEF2, PUEF2 , CARF1]
dfnewcolumn2 = [CarmenXFc034DpEd, BalulanFc034CamH, BalulanFc013Baca]
Assuming your split criteria is by fixed number of characters (e.g. 5 here), you can use:
df['dfnewcolumn1'] = df['dfcolumn'].str[:5]
df['dfnewcolumn2'] = df['dfcolumn'].str[5:]
Result:
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
If your split criteria is by the first digit in the string, you can use:
df[['dfnewcolumn1', 'dfnewcolumnX']] = df['dfcolumn'].str.split(r'(?<=\d)\D', n=1, expand=True)
df[['dfnewcolumnX', 'dfnewcolumn2']] = df['dfcolumn'].str.split(r'\D*\d', n=1, expand=True)
df = df.drop(columns='dfnewcolumnX')
Using the following modified original data with more test cases:
dfcolumn
0 PUEF2CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH
2 CARF1BalulanFc013Baca
3 CAF1BalulanFc013Baca
4 PUEFA2BalulanFc034CamH
Run code:
df[['dfnewcolumn1', 'dfnewcolumnX']] = df['dfcolumn'].str.split(r'(?<=\d)\D', n=1, expand=True)
df[['dfnewcolumnX', 'dfnewcolumn2']] = df['dfcolumn'].str.split(r'\D*\d', n=1, expand=True)
df = df.drop(columns='dfnewcolumnX')
Result:
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
3 CAF1BalulanFc013Baca CAF1 BalulanFc013Baca
4 PUEFA2BalulanFc034CamH PUEFA2 BalulanFc034CamH
Assuming your prefix consists of a sequence of alphabets followed by a sequence of digits, which both have variable length. Then a regex split function can be constructed and applied on each cell.
Solution
import pandas as pd
import re
# data
df = pd.DataFrame()
df["dfcolumn"] = ["PUEF2CarmenXFc034DpEd", "PUEF2BalulanFc034CamH", "CARF1BalulanFc013Baca"]
def f_split(s: str):
"""Split two part by regex"""
# alphabet(s) followed by digit(s)
o = re.match(r"^([A-Za-z]+\d+)(.*)$", s)
# may add exception handling here if there is no match
return o.group(1), o.group(2)
df[["dfnewcolumn1", "dfnewcolumn2"]] = df["dfcolumn"].apply(f_split).to_list()
Note the .to_list() to convert tuples into lists, which is required for the new column assignment to work.
Result
print(df)
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
Hoe about this compact solution:
import pandas as pd
df = pd.DataFrame({"original": ["PUEF2CarmenXFc034DpEd", "PUEF2BalulanFc034CamH", "CARF1BalulanFc013Baca"]})
df2 = pd.DataFrame(df.original.str.split(r"(\d)", n=1).to_list(), columns=["part1", "separator", "part2"])
df2.part1 = df2.part1 + df2.separator.astype(str)
df2
part1 separator part2
0 PUEF2 2 CarmenXFc034DpEd
1 PUEF2 2 BalulanFc034CamH
2 CARF1 1 BalulanFc013Baca
I use:
Series.str.split with a regex pattern and a kwarg to specify that it should only split on the first match.
in th regex pattern, I use a group (the round braces in (\d)) to capture the separating character
to_list() to output the split as a list of lists
DataFrame constructor to build a new DataFrame from that list
string concat of two columns

Count list length in a column of a DataFrame

This is my Dataframe:
CustomerID InvoiceNo
0 12346.0 [541431, C541433]
1 12347.0 [537626, 542237, 549222, 556201, 562032, 57351]
2 12348.0 [539318, 541998, 548955, 568172]
3 12349.0 [577609]
4 12350.0 [543037]
Desired Output:
CustomerID InvoiceCount
0 12346.0 2
1 12347.0 6
2 12348.0 4
3 12349.0 1
4 12350.0 1
I want to calculate the total number of Invoice a customer(CustomerID) have.
Please help.
See if this works:
df["InvoiceCount"] = df['InvoiceNo'].str.len()
If you have real list then you can do
df['InvoiceCount'] = df['InvoiceNo'].apply(len)
If you have string with list then you would have to convert string to real list before count
df['InvoiceNo'] = df['InvoiceNo'].apply(eval)
But it may not work if number C541433 (with C) is correct and may need
df['InvoiceCount'] = df['InvoiceNo'].apply(lambda x: len(x.split(',')))
or similar to example in #Datanovice comment
df['InvoiceCount'] = df['InvoiceNo'].str.split(',').str.len()
Minimal working example
import pandas as pd
import io
text = '''CustomerID;InvoiceNo
12346.0;[541431, 541433]
12347.0;[537626, 542237, 549222, 556201, 562032, 57351]
12348.0;[539318, 541998, 548955, 568172]
12349.0;[577609]
12350.0;[543037]'''
df = pd.read_csv(io.StringIO(text), sep=';')
print( df['InvoiceNo'].apply(lambda x: len(eval(x))) )
print( df['InvoiceNo'].apply(eval).apply(len) )
print( df['InvoiceNo'].apply(lambda x: len(x.split(','))) )
print( df['InvoiceNo'].str.split(',').str.len() )
df['InvoiceNo'] = df['InvoiceNo'].apply(eval)
print( df['InvoiceNo'].apply(len) )
If thats in a list, you can use the function 'len'
So let's say the list is in the variable values:
values = [537626, 542237, 549222, 556201, 562032, 57351]
then the amount is:
len(values) # 6
this would return 6 in this example

Counting Values In Columns Igonorig AlphaNumeric Values

First post here, I am trying to find out total count of values in an excel file. So after importing the file, I need to run a condition which is count all the values except 0 also where it finds 0 make that blank.
> df6 = df5.append(df5.ne(0).sum().rename('Final Value'))
I tried the above one but not working properly, It is counting the column name as well, I only need to count the float values.
Demo DataFrame:
0 1 2 3
ID_REF 1007_s 1053_a 117_at 121_at
GSM95473 0.08277 0.00874 0.00363 0.01877
GSM95474 0.09503 0.00592 0.00352 0
GSM95475 0.08486 0.00678 0.00386 0.01973
GSM95476 0.08105 0.00913 0.00306 0.01801
GSM95477 0.00000 0.00812 0.00428 0
GSM95478 0.07615 0.00777 0.00438 0.01799
GSM95479 0 0.00508 1 0
GSM95480 0.08499 0.00442 0.00298 0.01897
GSM95481 0.08893 0.00734 0.00204 0
0 1 2 3
ID_REF 1007_s 1053_a 117_at 121_at
These are column name and index value which needs to be ignored when counting.
The output Should be like this after counting:
Final 8 9 9 5
If you just nee the count, but change the values in your dataframe, you could apply a function to each cell in your DataFrame with the applymap method. First create a function to check for a float:
def floatcheck(value):
if isinstance(value, float):
return 1
else:
return 0
Then apply it to your dataframe:
df6 = df5.applymap(floatcheck)
This will create a dataframe with a 1 if the value is a float and a 0 if not. Then you can apply your sum method:
df7 = df6.append(df6.sum().rename("Final Value"))
I was able to solve the issue, So here it is:
df5 = df4.append(pd.DataFrame(dict(((df4[1:] != 1) & (df4[1:] != 0)).sum()), index=['Final']))
df5.columns = df4.columns
went = df5.to_csv("output3.csv")
What i did was i changed the starting index so i didn't count the first row which was alphanumeric and then i just compared it.
Thanks for your response.

pandas dataframe contains list

I have currently run the following script which uses Fuzzylogic to replace some common words from the list. Dataframe df1 contains my default list of possible values. Dataframe df2 is the main dataframe where transformations/changes are undertaken after referring to Dataframe df1. The code is as follows:
df1 = pd.DataFrame(['one','two','three','four','five','tsst'])
df2 = pd.DataFrame({'not_shifted':[np.nan,'one','too','three','fours','five','six',np.nan,'test']})
# Drop nan value
df2=pd.DataFrame(df2['not_shifted'].fillna(value=''))
df2['not_shifted'] = df2['not_shifted'].map(lambda x: difflib.get_close_matches(x, df1[0]))
The problem is the output is a dataframe which contains square brackets. To make matters worse, none of the texts within df2['not_shifted'] are viewable/ recallable:
Out[421]:
not_shifted
0 []
1 [one]
2 [two]
3 [three]
4 [four]
5 [five]
6 []
7 []
8 [tsst]
Please help.
df2.not_shifted.apply(lambda x: x[0] if len(x) != 0 else "") or simply df2.not_shifted.str[0] as solved by #Psidom
def replace_all(eg):
rep = {"[":"",
"]":"",
"u":"",
"}":"",
"'":"",
'"':"",
"frozenset":""}
for i,j in rep.items():
eg = eg.replace(i,j)
return eg
for each in df.columns:
df[each] = df[each].apply(lambda x : replace_all(str(x)))

Pandas DataFrame get substrings from column

I have a column named "KL" with for example:
sem_0405M4209F2057_1.000
sem_A_0103M5836F4798_1.000
Now I want to extract the four digits after "M" and the four digits after "F". But with df["KL"].str.extract I can't get it to work.
Locations of M and F vary, thus just using the slice [9:13] won't work for the complete column.
If you want to use str.extract, here's how:
>>> df['KL'].str.extract(r'M(?P<M>[0-9]{4})F(?P<F>[0-9]{4})')
M F
0 4209 2057
1 5836 4798
Here, M(?P<M>[0-9]{4}) matches the character 'M' and then captures 4 digits following it (the [0-9]{4} part). This is put in the column M (specified with ?P<M> inside the capturing group). The same thing is done for F.
You could use split to achieve this, probably a better way exists:
In [147]:
s = pd.Series(['sem_0405M4209F2057_1.000','sem_A_0103M5836F4798_1.000'])
s
Out[147]:
0 sem_0405M4209F2057_1.000
1 sem_A_0103M5836F4798_1.000
dtype: object
In [153]:
m = s.str.split('M').str[1].str.split('F').str[0][:4]
f = s.str.split('M').str[1].str.split('F').str[1].str[:4]
print(m)
print(f)
0 4209
1 5836
dtype: object
0 2057
1 4798
dtype: object
You can also use regex:
import re
def get_data(x):
data = re.search( r'M(\d{4})F(\d{4})', x)
if data:
m = data.group(1)
f = data.group(2)
return m, f
df = pd.DataFrame(data={'a': ['sem_0405M4209F2057_1.000', 'sem_0405M4239F2027_1.000']})
df['data'] = df['a'].apply(lambda x: get_data(x))
>>
a data
0 sem_0405M4209F2057_1.000 (4209, 2057)
1 sem_0405M4239F2027_1.000 (4239, 2027)

Categories

Resources