I have a Pandas series which can be produced by the below code:
Input:
l = ['abcd 1942 Lmauu 40% 70cl',
'something again something 1.5 L',
'some other stuff 45% 70 CL',
'not the exact data 3LTR',
'abcd 100Ltud 6%(8)500ML',
'cdef 6%(8)500 ml',
'a packet 24 x 27.5 cl ( PET )']
ser = pd.Series(l)
Problem Statement and expected output:
I am trying to extract the Volumes from the series and convert into a dataframe such that the volume would be in 1 column of the dataframe and the unit of measure in the other column, expected output can be reproduced using the below code:
d = {0: {0: '70',
1: '1.5',
2: '70',
3: '3',
4: '500',
5: '500',
6: '27.5'},
1: {0: 'cl', 1: 'L', 2: 'CL', 3: 'LTR', 4: 'ML', 5: 'ml', 6: 'cl'}}
expected_output = pd.DataFrame(d)
0 1
0 70 cl
1 1.5 L
2 70 CL
3 3 LTR
4 500 ML
5 500 ml
6 27.5 cl
My try code
Here is what I have tried , i have come very near to what I want but not quite , if you see i dont get the last volume. I think because i have included $ in my regex , but without it I was not able to parse the volume as in this string for example abcd 1942 Lmauu 40% 70cl, 1942 L would have been returned. Also I want the unit of measure only in second column not the first as shown in my output but that is secondary.
print(ser.str.extract(r'((?i)([\d]+?[.])?\d+?[\s+]?(cl$|ml$|ltr$|L$)(?:$))').iloc[:,[0,-1]])
0 2
0 70cl cl
1 1.5 L L
2 70 CL CL
3 3LTR LTR
4 500ML ML
5 500 ml ml
6 NaN NaN
Please suggest what should I do here.
You may use
r'(?i)\b(\d+(?:\.\d+)?)\s*(cl|ml|ltr|L)\b'
See the regex demo.
Details
(?i) - case insensitive mode on
\b - a word boundary
(\d+(?:\.\d+)?) - Capturing group 1: one or more digits followed with an optional sequence of a dot and one or more digits
\s* - 0+ whitespaces
(cl|ml|ltr|L) - cl, ml, ltr or L (mind the case insensitive matching)
\b - a word boundary
Test:
>>> ser.str.extract(r'(?i)\b(\d+(?:\.\d+)?)\s*(cl|ml|ltr|L)\b', expand=True)
0 1
0 70 cl
1 1.5 L
2 70 CL
3 3 LTR
4 500 ML
5 500 ml
6 27.5 cl
It is better to use named capturing groups, so that result
columns have meaningful names.
I also simplified a bit your regex and changed units of measure to lower case.
So change your code to:
res = ser.str.extract(r'(?i)(?P<Amount>\d+(?:\.\d+)?)\s?(?P<Unit>[CM]?L|LTR)\b')
res.Unit = res.Unit.str.lower()
The result is:
Amount Unit
0 70 cl
1 1.5 l
2 70 cl
3 3 ltr
4 500 ml
5 500 ml
6 27.5 cl
Note also that $ in (cl$|ml$|ltr$|L$) is wrong, because at least in
one case you have additional text after the unit of measure.
Related
Updated. Instead of dict data, I change for a dataframe as input
I'm analyzing a DataFrame with approximately 10,000 rows and 2 columns.
The criteria of my analysis is based on whether certain words appear in a certain cell.
I believe I will be more successful if I know which words are most relevant in terms of values...
Foo data to be used as an example:
data = { 'product': ['Dell Notebook I7', 'Dell Notebook I3', 'Logitech mx keys', 'Logitech mx 2'],
'cost': [1000,1200,300,100]}
df_data = pd.DataFrame(data)
product
cost
0
Dell Notebook I7
1000
1
Dell Notebook I3
1200
2
Logitech mx keys
300
3
Logitech mx 2
100
Basically, the column product shows the product an description.
In the column cost shows the product cost.
What I want:
I would like to create another dataframe like this:
Desired Output:
unique_words
total_cost_for_unique_word
1
Dell
2200
4
Logitech
2200
5
Notebook
2200
2
I3
1200
3
I7
1000
7
mx
400
6
keys
300
0
2
100
Column unique_words with the list of each word that appears in the column product.
Column total_cost_for_unique_word with the sum of the values of products that contain that word.
I've tried searching for posts here from StackOverflow... Also, I've done google research, but I haven't found a solution. Maybe I still don't have the knowledge to find the answer.
If by any chance it has already been answered, please let me know and I will delete the post.
Thank you all.
You can split, explode, groupby.agg:
df_data = pd.DataFrame(data)
new_df = (df_data
.assign(unique_words=df['product'].str.split())
.explode('unique_words')
.groupby('unique_words', as_index=False)
.agg(**{'total cost': ('cost' ,'sum')})
.sort_values('total cost', ascending=False, ignore_index=True)
)
Output:
unique_words total cost
0 Dell 2200
1 Notebook 2200
2 I3 1200
3 I7 1000
4 Logitech 400
5 mx 400
6 keys 300
7 2 100
If you first split the product into a list of all words (default is " "):
df["product"] = df["product"].str.split()
You can then explode this (for each item in the list as a new line), group all these together and sum the costs, then sorting and renaming columns to suit your outcome:
df.explode("product").groupby("product",as_index=False).agg("sum").sort_values("cost", ascending=False).rename(columns={"product": "unique_words", "cost", "total_cost_for_unique_word"})
i am new to python. Kindly help me.
Here I have two set of csv-files. i need to compare and output the difference like changed data/deleted data/added data. here's my example
file 1:
Sn Name Subject Marks
1 Ram Maths 85
2 sita Engilsh 66
3 vishnu science 50
4 balaji social 60
file 2:
Sn Name Subject Marks
1 Ram computer 85 #subject name have changed
2 sita Engilsh 66
3 vishnu science 90 #marks have changed
4 balaji social 60
5 kishor chem 99 #added new line
Output - i need to get like this :
Changed Items:
1 Ram computer 85
3 vishnu science 90
Added item:
5 kishor chem 99
Deleted item:
.................
I imported csv and done the comparasion via for loop with redlines. I am not getting the desire output. its confusing me a lot when flagging the added & deleted items between file 1 & file2 (csv files). pl suggest the effective code folks.
The idea here is to flatten your dataframe with melt to compare each value:
# Load your csv files
df1 = pd.read_csv('file1.csv', ...)
df2 = pd.read_csv('file2.csv', ...)
# Select columns (not mandatory, it depends on your 'Sn' column)
cols = ['Name', 'Subject', 'Marks']
# Flat your dataframes
out1 = df1[cols].melt('Name', var_name='Item', value_name='Old')
out2 = df2[cols].melt('Name', var_name='Item', value_name='New')
out = pd.merge(out1, out2, on=['Name', 'Item'], how='outer')
# Flag the state of each item
condlist = [out['Old'] != out['New'],
out['Old'].isna(),
out['New'].isna()]
out['State'] = np.select(condlist, choicelist=['changed', 'added', 'deleted'],
default='unchanged')
Output:
>>> out
Name Item Old New State
0 Ram Subject Maths computer changed
1 sita Subject Engilsh Engilsh unchanged
2 vishnu Subject science science unchanged
3 balaji Subject social social unchanged
4 Ram Marks 85 85 unchanged
5 sita Marks 66 66 unchanged
6 vishnu Marks 50 90 changed
7 balaji Marks 60 60 unchanged
8 kishor Subject NaN chem changed
9 kishor Marks NaN 99 changed
count, flag = 0, 1
for i, j in zip(df1.values, df2.values):
if sum(i == j) != 4:
if flag:
print("Changed Items:")
flag = 0
print(j)
count += 1
if count != len(df2):
print("Newly added:")
print(*df2.iloc[count:, :].values)
i have a csv lie this
userlabel|country
SZ5GZTD_[56][13631808]|russia
YZ5GZTC-3_[51][13680735]|uk
XZ5GZTA_12-[51][13574893]|usa
testYZ5GZWC_11-[51][13632101]|cuba
I use pandas to read this csv, I'd like to add a new column ci,Its value comes from userlabel,and the following conditions must be met:
convert values to lowercase
start with 'yz' or 'testyz'
the code is like this :
(df['userlabel'].str.lower()).str.extract(r"(test)?([a-z]+).*", expand=True)[1]
when it matched,ci is the number between the first "- or _" and second "- or _" from userlabel.
the fake code is like this:
ci = (userlabel,r'.*(\_|\-)(\d+)(\_|\-).*',2)
finally,the result is like this
userlabel ci country
SZ5GZTD_[56][13631808] russia
YZ5GZTC-3_[51][13680735] 3 uk
XZ5GZTA_12-[51][13574893] usa
testYZ5GZWC_11-[51][13632101] 11 cuba
You can use
import pandas as pd
df = pd.DataFrame({'userlabel':['SZ5GZTD_[56][13631808]','YZ5GZTC-3_[51][13680735]','XZ5GZTA_12-[51][13574893]','testYZ5GZWC_11-[51][13632101]'], 'country':['russia','uk','usa','cuba']})
df['ci'] = df['userlabel'].str.extract(r"(?i)^(?:yz|testyz)[^_-]*[_-](\d+)[-_]", expand=True)
>>> df['ci']
0 NaN
1 3
2 NaN
3 11
Name: ci, dtype: object
# To rearrange columns, add the following line:
df = df[['userlabel', 'ci', 'country']]
>>> df
userlabel ci country
0 SZ5GZTD_[56][13631808] NaN russia
1 YZ5GZTC-3_[51][13680735] 3 uk
2 XZ5GZTA_12-[51][13574893] NaN usa
3 testYZ5GZWC_11-[51][13632101] 11 cuba
See the regex demo.
Regex details:
(?i) - make the pattern case insensitive (no need using str.lower())
^ - start of string
(?:yz|testyz) - a non-capturing group matching either yz or testyz
[^_-]* - zero or more chars other than _ and -
[_-] - the first _ or -
(\d+) - Group 1 (the Series.str.extract requires a capturing group since it only returns this captured substring): one or more digits
[-_] - a - or _.
import re
def get_val(s):
l = re.findall(r'^(YZ|testYZ).*[_-](\d+)[_-].*', s)
return None if(len(l) == 0) else l[0][1]
df['ci'] = df['userlabel'].apply(lambda x: get_val(x))
df = df[['userlabel', 'ci', 'country']]
userlabel ci country
0 SZ5GZTD_[56][13631808] None russia
1 YZ5GZTC-3_[51][13680735] 3 uk
2 XZ5GZTA_12-[51][13574893] None usa
3 testYZ5GZWC_11-[51][13632101] 11 cuba
I have a dataset (df) with 2 columns of arbitrary length, and I need to split it up based on the value.
BUS
CODE
150 H.S.London-lon3 11£150 H.S.London-lon3 16£150 H.S.London-lon3 120
GERI
400 Airport Luton-ptr5 12£400 Airport Luton-ptr5 15£400 Airport Luton-ptr5 17
24£JTR
005 Plaza-cata-md6 08£005 Plaza-cata-md6 012£005 Plaza-cata-md6 18
78£TDE
I've been trying to split it to look like this:
bus
directions
zone
time
code
name
150
H.S.London
lon3
11
NaN
GERI
400
Airport Luton
ptr5
12
24
JTR
005
Plaza-cata
md6
08
78
TDE
So far, I tried to split by patterns, but isn't working and I'm out of ideas or how to split it in other way.
bus = '(?P<bus>[\d]+) (?P<direction>[\w\W]+)-(?P<zone>[\w]+)'
code = '(?P<code>[\S]+)£(?P<name>\d+)
df.BUS.str.extract(bus)).join(df.CODE.str.extract(code)
I was wondering if anyone had a good solution to this.
You can use .str.extract with regex pattern containing named capturing groups:
code = r'^(?P<code>\d+)?.*?(?P<name>[A-Za-z]+)'
bus = r'^(?P<bus>\d+)\s(?P<directions>.*?)-(?P<zone>[^\-]+)\s(?P<time>\d+)'
df['BUS'].str.extract(bus).join(df['CODE'].str.extract(code))
bus directions zone time code name
0 150 H.S.London lon3 11 NaN GERI
1 400 Airport Luton ptr5 12 24 JTR
2 005 Plaza-cata md6 08 78 TDE
See the regex demo for code pattern here and for bus pattern here.
You could use split:
For your code column:
new_cols = ['code','name']
df[new_cols] = df.CODE.str.split(pat = '£', expand = True)
Im sure you can find a way to do this for your first column, and if you have duplicates remove them after splitting
I want to make a loop that will pull a number or range within a dataframe and stop analyzing the string after the word has been found.
For example:
df['size']=['sz 10-13 of jordan 12', 'size 10 adidas',
'size 11 nike air forece 1', 'sz 6-7 jordan 6sz', ‘brand new Sz 11 jordan 5’]
I need a function similar to this:
def assignSize(row):
sizeList =[]
for word in sizeList:
if word == 'sz' or word == 'size':
#i do not know what to place here
But I would like my output to be:
df['size'] =['10-13','10','11','6-7']
Basically I want the script to stop reading the string after finding the first number or first range of numbers. So of there is another 'sz' that follows after the initial size or sz, it should not read it.
Why not just this?:
df['size'] = df['size'].apply(lambda x: x.split()[1])
print(df['size'])
Output:
0 10-13
1 10
2 11
3 6-7
Name: size, dtype: object
Edit:
Try this:
import re
df['size']=['sz 10-13 of jordan 12', 'size 10 adidas',
'brand new Sz 13 jordan 5', 'sz 6-7 jordan 6sz']
df['size'] = df['size'].apply(lambda x: '-'.join(re.findall(r'\d+', ' '.join(x.split()[:-1]))))
print(df['size'])
Output:
0 10-13
1 10
2 13
3 6-7
Name: size, dtype: object