Extract data from specific format in Pandas DF - python

I have a raw data in csv format which looks like this:
product-name brand-name rating
["Whole Wheat"] ["bb Royal"] ["4.1"]
Expected output:
product-name brand-name rating
Whole Wheat bb Royal 4.1
I want this to affect every entry in my dataset. I have 10,000 rows of data. How can I do this using pandas?
Can we do this using regular expressions? Not sure how to do it.
Thank you.
Edit 1:
My data looks something like this:
df = {
'product-name': [
[""'Whole Wheat'""], [""'Milk'""] ],
'brand-name': [
[""'bb Royal'""], [""'XYZ'""] ],
'rating': [
[""'4.1'""], [""'4.0'""] ]
}
df_p = pd.DataFrame(data=df)
It outputs like this: ["bb Royal"]
PS: Apologies for my programming. I am quite new to programming and also to this community. I really appreciate your help here :)

IIUC select first values of lists:
df = df.apply(lambda x: x.str[0])
Or if values are strings:
df = df.replace('[\[\]]', '', regex=True)

You can use the explode function
df = df.apply(pd.Series.explode)

Related

Python Pandas .str.extract method fails when indexing

I'd like to set values on a slice of a DataFrame using .loc using pandas str extract method .str.extract() however, it's not working due to indexing errors. This code works perfectly if I swap extract with contains.
Here is a sample frame:
import pandas as pd
df = pd.DataFrame(
{
'name': [
'JUNK-0003426', 'TEST-0003435', 'JUNK-0003432', 'TEST-0003433', 'TEST-0003436',
],
'value': [
'Junk', 'None', 'Junk', 'None', 'None',
]
}
)
Here is my code:
df.loc[df["name"].str.startswith("TEST"), "value"] = df["name"].str.extract(r"TEST-\d{3}(\d+)")
How can I set the None values to the extracted regex string
Hmm the problem seems to be that .str.extract returns a pd.DataFrame, you can .squeeze it to turn it into a series and it seems to work fine:
df.loc[df["name"].str.startswith("TEST"), "value"] = df["name"].str.extract(r"TEST-\d{3}(\d+)").squeeze()
indexing alignment takes care of the rest.
Instead of trying to get the group, you can replace the rest with the empty string:
df.loc[df['value']=='None', 'value'] = df.loc[df['value']=='None', 'name'].str.replace('TEST-\d{3}', '')
Was this answer helpful to your problem?
Here is a way to do it:
df.loc[df["name"].str.startswith("TEST"), "value"] = df["name"].str.extract(r"TEST-\d{3}(\d+)").loc[:,0]
Output:
name value
0 JUNK-0003426 Junk
1 TEST-0003435 3435
2 JUNK-0003432 Junk
3 TEST-0003433 3433
4 TEST-0003436 3436

Extract the country zip code based on the full country code - DataFrame Python

I have in my data information about place as full post code for example CZ25145. I would like to create new column for this with value CZ. How to do this?
I have this:
import pandas as pd
df = pd.DataFrame({
'CODE_LOAD_PLACE' : ['PL43100', 'CZ25905', 'DE29333', 'DE29384', 'SK92832']
},)
I would like to get it like below:
df = pd.DataFrame({
'CODE_LOAD_PLACE' : ['PL43100', 'CZ25905', 'DE29333', 'DE29384', 'SK92832'],
'COUNTRY_LOAD_PLACE' : ['PL', 'CZ', 'DE', 'DE', 'SK']
},)
I try use .factorize and .groupby but no positive final effect.
Use .str and select the first 2 characters:
df["COUNTRY_LOAD_PLACE"] = df["CODE_LOAD_PLACE"].str[:2]

Pandas create multiple dataframe based on group from another dataframe

I have a pandas dataframe
df=pd.DataFrame({'Name':['Jhon','Andy','Jenny','Joan','Paul','Rosa'],
'Position':['Programmer','Designer','Programmer','Designer','Analyst','Analyst']})
I want to create multiple of other dataframe based on the Position, and named each dataframe as "Job_as_"
Expected output would be
Job_as_Programmer=['Jhon','Jeny']
Job_as_Designer=['Andy','Jhon']
Use pandas.DataFrame.groupby with pandas.Series.add_prefix:
df2 = df.groupby("Position")["Name"].apply(list)
df2.add_prefix("Job_as_").to_dict()
Output:
{'Job_as_Analyst': ['Paul', 'Rosa'],
'Job_as_Designer': ['Andy', 'Joan'],
'Job_as_Programmer': ['Jhon', 'Jenny']}
You could create a dictionary:
{"Job_as_"+ x : df.loc[df.Position==x, "Name"].to_list() for x in df.Position.unique()}
Output
{
'Job_as_Programmer': ['Jhon', 'Jenny'],
'Job_as_Designer': ['Andy', 'Joan'],
'Job_as_Analyst': ['Paul', 'Rosa']
}
you could just use groupby as below:
import pandas as pd
df=pd.DataFrame({'Name':['Jhon','Andy','Jenny','Joan','Paul','Rosa'],
'Position':['Programmer','Designer','Programmer','Designer','Analyst','Analyst']})
newDf = df.groupby(["Position" , "Name"]).first()
newDf #To Print Table
Output:
Position Name
Analyst Paul
Rosa
Designer Andy
Joan
Programmer Jenny
Jhon

Use JSON column into a Pattern using Pandas

I have a JSON files Data. Given below is a sample of it.
[{
"Type": "Fruit",
"Names": "Apple;Orange;Papaya"
}, {
"Type": "Veggie",
"Names": "Cucumber;Spinach;Tomato"
}]
I have to read the Names and match each item of the Names with a column in another df.
I am stuck at converting the value of the Names key into a list that can be used in Pattern. The code I tried is
df1 = pd.DataFrame(data)
PriList=df1['Names'].str.split(";", n = 1, expand = True)
Pripat = '|'.join(r"\b{}\b".format(x) for x in PriList)
df['Match'] = df['MasterList'].str.findall('('+ Pripat + ')').str.join(', ')
The issue is with the Pripat. Its content is
\bApple, Orange\b
If I give the Names in a list like below
Prilist=['Apple','Orange','Papaya']
the code works fine...
Please help.
You'll need to call str.split and then flatten the result using itertools.chain.
First, do
df2 = df1.loc[df1.Type.eq('Fruit')]
Now,
from itertools import chain
prilist = list(chain.from_iterable(df2.Names.str.split(';').values))
There's also stack (which is slower):
prilist = df2.Names.str.split(';', expand=True).stack().tolist()
print(prilist)
['Apple', 'Orange', 'Papaya']
df2 = df1.loc[df1.Type.eq('Fruit')]
out_list=';'.join(df2['Names'].values).split(';')
#print(out_list)
['Apple', 'Orange', 'Papaya']

Trim each column values at pandas

I am working on .xls files after import data to a data frame with pandas, need to trim them. I have a lot of columns. Each data starting xxx: or yyy: and in a column
for example:
xxx:abc yyy:def \n
xxx:def yyy:ghi \n
xxx:ghi yyy:jkl \n
...
I need to trim that xxx: and yyy: for each column. Researched and tried some issue solves but they doesn't worked. How can I trim that, I need an effective code. Already thanks.
(Unnecessary chars don't have static length I just know what are them look like stop words. For example:
['Comp:Apple', 'Product:iPhone', 'Year:2018', '128GB', ...]
['Comp:Samsung', 'Product:Note', 'Year:2017', '64GB', ...]
i want to new dataset look like:
['Apple', 'iPhone', '2018', '128GB', ...]
['Samsung', 'Note', '2017', '64GB', ...]
So I want to trim ('Comp:', 'Product:', 'Year:', ...) stop words for each column.
You can use pd.Series.str.split for this:
import pandas as pd
df = pd.DataFrame([['Comp:Apple', 'Product:iPhone', 'Year:2018', '128GB'],
['Comp:Samsung', 'Product:Note', 'Year:2017', '64GB']],
columns=['Comp', 'Product', 'Year', 'Memory'])
for col in ['Comp', 'Product', 'Year']:
df[col] = df[col].str.split(':').str.get(1)
# Comp Product Year Memory
# 0 Apple iPhone 2018 128GB
# 1 Samsung Note 2017 64GB

Categories

Resources