Pandas create multiple dataframe based on group from another dataframe - python

I have a pandas dataframe
df=pd.DataFrame({'Name':['Jhon','Andy','Jenny','Joan','Paul','Rosa'],
'Position':['Programmer','Designer','Programmer','Designer','Analyst','Analyst']})
I want to create multiple of other dataframe based on the Position, and named each dataframe as "Job_as_"
Expected output would be
Job_as_Programmer=['Jhon','Jeny']
Job_as_Designer=['Andy','Jhon']

Use pandas.DataFrame.groupby with pandas.Series.add_prefix:
df2 = df.groupby("Position")["Name"].apply(list)
df2.add_prefix("Job_as_").to_dict()
Output:
{'Job_as_Analyst': ['Paul', 'Rosa'],
'Job_as_Designer': ['Andy', 'Joan'],
'Job_as_Programmer': ['Jhon', 'Jenny']}

You could create a dictionary:
{"Job_as_"+ x : df.loc[df.Position==x, "Name"].to_list() for x in df.Position.unique()}
Output
{
'Job_as_Programmer': ['Jhon', 'Jenny'],
'Job_as_Designer': ['Andy', 'Joan'],
'Job_as_Analyst': ['Paul', 'Rosa']
}

you could just use groupby as below:
import pandas as pd
df=pd.DataFrame({'Name':['Jhon','Andy','Jenny','Joan','Paul','Rosa'],
'Position':['Programmer','Designer','Programmer','Designer','Analyst','Analyst']})
newDf = df.groupby(["Position" , "Name"]).first()
newDf #To Print Table
Output:
Position Name
Analyst Paul
Rosa
Designer Andy
Joan
Programmer Jenny
Jhon

Related

How to transform json format into string column for python dataframe?

I got this dataframe:
Dataframe: df_case_1
Id RecordType
0 1234 {'attributes': {'type': 'RecordType', 'url': '/services/data/v55.0/sobjects/RecordType/1234', 'name', 'XYZ'}}
1 4321 {'attributes': {'type': 'RecordType', 'url': '/services/data/v55.0/sobjects/RecordType/4321', 'name', 'ABC'}}
I want to have this dataframe:
Dataframe: df_case_final
Id RecordType
0 1234 'XYZ'
1 4321 'ABC'
At the moment I use this statemane but it gives me the name on position 0 for every case object.
df_case_1['RecordType'] = df_case_1.RecordType[0]['Name']
How to build the statement, that I give me the correct name for every id, like in df_case_final?
Thanks
There are 3 Ways you can convert JSON to Pandas Dataframe
# 1. Use json_normalize() to convert JSON to DataFrame
dict= json.loads(data)
df = json_normalize(dict['technologies'])
# 2. Convert JSON to DataFrame Using read_json()
df2 = pd.read_json(jsonStr, orient ='index')
# 3. Use pandas.DataFrame.from_dict() to Convert JSON to DataFrame
dict= json.loads(data)
df2 = pd.DataFrame.from_dict(dict, orient="index")
Now, after converting Json to df take the last column and append it to your original dataframe
split your df by coma & trim un-neccessary cols
import pandas as pd
df=pd.read_csv(r"Hansmuff.csv")
df[['1', '2','3','required']]=df['RecordType'].str.split(',', expand=True)
df = df.drop(columns=['RecordType', '1','2','3'])
df['required'] = df['required'].str.strip('{}')
print(df)
output
Id required
0 1234 'XYZ'
1 4321 'ABC'

How to create a new columns conditional on two other columns in python?

I want to create a new columns conditional on two other columns in python.
Below is the dataframe:
name
address
apple
hello1234
banana
happy111
apple
str3333
pie
diary5144
I want to create a new column "want", conditional on column "name" and "column" address.
The rules are as follows:
(1)If the value in "name" is apple, the the value in "want" should be the first five letters in column "address".
(2)If the value in "name" is banana, the the value in "want" should be the first four letters in column "address".
(3)If the value in "name" is pie, the the value in "want" should be the first three letters in column "address".
The dataframe I want look like this:
name
address
want
apple
hello1234
hello
banana
happy111
happ
apple
str3333
str33
pie
diary5144
dia
How to address such problem? Thanks!
I hope you are well,
import pandas as pd
# Initialize data of lists.
data = {'Name': ['Apple', 'Banana', 'Apple', 'Pie'],
'Address': ['hello1234', 'happy111', 'str3333', 'diary5144']}
# Create DataFrame
df = pd.DataFrame(data)
# Add an empty column
df['Want'] = ''
for i in range(len(df)):
if df['Name'].iloc[i] == "Apple":
df['Want'].iloc[i] = df['Address'].iloc[i][:5]
if df['Name'].iloc[i] == "Banana":
df['Want'].iloc[i] = df['Address'].iloc[i][:4]
if df['Name'].iloc[i] == "Pie":
df['Want'].iloc[i] = df['Address'].iloc[i][:3]
# Print the Dataframe
print(df)
I hope it helps,
Have a lovely day
I think a broader way of doing this is by creating a conditional map dict and applying it with lambda functions on your dataset.
Creating the dataset:
import pandas as pd
data = {
'name': ['apple', 'banana', 'apple', 'pie'],
'address': ['hello1234', 'happy111', 'str3333', 'diary5144']
}
df = pd.DataFrame(data)
Defining the conditional dict:
conditionalMap = {
'apple': lambda s: s[:5],
'banana': lambda s: s[:4],
'pie': lambda s: s[:3]
}
Applying the map:
df.loc[:, 'want'] = df.apply(lambda row: conditionalMap[row['name']](row['address']), axis=1)
With the resulting df:
name
address
want
0
apple
hello1234
hello
1
banana
happy111
happ
2
apple
str3333
str33
3
pie
diary5144
dia
You could do the following:
for string, length in {"apple": 5, "banana": 4, "pie": 3}.items():
mask = df["name"].eq(string)
df.loc[mask, "want"] = df.loc[mask, "address"].str[:length]
Iterate over the 3 conditions: string is the string on which the length requirement depends, and the length requirement is stored in length.
Build a mask via df["name"].eq(string) which selects the rows with value string in column name.
Then set column want at those rows to the adequately clipped column address values.
Result for the sample dataframe:
name address want
0 apple hello1234 hello
1 banana happy111 happ
2 apple str3333 str33
3 pie diary5144 dia

Extract the country zip code based on the full country code - DataFrame Python

I have in my data information about place as full post code for example CZ25145. I would like to create new column for this with value CZ. How to do this?
I have this:
import pandas as pd
df = pd.DataFrame({
'CODE_LOAD_PLACE' : ['PL43100', 'CZ25905', 'DE29333', 'DE29384', 'SK92832']
},)
I would like to get it like below:
df = pd.DataFrame({
'CODE_LOAD_PLACE' : ['PL43100', 'CZ25905', 'DE29333', 'DE29384', 'SK92832'],
'COUNTRY_LOAD_PLACE' : ['PL', 'CZ', 'DE', 'DE', 'SK']
},)
I try use .factorize and .groupby but no positive final effect.
Use .str and select the first 2 characters:
df["COUNTRY_LOAD_PLACE"] = df["CODE_LOAD_PLACE"].str[:2]

Extract data from specific format in Pandas DF

I have a raw data in csv format which looks like this:
product-name brand-name rating
["Whole Wheat"] ["bb Royal"] ["4.1"]
Expected output:
product-name brand-name rating
Whole Wheat bb Royal 4.1
I want this to affect every entry in my dataset. I have 10,000 rows of data. How can I do this using pandas?
Can we do this using regular expressions? Not sure how to do it.
Thank you.
Edit 1:
My data looks something like this:
df = {
'product-name': [
[""'Whole Wheat'""], [""'Milk'""] ],
'brand-name': [
[""'bb Royal'""], [""'XYZ'""] ],
'rating': [
[""'4.1'""], [""'4.0'""] ]
}
df_p = pd.DataFrame(data=df)
It outputs like this: ["bb Royal"]
PS: Apologies for my programming. I am quite new to programming and also to this community. I really appreciate your help here :)
IIUC select first values of lists:
df = df.apply(lambda x: x.str[0])
Or if values are strings:
df = df.replace('[\[\]]', '', regex=True)
You can use the explode function
df = df.apply(pd.Series.explode)

Replace multiple columns with years for one column

I'm working with worldbank data and I'm trying to create some graphs representing time, but the data I have now looks like this:
As I don't think there's a way to change it to a datetime I think the only way is to replace all these years columns with 1 column called 'Year' with column names I have right now as values and current values in a separate column.
Is there any nice function in Python that allows that or would I have to iterate through the entire dataframe?
Edit to include some code:
df2 = pd.DataFrame({'Country Name': ['Aruba', 'Afghanistan', 'Angola'],
'Country Code': ['ABW', 'AFG', 'AGO'],
'1960':[65.66, 32.29, 33.25],
'1961': [66.07, 32.74, 33.57],
'1962': [66.44, 33.18, 33.91],
'1963': [66.79, 33.62, 34.27],
'1964': [66.11, 34.06, 34.65],
'1965': [67.44, 34.49, 35.03]}).set_index('Country Name')
You can try taking transpose of the dataframe thus the year values will become rows and then you can rename this as year and use it in the plots.
You can try something like this :
import pandas as pd
from matplotlib import pyplot as plt
df1 = pd.DataFrame({'Country Name' : ['Aruba', 'Afghanistan', 'Angola'],
'Country Code' : ['ABW', 'AFG', 'AGO'],
'1960' : [65.66, 32.29, 33.25],
'1961' : [66.07, 32.74, 33.57],
'1962' : [66.44, 33.18, 33.91],
'1963' : [66.79, 33.62, 34.27],
'1964' : [66.11, 34.06, 34.65],
'1965' : [67.44, 34.49, 35.03]})
df2 = df1.transpose()
df2.columns = df1['Country Name']
df2 = df2[2:]
df2['Year'] = df2.index.values
plt.plot(df2['Year'], df2['Aruba'])
plt.plot(df2['Year'], df2['Afghanistan'])
plt.plot(df2['Year'], df2['Angola'])
plt.legend()
plt.show()
Output : Plot Output

Categories

Resources