Be the following python pandas DataFrame:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
832932
France
30
31728#
I would like to make the following modifications for each row:
If the ID column has any '#' value, the row is left unchanged.
If the ID column has no '#' value, and country is NaN, "Other" is added to the country column, and a 0 is added to other column.
Finally, only if the money column is NaN and the other column has value, we assign the values money and money_add from the following table:
other_ID
money
money_add
19
4532
723823
50
1213
238232
18
1813
273283
30
1313
83293
0
8932
3920
Example of the resulting table:
ID
country
money
other
money_add
832932
France
12131
19
82932
217#8#
1329T2
Other
8932
0
3920
832932
France
1313
30
83293
31728#
First set values to both columns if match both conditions by list, then filter non # rows and update values by DataFrame.update only matched rows:
m1 = df['ID'].str.contains('#')
m2 = df['country'].isna()
df.loc[~m1 & m2, ['country','other']] = ['Other',0]
df1 = df1.set_index(df1['other_ID'])
df = df.set_index(df['other'].mask(m1))
df.update(df1, overwrite=False)
df = df.reset_index(drop=True)
print (df)
ID country money other money_add
0 832932 France 12131 19.0 82932.0
1 217#8# NaN ; NaN NaN
2 1329T2 Other 8932.0 0.0 3920.0
3 832932 France 1313.0 30.0 83293.0
4 31728# NaN NaN NaN NaN
Related
i'm working on spread r equivalent in pandas my dataframe looks like below
Name age Language year Period
Nik 18 English 2018 Beginer
John 19 French 2019 Intermediate
Kane 33 Russian 2017 Advanced
xi 44 Thai 2015 Beginer
and looking for output like this
Name age Language Beginer Intermediate Advanced
Nik 18 English 2018
John 19 French 2019
Kane 33 Russian 2017
John 44 Thai 2015
my code
pd.pivot(x1,values='year', columns=['Period'])
i'm getting only these columns Beginer,Intermediate,Advanced not the entire dataframe
while reshaping it i tried using index but says no duplicates in index.
So i created new index column but still not getting entire dataframe
If I understood correctly you could do something like this:
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
# concat and drop columns
output = pd.concat((df.drop(['year', 'Period'], 1), res), 1)
print(output)
Output
Name age Language Advanced Beginner Intermediate
0 Nik 18 English 0 2018 0
1 John 19 French 0 0 2019
2 Kane 33 Russian 2017 0 0
3 xi 44 Thai 0 2015 0
If you want to match the exact same output, convert the column to categorical first, and specify the order:
# encode as categorical
df['Period'] = pd.Categorical(df['Period'], ['Beginner', 'Advanced', 'Intermediate'], ordered=True)
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
# concat and drop columns
output = pd.concat((df.drop(['year', 'Period'], 1), res), 1)
print(output)
Output
Name age Language Beginner Advanced Intermediate
0 Nik 18 English 2018 0 0
1 John 19 French 0 0 2019
2 Kane 33 Russian 0 2017 0
3 xi 44 Thai 2015 0 0
Finally if you want to replace the 0, with missing values, add a third step:
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
res = res.replace(0, np.nan)
Output (with missing values)
Name age Language Beginner Advanced Intermediate
0 Nik 18 English 2018.0 NaN NaN
1 John 19 French NaN NaN 2019.0
2 Kane 33 Russian NaN 2017.0 NaN
3 xi 44 Thai 2015.0 NaN NaN
One way you can get to the equivalent of R's spread function using pd.pivot_table:
If you don't mind about the index, you can use reset_index() on the newly created df:
new_df = (pd.pivot_table(df, index=['Name','age','Language'],columns='Period',values='year',aggfunc='sum')).reset_index()
which will get you:
Period Name age Language Advanced Beginer Intermediate
0 John 19 French NaN NaN 2019.0
1 Kane 33 Russian 2017.0 NaN NaN
2 Nik 18 English NaN 2018.0 NaN
3 xi 44 Thai NaN 2015.0 NaN
EDIT
If you have many columns in your dataframe and you want to include them in the reshaped dataset:
Grab in a list the columns to be used in pivot table (i.e. Period and year)
Grab all the other columns in your dataframe in a list (using not in)
Use the index_cols as index in the pd.pivot_table() command
non_index_cols = ['Period','year'] # SPECIFY THE 2 COLUMNS IN THE PIVOT TABLE TO BE USED
index_cols = [i for i in df.columns if i not in non_index_cols] # GET ALL THE REST IN A LIST
new_df = (pd.pivot_table(df, index=index_cols,columns='Period',values='year',aggfunc='sum')).reset_index()
The new_df, will include all the columns of your initial dataframe.
Let's say that I have this dataframe with four columns : "Name", "Value", "Ccy" and "Group" :
import pandas as pd
Name = ['ID', 'Country', 'IBAN','Dan_Age', 'Dan_city', 'Dan_country', 'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country' ]
Value = ['TAMARA_CO', 'GERMANY','FR56','18', 'Berlin', 'GER', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP']
Ccy = ['','','','EUR','EUR','USD','USD','','CHF', '','DKN','']
Group = ['0','0','0','1','1','1','1','2','2','2','3','3']
df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group})
print(df)
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 Dan_Age 18 EUR 1
4 Dan_city Berlin EUR 1
5 Dan_country GER USD 1
6 Dan_sex M USD 1
7 Dan_Age 22 2
8 Dan_country FRA CHF 2
9 Dan_sex M 2
10 Dan_city Madrid DKN 3
11 Dan_country ESP 3
I want to represent this data differently before saving it in a csv. I would like to group the duplicates in the column "Name" with the associates values in "Values" and "Ccy". I want that the data in the column "Value" and "Ccy" are stored in the row(index) defined by the column "Group". Like that I do not mixed the data.
Then if the name is in the "group" 0, it means that it is general data so I would like that the all the rows from this "Name" are filled with the same value.
So I would like to get this result :
ID_Value Country_Value IBAN_Value Dan_age Dan_age_Ccy Dan_city_Value Dan_city_Ccy Dan_sex_Value
1 TAMARA GER FR56 18 EUR Berlin EUR M
2 TAMARA GER FR56 22 M
3 TAMARA GER FR56 Madrid DKN
I can not find how to do the first part. With the code below, I do not get what I want evn if I remove the columns empty
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
Anyone can help me !
Thank you
You can use the following. See comments in code for each step:
s = df.loc[df['Group'] == '0', 'Name'].tolist() # this variable will be used later according to Condition 2
df['Name'] = pd.Categorical(df['Name'], categories=df['Name'].unique(), ordered=True) #this preserves order before pivoting
df = df.pivot(index='Group', columns='Name') #transforms long-to-wide per expected output
for col in df.columns:
if col[1] in s: df[col] = df[col].shift().ffill() #Condition 2
df = df.iloc[1:].replace('',np.nan).dropna(axis=1, how='all').fillna('') #dataframe cleanup
df.columns = ['_'.join(col) for col in df.columns.swaplevel()] #column name cleanup
df
Out[1]:
ID_Value Country_Value IBAN_Value Dan_Age_Value Dan_city_Value \
Group
1 TAMARA_CO GERMANY FR56 18 Berlin
2 TAMARA_CO GERMANY FR56 22
3 TAMARA_CO GERMANY FR56 Madrid
Dan_country_Value Dan_sex_Value Dan_Age_Ccy Dan_city_Ccy \
Group
1 GER M EUR EUR
2 FRA M
3 ESP DKN
Dan_country_Ccy Dan_sex_Ccy
Group
1 USD USD
2 CHF
3
From there, you can drop columns you don't want, change strings from "TAMARA_CO" to "TAMARA", "GERMANY" to "GER", use reset_index(drop=True), etc.
You can do this quite easily with only 3 steps:
Split your data frame into 2 parts: the "general data" (which we want as a series) and the more specific data. Each data frame now contains the same kinds of information.
The key part of your problem: reorganizing the data. All you need is the pandas pivot function. It does exactly what you need!
Add the general information and the pivoted data back together.
# Split Data
general = df[df.Group == "0"].set_index("Name")["Value"].copy()
main_df = df[df.Group != "0"]
# Pivot Data
result = main_df.pivot(index="Group", columns=["Name"],
values=["Value", "Ccy"]).fillna("")
result.columns = [f"{c[1]}_{c[0]}" for c in result.columns]
# Create a data frame that has an identical row for each group
general_df = pd.DataFrame([general]*3, index=result.index)
general_df.columns = [c + "_Value" for c in general_df.columns]
# Merge the data back together
result = general_df.merge(result, on="Group")
The result given above does not give the exact column order you want, so you'd have to specify that manually with
final_cols = ["ID_Value", "Country_Value", "IBAN_Value",
"Dan_age_Value", "Dan_Age_Ccy", "Dan_city_Value",
"Dan_city_Ccy", "Dan_sex_Value"]
result = result[final_cols]
I am new to pandas and python.
I am trying to group items by one column and list the information from the data frame per group.
My dataframe:
B C D E F
1 Honda USA 2000 Washington New
2 Honda USA 2001 Salt Lake Used
3 Ford Canada 2005 Washington New
4 Toyota USA 2010 Ney York Used
5 Honda USA 2001 Salt Lake Used
6 Honda Canada 2011 Salt Lake Crashed
7 Ford Italy 2014 Rome New
I am trying to group my dataframe by column B and list how many C, D, E, F column values are in group B. For example we see that in column B there are 4 Honda which I am grouping it together. Then I want to list the following information - USA(3), Canada(1), 2000(1),2001(2), 2011(1), Washington(1), Salt Lake(3), New(1), Used(2), Crashed(1) and do the same per every group ( car make ) in column B:
Car Country Year City Condition
1 Honda(4) USA(3) 2000(1) Washington(1) New(1)
Canada(1) 2001(2) Salt Lake(3) Used(2)
2011(1) Crashed(1)
2 Ford(2) Canada(1) 2005(5) Washington(1) New(2)
Italy(1) 2014(1) Rome(1)
...
What I've tried so far:
df.groupby(['B'])
Which gives me back <pandas.core.groupby.generic.DataFrameGroupBy object at 0x11d559080>
At this point, I am not sure how I should code moving on forward getting the desired results after grouping the column B.
Thank you for your suggestions.
You need lambda function with custom function for processing each column separately with Series.value_counts and then join values of index to values of counts of Series together:
def f(x):
x = x.value_counts()
y = x.index.astype(str) + '(' + x.astype(str) + ')'
return y.reset_index(drop=True)
df1 = df.groupby(['B']).apply(lambda x: x.apply(f)).reset_index(drop=True)
print (df1)
B C D E F
0 Ford(2) Italy(1) 2014(1) Washington(1) New(2)
1 NaN Canada(1) 2005(1) Rome(1) NaN
2 Honda(4) USA(3) 2001(2) Salt Lake(3) Used(2)
3 NaN Canada(1) 2011(1) Washington(1) Crashed(1)
4 NaN NaN 2000(1) NaN New(1)
5 Toyota(1) USA(1) 2010(1) Ney York(1) Used(1)
I have a datafarme which looks like as follows (there are more columns having been dropped off):
memberID shipping_country
264991
264991 Canada
100 USA
5000
5000 UK
I'm trying to fill the blank cells with existing value of shipping country for each user:
memberID shipping_country
264991 Canada
264991 Canada
100 USA
5000 UK
5000 UK
However, I'm not sure what's the most efficient way to do this on a large scale dataset. Perhaps, using a vectored groupby method?
You can use GroupBy + ffill / bfill:
def filler(x):
return x.ffill().bfill()
res = df.groupby('memberID')['shipping_country'].apply(filler)
A custom function is necessary as there's no combined Pandas method to ffill and bfill sequentially.
This also caters for the situation where all values are NaN for a specific memberID; in this case they will remain NaN.
For the following sample dataframe (I added a memberID group that only contains '' in the shipping_country column):
memberID shipping_country
0 264991
1 264991 Canada
2 100 USA
3 5000
4 5000 UK
5 54
This should work for you, and also as the behavior that if a memberID group only has empty string values ('') in shipping_country, those will be retained in the output df:
df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first').fillna('')
Yields:
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
5 54
If you would like to leave the empty strings '' as NaN in the output df, then just remove the fillna(''), leaving:
df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first')
You can use chained groupbys, one with forward fill and one with backfill:
# replace blank values with `NaN` first:
df['shipping_country'].replace('',pd.np.nan,inplace=True)
df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
This method will also allow a group made up of all NaN to remain NaN:
>>> df
memberID shipping_country
0 264991
1 264991 Canada
2 100 USA
3 5000
4 5000 UK
5 1
6 1
df['shipping_country'].replace('',pd.np.nan,inplace=True)
df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
5 1 NaN
6 1 NaN
Hi I am trying to assign certain values in columns of a dataframe.
# Count the number of title counts
full.groupby(['Sex', 'Title']).Title.count()
Sex Title
female Dona 1
Dr 1
Lady 1
Miss 260
Mlle 2
Mme 1
Mrs 197
Ms 2
the Countess 1
male Capt 1
Col 4
Don 1
Dr 7
Jonkheer 1
Major 2
Master 61
Mr 757
Rev 8
Sir 1
Name: Title, dtype: int64
My tail of dataframe looks like follows:
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket Title
413 NaN NaN S 8.0500 Spector, Mr. Woolf 0 1305 3 male 0 NaN A.5. 3236 Mr
414 39.0 C105 C 108.9000 Oliva y Ocana, Dona. Fermina 0 1306 1 female 0 NaN PC 17758 Dona
415 38.5 NaN S 7.2500 Saether, Mr. Simon Sivertsen 0 1307 3 male 0 NaN SOTON/O.Q. 3101262 Mr
416 NaN NaN S 8.0500 Ware, Mr. Frederick 0 1308 3 male 0 NaN 359309 Mr
417 NaN NaN C 22.3583 Peter, Master. Michael J 1 1309 3 male 1 NaN 2668 Master
The name of my dataframe is full and I want to change names of Title.
Here is the following code I wrote :
# Create a variable rate_title to modify the names of Title
rare_title = ['Dona', "Lady", "the Countess", "Capt", "Col", "Don", "Dr", "Major", "Rev", "Sir", "Jonkheer"]
# Also reassign mlle, ms, and mme accordingly
full[full.Title == "Mlle"].Title = "Miss"
full[full.Title == "Ms"].Title = "Miss"
full[full.Title == "Mme"].Title = "Mrs"
full[full.Title.isin(rare_title)].Title = "Rare Title"
I also tried the following code in pandas:
full.loc[full['Title'] == "Mlle", ['Sex', 'Title']] = "Miss"
Still the dataframe is not changed. Any help is appreciated.
Use loc based indexing and set matching row values -
miss = ['Mlle', 'Ms', 'Mme']
rare_title = ['Dona', "Lady", ...]
df.loc[df.Title.isin(miss), 'Title'] = 'Miss'
df.loc[df.Title.isin(rare_title), 'Title'] = 'Rare Title'