How to map correct values? - python

I have a dataframe where most of the values are wrongly mapped.
here is my dataframe.
df:
Name Age Cust_Id
alex 47 1923894833
I need to re-map every values to its correct column.
df_output:
Name Age Phn_No
alex 47 1923894833

For fun, here is a hack way to perform the task (I really wouldn't use this in real-life):
df.apply(lambda row:
pd.Series(sorted(row, key=lambda x: (not str(x).isalpha())*(1+(len(str(x))>2))),
index=row.index), axis=1)
How it works:
convert as string
if all letters -> 0
if length > 2 -> 2
else 1
use the above number to sort the values and generate a new Series
first field will be all letters, second 2 characters, third the longer string
output:
Name Age Phn_No
0 alex 47 1923894833
1 Ross 23 17293883222
2 mike 34 8738272882
3 stefy 39 19298388392

Related

Filtering data-frame columns using regex, then using .groupby to calculate sum

I have a dataframe which I want to group, filter columns by regex, and then sum.
My code looks like this:
import pandas as pd
df = pd.DataFrame({'ID':[1,1,2,2,3,3],
'Invasive' : [12,1,1,0,1,0],
'invasive': [1,4,5,3,4,6],
'Wild':[4,7,1,0,0,0],
'wild':[0,0,9,8,3,2],
'Crop':[0,0,0,0,0,0],
'Crop_2':[2,3,2,2,1,2]})
df.groupby(['ID']).filter(regex='(Invasive)|(invasive)|(Wild)|(wild)').sum()
The error message I get is:
DataFrameGroupBy.filter() missing 1 required positional argument: 'func'
I get the same Err message if groupby comes after filter
Why does this happen? Where do I input the func argument?
EDIT:
My Expected output is one column that has summed across the filtered columns and is grouped by ID. E.g.:
ID Output
0 1 29
1 2 27
2 3 16
What you want to do doesn't make sense, groupby.filter is to filter rows, not to be confused with DataFrame.filter.
You likely want to filter the columns, then to aggregate:
df.filter(regex='(?i)(Invasive|Wild)').groupby(df['ID']).sum()
NB. I replaced (Invasive)|(invasive)|(Wild)|(wild) by (?i)(Invasive|Wild), which means 'InvasiveORWild` independently of the case.
Output:
Invasive invasive Wild wild
ID
1 13 5 11 0
2 1 8 1 17
3 1 10 0 5
edit:
the output that you show needs a further summation per row:
out = (df.filter(regex='(?i)(Invasive|Wild)')
.groupby(df['ID']).sum()
.sum(axis=1)
.reset_index(name='Output')
)
# or with summation before:
out = (df.filter(regex='(?i)(Invasive|Wild)')
.sum(axis=1)
.groupby(df['ID']).sum()
.reset_index(name='Output')
)
Output:
ID Output
0 1 29
1 2 27
2 3 16

How to remove strings from Colum A from strings of Colum B

Wonder if you have two Columns (A = 'Name', B = 'Name_Age'), is there a quick way to remove 'Name' from 'Name_Age' so that you can quickly get 'Age', like a reversed concatenation??
I've thought about 'string split', but in some cases (when there's no string split factor) I really need a method to remove strings of one column from strings of another.
#example data below:
import pandas as pd
data = {'Name':['Mark','Matt','Michael'], 'Name_Age':['Mark 14','Matt 29','Michael 18']}
df = pd.DataFrame(data)
You can try using pandas apply function, which lets you define your own function to be passed to every row of the dataframe:
def age_from_name_age(name, name_age):
return name_age.replace(name, '').strip()
df['Age'] = df.apply(lambda x: age_from_name_age(x['Name'], x['Name_Age']),
axis='columns')
age_from_name_age takes two strings (a name and a name_age), and returns just the age. Then, in the apply statement, I define an anonymous lambda function that just takes in a row and passes the correct fields to age_from_name_age.
Using string slicing:
df['Age'] = df.apply(lambda row: row['Name_Age'][len(row['Name']):], axis=1).astype(int)
You can use str.split() to separate the values from the column names with space separator and then rename the column's with new names.
1) Using str.split()
>>> df['Name_Age'].str.split(" ", expand=True).rename(columns={0:'Name', 1:'Age'})
Name Age
0 Mark 14
1 Matt 29
2 Michael 18
OR
>>> df = df['Name_Age'].str.split(" ", expand=True).rename(columns={0:'Name', 1:'Age'})
>>> df
Name Age
0 Mark 14
1 Matt 29
2 Michael 18
OR, by Converting the splitted list into new dataframe:
>>> pd.DataFrame(df.Name_Age.str.split().tolist(), columns="Name Age".split())
Name Age
0 Mark 14
1 Matt 29
2 Michael 18
2) Another option using str.partition
>>> df['Name_Age'].str.partition(" ", True).rename(columns={0:'Name', 2:'Age'}).drop(1, axis=1)
Name Age
0 Mark 14
1 Matt 29
2 Michael 18
3) another using df.assign with lambda
Use split() with default separator as follows and assigning the values back with new column Age.
>>> df.assign(Age = df.Name_Age.apply(lambda x: x.split()[1]))
Name Name_Age Age
0 Mark Mark 14 14
1 Matt Matt 29 29
2 Michael Michael 18 18
OR
>>> df.Name_Age.apply(lambda x: pd.Series(str(x).split())).rename({0:"Name",1:"Age"}, axis=1)
Name Age
0 Mark 14
1 Matt 29
2 Michael 18

counting all string values in given column of a table and grouping it based on third column

I have three columns. the table looks like this:
ID. names tag
1. john. 1
2. sam 0
3. sam,robin. 1
4. robin. 1
Id: type integer
Names: type string
Tag: type integer (just 0,1)
What I want is to find how many times each name is repeated grouped by 0 and 1. this is to be done in python.
Answer must look like
0 1
John 23 12
Robin 32 10
sam 9 30
Using extractall and crosstab:
s = df.names.str.extractall(r'(\w+)').reset_index(1, drop=True).join(df.tag)
pd.crosstab(s[0], s['tag'])
tag 0 1
0
john 0 1
robin 0 2
sam 1 1
Because of the nature of your names column, there is some re-processing that needs to be done before you can get value counts. In the case of your example dataframe, this could look something like:
my_counts = (df.set_index(['ID.', 'tag'])
# Get rid of periods and split on commas
.names.str.strip('.').str.split(',')
.apply(pd.Series)
.stack()
.reset_index([0, 1])
# rename column 0 for consistency, easier reading
.rename(columns={0: 'names'})
# Get value counts of names per tag:
.groupby('tag')['names']
.value_counts()
.unstack('tag', fill_value=0))
>>> my_counts
tag 0 1
names
john 0 1
robin 0 2
sam 1 1

Pandas way to keep int portion of a column of numbers in a dataframe

I have a dataframe (df) column of floats:
0 59.9179
1 50.3874
2 50.3874
3 55.0089
4 58.423
5 58.8227
6 55.2471
7 57.2266
8 46.4312
9 59.9097
10 57.1417
Is there a way in pandas to keep the integer portion of the number and discard the decimal, so the resulting column would look like:
0 59
1 50
2 50
3 55
4 58
5 58
6 55
7 57
8 46
9 59
10 57
I can see a way to do this for 1 number
>>> s = 59.9179
>>> i, d = divmod(s, 1)
>>> i
59
but not for a whole column in one go
Many thanks
You've got two options:
Casting the column type (or even the whole dataframe):
df[column] = df[column].astype(int)
Or using numpy's floor method (for positive floats as in your example)
df[column] = np.floor(df[column])
You can use apply method by passing a lambda function.
df = df.apply(lambda row: int(row['second_column']), axis=1)
Another method is to use astype method.
df[second_column] = df[second_column].astype(int)
If your column name is col then do the following:
df[col]=df[col].astype('int')
You can use this:
df[your_column] = df[your_column].astype(int)
Or for the whole df:
df = df.astype(int)
To keep integer part use trunc() function as shown below.
Easiest method is to use trunc() function on the column.
df[col_name] = trunc(df[col_name])

Pandas: assigning values to a dataframe column based on pivot table values using an if statement

I'm using the Titanic Kaggle data as a means to explore Pandas. I'm trying to figure out how to use an if statement inside of .ix[] (or otherwise) I have a pivot table I'm using to get a lookup value into my main dataframe. Here's a chunk of the pivot table (entitled 'data'):
Survived Count % Female Survived % Male Survived \
Sex female male female male
Embarked Pclass
C 1 42 17 43 42 97.67 40.48
2 7 2 7 10 100.00 20.00
3 15 10 23 43 65.22 23.26
Now I would like to go through each line in the main dataframe to assign its looked up value. No problem looking up the value hardcoded like:
df['Chance of Survival'] = data.ix['C']['% Female Survived'].get(1)
97.67
However when trying to insert the dynamic portion to include an if statement, things don't work out so great:
df['Chance of Survival'] = data.ix[df.Embarked][('% Female Survived' if df.Sex == 'female') | ('% Male Survived' if df.Sex=='male')].get(df.Pclass)
So the desired output in my main dataframe would look like this:
PersonId Embarked Sex Pclass Chance of Survival
1 C female 1 97.67
2 C male 2 20.00
3 C male 3 23.26
Thanks in advance! :)
Got it but in case anyone else has a similar problem. Or better yet, if anyone has a nicer way of doing it. :)
def getValue(line):
'''Lookup value in pivot table "data" given the contents of the line passed in from df'''
value = lambda line: '% Male Survived' if line.Sex == 'male' else '% Female Survived'
result = data.ix[line.Embarked][value(line)].get(line.Pclass)
return result
df['Chance of Survival'] = df.apply(getValue, axis=1)
So anyone who wants to assign values in the column of one dataframe based on values from another. I used .ix[] to drill down to the value, then .apply() to apply a function across each row (axis=1) finding the line's values just as you would a dataframe. ('line.element'/line['element'])
As far as I understand your problem, you want to assign values to an existing dataframe and you are currently using DataFrame.ix
The method you probably want is DataFrame.loc which works like this:
df = pd.DataFrame({'foo':[1,2,3,4], 'bar':[1,2,3,4]})
df
bar foo
0 1 1
1 2 2
2 3 3
3 4 4
df.loc[1]['foo'] = 4
df
bar foo
0 1 1
1 2 4
2 3 3
3 4 4
If you want to assign to new columns, you just have to create them first, simply
df['newcolumn'] = np.nan
Then you can assign it with the code above.

Categories

Resources