How to bring pandas.Series.str.get_dummies() to report NaN? - python

I have data in a file. CSV-like but multiple values per field are possible. I use get_dummies() to generate an overview of my column. What is in there and how often. Just like an histogram with nominal data. I want to see the missing (nan) values. But my code hides them.
I am using: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.get_dummies.html
I can't use: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html the dummy_na would solve the problem
Reason: I need the sep parameter.
To illustrate the difference.
import pandas
data = pandas.read_csv("testdata.csv",sep=";")
Bla["a"].str.get_dummies(",").sum() #no nan values
pandas.get_dummies(Bla["a"],dummy_na=True).sum() #not separated
Data:
a;b
Test,Tes;
;a
Tes;a
T;b
I would expect:
T 1
Tes 2
Test 1
NaN 1
But the output is:
T 1
Tes 2
Test 1
dtype: int64
or
T 1
Tes 1
Test,Tes 1
NaN 1
dtype: int64
Happy to also use another function! Maybe the .str part is the problem. I have not quite figured out what that does.

First replace missing values by Series.fillna and then in index by rename to NaN:
print (data["a"].fillna('Missing').str.get_dummies(",").sum().rename({'Missing':np.nan}))
NaN 1
T 1
Tes 2
Test 1
dtype: int64

Related

How to get some string of dataframe column?

I have dataframe like this.
print(df)
[ ID ... Control
0 PDF-1 ... NaN
1 PDF-3 ... NaN
2 PDF-4 ... NaN
I want to get only number of ID column. So the result will be.
1
3
4
How to get one of the strings of the dataframe column ?
How about just replace a common PDF- prefix?
df['ID'].str.replace('PDF-', '')
Could you please try following.
df['ID'].replace(regex=True,to_replace=r'([^\d])',value=r'')
One could refer documentation for df.replace
Basically using regex to remove everything apart from digits in column named ID where \d denotes digits and when we use [^\d] means apart form digits match everything.
Another possibility using Regex is:
df.ID.str.extract('(\d+)')
This avoids changing the original data just to extract the integers.
So for the following simple example:
import pandas as pd
df = pd.DataFrame({'ID':['PDF-1','PDF-2','PDF-3','PDF-4','PDF-5']})
print(df.ID.str.extract('(\d+)'))
print(df)
we get the following:
0
0 1
1 2
2 3
3 4
4 5
ID
0 PDF-1
1 PDF-2
2 PDF-3
3 PDF-4
4 PDF-5
Find "PDF-" ,and replace it with nothing
df['ID'] = df['ID'].str.replace('PDF-', '')
Then to print how you asked I'd convert the data frame to a string with no index.
print df['cleanID'].to_string(index=False)

Calculating mean value of item in several columns in pandas

I have a dataframe with values spread over several columns. I want to calculate the mean value of all items from specific columns.
All the solutions I looked up end up giving me either the separate means of each column or the mean of the means of the selected columns.
E.g. my Dataframe looks like this:
Name a b c d
Alice 1 2 3 4
Alice 2 4 2
Alice 3 2
Alice 1 5 2
Ben 3 3 1 3
Ben 4 1 2 3
Ben 1 2 2
And I want to see the mean of the values in columns b & c for each "Alice":
When I try:
df[df["Name"]=="Alice"][["b","c"]].mean()
The result is:
b 2.00
c 4.00
dtype: float64
In another post I found a suggestion to try a "double" mean one time for each axis e.g:
df[df["Name"]=="Alice"][["b","c"]].mean(axis=1).mean()
But the result was then:
3.00
which is the mean of the means of both columns.
I am expecting a way to calculate:
(2 + 3 + 4 + 5) / 4 = 3.50
Is there a way to do this in Python?
You can use numpy's np.nanmean [numpy-doc] here this will simply see your section of the dataframe as an array, and calculate the mean over the entire section by default:
>>> np.nanmean(df.loc[df['Name'] == 'Alice', ['b', 'c']])
3.5
Or if you want to group by name, you can first stack the dataframe, like:
>>> df[['Name','b','c']].set_index('Name').stack().reset_index().groupby('Name').agg('mean')
0
Name
Alice 3.500000
Ben 1.833333
Can groupby to sum all values and get their respective sizes. Then, divide to get the mean.
This way you get for all Names at once.
g = df.groupby('Name')[['b', 'c']]
g.sum().sum(1)/g.count().sum(1)
Name
Alice 3.500000
Ben 1.833333
dtype: float64
PS: In your example, looks like you have empty strings in some cells. That's not advisable, since you'll have dtypes set to object for your columns. Try to have NaNs instead, to take full advantage of vectorized operations.
Assume all your columns are numeric type and empty spaces are NaN. A simple set_index and stack and direct mean
df.set_index('Name')[['b','c']].stack().mean(level=0)
Out[117]:
Name
Alice 3.500000
Ben 1.833333
dtype: float64

Pandas -- Replace dirty strings with int

I am trying to do some machine learning practice, but the ID column of my dataframe is giving me trouble. I have this:
0 LP001002
1 LP001003
2 LP001005
3 LP001006
4 LP001008
I want this:
0 001002
1 001003
2 001005
3 001006
4 001008
My idea is to use a replace function, ID.replace('[LP]', '', inplace=True), but this doesn't actually change the series. Any one know a good way to convert this column?
You can use replace
df
Out[656]:
Val
0 LP001002
1 LP001003
2 LP001005
3 LP001006
4 LP001008
df.Val.replace({'LP':''},regex=True)
Out[657]:
0 001002
1 001003
2 001005
3 001006
4 001008
Name: Val, dtype: object
Here's something that will work for the example as given:
import pandas as pd
df = pd.DataFrame({'colname': ['LP001002', 'LP001003']})
# Slice off the 0th and 1st character of the string
df['colname'] = [x[2:] for x in df['colname']]
If this is your index, you can access it through df['my_index'] = df.index and then follow the remaining instructions.
In general, you might consider using something like the label encoder from scikit learn to convert nonnumeric elements to numeric ones.

How would I pivot this basic table using pandas?

What I want is this:
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
48944305 A02AG A03AX13 N02BE01 R05X NaN NaN NaN
I don't know how many atc_1...atc_7...?atc_100 columns there will need to be in advance. I just need to gather all associated atc_codes into one row with each visit_id.
This seems like a group_by and then a pivot but I have tried many times and failed. I also tried to self-join a la SQL using pandas' merge() but that doesn't work either.
The end result is that I will paste together atc_1, atc_7, ... atc_100 to form one long atc_code. This composite atc_code will be my "Y" or "labels" column of my dataset that I am trying to predict.
Thank you!
Use cumcount first for count values per groups which create columns by function pivot. Then add missing columns with reindex_axis and change column names by add_prefix. Last reset_index:
g = df.groupby('visit_id').cumcount() + 1
print (g)
0 1
1 2
2 3
3 4
4 5
5 1
6 2
7 3
8 4
dtype: int64
df = pd.pivot(index=df['visit_id'], columns=g, values=df['atc_code'])
.reindex_axis(range(1, 8), 1)
.add_prefix('atc_')
.reset_index()
print (df)
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
0 48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
1 48944305 A02AG A03AX13 N02BE01 R05X None NaN NaN

is there a method to skip unconvertible rows when casting a pandas series from str to float?

I have a pandas datagframe created from a csv file. One column of this dataframe contains numeric data that is initially cast as a string. Most entries are numeric-like, but some contain various error codes that are non-numeric. I do not know beforehand what all the error codes might be or how many there are. So, for instance, the dataframe might look like:
[In 1]: df
[Out 1]:
data OtherAttr
MyIndex
0 1.4 aaa
1 error1 foo
2 2.2 bar
3 0.8 bar
4 xxx bbb
...
743733 BadData ccc
743734 7.1 foo
I want to cast df.data as a float and throw out any values that don't convert properly. Is there a built-in functionality for this? Something like:
df.data = df.data.astype(float, skipbad = True)
(Although I know that specifically will not work and I don't see any kwargs within astype that do what I want)
I guess I could write a function using try and then use pandas apply or map, but that seems like an inelegant solution. This must be a fairly common problem, right?
Use the convert_objects method which "attempts to infer better dtype for object columns":
In [11]: df['data'].convert_objects(convert_numeric=True)
Out[11]:
0 1.4
1 NaN
2 2.2
3 0.8
4 NaN
Name: data, dtype: float64
In fact, you can apply this to the entire DataFrame:
In [12]: df.convert_objects(convert_numeric=True)
Out[12]:
data OtherAttr
MyIndex
0 1.4 aaa
1 NaN foo
2 2.2 bar
3 0.8 bar
4 NaN bbb

Categories

Resources