How to get some string of dataframe column? - python

I have dataframe like this.
print(df)
[ ID ... Control
0 PDF-1 ... NaN
1 PDF-3 ... NaN
2 PDF-4 ... NaN
I want to get only number of ID column. So the result will be.
1
3
4
How to get one of the strings of the dataframe column ?

How about just replace a common PDF- prefix?
df['ID'].str.replace('PDF-', '')

Could you please try following.
df['ID'].replace(regex=True,to_replace=r'([^\d])',value=r'')
One could refer documentation for df.replace
Basically using regex to remove everything apart from digits in column named ID where \d denotes digits and when we use [^\d] means apart form digits match everything.

Another possibility using Regex is:
df.ID.str.extract('(\d+)')
This avoids changing the original data just to extract the integers.
So for the following simple example:
import pandas as pd
df = pd.DataFrame({'ID':['PDF-1','PDF-2','PDF-3','PDF-4','PDF-5']})
print(df.ID.str.extract('(\d+)'))
print(df)
we get the following:
0
0 1
1 2
2 3
3 4
4 5
ID
0 PDF-1
1 PDF-2
2 PDF-3
3 PDF-4
4 PDF-5

Find "PDF-" ,and replace it with nothing
df['ID'] = df['ID'].str.replace('PDF-', '')
Then to print how you asked I'd convert the data frame to a string with no index.
print df['cleanID'].to_string(index=False)

Related

How to fill the values of previous cells based on the value of the other cell?

I have a dataframe like the following,
a | count
2020-03-29|
2020-03-30|
2020-03-31|
2020-04-01|
2020-04-02|
2020-04-03|
2020-04-04| 1
2020-04-05|
2020-04-06|
2020-04-07|
2020-04-08|
2020-04-09|
2020-04-10|
2020-04-11| 2
..
..
.. and so on
The structure of the df is like this, that is, there is a number after every 6 cells or days. How can I replace the blank values with the number at every 7th (or 7th multiple) cell backwards?
Final df should look like the following,
a | count
2020-03-29| 1
2020-03-30| 1
2020-03-31| 1
2020-04-01| 1
2020-04-02| 1
2020-04-03| 1
2020-04-04| 1
2020-04-05| 2
2020-04-06| 2
2020-04-07| 2
2020-04-08| 2
2020-04-09| 2
2020-04-10| 2
2020-04-11| 2
..
..
.. and so on
Try with
df['count'] = df['count'].bfill()
Use the method argument of pandas.DataFrame.fillna()
backfill / bfill: use next valid observation to fill gap
df['count'] = df['count'].fillna(method='bfill')
you need to backfill. See here for additional information on filling missing data: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
in your case, you would do:
df['count'].bfill()
or,
df['count'].fillna(method='bfill')

Pandas -- Replace dirty strings with int

I am trying to do some machine learning practice, but the ID column of my dataframe is giving me trouble. I have this:
0 LP001002
1 LP001003
2 LP001005
3 LP001006
4 LP001008
I want this:
0 001002
1 001003
2 001005
3 001006
4 001008
My idea is to use a replace function, ID.replace('[LP]', '', inplace=True), but this doesn't actually change the series. Any one know a good way to convert this column?
You can use replace
df
Out[656]:
Val
0 LP001002
1 LP001003
2 LP001005
3 LP001006
4 LP001008
df.Val.replace({'LP':''},regex=True)
Out[657]:
0 001002
1 001003
2 001005
3 001006
4 001008
Name: Val, dtype: object
Here's something that will work for the example as given:
import pandas as pd
df = pd.DataFrame({'colname': ['LP001002', 'LP001003']})
# Slice off the 0th and 1st character of the string
df['colname'] = [x[2:] for x in df['colname']]
If this is your index, you can access it through df['my_index'] = df.index and then follow the remaining instructions.
In general, you might consider using something like the label encoder from scikit learn to convert nonnumeric elements to numeric ones.

I am not getting matching string

This is my data which contains number and string.
df2 = pd.DataFrame({'A': ['1,008$','4,000$','6,000$','10,00$','8,00$','45€','45€']})
df2 = pd.DataFrame(df2, columns = ['A'])
vv=df2[df2['A'].str.match('$')]
I want an output like this.
0 1,008$
1 4,000$
2 6,000$
3 10,00$
4 8,00$
but I am getting this output:
Out[144]:
Empty DataFrame
Columns: [A]
Index: []
can anyone please help me?
A somewhat verbose way using Numpy's defchararray module.
I always want to give this some attention.
# Using #cᴏʟᴅsᴘᴇᴇᴅ's suggestion
# Same function as below but shorter namespace path
df2[np.char.find(df2.A.values.astype(str), '$') >= 0]
Old Answer
from numpy.core.defchararray import find
df2[find(df2.A.values.astype(str), '$') >= 0]
A
0 1,008$
1 4,000$
2 6,000$
3 10,00$
4 8,00$
str.match starts matching from the beginning. however, your $ pattern will be found only at the end.
The fix requires either, a modification to your pattern, or changing the function.
Option 1
str.match with a modified pattern (so \$ is matched at the end) -
df2[df2.A.str.match('.*\$$')]
A
0 1,008$
1 4,000$
2 6,000$
3 10,00$
4 8,00$
If you want to be specific about what is matched, you can match only on digits and commas -
df2[df2.A.str.match('[\d,]+\$$')]
A
0 1,008$
1 4,000$
2 6,000$
3 10,00$
4 8,00$
Note that this does not account for invalid entries in your column (they're matched as long as they have those characters somewhere in the string, and are terminated by $).
Option 2
str.contains
df2[df2.A.str.contains('\$$')]
A
0 1,008$
1 4,000$
2 6,000$
3 10,00$
4 8,00$

Python filling string column "forward" and groupby attaching groupby result to dataframe

I have a dataframe looking generated by:
df = pd.DataFrame([[100, ' tes t ', 3], [100, np.nan, 2], [101, ' test1', 3 ], [101,' ', 4]])
It looks like
0 1 2
0 100 tes t 3
1 100 NaN 2
2 101 test1 3
3 101 4
I would like to a fill column 1 "forward" with test and test1. I believe one approach would be to work with replacing whitespace by np.nan, but it is difficult since the words contain whitespace as well. I could also groupby column 0 and then use the first element of each group to fill forward. Could you provide me with some code for both alternatives I do not get it coded?
Additionally, I would like to add a column that contains the group means that is
the final dataframe should look like this
0 1 2 3
0 100 tes t 3 2.5
1 100 tes t 2 2.5
2 101 test1 3 3.5
3 101 test1 4 3.5
Could you also please advice how to accomplish something like this?
Many thanks please let me know in case you need further information.
IIUC, you could use str.strip and then check if the stripped string is empty.
Then, perform groupby operations and filling the Nans by the method ffill and calculating the means using groupby.transform function as shown:
df[1] = df[1].str.strip().dropna().apply(lambda x: np.NaN if len(x) == 0 else x)
df[1] = df.groupby(0)[1].fillna(method='ffill')
df[3] = df.groupby(0)[2].transform(lambda x: x.mean())
df
Note: If you must forward fill NaN values with first element of that group, then you must do this:
df.groupby(0)[1].apply(lambda x: x.fillna(x.iloc[0]))
Breakup of steps:
Since we want to apply the function only on strings, we drop all the NaN values present before, else we would be getting the TypeError due to both floats and string elements present in the column and complains of float having no method as len.
df[1].str.strip().dropna()
0 tes t # operates only on indices where strings are present(empty strings included)
2 test1
3
Name: 1, dtype: object
The reindexing part isn't a necessary step as it only computes on the indices where strings are present.
Also, the reset_index(drop=True) part was indeed unwanted as the groupby object returns a series after fillna which could be assigned back to column 1.

Python: how to find values in a column of a pandas dataframe separated by semicolon?

I have a column of a dataframe df like the following:
df.a=
0 2
1 2;4
2 4;2
3 2;4
4 4;2
5 1
I want to find all the rows that contain the value 4.
I am looking for a command like
df[df.a==4]
I would use str.contains (as long as the column is a series of strings):
a = df.loc[df['a'].str.contains('4')]
this returns:
a
0 2;4
1 4;2
2 2;4
3 4;2
EDIT: in the general case, you should use a regex expression to match single '4' values:
a = df.loc[df['a'].str.contains(r'\b4\b')]

Categories

Resources