Let's say I have the following example DataFrame
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
I would like to convert the column A from string to integer. In the case of '<2', I'd like to simply take off '<' sign and put 1 (the closest integer less than 2) in the second row. What's the most efficient way to do that? This is just a example. The actual data that I'm working on has hundreds of thousands of rows.
Thanks for your help in advance.
You could use Series.apply:
import pandas as pd
df = pd.DataFrame({'A':['1', '<2', '3']})
df['A'] = df['A'].apply(lambda x: int(x[1:])-1 if x.startswith('<') else int(x))
print(df.dtypes)
# A int64
# dtype: object
yields
print(df)
A
0 1
1 1
2 3
[3 rows x 1 columns]
You can use applymap on the DataFrame and remove the "<" character if it appears in the string:
df.applymap(lambda x: x.replace('<',''))
Here is the output:
A
0 1
1 2
2 3
Here are two other ways of doing this which may be helpful on the go-forward!
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
Outputs
df.A.str.strip('<').astype(int)
Out[1]:
0 1
1 2
2 3
And this way would be helpful if you were trying to remove a character in the middle of your number (e.g. if you had a comma or something).
df = DataFrame({'A':['1', '1,002', '3']})
df.A.str.replace(',', '').astype(int)
Outputs
Out[11]:
0 1
1 1002
2 3
Name: A, dtype: int64
>>> import re
>>> df.applymap(lambda x: int(re.sub(r'[^0-9.]', '', x)))
A
0 1
1 2
2 3
Related
I have a dataframe, df, and a list of strings, cols_needed, which indicate the columns I want to retain in df. The column names in df do not exactly match the strings in cols_needed, so I cannot directly use something like intersection. But the column names do contain the strings in cols_needed. I tried playing around with str.contains but couldn't get it to work. How can I subset df based on cols_needed?
import pandas as pd
df = pd.DataFrame({
'sim-prod1': [1,2],
'sim-prod2': [3,4],
'sim-prod3': [5,6],
'sim_prod4': [7,8]
})
cols_needed = ['prod1', 'prod2']
# What I want to obtain:
sim-prod1 sim-prod2
0 1 3
1 2 4
With the regex option of filter
df.filter(regex='|'.join(cols_needed))
sim-prod1 sim-prod2
0 1 3
1 2 4
You can explore str.contains with a joint pattern, for example:
df.loc[:,df.columns.str.contains('|'.join(cols_needed))]
Output:
sim-prod1 sim-prod2
0 1 3
1 2 4
A list comprehension could work as well:
columns = [cols for cols in df
for col in cols_needed
if col in cols]
['sim-prod1', 'sim-prod2']
In [110]: df.loc[:, columns]
Out[110]:
sim-prod1 sim-prod2
0 1 3
1 2 4
I have a dataframe with codes like the following and would like to create a new column that has the last sequence of numbers parse out.
array(['K9ADXXL2', 'K9ADXL2', 'K9ADXS2', 'IVERMAXSCM12', 'HPDMUDOGDRYL'])
So the new column would contain the following:
array([2,2,2,12,None])
Sample data
df:
codes
0 K9ADXXL2
1 K9ADXL2
2 K9ADXS2
3 IVERMAXSCM12
4 HPDMUDOGDRYL
Use str.extract gets digits at the end of string and passing to pd.to_numeric
pd.to_numeric(df.codes.str.extract(r'(\d+$)')[0], errors='coerce')
Out[11]:
0 2.0
1 2.0
2 2.0
3 12.0
4 NaN
Name: 0, dtype: float64
If you want get value as string of numbers, you may use str.extract or str.findall as follow
df.codes.str.findall(r'\d+$').str[0]
or
df.codes.str.extract(r'(\d+$)')[0]
Out[20]:
0 2
1 2
2 2
3 12
4 NaN
Name: codes, dtype: object
import re
import pandas as pd
def get_trailing_digits(s):
match = re.search("[0-9]+$",s)
return match.group(0) if match else None
original_column = pd.array(['K9ADXXL2', 'K9ADXL2', 'K9ADXS2', 'IVERMAXSCM12', 'HPDMUDOGDRYL'])
new_column = pd.array([get_trailing_digits(s) for s in original_column])
# ['2', '2', '2', '12', None]
0-9] means any digit
+ means one or more times
$means only at the end of the string
You can use the apply function of a series/data frame with get_trailing_digits as the function.
eg.
my_df["new column"] = my_df["old column"].apply(get_trailing_digits)
I have a set of data like:
0 1
0 type 1 type 2
1 type 3 type 4
How can I transfer it to:
0 1
0 1 2
1 3 4
perfer using applyor transform function
>>> df.apply(lambda x: x.str.replace('type ','').astype(int))
0 1
0 1 2
1 3 4
remove the .astype(int) if you don't need to convert to int
Yoou can use DataFrame.replace:
print (df.replace({'type ': ''}, regex=True))
0 1
0 1 2
1 3 4
Option 1
df.stack().str.replace('type ', '').unstack()
Option 2
df.stack().str.split().str[-1].unstack()
Option 3
# pandas version 0.18.1 +
df.stack().str.extract('(\d+)', expand=False).unstack()
# pandas version 0.18.0 or prior
df.stack().str.extract('(\d+)').unstack()
Timing
conclusion #jezreal's is best. no loops and no stacking.
code
20,000 by 200
df_ = df.copy()
df = pd.concat([df_ for _ in range(10000)], ignore_index=True)
df = pd.concat([df for _ in range(100)], axis=1, ignore_index=True)
You can use applymap and a regex (import re):
df = df.applymap(lambda x: re.search(r'.*(\d+)', x).group(1))
If you want the digits as integers:
df = df.applymap(lambda x: int(re.search(r'.*(\d+)', x).group(1)))
This will work even if you have other text instead of type, and only with integers (ie 'type 1.2' will break this code), so you will have to adapt it.
Also note that this code is bound to fail if no number is found (ie 'type'). You may want to create a function that can handle these errors instead of the lambda:
def extract_digit(x):
try:
return int(re.search(r'.*(\d+)', x).group(1))
except (ValueError, AttributeError):
# return the existing value
return x
df = df.applymap(lambda x: extract_digit(x))
I'm trying to use assign to create a new column in a pandas DataFrame. I need to use something like str.format to have the new column be pieces of existing columns. For instance...
import pandas as pd
df = pd.DataFrame(np.random.randn(3, 3))
gives me...
0 1 2
0 -0.738703 -1.027115 1.129253
1 0.674314 0.525223 -0.371896
2 1.021304 0.169181 -0.884293
an assign for a totally new column works
# works
print(df.assign(c = "a"))
0 1 2 c
0 -0.738703 -1.027115 1.129253 a
1 0.674314 0.525223 -0.371896 a
2 1.021304 0.169181 -0.884293 a
But, if I want to use an existing column into a new column it seems like pandas is adding the whole existing frame into the new column.
# doesn't work
print(df.assign(c = "a{}b".format(df[0])))
0 1 2 \
0 -0.738703 -1.027115 1.129253
1 0.674314 0.525223 -0.371896
2 1.021304 0.169181 -0.884293
c
0 a0 -0.738703\n1 0.674314\n2 1.021304\n...
1 a0 -0.738703\n1 0.674314\n2 1.021304\n...
2 a0 -0.738703\n1 0.674314\n2 1.021304\n...
Thanks for the help.
In [131]: df.assign(c="a"+df[0].astype(str)+"b")
Out[131]:
0 1 2 c
0 0.833556 -0.106183 -0.910005 a0.833556419295b
1 -1.487825 1.173338 1.650466 a-1.48782514804b
2 -0.836795 -1.192674 -0.212900 a-0.836795026809b
'a{}b'.format(df[0]) is a str. "a"+df[0].astype(str)+"b" is a Series.
In [142]: type(df[0].astype(str))
Out[142]: pandas.core.series.Series
In [143]: type('{}'.format(df[0]))
Out[143]: str
When you assign a single string to the column c, that string is repeated for every row in df.
Thus, df.assign(c = "a{}b".format(df[0])) assigns the string 'a{}b'.format(df[0])
to each row of df:
In [138]: 'a{}b'.format(df[0])
Out[138]: 'a0 0.833556\n1 -1.487825\n2 -0.836795\nName: 0, dtype: float64b'
It is really no different than what happened with df.assign(c = "a").
In contrast, when you assign a Series to the column c, then the index of the Series is aligned with the index of df and the corresponding values are assigned to df['c'].
Under the hood, the Series.__add__ method is defined in such a way so that addition of the Series containing strings with a string results in a new Series with the string concatenated with the values in the Series:
In [149]: "a"+df[0].astype(str)
Out[149]:
0 a0.833556419295
1 a-1.48782514804
2 a-0.836795026809
Name: 0, dtype: object
(The astype method was called to convert the floats in df[0] into strings.)
df['c'] = "a" + df[0].astype(str) + 'b'
df
0 1 2 c
0 -1.134154 -0.367397 0.906239 a-1.13415403091b
1 0.551997 -0.160217 -0.869291 a0.551996920472b
2 0.490102 -1.151301 0.541888 a0.490101854737b
So I got a pandas DataFrame with a single column and a lot of data.
I need to access each of the element, not to change it (with apply()) but to parse it into another function.
When looping through the DataFrame it always stops after the first one.
If I convert it to a list before, then my numbers are all in braces (eg. [12] instead of 12) thus breaking my code.
Does anyone see what I am doing wrong?
import pandas as pd
def go_trough_list(df):
for number in df:
print(number)
df = pd.read_csv("my_ids.csv")
go_trough_list(df)
df looks like:
1
0 2
1 3
2 4
dtype: object
[Finished in 1.1s]
Edit: I found one mistake. My first value is recognized as a header.
So I changed my code to:
df = pd.read_csv("my_ids.csv",header=None)
But with
for ix in df.index:
print(df.loc[ix])
I get:
0 1
Name: 0, dtype: int64
0 2
Name: 1, dtype: int64
0 3
Name: 2, dtype: int64
0 4
Name: 3, dtype: int64
edit: Here is my Solution thanks to jezrael and Nick!
First I added headings=None because my data has no header.
Then I changed my function to:
def go_through_list(df)
new_list = df[0].apply(my_function,parameter=par1)
return new_list
And it works perfectly! Thank you again guys, problem solved.
You can use the index as in other answers, and also iterate through the df and access the row like this:
for index, row in df.iterrows():
print(row['column'])
however, I suggest solving the problem differently if performance is of any concern. Also, if there is only one column, it is more correct to use a Pandas Series.
What do you mean by parse it into another function? Perhaps take the value, and do something to it and create it into another column?
I need to access each of the element, not to change it (with apply()) but to parse it into another function.
Perhaps this example will help:
import pandas as pd
df = pd.DataFrame([20, 21, 12])
def square(x):
return x**2
df['new_col'] = df[0].apply(square) # can use a lambda here nicely
You can convert column as Series tolist:
for x in df['Colname'].tolist():
print x
Sample:
import pandas as pd
df = pd.DataFrame({'a': pd.Series( [1, 2, 3]),
'b': pd.Series( [4, 5, 6])})
print df
a b
0 1 4
1 2 5
2 3 6
for x in df['a'].tolist():
print x
1
2
3
If you have only one column, use iloc for selecting first column:
for x in df.iloc[:,0].tolist():
print x
Sample:
import pandas as pd
df = pd.DataFrame({1: pd.Series( [2, 3, 4])})
print df
1
0 2
1 3
2 4
for x in df.iloc[:,0].tolist():
print x
2
3
4
This can work too, but it is not recommended approach, because 1 can be number or string and it can raise Key error:
for x in df[1].tolist():
print x
2
3
4
Say you have one column named 'myColumn', and you have an index on the dataframe (which is automatically created with read_csv). Try using the .loc function:
for ix in df.index:
print(df.loc[ix]['myColumn'])