I have a set of data like:
0 1
0 type 1 type 2
1 type 3 type 4
How can I transfer it to:
0 1
0 1 2
1 3 4
perfer using applyor transform function
>>> df.apply(lambda x: x.str.replace('type ','').astype(int))
0 1
0 1 2
1 3 4
remove the .astype(int) if you don't need to convert to int
Yoou can use DataFrame.replace:
print (df.replace({'type ': ''}, regex=True))
0 1
0 1 2
1 3 4
Option 1
df.stack().str.replace('type ', '').unstack()
Option 2
df.stack().str.split().str[-1].unstack()
Option 3
# pandas version 0.18.1 +
df.stack().str.extract('(\d+)', expand=False).unstack()
# pandas version 0.18.0 or prior
df.stack().str.extract('(\d+)').unstack()
Timing
conclusion #jezreal's is best. no loops and no stacking.
code
20,000 by 200
df_ = df.copy()
df = pd.concat([df_ for _ in range(10000)], ignore_index=True)
df = pd.concat([df for _ in range(100)], axis=1, ignore_index=True)
You can use applymap and a regex (import re):
df = df.applymap(lambda x: re.search(r'.*(\d+)', x).group(1))
If you want the digits as integers:
df = df.applymap(lambda x: int(re.search(r'.*(\d+)', x).group(1)))
This will work even if you have other text instead of type, and only with integers (ie 'type 1.2' will break this code), so you will have to adapt it.
Also note that this code is bound to fail if no number is found (ie 'type'). You may want to create a function that can handle these errors instead of the lambda:
def extract_digit(x):
try:
return int(re.search(r'.*(\d+)', x).group(1))
except (ValueError, AttributeError):
# return the existing value
return x
df = df.applymap(lambda x: extract_digit(x))
Related
New to python, so bear with me!
I have a function that applies random values to different columns in my df (depending on the column name; each column has specific random value range). For one specific function, I want to apply it to just half of my column. I attempted working with just my even rows to deal with this, but it seems it's not working. When I print out the first few rows, I can tell the column did not update. Would appreciate any help!
Here's what I've attempted:
for col in df.columns:
if 'shirt' in col:
df[col] = df[col].apply(lambda x: np.random.randint(0,5))
elif 'pants' in col:
df[col] = df[col].apply(lambda x: np.random.randint(0,10))
elif 'sweater' in col:
df[col] = df[col].apply(lambda x: np.random.randint(0,50) if df.index.all() % 2 == 0 else np.NaN)
else:
pass
(had to add .all() bc I got a Value Error: The truth value of an array with more than one element is ambiguous error.)
Using the index to apply it to half of my dataframe is what I can think of, but this wouldn't work if I wanted to apply that lambda function to just 60% or 80% of my column and leave the rest as nulls.
This is what I get as a result of the code above:
shirt_count pant_count sweater_count
0 14 3 5
1 18 3 7
2 1 3 5
3 7 1 9
4 2 3 2
Would love any help as I've been staring at my screen for hours!
You can apply separate functions to the even and odd rows of the 'sweater' column:
import pandas as pd
import numpy as np
df = pd.DataFrame({'shirt': [0,0,0,0,0], 'pants': [0,0,0,0,0], 'sweater': [0,0,0,0,0]})
for col in df.columns:
if 'shirt' in col:
df[col] = df[col].apply(lambda x: np.random.randint(0,5))
elif 'pants' in col:
df[col] = df[col].apply(lambda x: np.random.randint(0,10))
elif 'sweater' in col:
# random integers for even rows
df[col][::2] = df[col][::2].apply(lambda x: np.random.randint(0,50))
# na.NaN for odd rows
df[col][1::2] = df[col][1::2].apply(lambda x: np.NaN)
else:
pass
Output:
>>> df
shirt pants sweater
0 4 9 0.0
1 0 1 NaN
2 2 7 28.0
3 2 7 NaN
4 0 8 30.0
I have a situation wherein sometimes when I read a csv from df I get an unwanted index-like column named unnamed:0.
file.csv
,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9
The CSV is read with this:
pd.read_csv('file.csv')
Unnamed: 0 A B C
0 0 1 2 3
1 1 4 5 6
2 2 7 8 9
This is very annoying! Does anyone have an idea on how to get rid of this?
It's the index column, pass pd.to_csv(..., index=False) to not write out an unnamed index column in the first place, see the to_csv() docs.
Example:
In [37]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
pd.read_csv(io.StringIO(df.to_csv()))
Out[37]:
Unnamed: 0 a b c
0 0 0.109066 -1.112704 -0.545209
1 1 0.447114 1.525341 0.317252
2 2 0.507495 0.137863 0.886283
3 3 1.452867 1.888363 1.168101
4 4 0.901371 -0.704805 0.088335
compare with:
In [38]:
pd.read_csv(io.StringIO(df.to_csv(index=False)))
Out[38]:
a b c
0 0.109066 -1.112704 -0.545209
1 0.447114 1.525341 0.317252
2 0.507495 0.137863 0.886283
3 1.452867 1.888363 1.168101
4 0.901371 -0.704805 0.088335
You could also optionally tell read_csv that the first column is the index column by passing index_col=0:
In [40]:
pd.read_csv(io.StringIO(df.to_csv()), index_col=0)
Out[40]:
a b c
0 0.109066 -1.112704 -0.545209
1 0.447114 1.525341 0.317252
2 0.507495 0.137863 0.886283
3 1.452867 1.888363 1.168101
4 0.901371 -0.704805 0.088335
This is usually caused by your CSV having been saved along with an (unnamed) index (RangeIndex).
(The fix would actually need to be done when saving the DataFrame, but this isn't always an option.)
Workaround: read_csv with index_col=[0] argument
IMO, the simplest solution would be to read the unnamed column as the index. Specify an index_col=[0] argument to pd.read_csv, this reads in the first column as the index. (Note the square brackets).
df = pd.DataFrame('x', index=range(5), columns=list('abc'))
df
a b c
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
# Save DataFrame to CSV.
df.to_csv('file.csv')
<!- ->
pd.read_csv('file.csv')
Unnamed: 0 a b c
0 0 x x x
1 1 x x x
2 2 x x x
3 3 x x x
4 4 x x x
# Now try this again, with the extra argument.
pd.read_csv('file.csv', index_col=[0])
a b c
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
Note
You could have avoided this in the first place by
using index=False if the output CSV was created in pandas, if your DataFrame does not have an index to begin with:
df.to_csv('file.csv', index=False)
But as mentioned above, this isn't always an option.
Stopgap Solution: Filtering with str.match
If you cannot modify the code to read/write the CSV file, you can just remove the column by filtering with str.match:
df
Unnamed: 0 a b c
0 0 x x x
1 1 x x x
2 2 x x x
3 3 x x x
4 4 x x x
df.columns
# Index(['Unnamed: 0', 'a', 'b', 'c'], dtype='object')
df.columns.str.match('Unnamed')
# array([ True, False, False, False])
df.loc[:, ~df.columns.str.match('Unnamed')]
a b c
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
To get ride of all Unnamed columns, you can also use regex such as df.drop(df.filter(regex="Unname"),axis=1, inplace=True)
Another case that this might be happening is if your data was improperly written to your csv to have each row end with a comma. This will leave you with an unnamed column Unnamed: x at the end of your data when you try to read it into a df.
You can do either of the following with 'Unnamed' Columns:
Delete unnamed columns
Rename them (if you want to use them)
Method 1: Delete Unnamed Columns
# delete one by one like column is 'Unnamed: 0' so use it's name
df.drop('Unnamed: 0', axis=1, inplace=True)
#delete all Unnamed Columns in a single code of line using regex
df.drop(df.filter(regex="Unnamed"),axis=1, inplace=True)
Method 2: Rename Unnamed Columns
df.rename(columns = {'Unnamed: 0':'Name'}, inplace = True)
If you want to write out with a blank header as in the input file, just choose 'Name' above to be ''.
where the OP's input data 'file.csv' was:
,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9
#read file
df = pd.read_csv('file.csv')
Simply delete that column using: del df['column_name']
Simple do this:
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
Alternatively:
df = df.drop(columns=['Unnamed: 0'])
from IPython.display import display
import pandas as pd
import io
df = pd.read_csv('file.csv',index_col=[0])
df = pd.read_csv(io.StringIO(df.to_csv(index=False)))
display(df.head(5))
A solution that is agnostic to whether the index has been written or not when utilizing df.to_csv() is shown below:
df = pd.read_csv(file_name)
if 'Unnamed: 0' in df.columns:
df.drop('Unnamed: 0', axis=1, inplace=True)
If an index was not written, then index_col=[0] will utilize the first column as the index which is behavior that one would not want.
In my experience, there are many reasons you might not want to set that column as index_col =[0] as so many people suggest above. For example it might contain jumbled index values because data were saved to csv after being indexed or sorted without df.reset_index(drop=True) leading to instant confusion.
So if you know the file has this column and you don't want it, as per the original question, the simplest 1-line solutions are:
df = pd.read_csv('file.csv').drop(columns=['Unnamed: 0'])
or
df = pd.read_csv('file.csv',index_col=[0]).reset_index(drop=True)
This question already has answers here:
How to apply a function to two columns of Pandas dataframe
(15 answers)
Closed 4 years ago.
I need to make a column in my pandas dataframe that relies on other items in that same row. For example, here's my dataframe.
df = pd.DataFrame(
[['a',],['a',1],['a',1],['a',2],['b',2],['b',2],['c',3]],
columns=['letter','number']
)
letters numbers
0 a 1
1 a 1
2 a 1
3 a 2
4 b 2
5 b 2
6 c 3
I need a third column, that is 1 if 'a' and 2 are present in the row, and 0 otherwise. So it would be [`0,0,0,1,0,0,0]`
How can I use Pandas `apply` or `map` to do this? Iterating over the rows is my first thought, but this seems like a clumsy way of doing it.
You can use apply with axis=1. Suppose you wanted to call your new column c:
df['c'] = df.apply(
lambda row: (row['letter'] == 'a') and (row['number'] == 2),
axis=1
).astype(int)
print(df)
# letter number c
#0 a NaN 0
#1 a 1.0 0
#2 a 1.0 0
#3 a 2.0 1
#4 b 2.0 0
#5 b 2.0 0
#6 c 3.0 0
But apply is slow and should be avoided if possible. In this case, it would be much better to boolean logic operations, which are vectorized.
df['c'] = ((df['letter'] == "a") & (df['number'] == 2)).astype(int)
This has the same result as using apply above.
You can try to use pd.Series.where()/np.where(). If you only are interested in the int represantation of the boolean values, you can pick the other solution. If you want more freedom for the if/else value you can use np.where()
import pandas as pd
import numpy as np
# create example
values = ['a', 'b', 'c']
df = pd.DataFrame()
df['letter'] = np.random.choice(values, size=10)
df['number'] = np.random.randint(1,3, size=10)
# condition
df['result'] = np.where((df['letter'] == 'a') & (df['number'] == 2), 1, 0)
I have a dataframe, I want to change only those values of a column where another column fulfills a certain condition. I'm trying to do this with iloc at the moment and it either does not work or I'm getting that annoying warning:
A value is trying to be set on a copy of a slice from a DataFrame
Example:
import pandas as pd
DF = pd.DataFrame({'A':[1,1,2,1,2,2,1,2,1],'B':['a','a','b','c','x','t','i','x','b']})
Doing one of those
DF['B'].iloc[:][DF['A'] == 1] = 'X'
DF.iloc[:]['B'][DF['A'] == 1] = 'Y'
works, but leads to the warning above.
This one also gives a warning, but does not work:
DF.iloc[:][DF['A'] == 1]['B'] = 'Z'
I'm really confused about how to do boolean indexing using loc, iloc, and ix right, that is, how to provide row index, column index, AND boolean index in the right order and with the correct syntax.
Can someone clear this up for me?
You are chaining you're selectors, leading to the warning. Consolidate the selection into one.
Use loc instead
DF.loc[DF['A'] == 1, 'B'] = 'X'
DF
Use ix:
import pandas as pd
DF = pd.DataFrame({'A':[1,1,2,1,2,2,1,2,1],'B':['a','a','b','c','x','t','i','x','b']})
DF.ix[DF['A'] == 1, 'B'] = 'X'
print (DF)
0 1 X
1 1 X
2 2 b
3 1 X
4 2 x
5 2 t
6 1 X
7 2 x
8 1 X
Another solution with mask:
DF.B = DF.B.mask(DF['A'] == 1, 'X')
print (DF)
A B
0 1 X
1 1 X
2 2 b
3 1 X
4 2 x
5 2 t
6 1 X
7 2 x
8 1 X
Nice article about SettingWithCopy by Tom Augspurger.
Let's say I have the following example DataFrame
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
I would like to convert the column A from string to integer. In the case of '<2', I'd like to simply take off '<' sign and put 1 (the closest integer less than 2) in the second row. What's the most efficient way to do that? This is just a example. The actual data that I'm working on has hundreds of thousands of rows.
Thanks for your help in advance.
You could use Series.apply:
import pandas as pd
df = pd.DataFrame({'A':['1', '<2', '3']})
df['A'] = df['A'].apply(lambda x: int(x[1:])-1 if x.startswith('<') else int(x))
print(df.dtypes)
# A int64
# dtype: object
yields
print(df)
A
0 1
1 1
2 3
[3 rows x 1 columns]
You can use applymap on the DataFrame and remove the "<" character if it appears in the string:
df.applymap(lambda x: x.replace('<',''))
Here is the output:
A
0 1
1 2
2 3
Here are two other ways of doing this which may be helpful on the go-forward!
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
Outputs
df.A.str.strip('<').astype(int)
Out[1]:
0 1
1 2
2 3
And this way would be helpful if you were trying to remove a character in the middle of your number (e.g. if you had a comma or something).
df = DataFrame({'A':['1', '1,002', '3']})
df.A.str.replace(',', '').astype(int)
Outputs
Out[11]:
0 1
1 1002
2 3
Name: A, dtype: int64
>>> import re
>>> df.applymap(lambda x: int(re.sub(r'[^0-9.]', '', x)))
A
0 1
1 2
2 3