Remove stringified list of unicode strings in pandas column - python

I have a column in a df that looks like this:
pd.DataFrame(["[u'one_element']", "[u'two_elememts', u'two_elements']", "[u'three_elements', u'three_elements', u'three_elements']"])
0
0 [u'one_element']
1 [u'two_elememts', u'two_elements']
2 [u'three_elements', u'three_elements', u'three_elements']
Those elements are strings:
type(df[0].iloc[2]) == str
The end result should look like:
0
0 one_element
1 two_elememts, two_elements
2 three_elements, three_elements, three_elements
I tried with:
df[column] = df[column].map(lambda x: x.lstrip('[u').rstrip(']').replace("u'","").replace("'",""))
But obviously this is slow when you have many rows.
Is there a better way to do it? The df has many columns of different types: strings, integers, floats.
Thanks!

You can use regex and strip i.e
df[0] = df[0].str.strip("[]").str.replace("u'|'",'')
0 one_element
1 two_elememts, two_elements
2 three_elements, three_elements, three_elements
Name: 0, dtype: object

Using the ast module.
import pandas as pd
import ast
df = pd.DataFrame(["[u'one_element']", "[u'two_elememts', u'two_elements']", "[u'three_elements', u'three_elements', u'three_elements']"])
print(df[0].apply(lambda x: ", ".join(ast.literal_eval(x))))
Output:
0 one_element
1 two_elememts, two_elements
2 three_elements, three_elements, three_elements
Name: 0, dtype: object

You don't need map, you can use the str attribute for pandas Series:
(df[0].str.lstrip('[u')
.str.rstrip(']')
.str.replace("u'","")
.str.replace("'","")))
achieves the same result but does not use map
0 one_element
1 two_elememts, two_elements
2 three_elements, three_elements, three_elements
Name: 0, dtype: object

Related

How to replace the binary column (1,0) to (0,1) based on value_counts in python pandas dataframe?

In a Pandas dataframe column I get a value_counts() as
Data['Gender'].value_counts()
Out[98]:
1 169325
0 21289
Name: Gender, dtype: int64
I am not sure how to swap 1 to 0 and 0 to 1, if value_counts() of 1 is > 0, say from the output 1 is 169325 which is greater than 0 which is 21289, if its reverse no swap?
I think using the value_counts is bit of over kill and we just using mean here
Data['Gender'].mean()>0.5
If True
Then we do replace or map
Data['Gender'] = Data['Gender'].map({0:1,1:0})#Data['Gender'].replace({0:1,1:0})

How to convert a string in float with a space - pandas

When I import an Excel file, some numbers in a column are float and some are not. How can I convert all to float? The space in 3 000,00 is causing me problems.
df['column']:
column
0 3 000,00
1 156.00
2 0
I am trying:
df['column'] = df['column'].str.replace(' ','')
but it's not working. I would do after .astype(float), but cannot get there.
Any solutions? 1 is already a float, but 0 is a string.
Just cast them all as a string first:
df['column'] = [float(str(val).replace(' ','').replace(',','.')) for val in df['column'].values]
Example:
>>> df = pd.DataFrame({'column':['3 000,00', 156.00, 0]})
>>> df['column2'] = [float(str(val).replace(' ','').replace(',','.')) for val in df['column'].values]
>>> df
column column2
0 3 000,00 3000.0
1 156 156.0
2 0 0.0
import re
df['column'] = df['column'].apply(lambda x: re.sub("[^0-9.]", "", str(x).replace(',','.'))).astype(float)

pandas DataFrame assign with format

I'm trying to use assign to create a new column in a pandas DataFrame. I need to use something like str.format to have the new column be pieces of existing columns. For instance...
import pandas as pd
df = pd.DataFrame(np.random.randn(3, 3))
gives me...
0 1 2
0 -0.738703 -1.027115 1.129253
1 0.674314 0.525223 -0.371896
2 1.021304 0.169181 -0.884293
an assign for a totally new column works
# works
print(df.assign(c = "a"))
0 1 2 c
0 -0.738703 -1.027115 1.129253 a
1 0.674314 0.525223 -0.371896 a
2 1.021304 0.169181 -0.884293 a
But, if I want to use an existing column into a new column it seems like pandas is adding the whole existing frame into the new column.
# doesn't work
print(df.assign(c = "a{}b".format(df[0])))
0 1 2 \
0 -0.738703 -1.027115 1.129253
1 0.674314 0.525223 -0.371896
2 1.021304 0.169181 -0.884293
c
0 a0 -0.738703\n1 0.674314\n2 1.021304\n...
1 a0 -0.738703\n1 0.674314\n2 1.021304\n...
2 a0 -0.738703\n1 0.674314\n2 1.021304\n...
Thanks for the help.
In [131]: df.assign(c="a"+df[0].astype(str)+"b")
Out[131]:
0 1 2 c
0 0.833556 -0.106183 -0.910005 a0.833556419295b
1 -1.487825 1.173338 1.650466 a-1.48782514804b
2 -0.836795 -1.192674 -0.212900 a-0.836795026809b
'a{}b'.format(df[0]) is a str. "a"+df[0].astype(str)+"b" is a Series.
In [142]: type(df[0].astype(str))
Out[142]: pandas.core.series.Series
In [143]: type('{}'.format(df[0]))
Out[143]: str
When you assign a single string to the column c, that string is repeated for every row in df.
Thus, df.assign(c = "a{}b".format(df[0])) assigns the string 'a{}b'.format(df[0])
to each row of df:
In [138]: 'a{}b'.format(df[0])
Out[138]: 'a0 0.833556\n1 -1.487825\n2 -0.836795\nName: 0, dtype: float64b'
It is really no different than what happened with df.assign(c = "a").
In contrast, when you assign a Series to the column c, then the index of the Series is aligned with the index of df and the corresponding values are assigned to df['c'].
Under the hood, the Series.__add__ method is defined in such a way so that addition of the Series containing strings with a string results in a new Series with the string concatenated with the values in the Series:
In [149]: "a"+df[0].astype(str)
Out[149]:
0 a0.833556419295
1 a-1.48782514804
2 a-0.836795026809
Name: 0, dtype: object
(The astype method was called to convert the floats in df[0] into strings.)
df['c'] = "a" + df[0].astype(str) + 'b'
df
0 1 2 c
0 -1.134154 -0.367397 0.906239 a-1.13415403091b
1 0.551997 -0.160217 -0.869291 a0.551996920472b
2 0.490102 -1.151301 0.541888 a0.490101854737b

Iterating over each element in pandas DataFrame

So I got a pandas DataFrame with a single column and a lot of data.
I need to access each of the element, not to change it (with apply()) but to parse it into another function.
When looping through the DataFrame it always stops after the first one.
If I convert it to a list before, then my numbers are all in braces (eg. [12] instead of 12) thus breaking my code.
Does anyone see what I am doing wrong?
import pandas as pd
def go_trough_list(df):
for number in df:
print(number)
df = pd.read_csv("my_ids.csv")
go_trough_list(df)
df looks like:
1
0 2
1 3
2 4
dtype: object
[Finished in 1.1s]
Edit: I found one mistake. My first value is recognized as a header.
So I changed my code to:
df = pd.read_csv("my_ids.csv",header=None)
But with
for ix in df.index:
print(df.loc[ix])
I get:
0 1
Name: 0, dtype: int64
0 2
Name: 1, dtype: int64
0 3
Name: 2, dtype: int64
0 4
Name: 3, dtype: int64
edit: Here is my Solution thanks to jezrael and Nick!
First I added headings=None because my data has no header.
Then I changed my function to:
def go_through_list(df)
new_list = df[0].apply(my_function,parameter=par1)
return new_list
And it works perfectly! Thank you again guys, problem solved.
You can use the index as in other answers, and also iterate through the df and access the row like this:
for index, row in df.iterrows():
print(row['column'])
however, I suggest solving the problem differently if performance is of any concern. Also, if there is only one column, it is more correct to use a Pandas Series.
What do you mean by parse it into another function? Perhaps take the value, and do something to it and create it into another column?
I need to access each of the element, not to change it (with apply()) but to parse it into another function.
Perhaps this example will help:
import pandas as pd
df = pd.DataFrame([20, 21, 12])
def square(x):
return x**2
df['new_col'] = df[0].apply(square) # can use a lambda here nicely
You can convert column as Series tolist:
for x in df['Colname'].tolist():
print x
Sample:
import pandas as pd
df = pd.DataFrame({'a': pd.Series( [1, 2, 3]),
'b': pd.Series( [4, 5, 6])})
print df
a b
0 1 4
1 2 5
2 3 6
for x in df['a'].tolist():
print x
1
2
3
If you have only one column, use iloc for selecting first column:
for x in df.iloc[:,0].tolist():
print x
Sample:
import pandas as pd
df = pd.DataFrame({1: pd.Series( [2, 3, 4])})
print df
1
0 2
1 3
2 4
for x in df.iloc[:,0].tolist():
print x
2
3
4
This can work too, but it is not recommended approach, because 1 can be number or string and it can raise Key error:
for x in df[1].tolist():
print x
2
3
4
Say you have one column named 'myColumn', and you have an index on the dataframe (which is automatically created with read_csv). Try using the .loc function:
for ix in df.index:
print(df.loc[ix]['myColumn'])

Python, pandas: how to remove greater than sign

Let's say I have the following example DataFrame
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
I would like to convert the column A from string to integer. In the case of '<2', I'd like to simply take off '<' sign and put 1 (the closest integer less than 2) in the second row. What's the most efficient way to do that? This is just a example. The actual data that I'm working on has hundreds of thousands of rows.
Thanks for your help in advance.
You could use Series.apply:
import pandas as pd
df = pd.DataFrame({'A':['1', '<2', '3']})
df['A'] = df['A'].apply(lambda x: int(x[1:])-1 if x.startswith('<') else int(x))
print(df.dtypes)
# A int64
# dtype: object
yields
print(df)
A
0 1
1 1
2 3
[3 rows x 1 columns]
You can use applymap on the DataFrame and remove the "<" character if it appears in the string:
df.applymap(lambda x: x.replace('<',''))
Here is the output:
A
0 1
1 2
2 3
Here are two other ways of doing this which may be helpful on the go-forward!
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
Outputs
df.A.str.strip('<').astype(int)
Out[1]:
0 1
1 2
2 3
And this way would be helpful if you were trying to remove a character in the middle of your number (e.g. if you had a comma or something).
df = DataFrame({'A':['1', '1,002', '3']})
df.A.str.replace(',', '').astype(int)
Outputs
Out[11]:
0 1
1 1002
2 3
Name: A, dtype: int64
>>> import re
>>> df.applymap(lambda x: int(re.sub(r'[^0-9.]', '', x)))
A
0 1
1 2
2 3

Categories

Resources