Compute a row in Pandas operating other columns

Compute a row in Pandas operating other columns - python

I have a dataframe like this-
df = pd.DataFrame{"a":[2.22, 3.444, 4.3726],"b":[3.44, 5.96, 7.218] }
I need to compute another column c by the following operation on column a-
c = len(str(a))-len(str(int(a)))-1
Tried different methods but not able to achieve.

If there is different digits after . is possible use Series.str.len with Series.astype:
df = pd.DataFrame({"a":[2.22, 3.444, 4.3726],"b":[3.44, 5.96, 7.218] })
print (df.a.astype(str).str.len())
0 4
1 5
2 6
Name: a, dtype: int64
df['c'] = df.a.astype(str).str.len() - df.a.astype(int).astype(str).str.len() - 1
But because float precision is problematic count values with general data (simulate problem):
df = pd.DataFrame({"a":[2.220000000236, 3.444, 4.3726],"b":[3.44, 5.96, 7.218] })
print (df.a.astype(str).str.len())
0 14
1 5
2 6
Name: a, dtype: int64

This solution creates column C with the desired result.
df['c'] = df['a'].astype(str).str.len() - df['a'].astype(int).astype(str).str.len() - 1

Related

Pandas DataFrames: Extract Information and Collapse Columns

I have a pandas DataFrame which contains information in columns which I would like to extract into a new column.
It is best explained visually:
df = pd.DataFrame({'Number Type 1':[1,2,np.nan],
'Number Type 2':[np.nan,3,4],
'Info':list('abc')})
The Table shows the initial DataFrame with Number Type 1 and NumberType 2 columns.
I would like to extract the types and create a new Type column, refactoring the DataFrame accordingly.
basically, Numbers are collapsed into the Number columns, and the types extracted into the Type column. The information in the Info column is bound to the numbers (f.e. 2 and 3 have the same information b)
What is the best way to do this in Pandas?

Use melt with dropna:
df = df.melt('Info', value_name='Number', var_name='Type').dropna(subset=['Number'])
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
4 b 2 3
5 c 2 4
Another solution with set_index and stack:
df = df.set_index('Info').stack().rename_axis(('Info','Type')).reset_index(name='Number')
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
2 b 2 3
3 c 2 4

Boolean Comparison across multiple dataframes

I have an issue where I want to compare values across multiple dataframes. Here is a snippet example:
data0 = [[1,'01-01'],[2,'01-02']]
data1 = [[11,'02-30'],[12,'02-25']]
data2 = [[8,'02-30'],[22,'02-25']]
data3 = [[7,'02-30'],[5,'02-25']]
df0 = pd.DataFrame(data0,columns=['Data',"date"])
df1 = pd.DataFrame(data1,columns=['Data',"date"])
df2 = pd.DataFrame(data2,columns=['Data',"date"])
df3 = pd.DataFrame(data3,columns=['Data',"date"])
result=(df0['Data']| df1['Data'])>(df2['Data'] | df3['Data'])
What I would like to do as I hope it can be seen is say if a value in df0 rowX or df1 rowX is greater than df2 rowX or df3 rowX return True else it should be false. In the code above 11 in df1 is greater than both 8 and 7 (df2 and 3 respectively) so the result should be True and then for the second row neither 2 or 12 is greater than 22 (df2) so should be False. However, result gives me
False,False
instead of
True,False
any thoughts or help?

Problem
For your data:
>>> df0['Data']
0 1
1 2
Name: Data, dtype: int64
>>> df1['Data']
0 11
1 12
Name: Data, dtype: int64
your a doing a bitwise or with |:
>>> df0['Data']| df1['Data']
0 11
1 14
Name: Data, dtype: int64
>>> df2['Data']| df3['Data']
0 15
1 23
Name: Data, dtype: int64
Do this for the single numbers:
>>> 1 | 11
11
>>> 2 | 12
14
This is not what you want.
Solution
You can use np.maximum for find the biggest values from each series:
>>> np.maximum(df0['Data'], df1['Data']) > np.maximum(df2['Data'], df3['Data'])
0 True
1 False
Name: Data, dtype: bool

Your existing solution does not work because the | operator performs a bitwise OR operation on the elements.
df0.Data | df1.Data
0 11
1 14
Name: Data, dtype: int64
This results in you comparing values that are different to the values in your dataframe columns. In summary, your approach does not compare values as you'd expect.
You can make this easy by finding -
the max per row of df0 and df1, and
the max per row of df2 and df3
Comparing these two columns to retrieve your result -
i = np.max([df0.Data, df1.Data], axis=0)
j = np.max([df2.Data, df3.Data], axis=0)
i > j
array([ True, False], dtype=bool)
This approach happens to be extremely scalable for any number of dataframes.

How to delete all columns in DataFrame except certain ones?

Let's say I have a DataFrame that looks like this:
a b c d e f g
1 2 3 4 5 6 7
4 3 7 1 6 9 4
8 9 0 2 4 2 1
How would I go about deleting every column besides a and b?
This would result in:
a b
1 2
4 3
8 9
I would like a way to delete these using a simple line of code that says, delete all columns besides a and b, because let's say hypothetically I have 1000 columns of data.
Thank you.

In [48]: df.drop(df.columns.difference(['a','b']), 1, inplace=True)
Out[48]:
a b
0 1 2
1 4 3
2 8 9
or:
In [55]: df = df.loc[:, df.columns.intersection(['a','b'])]
In [56]: df
Out[56]:
a b
0 1 2
1 4 3
2 8 9
PS please be aware that the most idiomatic Pandas way to do that was already proposed by #Wen:
df = df[['a','b']]
or
df = df.loc[:, ['a','b']]

Another option to add to the mix. I prefer this approach for readability.
df = df.filter(['a', 'b'])
Where the first positional argument is items=[]
Bonus
You can also use a like argument or regex to filter.
Helpful if you have a set of columns like ['a_1','a_2','b_1','b_2']
You can do
df = df.filter(like='b_')
and end up with ['b_1','b_2']
Pandas documentation for filter.

there are multiple solution .
df = df[['a','b']] #1
df = df[list('ab')] #2
df = df.loc[:,df.columns.isin(['a','b'])] #3
df = pd.DataFrame(data=df.eval('a,b').T,columns=['a','b']) #4 PS:I do not recommend this method , but still a way to achieve this

Hey what you are looking for is:
df = df[["a","b"]]
You will recive a dataframe which only contains the columns a and b

If you only want to keep more columns than you're dropping put a "~" before the .isin statement to select every column except the ones you want:
df = df.loc[:, ~df.columns.isin(['a','b'])]

If you have more than two columns that you want to drop, let's say 20 or 30, you can use lists as well. Make sure that you also specify the axis value.
drop_list = ["a","b"]
df = df.drop(df.columns.difference(drop_list), axis=1)

Pandas, selecting by column and row

I want to sum up all values that I select based on some function of column and row.
Another way of putting it is that I want to use a function of the row index and column index to determine if a value should be included in a sum along an axis.
Is there an easy way of doing this?

Columns can be selected using the syntax dataframe[<list of columns>]. The index (row) can be used for filtering using the dataframe.index method.
import pandas as pd
df = pd.DataFrame({'a': [0.1, 0.2], 'b': [0.2, 0.1]})
odd_a = df['a'][df.index % 2 == 1]
even_b = df['b'][df.index % 2 == 0]
# odd_a:
# 1 0.2
# Name: a, dtype: float64
# even_b:
# 0 0.2
# Name: b, dtype: float64

If df is your dataframe :
In [477]: df
Out[477]:
A s2 B
0 1 5 5
1 2 3 5
2 4 5 5
You can access the odd rows like this :
In [478]: df.loc[1::2]
Out[478]:
A s2 B
1 2 3 5
and the even ones like this:
In [479]: df.loc[::2]
Out[479]:
A s2 B
0 1 5 5
2 4 5 5
To answer your question, getting even rows and column B would be :
In [480]: df.loc[::2,'B']
Out[480]:
0 5
2 5
Name: B, dtype: int64
and odd rows and column A can be done as:
In [481]: df.loc[1::2,'A']
Out[481]:
1 2
Name: A, dtype: int64

I think this should be fairly general if not the cleanest implementation. This should allow applying separate functions for rows and columns depending on conditions (that I defined here in dictionaries).
import numpy as np
import pandas as pd
ran = np.random.randint(0,10,size=(5,5))
df = pd.DataFrame(ran,columns = ["a","b","c","d","e"])
# A dictionary to define what function is passed
d_col = {"high":["a","c","e"], "low":["b","d"]}
d_row = {"high":[1,2,3], "low":[0,4]}
# Generate list of Pandas boolean Series
i_col = [df[i].apply(lambda x: x>5) if i in d_col["high"] else df[i].apply(lambda x: x<5) for i in df.columns]
# Pass the series as a matrix
df = df[pd.concat(i_col,axis=1)]
# Now do this again for rows
i_row = [df.T[i].apply(lambda x: x>5) if i in d_row["high"] else df.T[i].apply(lambda x: x<5) for i in df.T.columns]
# Return back the DataFrame in original shape
df = df.T[pd.concat(i_row,axis=1)].T
# Perform the final operation such as sum on the returned DataFrame
print(df.sum().sum())

Python, pandas: how to remove greater than sign

Let's say I have the following example DataFrame
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
I would like to convert the column A from string to integer. In the case of '<2', I'd like to simply take off '<' sign and put 1 (the closest integer less than 2) in the second row. What's the most efficient way to do that? This is just a example. The actual data that I'm working on has hundreds of thousands of rows.
Thanks for your help in advance.

You could use Series.apply:
import pandas as pd
df = pd.DataFrame({'A':['1', '<2', '3']})
df['A'] = df['A'].apply(lambda x: int(x[1:])-1 if x.startswith('<') else int(x))
print(df.dtypes)
# A int64
# dtype: object
yields
print(df)
A
0 1
1 1
2 3
[3 rows x 1 columns]

You can use applymap on the DataFrame and remove the "<" character if it appears in the string:
df.applymap(lambda x: x.replace('<',''))
Here is the output:
A
0 1
1 2
2 3

Here are two other ways of doing this which may be helpful on the go-forward!
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
Outputs
df.A.str.strip('<').astype(int)
Out[1]:
0 1
1 2
2 3
And this way would be helpful if you were trying to remove a character in the middle of your number (e.g. if you had a comma or something).
df = DataFrame({'A':['1', '1,002', '3']})
df.A.str.replace(',', '').astype(int)
Outputs
Out[11]:
0 1
1 1002
2 3
Name: A, dtype: int64

>>> import re
>>> df.applymap(lambda x: int(re.sub(r'[^0-9.]', '', x)))
A
0 1
1 2
2 3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compute a row in Pandas operating other columns - python

I have a dataframe like this- df = pd.DataFrame{"a":[2.22, 3.444, 4.3726],"b":[3.44, 5.96, 7.218] } I need to compute another column c by the following operation on column a- c = len(str(a))-len(str(int(a)))-1 Tried different methods but not able to achieve.

This solution creates column C with the desired result. df['c'] = df['a'].astype(str).str.len() - df['a'].astype(int).astype(str).str.len() - 1

Related

Pandas DataFrames: Extract Information and Collapse Columns

Boolean Comparison across multiple dataframes

How to delete all columns in DataFrame except certain ones?

Pandas, selecting by column and row

Python, pandas: how to remove greater than sign

Categories

Resources