Python Pandas find value in dataframe regardless of column - python

Is there a simple way to check for a value within a dataframe when it could possibly be in a variety of columns? Whether using iterrow and searching each row for the value and finding which column it is in or just checking the dataframe as a whole and getting its position like iat coords.

import pandas as pd
d = {'id': [1, 2, 3], 'col2': [3, 4, 5], 'col3': [8,3,9]}
df = pd.DataFrame(data=d)
df = df.set_index('id')
df
Sample Data
col2 col3
id
1 3 8
2 4 3
3 5 9
Find 3
df.isin([3]).any()
Output Column:
col2 True
col3 True
dtype: bool
Want more detals? Here you go:
df[df.isin([3])].stack().index.tolist()
Co-ordinates output:
[(1, 'col2'), (2, 'col3')]

You can search the value in dataframe and get the Boolean dataframe for your search. It
gives you all equalities of var1 in df.
df[df.eq(var1).any(1)]

Related

Pandas: Sort by sum of 2 columns

I have a DataFrame:
COL1 COL2
1 1
3 1
1 3
I need to sort by COL1 + COL2.
key=lambda col: f(col) argument-function of sort_values(...) lets you sort by a changed column but in the described case I need to sort on the basis of 2 columns. So, it would be nice if there were an opportunity to provide a key argument-function for 2 or more columns but I don't know whether such a one exists.
So, how can I sort its rows by sum COL1 + COL2?
Thank you for your time!
Assuming a unique index, you can also conveniently use the key parameter of sort_values to pass a callable to apply to the by column. Here we can add the other column:
df.sort_values(by='COL1', key=df['COL2'].add)
We can even generalize to any number of columns using sort_index:
df.sort_index(key=df.sum(1).get)
Output:
COL1 COL2
0 1 1
2 1 3
1 3 2
Used input:
data = {"COL1": [1, 3, 1], "COL2": [1, 2, 3]}
df = pd.DataFrame(data)
This does the trick:
data = {"Column 1": [1, 3, 1], "Column 2": [1, 2, 3]}
df = pd.DataFrame(data)
sorted_indices = (df["Column 1"] + df["Column 2"]).sort_values().index
df.loc[sorted_indices, :]
I just created a series that has the sum of both the columns, sorted it, got the sorted indices, and printed those indices out for the dataframe.
(I changed the data a little so you can see the sorting in action. Using the data you provided, you wouldn't have been able to see the sorted data as it would have been the same as the original one.)

Opposite of factorize function (Map numeric to categorical values)

I am searching for a way to map some numeric columns to categorical features.
All columns are of categorical nature but are represented as integers. However I need them to be a "String".
e.g.
col1 col2 col3 -> col1new col2new col3new
0 1 1 -> "0" "1" "1"
2 2 3 -> "2" "2" "3"
1 3 2 -> "1" "3" "2"
It does not matter what kind of String the new column contains as long as all distinct values from the original data set map to the same new String value.
Any ideas?
I have a bumpy representation of my data right now but any pandas solution would be also helpful.
Thanks a lot!
You can use applymap method. Cosider the following example:
df = pd.DataFrame({'col1': [0, 2, 1], 'col2': [1, 2, 3], 'col3': [1, 3, 2]})
df.applymap(str)
col1 col2 col3
0 0 1 1
1 2 2 3
2 1 3 2
You can convert all elements of col1, col2, and col3 to str using the following command:
df = df.applymap(str)
you can modify the type of the elements in a list by using the dataframe.apply function which is offered by pandas-dataframe-apply.
frame = pd.DataFrame(np.random.randint(0, 90, size =(5000000, 3)), columns =['col1', 'col2', 'col3'])
in the new dataframe you can define columns and the value by:
updated_frame = pd.DataFrame(np.random.randint(0, 90, size =(5000000, 3)), columns =['col1new', 'col2new', 'col3new'])
updated_frame['col1new'] = frame['col1'].apply(str)
updated_frame['col2new'] = frame['col2'].apply(str)
updated_frame['col3new'] = frame['col3'].apply(str)
You could use the .astype method. If you want to replace all the current columns with a string version then you could do (df your dataframe):
df = df.astype(str)
If you want to add the string columns as new ones:
df = df.assign(**{f"{col}new": df[col].astype(str) for col in df.columns})

Python - count number of elements that are equal between two columns of two dataframes

I have two dataframes: df1, df2
that contain two columns, col1 and col2. I would like to calculate the number of elements in column col1 of df1 that are equal to col2 of df2. How can I do that?
You can use Series.isin df1.col1.isin(df2.col2).sum():
df1 = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'col2': [1, 3, 5, 7]})
nb_comon_elements = df1.col1.isin(df2.col2).sum()
assert nb_comon_elements == 3
Be cautious depending on your use case because:
df1 = pd.DataFrame({'col1': [1, 1, 1, 2, 7]})
df1.col1.isin(df2.col2).sum()
Would return 4 and not 2, because all 1 from df1.col1 are present in df2.col2. If that's not the expected behaviour you could drop duplicates from df1.col1 before testing the intersection size:
df1.col1.drop_duplicates().isin(df2.col2).sum()
Which in this example would return 2.
To better understand why this is happening you can have look at what .isin is returning:
df1['isin df2.col2'] = df1.col1.isin(df2.col2)
Which gives:
col1 isin df2.col2
0 1 True
1 1 True
2 1 True
3 2 False
4 7 True
Now .sum() adds up the booleans from column isin df2.col2 (a total of 4 True).
I assume you're using pandas.
One way is to simply use pd.merge and merge on the second column, and return the length of that column.
pd.merge(df1, df2, on="column_to_merge")
Pandas does an inner merge by default.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

Pandas DataFrame filter

My question is about the pandas.DataFrame.filter command. It seems that pandas creates a copy of the data frame to write any changes. How am I able to write on the data frame itself?
In other words:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df.filter(regex='col1').iloc[0]=10
Output:
col1 col2
0 1 3
1 2 4
Desired Output:
col1 col2
0 10 3
1 2 4
I think you need extract columns names and then use loc or iloc functions:
cols = df.filter(regex='col1').columns
df.loc[0, cols]=10
Or:
df.iloc[0, df.columns.get_indexer(cols)] = 10
print (df)
col1 col2
0 10 3
1 2 4
You cannnot use filter function, because subset returns a Series/DataFrame which may have its data as a view. That's why SettingWithCopyWarning is possible there (or raise if you set the option).

How to convert data of type Panda to Panda.Dataframe?

I have a object of which type is Panda and the print(object) is giving below output
print(type(recomen_total))
print(recomen_total)
Output is
<class 'pandas.core.frame.Pandas'>
Pandas(Index=12, instrument_1='XXXXXX', instrument_2='XXXX', trade_strategy='XXX', earliest_timestamp='2016-08-02T10:00:00+0530', latest_timestamp='2016-08-02T10:00:00+0530', xy_signal_count=1)
I want to convert this obejct in pd.DataFrame, how i can do it ?
i tried pd.DataFrame(object), from_dict also , they are throwing error
Interestingly, it will not convert to a dataframe directly but to a series. Once this is converted to a series use the to_frame method of series to convert it to a DataFrame
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
index=['a', 'b'])
for row in df.itertuples():
print(pd.Series(row).to_frame())
Hope this helps!!
EDIT
In case you want to save the column names use the _asdict() method like this:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
index=['a', 'b'])
for row in df.itertuples():
d = dict(row._asdict())
print(pd.Series(d).to_frame())
Output:
0
Index a
col1 1
col2 0.1
0
Index b
col1 2
col2 0.2
To create new DataFrame from itertuples namedtuple you can use list() or Series too:
import pandas as pd
# source DataFrame
df = pd.DataFrame({'a': [1,2], 'b':[3,4]})
# empty DataFrame
df_new_fromAppend = pd.DataFrame(columns=['x','y'], data=None)
for r in df.itertuples():
# create new DataFrame from itertuples() via list() ([1:] for skipping the index):
df_new_fromList = pd.DataFrame([list(r)[1:]], columns=['c','d'])
# or create new DataFrame from itertuples() via Series (drop(0) to remove index, T to transpose column to row)
df_new_fromSeries = pd.DataFrame(pd.Series(r).drop(0)).T
# or use append() to insert row into existing DataFrame ([1:] for skipping the index):
df_new_fromAppend.loc[df_new_fromAppend.shape[0]] = list(r)[1:]
print('df_new_fromList:')
print(df_new_fromList, '\n')
print('df_new_fromSeries:')
print(df_new_fromSeries, '\n')
print('df_new_fromAppend:')
print(df_new_fromAppend, '\n')
Output:
df_new_fromList:
c d
0 2 4
df_new_fromSeries:
1 2
0 2 4
df_new_fromAppend:
x y
0 1 3
1 2 4
To omit index, use param index=False (but I mostly need index for the iteration)
for r in df.itertuples(index=False):
# the [1:] needn't be used, for example:
df_new_fromAppend.loc[df_new_fromAppend.shape[0]] = list(r)
The following works for me:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
for row in df.itertuples():
row_as_df = pd.DataFrame.from_records([row], columns=row._fields)
print(row_as_df)
The result is:
Index col1 col2
0 a 1 0.1
Index col1 col2
0 b 2 0.2
Sadly, AFAIU, there's no simple way to keep column names, without explicitly utilizing "protected attributes" such as _fields.
With some tweaks in #Igor's answer
I concluded with this satisfactory code which preserved column names and used as less of pandas code as possible.
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]})
# Or initialize another dataframe above
# Get list of column names
column_names = df.columns.values.tolist()
filtered_rows = []
for row in df.itertuples(index=False):
# Some code logic to filter rows
filtered_rows.append(row)
# Convert pandas.core.frame.Pandas to pandas.core.frame.Dataframe
# Combine filtered rows into a single dataframe
concatinated_df = pd.DataFrame.from_records(filtered_rows, columns=column_names)
concatinated_df.to_csv("path_to_csv", index=False)
The result is a csv containing:
col1 col2
1 0.1
2 0.2
To convert a list of objects returned by Pandas .itertuples to a DataFrame, while preserving the column names:
# Example source DF
data = [['cheetah', 120], ['human', 44.72], ['dragonfly', 54]]
source_df = pd.DataFrame(data, columns=['animal', 'top_speed'])
animal top_speed
0 cheetah 120.00
1 human 44.72
2 dragonfly 54.00
Since Pandas does not recommended building DataFrames by adding single rows in a for loop, we will iterate and build the DataFrame at the end:
WOW_THAT_IS_FAST = 50
list_ = list()
for animal in source_df.itertuples(index=False, name='animal'):
if animal.top_speed > 50:
list_.append(animal)
Now build the DF in a single command and without manually recreating the column names.
filtered_df = pd.DataFrame(list_)
animal top_speed
0 cheetah 120.00
2 dragonfly 54.00

Categories

Resources