Pandas - Find a column with a specific value in the entire dataframe - python

I have a DataFrame which has a few columns. There is a column with a value that only appears once in the entire dataframe. I want to write a function that returns the column name of the column with that specific value. I can manually find which column it is with the usual data exploration, but since I have multiple dataframes with the same properties, I need to be able to find that column for multiple dataframes. So a somewhat generalized function would be of better use.
The problem is that I don't know beforehand which column is the one I am looking for since in every dataframe the position of that particular column with that particular value is different. Also the desired columns in different dataframes have different names, so I cannot use something like df['my_column'] to extract the column.
Thanks

You'll need to iterate columns and look for the value:
def find_col_with_value(df, value):
for col in df:
if (df[col] == value).any():
return col
This will return the name of the first column that contains value. If value does not exist, it will return None.

Check the entire DataFrame for the specific value, checking any to see if it ever appears in a column, then slice the columns (or the DataFrame if you want the Series)
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.normal(0, 5, (100, 200)),
columns=[chr(i+40) for i in range(200)])
df.loc[5, 'Y'] = 'secret_value' # Secret value in column 'Y'
df.eq('secret_value').any().loc[lambda x: x].index
# or
df.columns[df.eq('secret_value').any()]
Index(['Y'], dtype='object')

I have another solution:
names = ds.columns
for i in names:
for j in ds[i]:
if j == 'your_value':
print(i)
break
Here you are collecting all the names of columns and then iterating all dataset while it will be found. Then print the name of column.

Related

Comparing two dataframes and storing results in another dataframe

I have two data frames like this: The first has one column and 720 rows (dataframe A), the second has ten columns and 720 rows(dataframe B). The dataframes contain only numerical values.
I am trying to compare them this way: I want to go through each column of dataframe B and compare each cell(row) of that column to the corresponding row in dataframe A .
(Example: For the first column of dataframe B I compare the first row to the first row of dataframe A, then the second row of B to the second row of A etc.)
Basically I want to compare each column of dataframe B to the single column in dataframe A, row by row.
If the the value in dataframe B is smaller or equal than the value in dataframe A, I want to add +1 to another dataframe (or list, depending on how its easier). In the end, I want to drop any column in dataframe B that doesnt have at least one cell to satisfy the condition (basically if the value added to the list or new dataframe is 0).
I tried something like this (written for a single row, I was thinking of creating a for loop using this) but it doesn't seem to do what I want:
DfA_i = pd.DataFrame(DA.iloc[i])
DfB_j = pd.DataFrame(DB.iloc[j])
B = DfB_j.values
DfC['Criteria'] = DfA_i.apply(lambda x: len(np.where(x.values <= B)), axis=1)
dv = dt_dens.values
if dv[1] < 1:
DF = DA.drop(i)
I hope I made my problem clear enough and sorry for any mistakes. Thanks for any help.
Let's try:
dfB.loc[:, dfB.ge(dfA.values).any()]
Explanation: dfA.values returns the numpy array with shape (720,1). Then dfB.ge(dfA.values) check each column from dfB against that single column from dfA; this returns a boolean dataframe of same size with dfB. Finally .any() check along the columns of that boolean dataframe for any True.
how about this:
pd.DataFrame(np.where(A.to_numpy() <= B.to_numpy(),1,np.nan), columns=B.columns, index=A.index).dropna(how='all')
you and replace the np.nan in the np.where condition with whatever values you wish, including keeping the original values of dataframe 'B'

Extract indices of a dataframe based on a values (provided as an array) from a different column

I have an array as: df1.values = array([1,2,3,4]).
Now, I want to get the indices of df2 where df2.x has the values from df1.values. So for instance, if df2.x.values= [1,3,4,2,5,6], then I want the return to be 1,4,2,3 which are index values of df2 where the values from df1 can be found.
I looked everywhere on stackoverflow and was not able to find how to do this.
If I understand your question, this should work:
import pandas as pd
df1 = pd.DataFrame([1,2,3,4],columns=['x'])
df2 = pd.DataFrame([1,3,4,2,5,6],columns=['x'])
df2['old_index']=df2.index.values
df2.set_index('x').loc[df1['x']]['old_index'].values
Basically, we extract the values of the original index of df2 (these are the return values that you want) as a new column, set the x column as a new index using .set_index (assuming you don't have any missing or duplicate values), and get your return values based on the new index.

Python: How to use or get the column header in Pandas and use it as input/value

I have the following script that produces an empty DataFrame using Pandas:
import pandas as pd
import datetime
#Creating a list of row headers with dates.
start=datetime.date(2017,3,27)
end=datetime.date.today() - datetime.timedelta(days=1)
row_dates=[x.strftime('%m/%d/%Y') for x in pd.bdate_range(start,end).tolist()]
identifiers=['A','B','C']
#Creating an empty Dataframe
df=pd.DataFrame(index=identifiers, columns=row_dates)
print(df)
Now, suppose I have a function (let's call it "my_function(index,date)") that requires two inputs: an index, and a date.
That function gives me an outcome and I want to fill the corresponding empty slot of the dataframe with that outcome. But for that, I need to be able to acquire both the index and the date.
For example, let's say I want to fill the first slot of my DataFrame, I require index 'A' and the first Date which is '27/3/2017', so I have this:
my_function('A', '27/3/2017')
How can I make that happen for my entire DataFrame? My apologies if any of this sounds confusing.
Added to the end of your code:
for col in df.columns:
for row in df.iterrows():
print(row[0], col)
That prints (returns) a tuple, consisting of the index (date) and the column name. There may be a faster way to do it.
If you want to just apply a function to every cell in your df, you can use .apply, .map, or .applymap, as necessary. https://chrisalbon.com/python/pandas_apply_operations_to_dataframes.html

How to print a specific row of a pandas DataFrame?

I have a massive DataFrame, and I'm getting the error:
TypeError: ("Empty 'DataFrame': no numeric data to plot", 'occurred at index 159220')
I've already dropped nulls, and checked dtypes for the DataFrame so I have no guess as to why it's failing on that row.
How do I print out just that row (at index 159220) of the DataFrame?
When you call loc with a scalar value, you get a pd.Series. That series will then have one dtype. If you want to see the row as it is in the dataframe, you'll want to pass an array like indexer to loc.
Wrap your index value with an additional pair of square brackets
print(df.loc[[159220]])
To print a specific row we have couple of pandas method
loc - It only get label i.e column name or Features
iloc - Here i stands for integer, actually row number
ix - It is a mix of label as well as integer
How to use for specific row
loc
df.loc[row,column]
For first row and all column
df.loc[0,:]
For first row and some specific column
df.loc[0,'column_name']
iloc
For first row and all column
df.iloc[0,:]
For first row and some specific column i.e first three cols
df.iloc[0,0:3]
Use ix operator:
print df.ix[159220]
If you want to display at row=159220
row=159220
#To display in a table format
display(df.loc[row:row])
display(df.iloc[row:row+1])
#To display in print format
display(df.loc[row])
display(df.iloc[row])
Sounds like you're calling df.plot(). That error indicates that you're trying to plot a frame that has no numeric data. The data types shouldn't affect what you print().
Use print(df.iloc[159220])
You can also index the index and use the result to select row(s) using loc:
row = 159220 # this creates a pandas Series (`row` is an integer)
row = [159220] # this creates a pandas DataFrame (`row` is a list)
df.loc[df.index[row]]
This is especially useful if you want to select rows by integer-location and columns by name. For example:
rows = 159220
cols = ['col2', 'col6']
df.loc[df.index[row], cols] # <--- OK
df.iloc[rows, cols] # <--- doesn't work
df.loc[cols].iloc[rows] # <--- OK but creates an intermediate copy

Append new data to a dataframe

I have a csv file with many columns but for simplicity I am explaining the problem using only 3 columns. The column names are 'user', 'A' and 'B'. I have read the file using the read_csv function in pandas. The data is stored as a data frame.
Now I want to remove some rows in this dataframe based on their values. So if value in column A is not equal to a and column B is not equal to b I want to skip those user rows.
The problem is I want to dynamically create a dataframe to which I can append one row at a time. Also I do not know the number of rows that there would be. Therefore, I cannot specify the index when defining the dataframe.
I am using the following code:
import pandas as pd
header=['user','A','B']
userdata=pd.read_csv('.../path/to/file.csv',sep='\t', usecols=header);
df = pd.DataFrame(columns=header)
for index, row in userdata.iterrows():
if row['A']!='a' and row['B']!='b':
data= {'user' : row['user'], 'A' : row['A'], 'B' : row['B']}
df.append(data,ignore_index=True)
The 'data' is being populated properly but I am not able to append. At the end, df comes to be empty.
Any help would be appreciated.
Thank you in advance.
Regarding your immediate problem, append() doesn't modify the DataFrame; it returns a new one. So you would have to reassign df via:
df = df.append(data,ignore_index=True)
But a better solution would be to avoid iteration altogether and simply query for the rows you want. For example:
df = userdata.query('A != "a" and B != "b"')

Categories

Resources