get column name that contains a specific value in pandas - python

I want to get column name from the whole database (assume the database contains more than 100 rows with more than 50 column) based on specific value that contain in a specific column in pandas.
Here is my code:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6], 'C':[7,8,9]})
pos = 2
response = raw_input("input")
placeholder = (df == response).idxmax(axis=1)[0]
print df
print (placeholder)
Tried a lot . . .
Example:
when the user will input 2; it will show answer: A
if the input is 4; feedback will be B
and if 7 then reply will be C
tried iloc but I've seen row have to be noticed there.
Please Help Dear Guys . . . . .
Thanks . . . :)

Try this
for i in df.columns:
newDf = df.loc[lambda df: df[i] == response]
if(not newDf.empty):
print(i)

First of all you should treat the input as integer. So instead of raw_input, use input:
response = input("input")
After that you can use any:
df[df==YOUR_VALUE].any()
This will return a boolean Series with columns names and whether they contain the value you are looking for.
In your example:
df = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6], 'C':[7,8,9]})
response = input("input")
placeholder = df[df==response].any()
for input 4 the output will be:
A False
B True
C False
dtype: bool

Related

How to drop duplicates ignoring one column

I have a DataFrame with multiple columns and the last column is timestamp which I want Python to ignore. I've used drop_columns(subset=...) but does not work as it returns literally the same DataFrame.
This is what the DataFrame looks like:
id
name
features
timestamp
1
34233
Bob
athletics
04-06-2022
2
23423
John
mathematics
03-06-2022
3
34233
Bob
english_literature
06-06-2022
4
23423
John
mathematics
10-06-2022
...
...
...
...
...
And this is are the data types when doing df.dtypes:
id
int64
name
object
features
object
timestamp
object
Lastly, this is the piece of code I used:
df.drop_duplicates(subset=df.columns.tolist().remove("timestamp"), keep="first").reset_index(drop=True)
The idea is to keep track of changes based on a timestamp IF there are changes to the other columns. For instance, I don't want to keep row 4 because nothing has changed with John, however, I want to keep Bob as it has changed from athletics to english_literature. Does that make sense?
EDIT:
This is the full code:
"""
db_data contains 10 records
new_data contains 12 records but I know only 5 are needed based on the logic I want to implement
"""
db_data = pd.read_sql("SELECT * FROM subscribed", engine)
new_data = pd.read_csv("new_data.csv")
# Checking columns match
# This prints "matching"
if db_data.columns == new_data.columns: print("matching")
df = pd.concat([db_data, new_data], axis=1)
consider = [x for x in df.columns if x != "timestamp"]
df = df.drop_duplicates(subset=consider).reset_index(drop=True)
# This outputs 22 but should have printed 15
print(len(df))
TEST:
I've done a test but has puzzled me even more. I've created a separate table in the db and loaded the csv file new_data.csv and then used read_sql to get it back into a DataFrame. Surprisingly, this works. However, I do not want to take this unnecessary extra step. I am puzzled on why this works. I've checked the data types they match.
db_data = pd.read_sql("SELECT * FROM subscribed, engine")
new_data = pd.read_sql("SELECT * FROM test, engine")
# Checking columns match
# This still prints "matching"
if db_data.columns == new_data.columns: print("matching")
df = pd.concat([db_data, new_data], axis=1)
consider = [x for x in df.columns if x != "timestamp"]
df = df.drop_duplicates(subset=consider).reset_index(drop=True)
# This the right output... in other words, it worked.
print(len(df))
The remove method of a list returns None. That's why the returned dataframe is similar. You can do as follows:
Create the list of columns for the subset: col_subset = df.columns.tolist()
Remove timestamp: col_subset.remove('timestamp')
Use the col_subset list in the drop_duplicates() function: df.drop_duplicates(subset=col_subset, keep="first").reset_index(drop=True)
Try this:
consider = [x for x in df.columns if x != "timestamp"]
df.drop_duplicates(subset=consider).reset_index(drop=True)
(You don't need tolist() and keep="first" here)
If I understood you correctly, this code would do:
df.drop_duplicates(subset='features', keep ='first').reset_index()

How to cast variable in .query() function to lower case?

I have a df where I want to query while using values from itertuples() from another dataframe:
matching_group = Pandas(Index=0, match='TEST')
df.query("column_a == #matching_group.match")
"matching_group" is coming from df.itertuples() and I can't change that data type. The query above works.
But now I need to cast "matching_group.match" to lowercase.
matching_group = Pandas(Index=0, match='TEST')
df.query("column_a == #matching_group.match.lower()")
This does not work.
It's hard to create a minimal viable example here.
How can I cast a variable used via # in a df.query() to lowercase?
For me working well your code with named tuples, one possible reason for not matching should be trailing whitesapces, then remove them by strip:
df = pd.DataFrame({ 'column_a': ['test', 'tesT', 'No']})
from collections import namedtuple
Pandas = namedtuple('Pandas', 'Index match')
matching_group = Pandas(Index=0, match='TEST')
print (matching_group)
Pandas(Index=0, match='TEST')
df3 = df.query("column_a == #matching_group.match.lower()")
print (df3)
column_a
0 test
df3 = df.query("column_a.str.strip() == #matching_group.match.lower().strip()")
Input Toy Example
df = pd.DataFrame({
'test':['abc', 'DEF'],
'num':[1,2]
})
val='Abc' # variable to be matched
Input df
test num
0 abc 1
1 DEF 2
Code
df.query('test == #val.lower()')
Output
test num
0 abc 1
Tested on pandas version
pd.version # '1.2.4'

How to convert object to float in Pandas?

I read a csv file into a pandas dataframe and got all column types as objects. I need to convert the second and third columns to float.
I tried using
df["Quantidade"] = pd.to_numeric(df.Quantidade, errors='coerce')
but got NaN.
Here's my dataframe. Should I need to use some regex in the third column to get rid of the "R$ "?
Try this:
# sample dataframe
d = {'Quantidade':['0,20939', '0,0082525', '0,009852', '0,012920', '0,0252'],
'price':['R$ 165.000,00', 'R$ 100.000,00', 'R$ 61.500,00', 'R$ 65.900,00', 'R$ 49.375,12']}
df = pd.DataFrame(data=d)
# Second column
df["Quantidade"] = df["Quantidade"].str.replace(',', '.').astype(float)
#Third column
df['price'] = df.price.str.replace(r'\w+\$\s+', '').str.replace('.', '')\
.str.replace(',', '.').astype(float)
Output:
Quantidade price
0 0.209390 165000.00
1 0.008252 100000.00
2 0.009852 61500.00
3 0.012920 65900.00
4 0.025200 49375.12
Try something like this:
df["Quantidade"] = df["Quantidade"].str.replace(',', '.').astype(float)
df['Quantidade'] = df['Quantidade'].astype(float)

Compare entire rows for equality if some condition is satisfied

Let's say I have the following data of a match in a CSV file:
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4
I'm writing a python program. Somewhere in my program I have scores collected for a match stored in a list, say x = [1,0,4]. I have found where in the data these scores exist using pandas and I can print "found" or "not found". However I want my code to print out to which name these scores correspond to. In this case the program should output "charlie" since charlie has all these values [1,0,4]. how can I do that?
I will have a large set of data so I must be able to tell which name corresponds to the numbers I pass to the program.
Yes, here's how to compare entire rows in a dataframe:
df[(df == x).all(axis=1)].index # where x is the pd.Series we're comparing to
Also, it makes life easiest if you directly set name as the index column when you read in the CSV.
import pandas as pd
from io import StringIO
df = """\
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4"""
df = pd.read_csv(StringIO(df), index_col='name')
x = pd.Series({'match1':1, 'match2':0, 'match3':4})
Now you can see that doing df == x, or equivalently df.eq(x), is not quite what you want because it does element-wise compare and returns a row of True/False. So you need to aggregate those rows with .all(axis=1) which finds rows where all comparison results were True...
df.eq(x).all(axis=1)
df[ (df == x).all(axis=1) ]
# match1 match2 match3
# name
# Charlie 1 0 4
...and then finally since you only want the name of such rows:
df[ (df == x).all(axis=1) ].index
# Index(['Charlie'], dtype='object', name='name')
df[ (df == x).all(axis=1) ].index.tolist()
# ['Charlie']
which is what you wanted. (I only added the spaces inside the expression for clarity).
You need to use DataFrame.loc which would work like this:
print(df.loc[(df.match1 == 1) & (df.match2 == 0) & (df.match3 == 4), 'name'])
Maybe try something like this:
import pandas as pd
import numpy as np
# Makes sample data
match1 = np.array([2,2,1])
match2 = np.array([4,4,0])
match3 = np.array([3,3,4])
name = np.array(['Alice','Bob','Charlie'])
df = pd.DataFrame({'name': id, 'match1': match1, 'match2':match2, 'match3' :match3})
df
# example of the list you want to get the data from
x=[1,0,4]
#x=[2,4,3]
# should return the name Charlie as well as the index (based on the values in the list x)
df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])]
# Makes a new dataframe out of the above
mydf = pd.DataFrame(df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])])
# Loop that prints out the name based on the index of mydf
# Assuming there are more than one name, it will print all. if there is only one name, it will print only that)
for i in range(0,len(mydf)):
print(mydf['name'].iloc[i])
you can use this
here data is your Data frame ,you can change accordingly your data frame name,
and
considering [1,0,4] is int type
data = data[(data['match1']== 1)&(data['match2']==0)&(data['match3']== 4 ).index
print(data[0])
if data is object type then use this
data = data[(data['match1']== "1")&(data['match2']=="0")&(data['match3']== "4" ).index
print(data[0])

Subset string rows that contain a 'flexible' pattern

I have the following df.
data = [
['DWWWWD'],
['DWDW'],
['WDWWWWWWWWD'],
['DDW'],
['WWD'],
]
df = pd.DataFrame(data, columns=['letter_sequence'])
I want to subset the rows that contain the pattern 'D' + '[whichever number of W's]' + 'D'. Examples of rows I want in my output df: DWD, DWWWWWWWWWWWD, WWWWWDWDW...
I came up with the following, but it does not really work for 'whichever number of W's'.
df[df['letter_sequence'].str.contains(
'DWD|DWWD|DWWWD|DWWWWD|DWWWWWD|DWWWWWWD|DWWWWWWWD|DWWWWWWWWD', regex=True
)]
Desired output new_df:
letter_sequence
0 DWWWWD
1 DWDW
2 WDWWWWWWWWD
Any alternatives?
Use [W]{1,} for one or more W, regex=True is by default, so should be omit:
df = df[df['letter_sequence'].str.contains('D[W]{1,}D')]
print (df)
letter_sequence
0 DWWWWD
1 DWDW
2 WDWWWWWWWWD
You can use the regex: D\w+D.
The code is shown below:
df = df[df['letter_sequence'].str.contains('Dw+D')]
Please let me know if it helps.

Categories

Resources