How to cast variable in .query() function to lower case? - python

I have a df where I want to query while using values from itertuples() from another dataframe:
matching_group = Pandas(Index=0, match='TEST')
df.query("column_a == #matching_group.match")
"matching_group" is coming from df.itertuples() and I can't change that data type. The query above works.
But now I need to cast "matching_group.match" to lowercase.
matching_group = Pandas(Index=0, match='TEST')
df.query("column_a == #matching_group.match.lower()")
This does not work.
It's hard to create a minimal viable example here.
How can I cast a variable used via # in a df.query() to lowercase?

For me working well your code with named tuples, one possible reason for not matching should be trailing whitesapces, then remove them by strip:
df = pd.DataFrame({ 'column_a': ['test', 'tesT', 'No']})
from collections import namedtuple
Pandas = namedtuple('Pandas', 'Index match')
matching_group = Pandas(Index=0, match='TEST')
print (matching_group)
Pandas(Index=0, match='TEST')
df3 = df.query("column_a == #matching_group.match.lower()")
print (df3)
column_a
0 test
df3 = df.query("column_a.str.strip() == #matching_group.match.lower().strip()")

Input Toy Example
df = pd.DataFrame({
'test':['abc', 'DEF'],
'num':[1,2]
})
val='Abc' # variable to be matched
Input df
test num
0 abc 1
1 DEF 2
Code
df.query('test == #val.lower()')
Output
test num
0 abc 1
Tested on pandas version
pd.version # '1.2.4'

Related

Search columns with list of string for a specific set of text and if the text is found enter new a new string of text in a new column

I want to search for names in column col_one where I have a list of names in the variable list20. When searching, if the value of col_one matches in list20, put the same name in a new column named new_col
Most of the time, the name will be at the front, such as ZEN, W, WICE, but there will be some names.
with a symbol after the name again, such as ZEN-R, ZEN-W2, ZEN13P2302A
my data
import pandas as pd
list20 = ['ZEN', 'OOP', 'WICE', 'XO', 'WP', 'K', 'WGE', 'YGG', 'W', 'YUASA', 'XPG', 'ABC', 'WHA', 'WHAUP', 'WFX', 'WINNER', 'WIIK', 'WIN', 'YONG', 'WPH', 'KCE']
data = {
"col_one": ["ZEN", "WPH", "WICE", "YONG", "K" "XO", "WIN", "WP", "WIIK", "YGG-W1", "W-W5", "WINNER", "YUASA", "WGE", "WFX", "XPG", "WHAUP", "WHA", "KCE13P2302A", "OOP-R"],
}
df = pd.DataFrame(data)
# The code you provided will give the result like the picture below. and it's not right
# or--------
df['new_col'] = df['col_one'].str.extract('('+'|'.join(list20)+')')[0]
# or--------
import re
pattern = re.compile(r"|".join(x for x in list20))
df = (df
.assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one])
)
# or----------
def matcher(col_one):
for i in list20:
if i in col_one:
return i
return 'na' #adjust as you see fit
df['new_col'] = df.apply(lambda x: matcher(x['col_one']), axis=1)
The result obtained from the code above and it's not right
Expected Output
Try to sort the list first:
pattern = re.compile(r"|".join(x for x in sorted(list20, reverse=True, key=len)))
(df
.assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one])
)
Try with str.extract
df['new'] = df['col_one'].str.extract('('+'|'.join(list20)+')')[0]
df
Out[121]:
col_one new
0 CFER CFER
1 ABCP6P45C9 ABC
2 LOU-W5 LOU
3 CFER-R CFER
4 ABC-W1 ABC
5 LOU13C2465 LOU
One way to do this, less attractive in terms of efficiency, is to use a simple function with a lambda such that:
def matcher(col_one):
for i in list20:
if i in col_one:
return i
return 'na' #adjust as you see fit
df['new_col'] = df.apply(lambda x: matcher(x['col_one']), axis=1)
df
expected results:
col_one new_col
0 CFER CFER
1 ABCP6P45C9 ABC
2 LOU-W5 LOU
3 CFER-R CFER
4 ABC-W1 ABC
5 LOU13C2465 LOU
Another approach:
pattern = re.compile(r"|".join(x for x in list20))
(df
.assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one])
)

change the first occurrence in a pandas column based on certain condition

so I would like to change the first number in the number column to +233 given the first number is 0, basically I would like all rows in number to be like that of row Paul
Both columns are string objects.
Expectations:
The first character of the values in the column named df["numbers"] should be replaced to "+233" if only == "0"
df = pd.DataFrame([[ken, 080222333222],
[ben, +233948433],
[Paul, 0800000073]],
columns=['name', 'number'])`
Hope I understood your edition, try this:
Notice - I removed the first 0 and replaced it with +233
import pandas as pd
df = pd.DataFrame([["ken", "080222333222"], ["ben", "+233948433"], ["Paul", "0800000073"]], columns=['name', 'number'])
def convert_number(row):
if row[0] == '0':
row = "+233" + row[1:]
return row
df['number'] = df['number'].apply(convert_number)
print(df)
You can used replace directly
df['relace_Col']=df.number.str.replace(r'^\d', '+233',1)
which produced
name number relace_Col
0 ken 080222333222 +23380222333222
1 ben +233948433 +233948433
2 Paul 0800000073 +233800000073
The full code to reproduce the above
import pandas as pd
df = pd.DataFrame([['ken', '080222333222'], ['ben', '+233948433'],
['Paul', '0800000073']], columns=['name', 'number'])
df['relace_Col']=df.number.str.replace(r'^\d', '+233',1)
print(df)

Compare entire rows for equality if some condition is satisfied

Let's say I have the following data of a match in a CSV file:
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4
I'm writing a python program. Somewhere in my program I have scores collected for a match stored in a list, say x = [1,0,4]. I have found where in the data these scores exist using pandas and I can print "found" or "not found". However I want my code to print out to which name these scores correspond to. In this case the program should output "charlie" since charlie has all these values [1,0,4]. how can I do that?
I will have a large set of data so I must be able to tell which name corresponds to the numbers I pass to the program.
Yes, here's how to compare entire rows in a dataframe:
df[(df == x).all(axis=1)].index # where x is the pd.Series we're comparing to
Also, it makes life easiest if you directly set name as the index column when you read in the CSV.
import pandas as pd
from io import StringIO
df = """\
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4"""
df = pd.read_csv(StringIO(df), index_col='name')
x = pd.Series({'match1':1, 'match2':0, 'match3':4})
Now you can see that doing df == x, or equivalently df.eq(x), is not quite what you want because it does element-wise compare and returns a row of True/False. So you need to aggregate those rows with .all(axis=1) which finds rows where all comparison results were True...
df.eq(x).all(axis=1)
df[ (df == x).all(axis=1) ]
# match1 match2 match3
# name
# Charlie 1 0 4
...and then finally since you only want the name of such rows:
df[ (df == x).all(axis=1) ].index
# Index(['Charlie'], dtype='object', name='name')
df[ (df == x).all(axis=1) ].index.tolist()
# ['Charlie']
which is what you wanted. (I only added the spaces inside the expression for clarity).
You need to use DataFrame.loc which would work like this:
print(df.loc[(df.match1 == 1) & (df.match2 == 0) & (df.match3 == 4), 'name'])
Maybe try something like this:
import pandas as pd
import numpy as np
# Makes sample data
match1 = np.array([2,2,1])
match2 = np.array([4,4,0])
match3 = np.array([3,3,4])
name = np.array(['Alice','Bob','Charlie'])
df = pd.DataFrame({'name': id, 'match1': match1, 'match2':match2, 'match3' :match3})
df
# example of the list you want to get the data from
x=[1,0,4]
#x=[2,4,3]
# should return the name Charlie as well as the index (based on the values in the list x)
df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])]
# Makes a new dataframe out of the above
mydf = pd.DataFrame(df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])])
# Loop that prints out the name based on the index of mydf
# Assuming there are more than one name, it will print all. if there is only one name, it will print only that)
for i in range(0,len(mydf)):
print(mydf['name'].iloc[i])
you can use this
here data is your Data frame ,you can change accordingly your data frame name,
and
considering [1,0,4] is int type
data = data[(data['match1']== 1)&(data['match2']==0)&(data['match3']== 4 ).index
print(data[0])
if data is object type then use this
data = data[(data['match1']== "1")&(data['match2']=="0")&(data['match3']== "4" ).index
print(data[0])

Drop rows in pandas if they contains "???"

Im trying to drop rows in pandas that contains "???", it works for every other value except for "???", I do not know whats the problem.
This is my code (I have tried both types):
df = df[~df["text"].str.contains("?????", na=False)]
df = df[~df["text"].str.contains("?????")]
error that I'm getting:
re.error: nothing to repeat at position 0
It works for every other value except for "????".
I have googled it, and looked all over this website but I couldnt find any solutions.
The parameter expects a regular expression, hence the error re.error.
You can either escape the ? inside the expression like this:
df = df[~df["text"].str.contains("\?\?\?\?\?")]
Or set regex=False as Vorsprung sugested:
df = df[~df["text"].str.contains("?????",regex=False)]
let's convert this into running code:
import numpy as np
import pandas as pd
data = {'A': ['abc', 'cxx???xx', '???',], 'B': ['add', 'ddb', 'c', ]}
df = pd.DataFrame.from_dict(data)
df
output:
A B
0 abc add
1 cxx???xx ddb
2 ??? c
with this:
df[df['A'].str.contains('???',regex=False)]
output:
A B
1 cxx???xx ddb
2 ??? c
you need to tell contains(), that your search string is not a regex.

get column name that contains a specific value in pandas

I want to get column name from the whole database (assume the database contains more than 100 rows with more than 50 column) based on specific value that contain in a specific column in pandas.
Here is my code:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6], 'C':[7,8,9]})
pos = 2
response = raw_input("input")
placeholder = (df == response).idxmax(axis=1)[0]
print df
print (placeholder)
Tried a lot . . .
Example:
when the user will input 2; it will show answer: A
if the input is 4; feedback will be B
and if 7 then reply will be C
tried iloc but I've seen row have to be noticed there.
Please Help Dear Guys . . . . .
Thanks . . . :)
Try this
for i in df.columns:
newDf = df.loc[lambda df: df[i] == response]
if(not newDf.empty):
print(i)
First of all you should treat the input as integer. So instead of raw_input, use input:
response = input("input")
After that you can use any:
df[df==YOUR_VALUE].any()
This will return a boolean Series with columns names and whether they contain the value you are looking for.
In your example:
df = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6], 'C':[7,8,9]})
response = input("input")
placeholder = df[df==response].any()
for input 4 the output will be:
A False
B True
C False
dtype: bool

Categories

Resources