find the field names from a search query - python

i have a where condition query.and i want to make a dataframe with those fields inside the where condition.
Question is how to extract those fields from inside the where condition.
I tried things like finding the string before any operator(like ==,>=,&,/) using rstrip,lstrip.But still not successful. i do beleive some string search method will do it but i am not getting it.
my where condition is
whereFields == "CITY_NAME == 'city1' & EVENT_GENRE == 'KIDS' & count_EVENT_GENRE >= 1$#$FAV_VENUE_CITY_NAME == 'city1' & EVENT_GENRE == 'FANTASY' & count_EVENT_GENRE >= 1$#$CITY_NAME == 'city1' & EVENT_GENRE == 'FESTIVAL' & count_EVENT_GENRE >= 1$#$CITY_NAME == 'city1' & EVENT_GENRE == 'WORKSHOP' & count_EVENT_GENRE >= 1$#$CITY_NAME == 'city1' & EVENT_GENRE == 'EXHIBITION' & count_EVENT_GENRE >= 1$#$CITY_NAME == 'city1' & FAV_GENRE == '|DRAMA|'$#$CITY_NAME == 'city1' & & FAV_GENRE == '|ACTION|ADVENTURE|SCI-FI"
i want the field names involved.like my dataframe should have all unique columns.
Any help will be appreciated.

import re
res = [re.split(r'[(==)(>=)]', x)[0].strip() for x in re.split('[&($#$)]', whereFields)]
seems to work. Now you may want the unique ones, and no empty field:
res = [x for x in list(set(res)) if x]

In [98]:
pd.DataFrame(data = pd.Series(re.findall('\w+ *(?==|<|>)' , whereFields)).unique() , columns = ['fields'])
Out[98]:
fields
0 CITY_NAME
1 EVENT_GENRE
2 count_EVENT_GENRE
3 FAV_VENUE_CITY_NAME
4 FAV_GENRE

Related

Loop through a few commands using a function

I have a loop where I constantly got an error message.
print(((df1['col1_df'] == 0) & (df1['col2_df'] == True)).sum())
print(((df1['col1_df'] == 0) & (df1['col3_df'] == True)).sum())
print(((df1['col1_df'] == 0) & (df1['col4_df'] == True)).sum())
print(((df1['col1_df'] == 0) & (df1['col5_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col2_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col3_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col4_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col5_df'] == True)).sum())
I want to loop them through a function.
So far I have:
for i in range (2,5):
col = "col{}_df".format(i)
print(((df['col'] == 0) & (df['col'] == 2)).sum())
How can I number the df and let the df go through 1, 2, 3, 4 (like df1, df2 df3)
col is a variable. 'col' is a string. Having df['col'] doesn't refer to the variable col.
Your string format is done wrong: col = "col{}_df".format(i)
Also, range(2,5) will give you [2,5) not [2,5]. It is not inclusive.
You can express your entire code in two lines using a comprehension:
print(*( ((df1[f'col1_df'] == 0) & (df1[f'col{i}_df'] == True)).sum() for i in range(2,6) ), sep="\n")
print(*( ((df2[f'col1_df'] == 0) & (df2[f'col{i}_df'] == True)).sum() for i in range(2,6) ), sep="\n")
The expression ((df1[f'col1_df'] == 0) & (df1[f'col{i}_df'] == True)).sum() for i in range(2,6) ) creates a generator object yielding the 4 expressions one by one.
The * scatters the elements in this generator and passes the arguments to print as you would have passed a comma-separated argument list. The end=\n ensures that each of these arguments are separated by a new line when output.

Applying more than Two Filter to DataFrame with & operator

I have a df and I want to apply multiple filtering that df.
...
def applyFilter(self):
## 1st Condition
if self.col1_lineEdit.text() != "":
self.filter_col1 = (self.myDataFrame['col1'] == self.col1_lineEdit.text())
else:
self.filter_col1 = [True] * len(self.myDataFrame) # If line edit is empty, take all values.
if self.col2_lineEdit.text() != "":
self.filter_col2 = (self.myDataFrame['col2'] == self.col2_lineEdit.text())
else:
self.filter_col2 = [True] * len(self.myDataFrame)
if self.col3_lineEdit.text() != "":
self.filter_col3 = (self.myDataFrame['col3'] == self.col3_lineEdit.text())
else:
self.filter_col3 = [True] * len(self.myDataFrame)
## 2nd conditions
if self.col4_lineEdit.text() != "":
self.filter_col4 = (self.myDataFrame['col4'] == self.col4_lineEdit.text())
else:
self.filter_col4 = [True] * len(self.myDataFrame)
if self.col5_lineEdit.text() != "":
self.filter_col5 = (self.myDataFrame['col5'] == self.col5_lineEdit.text())
else:
self.filter_col5 = [True] * len(self.myDataFrame)
if self.col6_lineEdit.text() != "":
self.filter_col6 = (self.myDataFrame['col6'] == self.col6_lineEdit.text())
else:
self.filter_col6 = [True] * len(self.myDataFrame)
...
this is my apply filter method. After that I used something like this;
self.filteredResult = self.myDataFrame[self.filter_col1 & self.filter_col2 & self.filter_col3 & self.filter_col4 & self.filter_col5 & self.filter_col6]
this is working only for col1 and col2 filtering. However if I change the formula like this (disable first operand wih and) ;
self.filteredResult = self.myDataFrame[self.filter_col1 and self.filter_col2 & self.filter_col3 & self.filter_col4 & self.filter_col5 & self.filter_col6]
it works for col2 and col3. So I tried to debug the code, then I got an error like this;
self.filteredResult = self.myDataFrame[self.filter_col1 & self.filter_col2 & self.filter_col3 & self.filter_col4 & self.filter_col5 & self.filter_col6]
TypeError: unsupported operand type(s) for &: 'list' and 'list'
When I search the problem on the internet, every solution was applying two filters not more. If I apply two filters there is no problem. However I need to apply 6 filters. I also serached and tried some soluiton with error code. Also I got nothing. Can you help me, or say anything about the problem?
Edit: I didnt understand the error. If & operand is not allowed to use on list. Why the first and second can apply not the others?
Important Edit : I forgot to say this; if I apply filter first or second column, I can apply the other filters. (in this case)
self.filteredResult = self.myDataFrame[self.filter_col1 & self.filter_col2 & self.filter_col3 & self.filter_col4 & self.filter_col5 & self.filter_col6]
As far as I understand, program takes to compare the first two filters with & operand, then the other filters depends on these filters.
Try adding a parenthesis in each conditions:
self.filteredResult = self.myDataFrame[(self.filter_col1) & (self.filter_col2) & (self.filter_col3) & (self.filter_col4) & (self.filter_col5) & (self.filter_col6)]
To solve this problem, I had to add a column that had no effect on the results but had to trick filter. So I add to my table an addition column which named ALL and filled with ALL parameter. Then i edited my filter as;
self.filteredResult = self.myDataFrame[(self.myDataFrame['ALL'] == "ALL") & self.filter_col1 & self.filter_col2 & self.filter_col3 & self.filter_col4 & self.filter_col5 & self.filter_col6]

How to check if a serial number starts with 0

Im excluding rows from my df that fill certain conditions
df[~((df['Wood_type'] == 'pine') & (df['wood_size'] == 20))]
I would like to also exclude, the numbers that start with 0 in the column 'Serial'
df[~((df['Wood_type'] == 'pine') & (df['wood_size'] == 20) & (df['Serial'] == range(0) == 0))]
I tried the above, no result.
You will probably want to use df.str.startswith to check the first character:
df[ ~((df['Wood_type'] == 'pine') & (df['wood_size'] == 20) & (df['Serial'].str.startswith('0')))]
The current expression df['Serial'] == range(0) == 0 is meaningless. It is equivalent to df['Serial'] == range(0) and range(0) == 0. Clearly, neither of those is related to comparing the first character of the string to '0' (as opposed to 0).

Dynamically create DF and filter with loop

What is the best practice to loop this?
I want to filter and dynamically create new DFs
df_aus19 = df_mat.loc[(df_temp['Year'] == 2020 ) & (df_mat['Country'] == 'AUS')]
df_aus20 = df_mat.loc[(df_temp['Year'] == 2020 ) & (df_mat['Country'] == 'AUS')]
df_aus21 = df_mat.loc[(df_temp['Year'] == 2021 ) & (df_mat['Country'] == 'AUS')]
df_aus22 = df_mat.loc[(df_temp['Year'] == 2022 ) & (df_mat['Country'] == 'AUS')]
df_aus23 = df_mat.loc[(df_temp['Year'] == 2023 ) & (df_mat['Country'] == 'AUS')]
AS per code
A more pythonic way to solve your problem than by loops and dynamically creating global variables would be a list comprehension:
years = list(range(2019, 2024))
df_aus_list = [df_mat.loc[(df_temp['Year'] == i) & (df_mat['Country'] == 'AUS')] for i in years]
If you really need to produce the same result as your code you could dynamically create new global variables like so:
years = list(range(2019, 2024))
for i in years:
globals()["df_aus"+str(i % 100)] = df_mat.loc[(df_temp['Year'] == i ) & (df_mat['Country'] == 'AUS')]
Don't. Just don't. When you find yourself ready to dynamically create new variables, just refrain: it is hard to obtain and harder to maintain. Just use containers like dictionaries or lists.
And the common way to get that in pandas is group_by:
dfs = dict()
for (year, country), sub in df.groupby(['Year', 'Country']):
# if 2020 <= year <= 2023 and county == 'AUS': # filter optionaly
dfs[(year, country)] = sub
You can then get the dataframe for AUS and 2021 as dfs[(2021, 'AUS')]

Speed up Pandas: find all columns which fullfill set of conditions

I have data represented using pandas DataFrame, which for example looks as follows:
| id | entity | name | value | location
where id is an integer value, entity is an integer , name is a string, value is an integer, and location is a string (for example US, CA, UK etc).
Now, I want to add a new column to this data frame, column "flag", where values are assigned as follows:
for d in df.iterrows():
if d.entity == 10 and d.value != 1000 and d.location == CA:
d.flag = "A"
elif d.entity != 10 and d.entity != 0 and d.value == 1000 and d.location == US:
d.flag = "C"
elif d.entity == 0 and d.value == 1000 and d.location == US"
d.flag = "B"
else:
print("Different case")
Is there a way to speed this up and use some built in functions instead of the for loop?
Use np.select which you pass a list of conditions, based on those conditions you give it choices and you can specify a default value when none of the conditions is met.
conditions = [
(d.entity == 10) & (d.value != 1000) & (d.location == 'CA'),
(d.entity != 10) & (d.entity != 0) & (d.value == 1000) & (d.location == 'US'),
(d.entity == 0) & (d.value == 1000) & (d.location == 'US')
]
choices = ["A", "C", "B"]
df['flag'] = np.select(conditions, choices, default="Different case")
Add () with bitwise and -> & for working with numpy.select:
m = [
(d.entity == 10) & (d.value != 1000) & (d.location == 'CA'),
(d.entity != 10) & (d.entity != 0) & (d.value == 1000) & (d.location == 'US'),
(d.entity == 0) & (d.value == 1000) & (d.location == 'US')
]
df['flag'] = np.select(m, ["A", "C", "B"], default="Different case")
You wrote "find all columns which fulfill a set of conditions", but your code shows you're actually trying to add a new column whose value for each row is computed from the values of other columns of the same row.
If that's indeed the case, you can use df.apply, giving it a function that computes the value for a specific row:
def flag_value(row):
if row.entity == 10 and row.value != 1000 and row.location == CA:
return "A"
elif row.entity != 10 and row.entity != 0 and row.value == 1000 and row.location == US:
return "C"
elif row.entity == 0 and row.value == 1000 and row.location == US:
return "B"
else:
return "Different case"
df['flag'] = df.apply(flag_value, axis=1)
Take a look at this related question for more information.
If you truly want to find all columns which specify some condition, the usual way to do this with a Pandas dataframe is to use df.loc and indexing:
only_a_cases = df.loc[df.entity == 10 & df.value != 1000 & df.location == "CA"]
# or:
only_a_cases = df.loc[lambda df: df.entity == 10 & df.value != 1000 & df.location == "CA"]

Categories

Resources