i have a where condition query.and i want to make a dataframe with those fields inside the where condition.
Question is how to extract those fields from inside the where condition.
I tried things like finding the string before any operator(like ==,>=,&,/) using rstrip,lstrip.But still not successful. i do beleive some string search method will do it but i am not getting it.
my where condition is
whereFields == "CITY_NAME == 'city1' & EVENT_GENRE == 'KIDS' & count_EVENT_GENRE >= 1$#$FAV_VENUE_CITY_NAME == 'city1' & EVENT_GENRE == 'FANTASY' & count_EVENT_GENRE >= 1$#$CITY_NAME == 'city1' & EVENT_GENRE == 'FESTIVAL' & count_EVENT_GENRE >= 1$#$CITY_NAME == 'city1' & EVENT_GENRE == 'WORKSHOP' & count_EVENT_GENRE >= 1$#$CITY_NAME == 'city1' & EVENT_GENRE == 'EXHIBITION' & count_EVENT_GENRE >= 1$#$CITY_NAME == 'city1' & FAV_GENRE == '|DRAMA|'$#$CITY_NAME == 'city1' & & FAV_GENRE == '|ACTION|ADVENTURE|SCI-FI"
i want the field names involved.like my dataframe should have all unique columns.
Any help will be appreciated.
import re
res = [re.split(r'[(==)(>=)]', x)[0].strip() for x in re.split('[&($#$)]', whereFields)]
seems to work. Now you may want the unique ones, and no empty field:
res = [x for x in list(set(res)) if x]
In [98]:
pd.DataFrame(data = pd.Series(re.findall('\w+ *(?==|<|>)' , whereFields)).unique() , columns = ['fields'])
Out[98]:
fields
0 CITY_NAME
1 EVENT_GENRE
2 count_EVENT_GENRE
3 FAV_VENUE_CITY_NAME
4 FAV_GENRE
Related
I have a loop where I constantly got an error message.
print(((df1['col1_df'] == 0) & (df1['col2_df'] == True)).sum())
print(((df1['col1_df'] == 0) & (df1['col3_df'] == True)).sum())
print(((df1['col1_df'] == 0) & (df1['col4_df'] == True)).sum())
print(((df1['col1_df'] == 0) & (df1['col5_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col2_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col3_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col4_df'] == True)).sum())
print(((df2['col1_df'] == 0) & (df2['col5_df'] == True)).sum())
I want to loop them through a function.
So far I have:
for i in range (2,5):
col = "col{}_df".format(i)
print(((df['col'] == 0) & (df['col'] == 2)).sum())
How can I number the df and let the df go through 1, 2, 3, 4 (like df1, df2 df3)
col is a variable. 'col' is a string. Having df['col'] doesn't refer to the variable col.
Your string format is done wrong: col = "col{}_df".format(i)
Also, range(2,5) will give you [2,5) not [2,5]. It is not inclusive.
You can express your entire code in two lines using a comprehension:
print(*( ((df1[f'col1_df'] == 0) & (df1[f'col{i}_df'] == True)).sum() for i in range(2,6) ), sep="\n")
print(*( ((df2[f'col1_df'] == 0) & (df2[f'col{i}_df'] == True)).sum() for i in range(2,6) ), sep="\n")
The expression ((df1[f'col1_df'] == 0) & (df1[f'col{i}_df'] == True)).sum() for i in range(2,6) ) creates a generator object yielding the 4 expressions one by one.
The * scatters the elements in this generator and passes the arguments to print as you would have passed a comma-separated argument list. The end=\n ensures that each of these arguments are separated by a new line when output.
I have a df and I want to apply multiple filtering that df.
...
def applyFilter(self):
## 1st Condition
if self.col1_lineEdit.text() != "":
self.filter_col1 = (self.myDataFrame['col1'] == self.col1_lineEdit.text())
else:
self.filter_col1 = [True] * len(self.myDataFrame) # If line edit is empty, take all values.
if self.col2_lineEdit.text() != "":
self.filter_col2 = (self.myDataFrame['col2'] == self.col2_lineEdit.text())
else:
self.filter_col2 = [True] * len(self.myDataFrame)
if self.col3_lineEdit.text() != "":
self.filter_col3 = (self.myDataFrame['col3'] == self.col3_lineEdit.text())
else:
self.filter_col3 = [True] * len(self.myDataFrame)
## 2nd conditions
if self.col4_lineEdit.text() != "":
self.filter_col4 = (self.myDataFrame['col4'] == self.col4_lineEdit.text())
else:
self.filter_col4 = [True] * len(self.myDataFrame)
if self.col5_lineEdit.text() != "":
self.filter_col5 = (self.myDataFrame['col5'] == self.col5_lineEdit.text())
else:
self.filter_col5 = [True] * len(self.myDataFrame)
if self.col6_lineEdit.text() != "":
self.filter_col6 = (self.myDataFrame['col6'] == self.col6_lineEdit.text())
else:
self.filter_col6 = [True] * len(self.myDataFrame)
...
this is my apply filter method. After that I used something like this;
self.filteredResult = self.myDataFrame[self.filter_col1 & self.filter_col2 & self.filter_col3 & self.filter_col4 & self.filter_col5 & self.filter_col6]
this is working only for col1 and col2 filtering. However if I change the formula like this (disable first operand wih and) ;
self.filteredResult = self.myDataFrame[self.filter_col1 and self.filter_col2 & self.filter_col3 & self.filter_col4 & self.filter_col5 & self.filter_col6]
it works for col2 and col3. So I tried to debug the code, then I got an error like this;
self.filteredResult = self.myDataFrame[self.filter_col1 & self.filter_col2 & self.filter_col3 & self.filter_col4 & self.filter_col5 & self.filter_col6]
TypeError: unsupported operand type(s) for &: 'list' and 'list'
When I search the problem on the internet, every solution was applying two filters not more. If I apply two filters there is no problem. However I need to apply 6 filters. I also serached and tried some soluiton with error code. Also I got nothing. Can you help me, or say anything about the problem?
Edit: I didnt understand the error. If & operand is not allowed to use on list. Why the first and second can apply not the others?
Important Edit : I forgot to say this; if I apply filter first or second column, I can apply the other filters. (in this case)
self.filteredResult = self.myDataFrame[self.filter_col1 & self.filter_col2 & self.filter_col3 & self.filter_col4 & self.filter_col5 & self.filter_col6]
As far as I understand, program takes to compare the first two filters with & operand, then the other filters depends on these filters.
Try adding a parenthesis in each conditions:
self.filteredResult = self.myDataFrame[(self.filter_col1) & (self.filter_col2) & (self.filter_col3) & (self.filter_col4) & (self.filter_col5) & (self.filter_col6)]
To solve this problem, I had to add a column that had no effect on the results but had to trick filter. So I add to my table an addition column which named ALL and filled with ALL parameter. Then i edited my filter as;
self.filteredResult = self.myDataFrame[(self.myDataFrame['ALL'] == "ALL") & self.filter_col1 & self.filter_col2 & self.filter_col3 & self.filter_col4 & self.filter_col5 & self.filter_col6]
Im excluding rows from my df that fill certain conditions
df[~((df['Wood_type'] == 'pine') & (df['wood_size'] == 20))]
I would like to also exclude, the numbers that start with 0 in the column 'Serial'
df[~((df['Wood_type'] == 'pine') & (df['wood_size'] == 20) & (df['Serial'] == range(0) == 0))]
I tried the above, no result.
You will probably want to use df.str.startswith to check the first character:
df[ ~((df['Wood_type'] == 'pine') & (df['wood_size'] == 20) & (df['Serial'].str.startswith('0')))]
The current expression df['Serial'] == range(0) == 0 is meaningless. It is equivalent to df['Serial'] == range(0) and range(0) == 0. Clearly, neither of those is related to comparing the first character of the string to '0' (as opposed to 0).
What is the best practice to loop this?
I want to filter and dynamically create new DFs
df_aus19 = df_mat.loc[(df_temp['Year'] == 2020 ) & (df_mat['Country'] == 'AUS')]
df_aus20 = df_mat.loc[(df_temp['Year'] == 2020 ) & (df_mat['Country'] == 'AUS')]
df_aus21 = df_mat.loc[(df_temp['Year'] == 2021 ) & (df_mat['Country'] == 'AUS')]
df_aus22 = df_mat.loc[(df_temp['Year'] == 2022 ) & (df_mat['Country'] == 'AUS')]
df_aus23 = df_mat.loc[(df_temp['Year'] == 2023 ) & (df_mat['Country'] == 'AUS')]
AS per code
A more pythonic way to solve your problem than by loops and dynamically creating global variables would be a list comprehension:
years = list(range(2019, 2024))
df_aus_list = [df_mat.loc[(df_temp['Year'] == i) & (df_mat['Country'] == 'AUS')] for i in years]
If you really need to produce the same result as your code you could dynamically create new global variables like so:
years = list(range(2019, 2024))
for i in years:
globals()["df_aus"+str(i % 100)] = df_mat.loc[(df_temp['Year'] == i ) & (df_mat['Country'] == 'AUS')]
Don't. Just don't. When you find yourself ready to dynamically create new variables, just refrain: it is hard to obtain and harder to maintain. Just use containers like dictionaries or lists.
And the common way to get that in pandas is group_by:
dfs = dict()
for (year, country), sub in df.groupby(['Year', 'Country']):
# if 2020 <= year <= 2023 and county == 'AUS': # filter optionaly
dfs[(year, country)] = sub
You can then get the dataframe for AUS and 2021 as dfs[(2021, 'AUS')]
I have data represented using pandas DataFrame, which for example looks as follows:
| id | entity | name | value | location
where id is an integer value, entity is an integer , name is a string, value is an integer, and location is a string (for example US, CA, UK etc).
Now, I want to add a new column to this data frame, column "flag", where values are assigned as follows:
for d in df.iterrows():
if d.entity == 10 and d.value != 1000 and d.location == CA:
d.flag = "A"
elif d.entity != 10 and d.entity != 0 and d.value == 1000 and d.location == US:
d.flag = "C"
elif d.entity == 0 and d.value == 1000 and d.location == US"
d.flag = "B"
else:
print("Different case")
Is there a way to speed this up and use some built in functions instead of the for loop?
Use np.select which you pass a list of conditions, based on those conditions you give it choices and you can specify a default value when none of the conditions is met.
conditions = [
(d.entity == 10) & (d.value != 1000) & (d.location == 'CA'),
(d.entity != 10) & (d.entity != 0) & (d.value == 1000) & (d.location == 'US'),
(d.entity == 0) & (d.value == 1000) & (d.location == 'US')
]
choices = ["A", "C", "B"]
df['flag'] = np.select(conditions, choices, default="Different case")
Add () with bitwise and -> & for working with numpy.select:
m = [
(d.entity == 10) & (d.value != 1000) & (d.location == 'CA'),
(d.entity != 10) & (d.entity != 0) & (d.value == 1000) & (d.location == 'US'),
(d.entity == 0) & (d.value == 1000) & (d.location == 'US')
]
df['flag'] = np.select(m, ["A", "C", "B"], default="Different case")
You wrote "find all columns which fulfill a set of conditions", but your code shows you're actually trying to add a new column whose value for each row is computed from the values of other columns of the same row.
If that's indeed the case, you can use df.apply, giving it a function that computes the value for a specific row:
def flag_value(row):
if row.entity == 10 and row.value != 1000 and row.location == CA:
return "A"
elif row.entity != 10 and row.entity != 0 and row.value == 1000 and row.location == US:
return "C"
elif row.entity == 0 and row.value == 1000 and row.location == US:
return "B"
else:
return "Different case"
df['flag'] = df.apply(flag_value, axis=1)
Take a look at this related question for more information.
If you truly want to find all columns which specify some condition, the usual way to do this with a Pandas dataframe is to use df.loc and indexing:
only_a_cases = df.loc[df.entity == 10 & df.value != 1000 & df.location == "CA"]
# or:
only_a_cases = df.loc[lambda df: df.entity == 10 & df.value != 1000 & df.location == "CA"]