Searching a list of dataframes for a specific value

Searching a list of dataframes for a specific value - python

I scraped a bunch of tables of financial data using pandas.read_excel. I am trying to search through the list of dataframes and select only the ones that contain a certain value/string. Is it possible to do that? I had thought I could do something like:
search = [x.isin('string') for x in df_list]

You might want this (for each frame):
(df == 'foo').any()
That will return True if 'foo' is anywhere in the frame.

[x for x in df.isin('string').any().sum()]
check if word exist in each column and sum the boolean vales for each columns.
so, it will returm True if it exist at least in one of the columns.

Related

Cannot match two values in two different csvs

I am parsing through two separate csv files with the goal of finding matching customerID's and dates to manipulate balance.
In my for loop, at some point there should be a match as I intentionally put duplicate ID's and dates in my csv. However, when parsing and attempting to match data, the matches aren't working properly even though the values are the same.
main.py:
transactions = pd.read_csv(INPUT_PATH, delimiter=',')
accounts = pd.DataFrame(
columns=['customerID', 'MM/YYYY', 'minBalance', 'maxBalance', 'endingBalance'])
for index, row in transactions.iterrows():
customer_id = row['customerID']
date = formatter.convert_date(row['date'])
minBalance = 0
maxBalance = 0
endingBalance = 0
dict = {
"customerID": customer_id,
"MM/YYYY": date,
"minBalance": minBalance,
"maxBalance": maxBalance,
"endingBalance": endingBalance
}
print(customer_id in accounts['customerID'] and date in accounts['MM/YYYY'])
# Returns False
if (accounts['customerID'].equals(customer_id)) and (accounts['MM/YYYY'].equals(date)):
# This section never runs
print("hello")
else:
print("world")
accounts.loc[index] = dict
accounts.to_csv(OUTPUT_PATH, index=False)
Transactions CSV:
customerID,date,amount
1,12/21/2022,500
1,12/21/2022,-300
1,12/22/2022,100
1,01/01/2023,250
1,01/01/2022,300
1,01/01/2022,-500
2,12/21/2022,-200
2,12/21/2022,700
2,12/22/2022,200
2,01/01/2023,300
2,01/01/2023,400
2,01/01/2023,-700
Accounts CSV
customerID,MM/YYYY,minBalance,maxBalance,endingBalance
1,12/2022,0,0,0
1,12/2022,0,0,0
1,12/2022,0,0,0
1,01/2023,0,0,0
1,01/2022,0,0,0
1,01/2022,0,0,0
2,12/2022,0,0,0
2,12/2022,0,0,0
2,12/2022,0,0,0
2,01/2023,0,0,0
2,01/2023,0,0,0
2,01/2023,0,0,0
Expected Accounts CSV
customerID,MM/YYYY,minBalance,maxBalance,endingBalance
1,12/2022,0,0,0
1,01/2023,0,0,0
1,01/2022,0,0,0
2,12/2022,0,0,0
2,01/2023,0,0,0

Where does the problem come from
Your Problem comes from the comparison you're doing with pandas Series, to make it simple, when you do :
customer_id in accounts['customerID']
You're checking if customer_id is an index of the Series accounts['customerID'], however, you want to check the value of the Series.
And in your if statement, you're using the pd.Series.equals method. Here is an explanation of what does the method do from the documentation
This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.
So equals is used to compare between DataFrames and Series, which is different from what you're trying to do.
One of many solutions
There are multiple ways to achieve what you're trying to do, the easiest is simply to get the values from the series before doing the comparison :
customer_id in accounts['customerID'].values
Note that accounts['customerID'].values returns a NumPy array of the values of your Series.
So your comparison should be something like this :
print(customer_id in accounts['customerID'].values and date in accounts['MM/YYYY'].values)
And use the same thing in your if statement :
if (customer_id in accounts['customerID'].values and date in accounts['MM/YYYY'].values):
Alternative solutions
You can also use the pandas.Series.isin function that given an element as input return a boolean Series showing whether each element in the Series matches the given input, then you will just need to check if the boolean Series contain one True value.
Documentation of isin : https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html

It is not clear from the information what does formatter.convert_date function does. but from the example CSVs you added it seems like it should do something like:
def convert_date(mmddyy):
(mm,dd,yy) = mmddyy.split('/')
return mm + '/' + yy
in addition, make sure that data types are also equal
(both date fields are strings and also for customer id)

How Do I Count The Number of Times a Subset of Words Appear In My Pandas Dataframe?

I have a bunch of keywords stored in a 620x2 pandas dataframe seen below. I think I need to treat each entry as its own set, where semicolons separate elements. So, we end up with 1240 sets. Then I'd like to be able to search how many times keywords of my choosing appear together. For example, I'd like to figure out how many times 'computation theory' and 'critical infrastructure' appear together as a subset in these sets, in any order. Is there any straightforward way I can do this?

Use .loc to find if the keywords appear together.
Do this after you have split the data into 1240 sets. I don't understand whether you want to make new columns or just want to keep the columns as is.
# create a filter for keyword 1
filter_keyword_2 = (df['column_name'].str.contains('critical infrastructure'))
# create a filter for keyword 2
filter_keyword_2 = (df['column_name'].str.contains('computation theory'))
# you can create more filters with the same construction as above.
# To check the number of times both the keywords appear
len(df.loc[filter_keyword_1 & filter_keyword_2])
# To see the dataframe
subset_df = df.loc[filter_keyword_1 & filter_keyword_2]
.loc selects the conditional data frame. You can use subset_df=df[df['column_name'].str.contains('string')] if you have only one condition.
To the column split or any other processing before you make the filters or run the filters again after processing.

Not sure if this is considered straightforward, but it works. keyword_list is the list of paired keywords you want to search.
df['Author Keywords'] = df['Author Keywords'].fillna('').str.split(';\s*').apply(set)
df['Index Keywords'] = df['Index Keywords'].fillna('').str.split(';\s*').apply(set)
df.apply(lambda x : x.apply(lambda y : all([kw in y for kw in keyword_list]))).sum().sum()

comparing two columns of a row in python dataframe

I know that one can compare a whole column of a dataframe and making a list out of all rows that contain a certain value with:
values = parsedData[parsedData['column'] == valueToCompare]
But is there a possibility to make a list out of all rows, by comparing two columns with values like:
values = parsedData[parsedData['column01'] == valueToCompare01 and parsedData['column02'] == valueToCompare02]
Thank you!

It is completely possible, but I have never tried using and in order to mask the dataframe, rather using & would be of interest in this case. Note that, if you want your code to be more clear, use ( ) in each statement:
values = parsedData[(parsedData['column01'] == valueToCompare01) & (parsedData['column02'] == valueToCompare02)]

Can you filter a pandas dataframe based on a sum or count or multiple variables?

I'm trying to filter a Pandas dataframe based on a set of or conditions, but they're all very similar, and I'm wondering if there's a more efficient way to write this.
Specifically, I want to include rows from the dataframe (df) where any of a set of variables is 1:
df.query("Q50r5==1 or Q50r6==1 or Q50r7==1 or Q50r8==1 or Q50r9==1 or Q50r10==1 or Q50r11==1")
This filters correctly to rows where any of these variables is 1.
However, I expect to have a lot more situations where I need to filter my dataframe to something similar, e.g.:
df.query("Q20r1==1 or Q20r2==1 or Q20r3==1")
df.query("Q23r2==1 or Q23r5==1 or Q23r7==1 or Q23r8==1")
The pandas documentation on .query() doesn't specify any more than that you can use and and or like you can elsewhere in Python, so it's possible this is the only way to do this query, but is there some kind of sum or count I could do across these columns within the query? Something like "any(1,Q20r1,Q20r2,Q20r3)" or "sum(Q20r1,Q20r2,Q20r3)>0"?
EDIT: For example, using this small dataframe:
I would want to retrieve ID #s 1,2,4,5,7 and exclude #s 3 and 6, because 3 and 6 do not have any 1's across the columns I'm referring to.

You can use any with axis = 1 to check that at least one value is True in a row.
For example, you can run
df[(df[["Q20r1", "Q20r2", "Q20r3"]] == 1).any(axis = 1)]

Python Join two dataframes on columns meeting a condition

Say I have two dataframes A and B, each containing two columns called x and y.
I want to join these two dataframes but not on rows on which the x and y columns are equal across the two dataframes, but on rows where A's x columns is a substring of B's x column and same for y.
For example
if A[x][1]='mpla' and B[x][1]='mplampla'
I would want that to be captured.
On sql it would be something like:
select *
from A
join B
on A.x<=B.x and A.y<=B.y.
Can something like this be done on python?

You can match a single string at a time against all the strings in one column, like this:
import numpy.core.defchararray as ca
ca.find(B.x.values.astype(str), 'mpla') >= 0
The problem with that is you'll have to loop over all elements of A. But if you can afford that, it should work.
See also: pandas + dataframe - select by partial string

you could try something like
B.x.where(B.x.str.contains(A.x), B.index, axis=index) #this would give you the ones that don't match
B.x.where(B.x.str.match(A.x, as_indexer=True), B.index, axis=index) #this would also give you the one's that don't match. You could see if you can use the "^" operator used for regex to get the ones that match.
You could also maybe try
np.where(B.x.str.contains(A.x), B.index, np.nan)
also you can try:
matchingmask = B[B.x.str.contains(A.x)]
matchingframe = B.ix[matchingmask.index] #or
matchingcolumn = B.ix[matchingmask.index].x #or
matchingindex = B.ix[matchingmask.index].index
All of these assume you have the same index on both frames (I think)
You want to look at the string methods: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods
you want to read up on regex and pandas where method: http://pandas.pydata.org/pandas-docs/dev/indexing.html#the-where-method-and-masking

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Searching a list of dataframes for a specific value - python

I scraped a bunch of tables of financial data using pandas.read_excel. I am trying to search through the list of dataframes and select only the ones that contain a certain value/string. Is it possible to do that? I had thought I could do something like: search = [x.isin('string') for x in df_list]

You might want this (for each frame): (df == 'foo').any() That will return True if 'foo' is anywhere in the frame.

[x for x in df.isin('string').any().sum()] check if word exist in each column and sum the boolean vales for each columns. so, it will returm True if it exist at least in one of the columns.

Related

Cannot match two values in two different csvs

How Do I Count The Number of Times a Subset of Words Appear In My Pandas Dataframe?

comparing two columns of a row in python dataframe

Can you filter a pandas dataframe based on a sum or count or multiple variables?

Python Join two dataframes on columns meeting a condition

Categories

Resources