Selection in dataframe with array as column value

Selection in dataframe with array as column value - python

I have a dataframe filled with twitter data. The columns are:
row_id : Int
content : String
mentions : [String]
value : Int
So for every tweet I have it's row id in the dataframe, the content of the tweet, the mentions used in it (for example: '#foo') as an array of strings and a value that I calculated based on the content of the tweet.
An example of a row would be:
row_id : 12
content : 'Game of Thrones was awful'
mentions : ['#hbo', '#tv', '#dissapointment', '#whatever']
value: -0.71
So what I need is a way to do the following 3 things:
find all rows that contain the mention '#foo' in the mentions-field
find all rows that ONLY contain the mention '#foo' in the mentions-field
above two but checking for an array of strings instead of checking for only one handle
If anyone could help met with this, or even just point me in the right direction that'd be great.

Let's call your DataFrame df.
For the first task you use:
result = df[(Dataframe(df['mentions'].tolist()) == '#foo').any(1)]
Here, the Dataframe(df['mentions']) creates a new DataFrame where each column is a mention and each row a tweet.
Then == '#foo' generates a boolean dataframe containing True where the mentions are '#foo'.
Finally .any(1) returns a boolean index which elements are True if any element in the row is True.
I think with this help you can manage to solve the rest for yourself.

Related

Cannot match two values in two different csvs

I am parsing through two separate csv files with the goal of finding matching customerID's and dates to manipulate balance.
In my for loop, at some point there should be a match as I intentionally put duplicate ID's and dates in my csv. However, when parsing and attempting to match data, the matches aren't working properly even though the values are the same.
main.py:
transactions = pd.read_csv(INPUT_PATH, delimiter=',')
accounts = pd.DataFrame(
columns=['customerID', 'MM/YYYY', 'minBalance', 'maxBalance', 'endingBalance'])
for index, row in transactions.iterrows():
customer_id = row['customerID']
date = formatter.convert_date(row['date'])
minBalance = 0
maxBalance = 0
endingBalance = 0
dict = {
"customerID": customer_id,
"MM/YYYY": date,
"minBalance": minBalance,
"maxBalance": maxBalance,
"endingBalance": endingBalance
}
print(customer_id in accounts['customerID'] and date in accounts['MM/YYYY'])
# Returns False
if (accounts['customerID'].equals(customer_id)) and (accounts['MM/YYYY'].equals(date)):
# This section never runs
print("hello")
else:
print("world")
accounts.loc[index] = dict
accounts.to_csv(OUTPUT_PATH, index=False)
Transactions CSV:
customerID,date,amount
1,12/21/2022,500
1,12/21/2022,-300
1,12/22/2022,100
1,01/01/2023,250
1,01/01/2022,300
1,01/01/2022,-500
2,12/21/2022,-200
2,12/21/2022,700
2,12/22/2022,200
2,01/01/2023,300
2,01/01/2023,400
2,01/01/2023,-700
Accounts CSV
customerID,MM/YYYY,minBalance,maxBalance,endingBalance
1,12/2022,0,0,0
1,12/2022,0,0,0
1,12/2022,0,0,0
1,01/2023,0,0,0
1,01/2022,0,0,0
1,01/2022,0,0,0
2,12/2022,0,0,0
2,12/2022,0,0,0
2,12/2022,0,0,0
2,01/2023,0,0,0
2,01/2023,0,0,0
2,01/2023,0,0,0
Expected Accounts CSV
customerID,MM/YYYY,minBalance,maxBalance,endingBalance
1,12/2022,0,0,0
1,01/2023,0,0,0
1,01/2022,0,0,0
2,12/2022,0,0,0
2,01/2023,0,0,0

Where does the problem come from
Your Problem comes from the comparison you're doing with pandas Series, to make it simple, when you do :
customer_id in accounts['customerID']
You're checking if customer_id is an index of the Series accounts['customerID'], however, you want to check the value of the Series.
And in your if statement, you're using the pd.Series.equals method. Here is an explanation of what does the method do from the documentation
This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.
So equals is used to compare between DataFrames and Series, which is different from what you're trying to do.
One of many solutions
There are multiple ways to achieve what you're trying to do, the easiest is simply to get the values from the series before doing the comparison :
customer_id in accounts['customerID'].values
Note that accounts['customerID'].values returns a NumPy array of the values of your Series.
So your comparison should be something like this :
print(customer_id in accounts['customerID'].values and date in accounts['MM/YYYY'].values)
And use the same thing in your if statement :
if (customer_id in accounts['customerID'].values and date in accounts['MM/YYYY'].values):
Alternative solutions
You can also use the pandas.Series.isin function that given an element as input return a boolean Series showing whether each element in the Series matches the given input, then you will just need to check if the boolean Series contain one True value.
Documentation of isin : https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html

It is not clear from the information what does formatter.convert_date function does. but from the example CSVs you added it seems like it should do something like:
def convert_date(mmddyy):
(mm,dd,yy) = mmddyy.split('/')
return mm + '/' + yy
in addition, make sure that data types are also equal
(both date fields are strings and also for customer id)

How do I search a pandas dataframe to get the row with a cell matching a specified value?

I have a dataframe that might look like this:
print(df_selection_names)
name
0 fatty red meat, like prime rib
0 grilled
I have another dataframe, df_everything, with columns called name, suggestion and a lot of other columns. I want to find all the rows in df_everything with a name value matching the name values from df_selection_names so that I can print the values for each name and suggestion pair, e.g., "suggestion1 is suggested for name1", "suggestion2 is suggested for name2", etc.
I've tried several ways to get cell values from a dataframe and searching for values within a row including
# number of items in df_selection_names = df_selection_names.shape[0]
# so, in other words, we are looping through all the items the user selected
for i in range(df_selection_names.shape[0]):
# get the cell value using at() function
# in 'name' column and i-1 row
sel = df_selection_names.at[i, 'name']
# this line finds the row 'sel' in df_everything
row = df_everything[df_everything['name'] == sel]
but everything I tried gives me ValueErrors. This post leads me to think I may be
way off, but I'm feeling pretty confused about everything at this point!

https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html?highlight=isin#pandas.Series.isin
df_everything[df_everything['name'].isin(df_selection_names["name"])]

Create a column in dataframe with name of an existing array (initial 4 letters of array name)

I would like to create a column in dataframe having name of an array. For example, the name of array is "customer" then name of the column should be "cust_prop" (initial 4 letters from array's name). Is there any way to get it?

Your question is a bit unclear, but presuming that you are asking: how do i turn the string "customer" into "cust_prop", thats easy enough:
Str = "customer"
NewStr = Str[0:4] + "_prop"
you might need to some extra checking for shorter strings, but i dont know what the behaviour there would be that you want.
If you mean something else, please post some code examples of what you have tried.

You didn't really describe from where you get an array name, so I'll just assume you have it in a variable:
array_name = 'customer'
to slice only first four digit and use it:
new_col_name = f'{array_name[0:4]}_prop'
df[new_col_name] = 1
here I "created" a new column in existing dataframe df, and put value of 1 to the entire column. Instead, you can create a series with any value you want:
series = pd.Series(name=new_col_name, data=array_customer)
Here I created a series with the name as desired, and assumed you have an array_customer variable which holds the array

Can someone help me understand what .index is doing in this code?

I have the following code:
print(df.drop(df[df['Quantity'] == 0].index).rename(columns={'Weight': 'Weight (oz.)'}))
I understand what query is trying to do, but I'm lost at why you need to add the " .index " portion?
What is .index doing in this particular code?
For context here is what the dataframe looks like:
I looked at the python documentation for dataframe index:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.index.html
but unfortunately it was too vague for me to make sense of it.

The DataFrame.index is the index of each record in your dataframe. It is unique to each row even if two rows have the same data in each column. DataFrame.drop takes the index : single label or list-like and drops those rows that match the index.
So from the code above,
df[df['Quantity'] == 0] gets the rows that has Quantity == 0,
df[df['Quantity'] == 0].index gets the indexes of all rows that has the predicate,
df.drop(df[df['Quantity'] == 0].index) this drops all the indices that returned True for that predicate.
Hope this helps!

I checked df.drop()'s documentation. It says that it drops by index. This code first finds the items that has the quantity 0, but because drop() works with indexes , it sends the items back to the dataframe and receive their indexes. That's index.
https://pandas.pydata.org/pandas-docs/stable//reference/api/pandas.DataFrame.drop.html

Extract value from single row of pandas DataFrame

I have a dataset in a relational database format (linked by ID's over various .csv files).
I know that each data frame contains only one value of an ID, and I'd like to know the simplest way to extract values from that row.
What I'm doing now:
# the group has only one element
purchase_group = purchase_groups.get_group(user_id)
price = list(purchase_group['Column_name'])[0]
The third row is bothering me as it seems ugly, however I'm not sure what is the workaround. The grouping (I guess) assumes that there might be multiple values and returns a <class 'pandas.core.frame.DataFrame'> object, while I'd like just a row returned.

If you want just the value and not a df/series then call values and index the first element [0] so just:
price = purchase_group['Column_name'].values[0]
will work.

If purchase_group has single row then doing purchase_group = purchase_group.squeeze() would make it into a series so you could simply call purchase_group['Column_name'] to get your values

Late to the party here, but purchase_group['Column Name'].item() is now available and is cleaner than some other solutions

This method is intuitive; for example to get the first row (list from a list of lists) of values from the dataframe:
np.array(df)[0]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selection in dataframe with array as column value - python

Related

Cannot match two values in two different csvs

How do I search a pandas dataframe to get the row with a cell matching a specified value?

Create a column in dataframe with name of an existing array (initial 4 letters of array name)

Can someone help me understand what .index is doing in this code?

Extract value from single row of pandas DataFrame

Categories

Resources