Pandas Groupby Count Partial Strings - python

I am wanting to try to get a count of how many rows within a column contain a partial string based on an imported dataframe. In the sample data below, I want to groupby Trans_type and then get a count of how many rows contain a value.
So I would expect to see:
First, is this possible generically without passing a link to get each types expected brand? If not, how could I pass say Car a list of .str.contains['Audi','BMW'].
Thanks for any help!

Try this one:
df.groupby(df["Trans_type"], df["Brand"].str.extract("([a-zA-Z])+", expand=False)).count()

Related

ValueError: must supply a tuple to get_group with multiple grouping keys

Expected outout
Trying to find all URLs with Response Code 200 using pandas- grouping dataframe.
below is my code that gives the error message below :
ValueError: must supply a tuple to get_group with multiple grouping keys
url_response_grouped = log_df.groupby(['URL','ResponseCode'])
url_response_grouped.ngroups
url_response_grouped.groups.keys()
url_response_grouped.get_group('URL','200')
Well, now I see, you don't really need to use groupby() with two columns to see all the urls with 'ResponseCode' 200.00. You just have to do:
url_response_grouped = log_df.groupby('ResponseCode')
And then:
url_response_grouped.get_group(200)
The following code won't work because you got multiple values as your urls, but none of them is 'URL'.
url_response_grouped.get_group('URL','200')
Although the problem is already solved perfectly, still want to provide a solution to the tuple thing. If you do want to groupby based on multiple columns and want to get a specific group,do
randomfiles.groupby(['name', 'age'])
randomfiles.get_group(('Alice', 13))
Hope it will help someone in the future!

how to access based row based on condition with grouped dataframe

I am new to Python and I want to access some rows for an already grouped dataframe (used groupby).
However, I am unable to select the row I want and would like your help.
The code I used for groupby shown below:
language_conversion = house_ads.groupby(['date_served','language_preferred']).agg({'user_id':'nunique',
'converted':'sum'})
language_conversion
Result shows:
For example, I want to access the number of Spanish-speaking users who received house ads using:
language_conversion[('user_id','Spanish')]
gives me KeyError('user_id','Spanish')
This is the same when I try to create a new column, which gives me the same error.
Thanks for your help
Use this,
language_conversion.loc[(slice(None), 'Arabic'), 'user_id']
You can see the indices(in this case tuples of length 2) using language_conversion.index
you should use this
language_conversion.loc[(slice(None),'Spanish'), 'user_id']
slice(None) here includes all rows in date index.
if you have one particular date in mind just replace slice(None) with that specific date.
the error you are getting is because u accessed columns before indexes which is not correct way of doing it follow the link to learn more indexing

pandas cleaning 1+1 values in a column

I have a column that has the following data
column
------
1+1
2+3
4+5
How do I get pandas to sum these values so that the out put is 2,5,9 instead of the above?
Many thanks
You column obviously contains strings, so, you must somehow evaluate them. Use pd.eval function. Eg
frame['column'].apply(pd.eval)
If interested in performance, probably use an alternative method, like ast.literal_eval. Thanks to user #Serge Ballesta for mentioning

How to use DataFrame.isin without the constraint of having to match both index and value?

So, I have two files one with 6 million entries and the other with around 5 million entries. I want to compare a particular column values in both the dataframes. This is the code that I have used:
print(df1['Col1'].isin(df2['col3']).value_counts())
This is essential for me as I want to see the number of True(same) and False(different). I am getting most of the entries around 95% as true however some 5% data is coming as false. I extracted this data by using to_csv and compared the columns using vimdiff and they are all identical, then why is the code labelling them as false(different)? Is there a better and more fullproof method?
Note: I have checked for whitespace in the columns as well. There is no whitespace.
PS. The Pandas.isin documentation states that both index and value has to match. Since I have more entries in 1 file, so the index is not matching for these entries, how to remove that constraint?
First, convert the column you use as parameter inside your isin() method as a list.
Then parse it as a copy of your df1 dataframe because you need to get the value counts at the same column you filtered.
From your example:
print(df1[df1['Col1'].isin(df2['col3'].values.tolist())]['Col1'].value_counts())
Try running that again.

Using a list to call pandas columns

I read in a CSV file
times = pd.read_csv("times.csv",header=0)
times.columns.values
The column names are in a list
titles=('case','num_gen','year')
titles are much longer and complex but for simplicity sake, it is truncated here.
I want to call an index of a column of times using an index from titles.
My attempt is:
times.titles[2][0]
This is tho try to get the effect of:
times.year[0]
I need to do this because there are 75 columns that I need to call in a loop, therefore, I can not have each column name typed out as in the line above.
Any ideas on how to accomplish this?
I think you need to use .iloc let's look at the pandas doc on selection by position:
time.iloc[2,0] #will return the third row and first column, the indexes are zero-based.

Categories

Resources