Create categories based on Partial Values Python - python

Hi I have a data frame as below:
response ticket
so service reset performed 123
reboot done 343
restart performed 223
no value 444
ticket created 765
Im trying something like this:
import pandas as pd
df = pd.read_excel (r'C:\Users\Downloads\response.xlsx')
print (df)
count_other = 0
othersvocab = ['Service reset' , 'Reboot' , 'restart']
if df.response = othersvocab
{
count_other = count_other + 1
}
What I'm trying to do is get the count of how many have either of 'othersvocab' and how many don't.
I'm really new to Python, and I'm not sure how to do this.
Expected Output:
other ticketed
3 2
Can you help me figure it out, hopefully with what's happening in your code?

I am doing this on lunch break, I don't like the for other in others thing I have and there are better ways using pandas DataFrame methods you can use but it will have to do.
import pandas as pd
df = pd.DataFrame({"response": ["so service reset performed", "reboot done",
"restart performed"],
"ticket": [123, 343, 223]})
others = ['service reset' , 'reboot' , 'restart']
count_other = 0
for row in df["response"].values:
for other in others:
if other in row:
count_other += 1
So first you are going to need to address that if you want to perform this in the way I have you're going to have to lowercase the response column and the others variable, that's not very hard (lookup for pandas apply and the string operator .lower).
What I have done in this is I am looping first over the values in the loop column.
Then within this loop I am looping over the others list items.
Finally seeing whether any of these is in the list.
I hope my rushed response gives a hand.

Consider below df:
In [744]: df = pd.DataFrame({'response':['so service reset performed', 'reboot done', 'restart performed', 'no value', 'ticket created'], 'ticket':[123, 343, 223, 444, 765]})
In [745]: df
Out[745]:
response ticket
0 so service reset performed 123
1 reboot done 343
2 restart performed 223
3 no value 444
4 ticket created 765
Below is your othersvocab:
In [727]: othersvocab = ['Service reset' , 'Reboot' , 'restart']
# Converting all elements to lowercase
In [729]: othersvocab = [i.lower() for i in othersvocab]
Use Series.str.contains:
# Converting response column to lowercase
In [733]: df.response = df.response.str.lower()
In [740]: count_in_vocab = len(df[df.response.str.contains('|'.join(othersvocab))])
In [742]: count_others = len(df) - count_in_vocab
In [752]: res = pd.DataFrame({'other': [count_in_vocab], 'ticketed': [count_others]})
In [753]: res
Out[753]:
other ticketed
0 3 2

Related

How do you choose where to put information based on an index into a pandas DataFrame?

In MATLAB, the loop I create looks like:
header_names = {'InvoiceNo','Customer',...}
for i = 1:length(x)
entry(index+i,:) = [InvoiceNo, Customer,...]
end
% Create a table from the data.
fin_table = cell2table(entry,'VariableNames',header_names);
% Write to the finish file.
writetable(fin_table,finish);
With the table values and headers, I will end up getting something that looks like:
InvoiceNo
Customer
1000
Jimmy
1001
Bob
1002
Max
1003
April
1004
Tom
...
...
...
...
...
...
...
...
I would like to know how to accomplish this in Python. My main question is how do I create the entry? How do I put a table in a for loop and ask it to print information on the next row for each iteration?
In Python, I currently have the following:
for i in range(len(item)):
entry = pd.DataFrame(
[InvoiceNo, Customer, invoice, due, terms, Location, memo, item[i], item[i], quan[i], rate[i], taxable,
tax_rate, invoice, email, Class],
columns=['InvoiceNo', 'Customer', 'InvoiceDate', 'DueDate', 'Terms', 'Location', 'Memo', 'Item',
'ItemDescription', 'ItemQuantity', 'ItemRate', 'ItemAmount', 'Taxable', 'TaxRate',
'ServiceDate', 'Email', 'Class'])
# Increment the index for entry values to be correct.
index += len(item)
Any help would be awesome!
Although I do not get your question completely, I will try to give you some tools that might be useful:
To get input value you can use (and put this inside of a 'for' loop depending on the number of rows you want to create)
new_InvoiceNo= input("Enter InvoiceNo:\n")
new_Customer= input("Enter Customer:\n")
new_invoice = input("Enter invoice:\n")
...
then you can either append these values as list into the main DF :
to_append = [new_InvoiceNo, new_Customer, new_invoice, ...]
new_values = pd.Series(to_append, index = df.columns)
df = df.append(new_values , ignore_index=True)
or , you can use '.loc' method:
to_append = [new_InvoiceNo, new_Customer, new_invoice, ...]
df_length = len(df)
df.loc[df_length] = to_append
Try to implement this in your code and report it back here.

How to delete or drop lines into a dataframe using specific values in a column?

I'm using this code, but when I group to show results, I was expecting that No entry and Out of Business didn't appear but they do.
data2 = pd.DataFrame(data)
data2 = data2[(data2['results'] != 'No Entry') | (data2['results'] != 'Out of Business')]
data2.groupby('results').size().sort_values(ascending=False)
results
Pass 417
Pass w/ Conditions 233
Fail 192
No Entry 69
Out of Business 55
Not Ready 28
Thanks in advance.
simply use the following code to drop certain rows from a DF:
df = df.loc[df['results'] != 'No Entry']
df = df.loc[df['results'] != 'Out of Business']
The code will work in a way, that the df is going to be selected without the two rows.
I hope, this helps.
Take care

Pandas df conditionals: changing value name if pd.value_counts is less than something

I have this table with models df['model'] and
pd.value_counts(df2['model'].values, sort=True)
returns this:
'''
MONSTER 331
MULTISTRADA 134
HYPERMOTARD 69
SCRAMBLER 63
SUPERSPORT 31
...
900 1
T-MAX 1
FC 1
GTS 1
SCOUT 1
Length: 75, dtype: int64
'''
I want to rename all the values in df2['model'] that have count <5 into 'OTHER'.
Please can anyone help me, how to go about this?
You first can get a list of the categories you want to change to other with the first line of code. It takes your functiona and selects the rows which meet the condicion you want (in this case less than 5 occurences).
Then you select the dataframe and just select the rows whose model cell is in the list of categories you want to substitute and change te value to 'OTHER'.
other_classes = data['model'].value_counts()[data['model'].value_counts() < 5].index
data['model'][data['model'].isin(other_classes)] = 'OTHER'
Hope it helps
I suspect it is not at all elegant or pythonic, but this worked in the end:
df_pooled_other = df_final.assign(freq=df_final.groupby('model name')['model name'].transform('count'))\
.sort_values(by=['freq','model name', 'Age in months_x_x'],ascending=[False,True, True])
df_pooled_other['model name'] = np.where(df_pooled_other['freq'] <= 5, 'Other', df_pooled_other['model name'])

remove duplicate word from pandas column

I have dataframe with information like below stored in one column
>>> Results.Category[:5]
0 issue delivery wrong master account
1 data wrong master account batch
2 order delivery wrong data account
3 issue delivery wrong master account
4 delivery wrong master account batch
Name: Category, dtype: object
Now I want to keep unique word in Category column
For Example :
In first row word "wrong" is present I want to remove it from all rest of the rows and keep word "wrong" in first row only
In second row word "data" is available then I want to remove it from all rest of the rows and keep word "data" in second row only
I found that if duplicates are available in row we can remove using below , but I need to remove duplicate words from columns, Can anyone please help me here.
AFResults['FinalCategoryN'] = AFResults['FinalCategory'].apply(lambda x: remove_dup(x))
It seems you want something like,
out = []
seen = set()
for c in df['Category']:
words = c.split()
out.append(' '.join([w for w in words if w not in seen]))
seen.update(words)
df['FinalCategoryN'] = out
df
Category FinalCategoryN
0 issue delivery wrong master account issue delivery wrong master account
1 data wrong master account batch data batch
2 order delivery wrong data account order
3 issue delivery wrong master account
4 delivery wrong master account batch
If you don't care about the ordering, you can use set logic:
u = df['Category'].apply(str.split)
v = split.shift().map(lambda x: [] if x != x else x).cumsum().map(set)
(u.map(set) - v).str.join(' ')
0 account delivery issue master wrong
1 batch data
2 order
3
4
Name: Category, dtype: object
In you case you need split it first then remove duplicate by drop_duplicates
df.c.str.split(expand=True).stack().drop_duplicates().\
groupby(level=0).apply(','.join).reindex(df.index)
Out[206]:
0 issue,delivery,wrong,master,account
1 data,batch
2 order
3 NaN
4 NaN
dtype: object
What you what cannot be vectorized, so let us just forget about pandas and use a Python set:
total = set()
result = []
for line in AFResults['FinalCategory']:
line = set(line.split()).difference(total)
total = total.union(line)
result.append(' '.join(line))
You get that list: ['wrong issue master delivery account', 'batch data', 'order', '', '']
You can use it to populate a dataframe column:
AFResults['FinalCategoryN'] = result
Use apply with sorted and set and str.join and list.index:
AFResults['FinalCategoryN'] = AFResults['FinalCategory'].apply(lambda x: ' '.join(sorted(set(x.split()), key=x.index)))

How to add entries in Pandas DataFrame?

Basically I have census data of US that I have read in Pandas from a csv file.
Now I have to write a function that finds counties in a specific manner (not gonna explain that because that's not what the question is about) from the table I have gotten from csv file and return those counties.
MY TRY:
What I did is that I created lists with the names of the columns (that the function has to return), then applied the specific condition in the for loop using if-statement to read the entries of all required columns in their respective list. Now I created a new DataFrame and I want to read the entries from lists into this new DataFrame. I tried the same for loop to accomplish it, but all in vain, tried to make Series out of those lists and tried passing them as a parameter in the DataFrame, still all in vain, made DataFrames out of those lists and tried using append() function to concatenate them, but still all in vain. Any help would be appreciated.
CODE:
#idxl = list()
#st = list()
#cty = list()
idx2 = 0
cty_reg = pd.DataFrame(columns = ('STNAME', 'CTYNAME'))
for idx in range(census_df['CTYNAME'].count()):
if((census_df.iloc[idx]['REGION'] == 1 or census_df.iloc[idx]['REGION'] == 2) and (census_df.iloc[idx]['POPESTIMATE2015'] > census_df.iloc[idx]['POPESTIMATE2014']) and census_df.loc[idx]['CTYNAME'].startswith('Washington')):
#idxl.append(census_df.index[idx])
#st.append(census_df.iloc[idx]['STNAME'])
#cty.append(census_df.iloc[idx]['CTYNAME'])
cty_reg.index[idx2] = census_df.index[idx]
cty_reg.iloc[idxl2]['STNAME'] = census_df.iloc[idx]['STNAME']
cty_reg.iloc[idxl2]['CTYNAME'] = census_df.iloc[idx]['CTYNAME']
idx2 = idx2 + 1
cty_reg
CENSUS TABLE PIC:
SAMPLE TABLE:
REGION STNAME CTYNAME
0 2 "Wisconsin" "Washington County"
1 2 "Alabama" "Washington County"
2 1 "Texas" "Atauga County"
3 0 "California" "Washington County"
SAMPLE OUTPUT:
STNAME CTYNAME
0 Wisconsin Washington County
1 Alabama Washington County
I am sorry for the less-knowledge about the US-states and counties, I just randomly put the state names and counties in the sample table, just to show you what do I want to get out of that. Thanks for the help in advanced.
There are some missing columns in the source DF posted in the OP. However, reading the loop I don't think the loop is required at all. There are 3 filters required - for REGION, POPESTIMATE2015 and CTYNAME. If I have understood the logic in the OP, then this should be feasible without the loop
Option 1 - original answer
print df.loc[
(df.REGION.isin([1,2])) & \
(df.POPESTIMATE2015 > df.POPESTIMATE2014) & \
(df.CTYNAME.str.startswith('Washington')), \
['REGION', 'STNAME', 'CTYNAME']]
Option 2 - using and with pd.eval
q = pd.eval("(df.REGION.isin([1,2])) and \
(df.POPESTIMATE2015 > df.POPESTIMATE2014) and \
(df.CTYNAME.str.startswith('Washington'))", \
engine='python')
print df.loc[q, ['REGION', 'STNAME', 'CTYNAME']]
Option 3 - using and with df.query
regions_list = [1,2]
dfq = df.query("(REGION==#regions_list) and \
(POPESTIMATE2015 > POPESTIMATE2014) and \
(CTYNAME.str.startswith('Washington'))", \
engine='python')
print dfq[['REGION', 'STNAME', 'CTYNAME']]
If I'm reading the logic in your code right, you want to select rows according to the following conditions:
REGION should be 1 or 2
POPESTIMATE2015 > POPESTIMATE2014
CTYNAME needs to start with "Washington"
In general, Pandas makes it easy to select rows based on conditions without having to iterate over the dataframe:
df = census_df[
((df.REGION == 1) | (df.REGION == 2)) & \
(df.POPESTIMATE2015 > POPESTIMATE2014) & \
(df.CTYNAME.str.startswith('Washington'))
]
Assuming you're selecting some kind of rows that satisfy a criteria, let's just say that select(row) and this function returns True if selected or False if not. I'll not infer what it is because you specifically said it was not important
And then you wanted the STNAME and CTYNAME of that row.
So here's what you would do:
your_new_df = census_df[census_df.apply(select, axis=1)]\
.apply(lambda x: x[['STNAME', 'CTYNAME']], axis=1)
This is the one liner that will get you what you wanted provided you wrote the select function that will pick the rows.

Categories

Resources